ChapterIII: Transport Layer
Transcript of ChapterIII: Transport Layer
Chapter III Transport Layer
UG3 Computer Communications amp Networks(COMN)
Mahesh Marinamaheshedacuk
Slides copyright of Kurose and Ross
Transport services and protocols
bull provide logical communicationbetween app processes running on different hosts
bull transport protocols run in end systems ndash send side breaks app messages
into segments passes to network layer
ndash rcv side reassembles segments into messages passes to app layer
bull more than one transport protocol available to appsndash Internet TCP and UDP
2
applicationtransportnetworkdata linkphysical
logical end-end transportapplicationtransportnetworkdata linkphysical
Transport vs network layer
bull network layer logical communication between hosts
bull transport layer logical communication between processesndash relies on enhances
network layer services
12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house
bull hosts = housesbull processes = kidsbull app messages = letters in
envelopesbull transport protocol = Ann
and Bill who demux to in-house siblings
bull network-layer protocol = postal service
household analogy
3
Internet transport-layer protocols
bull reliable in-order delivery TCPndash congestion control ndash flow controlndash connection setup
bull unreliable unordered delivery UDPndash no-frills extension of ldquobest-
effortrdquo IP
bull services not available ndash delay guaranteesndash bandwidth guarantees
applicationtransportnetworkdata linkphysical
applicationtransportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical network
data linkphysical
logical end-end transport
4
UDP User Datagram Protocol [RFC 768]
bull ldquobare bonesrdquo Internet transport protocol
bull ldquobest effortrdquo service UDP segments may bendash lostndash delivered out-of-order to
appbull connectionless
ndash no handshaking between UDP sender receiver
ndash each UDP segment handled independently of others
5
bull UDP usendash streaming multimedia apps
(loss tolerant rate sensitive)ndash DNSndash SNMP
bull reliable transfer over UDP ndash add reliability at application
layerndash application-specific error
recovery
UDP segment header
6
bull no connection establishment (which can add delay)
bull simple no connection state at sender receiver
bull small header sizebull no congestion control UDP
can blast away as fast as desired
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
why is there a UDP
UDP checksum
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (1rsquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of
received segmentbull check if computed
checksum equals checksum field valuendash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
7
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
Internet checksum example
8
example add two 16-bit integers
1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1
wraparound
sumchecksum
Note when adding numbers a carryout from the most significant bit needs to be added to the result
Principles of reliable data transfer
9
bull important in application transport link layersndash top-10 list of important networking topics
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of reliable data transfer
10
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Principles of reliable data transfer
11
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Transport services and protocols
bull provide logical communicationbetween app processes running on different hosts
bull transport protocols run in end systems ndash send side breaks app messages
into segments passes to network layer
ndash rcv side reassembles segments into messages passes to app layer
bull more than one transport protocol available to appsndash Internet TCP and UDP
2
applicationtransportnetworkdata linkphysical
logical end-end transportapplicationtransportnetworkdata linkphysical
Transport vs network layer
bull network layer logical communication between hosts
bull transport layer logical communication between processesndash relies on enhances
network layer services
12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house
bull hosts = housesbull processes = kidsbull app messages = letters in
envelopesbull transport protocol = Ann
and Bill who demux to in-house siblings
bull network-layer protocol = postal service
household analogy
3
Internet transport-layer protocols
bull reliable in-order delivery TCPndash congestion control ndash flow controlndash connection setup
bull unreliable unordered delivery UDPndash no-frills extension of ldquobest-
effortrdquo IP
bull services not available ndash delay guaranteesndash bandwidth guarantees
applicationtransportnetworkdata linkphysical
applicationtransportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical network
data linkphysical
logical end-end transport
4
UDP User Datagram Protocol [RFC 768]
bull ldquobare bonesrdquo Internet transport protocol
bull ldquobest effortrdquo service UDP segments may bendash lostndash delivered out-of-order to
appbull connectionless
ndash no handshaking between UDP sender receiver
ndash each UDP segment handled independently of others
5
bull UDP usendash streaming multimedia apps
(loss tolerant rate sensitive)ndash DNSndash SNMP
bull reliable transfer over UDP ndash add reliability at application
layerndash application-specific error
recovery
UDP segment header
6
bull no connection establishment (which can add delay)
bull simple no connection state at sender receiver
bull small header sizebull no congestion control UDP
can blast away as fast as desired
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
why is there a UDP
UDP checksum
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (1rsquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of
received segmentbull check if computed
checksum equals checksum field valuendash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
7
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
Internet checksum example
8
example add two 16-bit integers
1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1
wraparound
sumchecksum
Note when adding numbers a carryout from the most significant bit needs to be added to the result
Principles of reliable data transfer
9
bull important in application transport link layersndash top-10 list of important networking topics
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of reliable data transfer
10
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Principles of reliable data transfer
11
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Transport vs network layer
bull network layer logical communication between hosts
bull transport layer logical communication between processesndash relies on enhances
network layer services
12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house
bull hosts = housesbull processes = kidsbull app messages = letters in
envelopesbull transport protocol = Ann
and Bill who demux to in-house siblings
bull network-layer protocol = postal service
household analogy
3
Internet transport-layer protocols
bull reliable in-order delivery TCPndash congestion control ndash flow controlndash connection setup
bull unreliable unordered delivery UDPndash no-frills extension of ldquobest-
effortrdquo IP
bull services not available ndash delay guaranteesndash bandwidth guarantees
applicationtransportnetworkdata linkphysical
applicationtransportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical network
data linkphysical
logical end-end transport
4
UDP User Datagram Protocol [RFC 768]
bull ldquobare bonesrdquo Internet transport protocol
bull ldquobest effortrdquo service UDP segments may bendash lostndash delivered out-of-order to
appbull connectionless
ndash no handshaking between UDP sender receiver
ndash each UDP segment handled independently of others
5
bull UDP usendash streaming multimedia apps
(loss tolerant rate sensitive)ndash DNSndash SNMP
bull reliable transfer over UDP ndash add reliability at application
layerndash application-specific error
recovery
UDP segment header
6
bull no connection establishment (which can add delay)
bull simple no connection state at sender receiver
bull small header sizebull no congestion control UDP
can blast away as fast as desired
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
why is there a UDP
UDP checksum
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (1rsquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of
received segmentbull check if computed
checksum equals checksum field valuendash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
7
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
Internet checksum example
8
example add two 16-bit integers
1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1
wraparound
sumchecksum
Note when adding numbers a carryout from the most significant bit needs to be added to the result
Principles of reliable data transfer
9
bull important in application transport link layersndash top-10 list of important networking topics
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of reliable data transfer
10
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Principles of reliable data transfer
11
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Internet transport-layer protocols
bull reliable in-order delivery TCPndash congestion control ndash flow controlndash connection setup
bull unreliable unordered delivery UDPndash no-frills extension of ldquobest-
effortrdquo IP
bull services not available ndash delay guaranteesndash bandwidth guarantees
applicationtransportnetworkdata linkphysical
applicationtransportnetworkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical
networkdata linkphysical network
data linkphysical
logical end-end transport
4
UDP User Datagram Protocol [RFC 768]
bull ldquobare bonesrdquo Internet transport protocol
bull ldquobest effortrdquo service UDP segments may bendash lostndash delivered out-of-order to
appbull connectionless
ndash no handshaking between UDP sender receiver
ndash each UDP segment handled independently of others
5
bull UDP usendash streaming multimedia apps
(loss tolerant rate sensitive)ndash DNSndash SNMP
bull reliable transfer over UDP ndash add reliability at application
layerndash application-specific error
recovery
UDP segment header
6
bull no connection establishment (which can add delay)
bull simple no connection state at sender receiver
bull small header sizebull no congestion control UDP
can blast away as fast as desired
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
why is there a UDP
UDP checksum
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (1rsquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of
received segmentbull check if computed
checksum equals checksum field valuendash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
7
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
Internet checksum example
8
example add two 16-bit integers
1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1
wraparound
sumchecksum
Note when adding numbers a carryout from the most significant bit needs to be added to the result
Principles of reliable data transfer
9
bull important in application transport link layersndash top-10 list of important networking topics
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of reliable data transfer
10
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Principles of reliable data transfer
11
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
UDP User Datagram Protocol [RFC 768]
bull ldquobare bonesrdquo Internet transport protocol
bull ldquobest effortrdquo service UDP segments may bendash lostndash delivered out-of-order to
appbull connectionless
ndash no handshaking between UDP sender receiver
ndash each UDP segment handled independently of others
5
bull UDP usendash streaming multimedia apps
(loss tolerant rate sensitive)ndash DNSndash SNMP
bull reliable transfer over UDP ndash add reliability at application
layerndash application-specific error
recovery
UDP segment header
6
bull no connection establishment (which can add delay)
bull simple no connection state at sender receiver
bull small header sizebull no congestion control UDP
can blast away as fast as desired
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
why is there a UDP
UDP checksum
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (1rsquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of
received segmentbull check if computed
checksum equals checksum field valuendash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
7
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
Internet checksum example
8
example add two 16-bit integers
1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1
wraparound
sumchecksum
Note when adding numbers a carryout from the most significant bit needs to be added to the result
Principles of reliable data transfer
9
bull important in application transport link layersndash top-10 list of important networking topics
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of reliable data transfer
10
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Principles of reliable data transfer
11
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
UDP segment header
6
bull no connection establishment (which can add delay)
bull simple no connection state at sender receiver
bull small header sizebull no congestion control UDP
can blast away as fast as desired
source port dest port
32 bits
applicationdata (payload)
UDP segment format
length checksum
length in bytes of UDP segment
including header
why is there a UDP
UDP checksum
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (1rsquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of
received segmentbull check if computed
checksum equals checksum field valuendash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
7
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
Internet checksum example
8
example add two 16-bit integers
1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1
wraparound
sumchecksum
Note when adding numbers a carryout from the most significant bit needs to be added to the result
Principles of reliable data transfer
9
bull important in application transport link layersndash top-10 list of important networking topics
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of reliable data transfer
10
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Principles of reliable data transfer
11
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
UDP checksum
senderbull treat segment contents
including header fields as sequence of 16-bit integers
bull checksum addition (1rsquos complement sum) of segment contents
bull sender puts checksum value into UDP checksum field
receiverbull compute checksum of
received segmentbull check if computed
checksum equals checksum field valuendash NO - error detectedndash YES - no error detected
But maybe errors nonetheless More later hellip
7
Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment
Internet checksum example
8
example add two 16-bit integers
1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1
wraparound
sumchecksum
Note when adding numbers a carryout from the most significant bit needs to be added to the result
Principles of reliable data transfer
9
bull important in application transport link layersndash top-10 list of important networking topics
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of reliable data transfer
10
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Principles of reliable data transfer
11
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Internet checksum example
8
example add two 16-bit integers
1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1
wraparound
sumchecksum
Note when adding numbers a carryout from the most significant bit needs to be added to the result
Principles of reliable data transfer
9
bull important in application transport link layersndash top-10 list of important networking topics
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of reliable data transfer
10
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Principles of reliable data transfer
11
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Principles of reliable data transfer
9
bull important in application transport link layersndash top-10 list of important networking topics
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of reliable data transfer
10
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Principles of reliable data transfer
11
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Principles of reliable data transfer
10
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Principles of reliable data transfer
11
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Principles of reliable data transfer
11
bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
bull important in application transport link layersndash top-10 list of important networking topics
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Reliable data transfer getting started
12
sendside
receiveside
rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer
udt_send() called by rdtto transfer packet over unreliable channel to receiver
rdt_rcv() called when packet arrives on rcv-side of channel
deliver_data() called by rdt to deliver data to upper
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Reliable data transfer getting started
13
Wersquollbull incrementally develop sender receiver sides of
reliable data transfer protocol (rdt)bull consider only unidirectional data transfer
ndash but control info will flow on both directions
bull use finite state machines (FSMs) to specify sender receiver
state 1 state 2
event causing state transitionactions taken on state transition
state when in this ldquostaterdquo next state
uniquely determined by next event
eventactions
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt10 reliable transfer over a reliable channel
bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets
bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel
14
Wait for call from above packet = make_pkt(data)
udt_send(packet)
rdt_send(data)extract (packetdata)deliver_data(data)
Wait for call from below
rdt_rcv(packet)
sender receiver
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that
pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells
sender that pkt had errorsndash sender retransmits pkt on receipt of NAK
bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender
15
How do humans recover from ldquoerrorsrdquoduring conversation
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt20 channel with bit errors
bull underlying channel may flip bits in packetndash checksum to detect bit errors
bull the question how to recover from errors
ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK
ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors
ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)
ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender
16
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt20 FSM specification
17
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from belowsender
receiverrdt_send(data)
L
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt20 operation with no errors
18
Wait for call from above
sndpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt20 error scenario
19
Wait for call from above
snkpkt = make_pkt(data checksum)udt_send(sndpkt)
extract(rcvpktdata)deliver_data(data)udt_send(ACK)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)
udt_send(NAK)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
Wait for ACK or NAK
Wait for call from below
rdt_send(data)
L
sender
receiver
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt20 has a fatal flaw
what happens if ACKNAK corrupted
bull sender doesnrsquot know what happened at receiver
bull canrsquot just retransmit possible duplicate
handling duplicates bull sender retransmits current
pkt if ACKNAK corruptedbull sender adds sequence
number to each pktbull receiver discards (doesnrsquot
deliver up) duplicate pkt
20
stop and waitsender sends one packet then waits for receiver response
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 sender handles garbled ACKNAKs
21
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0 udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
Wait forcall 1 from above
Wait for ACK or NAK 1
LL
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Wait for 0 from below
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
rdt21 receiver handles garbled ACKNAKs
22
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 Example 1
23
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 Example 1
24
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 Example 1
25
Wait for 0 from below
Wait for 1 from below
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 Example 1
26
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 Example 1
27
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 Example 1
28
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 Example 2
29
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 Example 2
30
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 Example 2
31
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)
sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 Example 2
32
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)
L
Wait for 0 from below
Wait for 1 from below
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 Example 2
33
Wait for call 0 from above
Wait for ACK or NAK 0
Wait forcall 1 from above
Wait for ACK or NAK 1
Wait for 0 from below
Wait for 1 from below
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt21 discussion
senderbull seq added to pktbull two seq rsquos (01) will
suffice Whybull must check if received
ACKNAK corrupted bull twice as many states
ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1
receiverbull must check if received
packet is duplicatendash state indicates whether 0
or 1 is expected pkt seq
bull note receiver cannotknow if its last ACKNAK received OK at sender
34
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt22 a NAK-free protocol
bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt
received OKndash receiver must explicitly include seq of pkt being ACKed
bull duplicate ACK at sender results in same action as NAK retransmit current pkt
35
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt22 sender receiver fragments
36
Wait for call 0 from above
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)
rdt_send(data)
udt_send(sndpkt)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
Wait for ACK 0
sender FSMfragment
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)
Wait for 0 from below
rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))
udt_send(sndpkt)receiver FSMfragment
L
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt30 channels with errors and loss
new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs
retransmissions will be of help hellip but not enough
approach sender waits ldquoreasonablerdquo amount of time for ACK
bull retransmits if no ACK received in this time
bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be
duplicate but seq rsquos already handles this
ndash receiver must specify seq of pkt being ACKed
bull requires countdown timer
37
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt30 sender
38
sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
Wait for ACK0
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )
Wait for call 1 from above
sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer
rdt_send(data)
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)
rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)
stop_timerstop_timer
udt_send(sndpkt)start_timer
timeout
udt_send(sndpkt)start_timer
timeout
rdt_rcv(rcvpkt)
Wait for call 0 from above
Wait for ACK1
Lrdt_rcv(rcvpkt)
LL
L
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
pkt1
ack1
ack0
ack0
(a) no loss
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(b) packet loss
pkt1X
loss
pkt1timeout
resend pkt1
rdt30 in action
39
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt30 in action
40
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
rcv pkt0
send ack0
send ack1
send ack0
rcv ack0
send pkt0
send pkt1
rcv ack1
send pkt0rcv pkt0
pkt0
pkt0
ack1
ack0
ack0
(c) ACK loss
ack1X
loss
pkt1timeout
resend pkt1
rcv pkt1send ack1
(detect duplicate)
pkt1
sender receiver
rcv pkt1
send ack0rcv ack0
send pkt1
send pkt0rcv pkt0
pkt0
ack0
(d) premature timeout delayed ACK
pkt1timeout
resend pkt1
ack1
ack1 rcv pkt0send ack0
send ack1
do nothingrcv ack1send pkt0rcv ack1 pkt0
rcv ack0
ack0
send pkt1pkt1
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Performance of rdt30
bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet
41
sect U sender utilization ndash fraction of time sender busy sending
U sender =
008 30008
= 000027 L R RTT + L R
=
sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link
v network protocol limits use of physical resources
Dtrans = LR
8000 bits109 bitssec= = 8 microsecs
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
rdt30 stop-and-wait operation
42
first packet bit transmitted t = 0sender receiver
RTT
last packet bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
U sender =
008 30008
= 000027 L R RTT + L R
=
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Pipelined protocols
pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver
43
bull two generic forms of pipelined protocols Go-Back-N Selective Repeat
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Pipelining increased utilization
44
first packet bit transmitted t = 0sender receiver
RTT
last bit transmitted t = L R
first packet bit arriveslast packet bit arrives send ACK
ACK arrives send next packet t = RTT + L R
last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK
3-packet pipelining increasesutilization by a factor of 3
U sender =
0024 30008
= 000081 3L R RTT + L R
=
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Pipelined protocols overview
Go-back-Nbull sender can have up to
N unacked packets in pipeline
bull receiver only sends cumulative ackndash Doesnrsquot ack packet if
therersquos a gapbull sender has timer for
oldest unacked packetndash when timer expires
retransmit all unackedpackets
Selective Repeatbull sender can have up to
N unacked packets in pipeline
bull rcvr sends individual ackfor each packet
bull sender maintains timer for each unacked packetndash when timer expires
retransmit only that unacked packet
45
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
Go-Back-N sender
bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed
46
v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)
v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in
window
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment
GBN sender extended FSM
47
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN sender extended FSM
48
Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])
timeout
rdt_send(data)
if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)
start_timernextseqnum++
else
refuse_data(data)
base = getacknum(rcvpkt)+1If (base == nextseqnum)
stop_timerelse
start_timer
rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)
base=1nextseqnum=1
rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
49
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN receiver extended FSM
ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum
bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq
50
Wait
udt_send(sndpkt)default
rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)
extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++
expectedseqnum=1sndpkt = make_pkt(0ACKchksum)
L
GBN in action
51
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
GBN in action
52
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 discard (re)send ack1rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5
Xloss
receive pkt4 discard (re)send ack1
receive pkt5 discard (re)send ack1
rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5
ignore duplicate ACK
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Selective repeat
bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to
upper layer
bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet
bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets
53
Selective repeat sender receiver windows
54
Selective repeat
data from abovebull if next available seq in
window send pkt
timeout(n)bull resend pkt n restart timer
ACK(n) in [sendbase sendbase+N-1]
bull mark pkt n as receivedbull if n smallest unACKed pkt
advance window base to next unACKed seq
55
senderpkt n in [rcvbase rcvbase+N-1]
v send ACK(n)v out-of-order bufferv in-order deliver (also
deliver buffered in-order pkts) advance window to next not-yet-received pkt
pkt n in [rcvbase-N rcvbase-1]
v ACK(n)otherwisev ignore
receiver
Selective repeat in action
56
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeat in action
57
send pkt0send pkt1send pkt2send pkt3
(wait)
sender receiver
receive pkt0 send ack0receive pkt1 send ack1
receive pkt3 buffer send ack3rcv ack0 send pkt4
rcv ack1 send pkt5
pkt 2 timeoutsend pkt2
Xloss
receive pkt4 buffer send ack4
receive pkt5 buffer send ack5
rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2
record ack3 arrived
0 1 2 3 4 5 6 7 8
sender window (N=4)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
record ack4 arrivedrecord ack5 arrived
Q what happens when ack2 arrives
Selective repeatdilemma
example bull seq rsquos 0 1 2 3bull window size=3
receiver window(after receipt)
sender window(after receipt)
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2 pkt0
timeoutretransmit pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2XXX
will accept packetwith seq number 0(b) oops
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
pkt0pkt1pkt2
0 1 2 3 0 1 2pkt0
0 1 2 3 0 1 2
0 1 2 3 0 1 2
0 1 2 3 0 1 2
Xwill accept packetwith seq number 0
0 1 2 3 0 1 2 pkt3
(a) no problem
receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong
v receiver sees no difference in two scenarios
v duplicate data accepted as new in (b)
Q what relationship between seq size and window size to avoid problem in (b)
58
TCP Overview RFCs 79311221323 2018 2581
bull point-to-pointndash one sender one receiver
bull reliable in-order byte streamndash no ldquomessage boundariesrdquo
bull pipelinedndash TCP congestion and flow
control set window size
bull full duplex datandash bi-directional data flow in
same connectionndash MSS maximum segment
size
bull connection-orientedndash handshaking (exchange of
control msgs) inits sender receiver state before data exchange
bull flow controlledndash sender will not overwhelm
receiver
59
TCP segment structure
60
source port dest port
32 bits
applicationdata (variable length)
sequence numberacknowledgement number
receive windowUrg data pointerchecksum
FSRPAUheadlen
notused
options (variable length)
URG urgent data (generally not used)
ACK ACK valid
PSH push data now
RST SYN FINconnection estab(setup teardown
commands)
bytes rcvr willingto accept
countingby bytes of data(not segments)
Internetchecksum
(as in UDP)
TCP seq numbers ACKs
sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data
acknowledgementsndashseq of next byte expected from other side
ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor
61
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
incoming segment to sender
A
sent ACKed
sent not-yet ACKed(ldquoin-flightrdquo)
usablebut not yet sent
not usable
window sizeN
sender sequence number space
source port dest port
sequence numberacknowledgement number
checksum
rwndurg pointer
outgoing segment from sender
Byte stream in TCP
62
Window N bytes
HTTP Get Message (K bytes)
100th byte
TCP header(seq no = 100)
M bytes
HTTP Get Message (K bytes)
Cannot be transmitted now
TCP seq numbers ACKs
63
UsertypeslsquoCrsquo
host ACKsreceipt
of echoedlsquoCrsquo
host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo
simple telnet scenario
Host BHost A
Seq=42 ACK=79 data = lsquoCrsquo
Seq=79 ACK=43 data = lsquoCrsquo
Seq=43 ACK=80
TCP round trip time timeout
Q how to set TCP timeout value
bull longer than RTTndash but RTT varies
bull too short premature timeout unnecessary retransmissions
bull too long slow reaction to segment loss
Q how to estimate RTTbull SampleRTT measured
time from segment transmission until ACK receiptndash ignore retransmissions
bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent
measurements not just current SampleRTT
64
RTT gaiacsumassedu to fantasiaeurecomfr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)
RTT
(mill
iseco
nds)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT
v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125
TCP round trip time timeout
65
RTT
(milli
seco
nds)
RTT gaiacsumassedu to fantasiaeurecomfr
sampleRTTEstimatedRTT
time (seconds)
TCP round trip time timeout
bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin
bull estimate SampleRTT deviation from EstimatedRTT
66
DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|
(typically b = 025)
TimeoutInterval = EstimatedRTT + 4DevRTT
estimated RTT ldquosafety marginrdquo
TCP reliable data transfer
bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer
bull retransmissions triggered byndash timeout eventsndash duplicate acks
67
letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control
congestion control
TCP sender events
data rcvd from appbull create segment with seq bull seq is byte-stream
number of first data byte in segment
bull start timer if not already running ndash think of timer as for oldest
unacked segmentndash expiration interval TimeOutInterval
timeoutbull retransmit segment that
caused timeoutbull restart timerack rcvdbull if ack acknowledges
previously unackedsegmentsndash update what is known to
be ACKedndash start timer if there are still
unacked segments
68
TCP sender (simplified)
69
waitfor event
NextSeqNum = InitialSeqNumSendBase = InitialSeqNum
L
create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
start timer
data received from application above
retransmit not-yet-acked segment with smallest seq
start timer
timeout
if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)
start timerelse stop timer
ACK received with ACK field value y
TCP retransmission scenarios
70
lost ACK scenario
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8 bytes of data
Xtimeo
ut
ACK=100
premature timeout
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=92 8bytes of data
timeo
ut
ACK=120
Seq=100 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
TCP retransmission scenarios
71
X
cumulative ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
Seq=120 15 bytes of data
timeo
ut
Seq=100 20 bytes of data
ACK=120
TCP ACK generation [RFC 5861]
72
event at receiver
arrival of in-order segment withexpected seq All data up toexpected seq already ACKed
arrival of in-order segment withexpected seq One other segment has ACK pending
arrival of out-of-order segmenthigher-than-expect seq Gap detected
arrival of segment that partially or completely fills gap
TCP receiver action
delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK
immediately send single cumulative ACK ACKing both in-order segments
immediately send duplicate ACKindicating seq of next expected byte
immediate send ACK provided thatsegment starts at lower end of gap
TCP fast retransmit
bull time-out period often relatively longndash long delay before resending
lost packet
bull detect lost segments via duplicate ACKsndash sender often sends many
segments back-to-backndash if segment is lost there will
likely be many duplicate ACKs
73
if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked
segment lost so donrsquot wait for timeout
TCP fast retransmit
(ldquotriple duplicate ACKsrdquo)
X
fast retransmit after sender receipt of triple duplicate ACK
Host BHost A
Seq=92 8 bytes of data
ACK=100
timeo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
74
Seq=100 20 bytes of data
Seq=100 20 bytes of data
3 DUP ACKs
TCP flow control
75
applicationprocess
TCP socketreceiver buffers
TCPcode
IPcode
applicationOS
receiver protocol stack
application may remove data from
TCP socket buffers hellip
hellip slower than TCP receiver is delivering(sender is sending)
from sender
receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast
flow control
TCP flow control
bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket
options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer
bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value
bull guarantees receive buffer will not overflow
76
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
receiver-side buffering
Connection Management
before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to
establish connection)bull agree on connection parameters
77
connection state ESTABconnection variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
connection state ESTABconnection Variables
seq client-to-serverserver-to-client
rcvBuffer sizeat serverclient
application
network
Socket clientSocket = newSocket(hostnameport number)
Socket connectionSocket = welcomeSocketaccept()
TCP 3-way handshake
80
SYNbit=1 Seq=x
choose init seq num xsend TCP SYN msg
ESTAB
SYNbit=1 Seq=yACKbit=1 ACKnum=x+1
choose init seq num ysend TCP SYNACKmsg acking SYN
ACKbit=1 ACKnum=y+1
received SYNACK(x) indicates server is livesend ACK for SYNACK
this segment may contain client-to-server data received ACK(y)
indicates client is live
SYNSENT
ESTAB
SYN RCVD
client stateCLOSED
server stateLISTEN
TCP 3-way handshake FSM
81
closed
L
listen
SYNrcvd
SYNsent
ESTAB
Socket clientSocket = newSocket(hostnameport number)
SYN(seq=x)
Socket connectionSocket = welcomeSocketaccept()
SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client
SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)
L
TCP closing a connection
bull client server each close their side of connectionndash send TCP segment with FIN bit = 1
bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN
bull simultaneous FIN exchanges can be handled
82
FIN_WAIT_2
CLOSE_WAIT
FINbit=1 seq=y
ACKbit=1 ACKnum=y+1
ACKbit=1 ACKnum=x+1wait for server
close
can stillsend data
can no longersend data
LAST_ACK
CLOSED
TIMED_WAIT
timed wait for 2max
segment lifetime
CLOSED
TCP closing a connection
83
FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data
clientSocketclose()
client state server stateESTABESTAB
The ldquoTwo Army Problemrdquo
84
Principles of congestion control
congestionbull informally ldquotoo many sources sending too much data
too fast for network to handlerdquobull different from flow controlbull manifestations
ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)
bull a top-10 problem
85
Causescosts of congestion scenario 1
bull two senders two receivers
bull one router infinite buffers
bull output link capacity Rbull no retransmission
bull maximum per-connection throughput R2
86
unlimited shared output link buffers
Host A
original data lin
Host B
throughput lout
R2
R2
l out
lin R2
dela
ylin
v large delays as arrival rate lin approaches capacity
Causescosts of congestion scenario 2
bull one router finite buffers bull sender retransmission of timed-out packet
ndash application-layer input = application-layer output lin = lout
ndash transport-layer input includes retransmissions lrsquoin lin
87
finite shared output link buffers
Host A
lin original data
Host B
loutlin original data plusretransmitted data
Causescosts of congestion scenario 2
idealization perfect knowledgebull sender sends only when router
buffers available
88
finite shared output link buffers
lin original dataloutlin original data plus
retransmitted datacopy
free buffer space
R2
R2
l out
lin
Host B
A
lin original dataloutlin original data plus
retransmitted datacopy
no buffer space
Causescosts of congestion scenario 2
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
89
A
Host B
lin original dataloutlin original data plus
retransmitted data
free buffer space
Causescosts of congestion scenario 2
90
R2
R2lin
l out
when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)
A
Host B
Idealization known losspackets can be lost dropped at router due to full buffers
bull sender only resends if packet known to be lost
A
lin loutlincopy
free buffer space
timeout
R2
R2lin
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
Host B
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 2
91
R2
l out
when sending at R2 some packets are retransmissions including duplicated that are delivered
ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt
sect decreasing goodput
R2lin
Causescosts of congestion scenario 2
92
Realistic duplicatesv packets can be lost dropped
at router due to full buffersv sender times out prematurely
sending two copies both of which are delivered
Causescosts of congestion scenario 3
bull four sendersbull multihop pathsbull timeoutretransmit
93
Q what happens as lin and linrsquo
increase
finite shared output link buffers
Host A lout Host B
Host CHost D
lin original datalin original data plus
retransmitted data
A as red linrsquo increases all arriving
blue pkts at upper queue are dropped blue throughput g 0
another ldquocostrdquo of congestionv when packet dropped any ldquoupstream
transmission capacity used for that packet was wasted
Causescosts of congestion scenario 3
94
R2
R2
l out
linrsquo
Bandwidth wastage for packets dropped at the 2nd router
Offered load by Host A
Thro
ughp
ut b
y bl
ue tr
affic
Approaches towards congestion control
95
two broad approaches towards congestion control
end-end congestion control
bull no explicit feedback from network
bull congestion inferred from end-system observed loss delay
bull approach taken by TCP
network-assisted congestion control
bull routers provide feedback to end systemsndashsingle bit indicating
congestion (SNA DECbit TCPIP ECN ATM)
ndashexplicit rate for sender to send at
TCP congestion controladditive increase multiplicative decrease (AIMD)
96
v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every
RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss
cwnd
TCP
send
er
cong
estio
n w
indo
w s
ize
AIMD saw toothbehavior probing
for bandwidth
additively increase window size helliphellip until loss occurs (then cut window in half)
time
TCP Congestion Control details
bull sender limits transmission
bull cwnd is dynamic function of perceived network congestion
TCP sending ratebull roughly send cwnd
bytes wait RTT for ACKs then send more bytes
97
last byteACKed sent not-
yet ACKed(ldquoin-flightrdquo)
last byte sent
cwnd
LastByteSent-LastByteAcked
lt cwnd
sender sequence number space
rate ~~cwndRTT
bytessec
TCP Slow Start
bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received
bull summary initial rate is slow but ramps up exponentially fast
98
Host A
one segment
Host B
RTT
time
two segments
four segments
TCP detecting reacting to loss
bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to
threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO
ndash dup ACKs indicate network capable of delivering some segments
ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3
duplicate acks)
99
TCP switching from slow start to CA
100
Implementationbull variable ssthreshbull on loss event ssthresh is
set to 12 of cwnd just before loss event
Q when should the exponential increase switch to linear
A when cwnd gets to 12 of its value before timeout
Summary TCP Congestion Control
101
timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment
Lcwnd gt ssthresh
congestionavoidance
cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed
new ACK
dupACKcount++duplicate ACK
fastrecovery
cwnd = cwnd + MSStransmit new segment(s) as allowed
duplicate ACK
ssthresh= cwnd2cwnd = ssthresh + 3
retransmit missing segment
dupACKcount == 3
timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment
ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment
dupACKcount == 3cwnd = ssthreshdupACKcount = 0
New ACK
slow start
timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment
cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed
new ACKdupACKcount++duplicate ACK
Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0
NewACK
NewACK
NewACK
TCP throughput
bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send
bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT
102
W
W2
avg TCP throuput = 34WRTT bytessec
34W
TCP Futures TCP over ldquolong fat pipesrdquo
bull example 1500 byte segments 100ms RTT want 10 Gbps throughput
bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L
[Mathis 1997]
to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate
bull new versions of TCP for high-speed
TCP throughput = 122 MSSRTT L
TCP Fairness
fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK
104
TCP connection 1
bottleneckroutercapacity RTCP connection 2
Why is TCP fair
two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally
105
R
R
equal bandwidth share
Connection 1 throughput
Con
nect
ion
2 th
roug
hput
congestion avoidance additive increaseloss decrease window by factor of 2
congestion avoidance additive increaseloss decrease window by factor of 2
Full bandwidth utilization line
(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2
Fairness (more)
Fairness and UDPbull multimedia apps often
do not use TCPndash do not want rate
throttled by congestion control
bull instead use UDPndash send audiovideo at
constant rate tolerate packet loss
Fairness parallel TCP connections
bull application can open multiple parallel connections between two hosts
bull web browsers do this bull eg link of rate R with
9 existing connectionsndash new app asks for 1 TCP gets
rate R10ndash new app asks for 11 TCPs
gets R2
106
network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate
congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit
on receiver-to-sender ACK segment to notify sender of congestion
Explicit Congestion Notification (ECN)
sourceapplicationtransportnetworklinkphysical
destinationapplicationtransportnetworklinkphysical
ECN=00 ECN=11
ECE=1
IP datagram
TCP ACK segment