QCon London 2015 Protocols of Interaction

87
Protocols of Interaction Best Current Practices Todd L. Montgomery @toddlmontgomery

Transcript of QCon London 2015 Protocols of Interaction

Protocols of Interaction Best Current Practices

Todd L. Montgomery @toddlmontgomery

What is a Protocol?

Why should I care!?

@toddlmontgomery

pro·to·col noun \ˈprō-tə-ˌko ̇l, -ˌkōl, -ˌkäl, -kəl\

...

3 b : a set of conventions governing the treatment and especially the formatting of data in an electronic communications system <network protocols>

...

3 a : a code prescribing strict adherence to correct etiquette and precedence (as in diplomatic exchange and in the military services) <a breach of protocol>

Protocols of Interaction Matter

In an emerging era of micro-services,protocols of interaction matter

Protocols are a rich source of solutionsto micro-service problems

AlgorithmsPerformanceConcurrency

Security

Multi-Disciplinary

Number TheoryStatistics

Graph TheoryBiology ?!

Networks, and especially the Internet,are Hostile Environments

Data can be lost,

duplicated, and re-ordered!!

TCP connections can…

be closedunexpectedly

end in anunknown state

be interceptedby idiots, er Proxies

Duplicated

Re-Ordered

Lost

Which meansData over TCP* might be…

* - When connections are re-established

Case Study 1

Loose Ordering = The New Normal(De)multiplexing

@toddlmontgomery

SyncRequests

&Responses

Request

Request

RequestResponse

Response

Response

Throughput limited by Round-Trip Time (RTT)!

@toddlmontgomery

AsyncRequests

&Responses

Request

Request

RequestResponse

Response

Response

Throughput less limited by Round-Trip Time!

@toddlmontgomery

AsyncRequests

&Responses

Correlation!

Request 0

Request 1

Request 2Response 0

Response 1

Response 2

Aside…

Ordering is an Illusion!!

Compiler can re-order

Runtime can re-order

CPU can re-order

Ordering has to be imposed!

@toddlmontgomery

AsyncRequests

&Responses

Correlation!

Request 0

Request 1

Request 2Response 0

Response 1

Response 2

@toddlmontgomery

Correlation!

Request 0

Request 1

Request 2Response 0

Response 1

Response 2

Ordering

@toddlmontgomery

Correlation!

Request 0

Request 1

Request 2

Response 0

Response 1

Response 2

(Valid)Re-Ordering

@toddlmontgomery

Handling the Unexpected

Request 0

Response 1

Invalid, Drop We only know of 0.1 is unknown!

SCTPHTTP/2 (SPDY)

…most OSI Layer 4 protocols

Case Study 2

Can you hear me now?Timeouts & Retries

@toddlmontgomery

Request

ACK

Processing

Handling the unexpected

@toddlmontgomery

Request

Tim

eout

Inte

rval

X

@toddlmontgomery

Request

ACK

Processing

XTi

meo

ut In

terv

al

Retransmit at end of interval

@toddlmontgomery

ACK

Processing…

Spurious Retransmits

Retransmit

Original

Tim

eout

Inte

rval

@toddlmontgomery

Interval = N x “typical” RTT

Account for processing delay

XTi

meo

ut In

terv

al

“Average”

@toddlmontgomery

Measure! But very “noisy”?

RTT

Mea

sure

men

t

Variances inprocessing,

transmission,etc.

TCP Retransmit Timeout (RTO)

Err = M - A A <- A + gErrD <- D + h(|Err| - D)RTO = A + 4D

M = measurement, A = smoothed average, D = smoothed mean deviation,

g and h = gain constants (0 to 1)

TCP Retransmit Timeout (RTO)

Err = M - A A <- A + gErrD <- D + h(|Err| - D)RTO = A + 4D

Do you measure on a Retransmit? NO!

@toddlmontgomery

Does processing twice hurt?

X

Original

ACK

Retrans

Process Once

Process Twice

Tim

eout

Inte

rval

@toddlmontgomery

Are Original & Retransmit treated the same?

X

Original

ACK

Retrans

Process Once

Process Twice

Tim

eout

Inte

rval

TCPSCTPAeron

…anything with reliability

Case Study 3

What I Need! When I Need It!“Lifetime” Management

“Managing” Application Working Set

Caching Algorithms

LRU, MRU, PLRU, RR,SLRU, LFU, …

“Liveness” is essential

@toddlmontgomery

Request

ACK

Service Ais Alive!

Service Bis Alive!

Service A Service B

Consequence of Processing

@toddlmontgomery

Keepalive

Keepal

ive

Service Ais Alive!

Service Bis Alive!

Service A Service B

Absence of Processing

RIP Route Deletion

Step 0 - route info broadcast @30 secondsStep 1 (3 min) - Set Distance to Infinity (16) Step 2 (+1 min) - Delete Route

Aside… RIP… aptly named

Aeron Driver Keepalive

Time of Last Activity = Shared Variable

Doesn’t need to be a message…

@toddlmontgomery

Bye

Bye

Service Ais gone!

Service Bis gone!

Service A Service B

Optimization, but insufficient with arbitrary failures

Liveness often exists acrosstransient connectivity

So…Don’t conflate transport

state with liveness!

Like TCP connection state

BGPOSPF

Transports…

almost every protocol

Case Study 4

Elasti-What?Self-Similar Behavior

Request X

Request X

Request X

Request X, X, X

Multiple same/similar requests at the same time

Response X, X, X

Similar Problem…

Reliable Multicast

1, 2, 3

1, 2, 3 1, 2, 3 1, 2, 3

Non-correlated loss

X X X

NAK 1, 2, 3

NAK 2

NAK 1

NAK 3

Request individual lost data

Retransmit 1, 2, 3

1, 2, 3

1, 2, 3 1, 2, 3 1, 2, 3

Temporally/Spatially Correlated Loss

X X X

NAK 2

NAK 2

NAK 2

NAK 2, 2, 2

Multiple requests for same data

Retransmit 2, 2, 2

Request 2

Request 2

Request 2

Request 2, 2, 2

It’s a generic problem!

Request 2

Request 2

Request 2

Request 2, 2, 2

Overloading Responder & Network

Request 2

Don’t Immediately Request, Listen first

Timeout!Request

2Request

2

Suppress Request

Request 2

How long to wait & listen for?

Timeout!Request

2Request

2

Suppress Request

Statistics to the Rescue!

SRM Backoff

RandomBackoff = [C1, C1+C2] * 1-way delay

Random is more than good enough

Optimal Multicast Feedback

double RandomBackoff(double T_maxBackoff, double groupSize){ double lambda = log(groupSize) + 1; double x = UniformRand(lambda/T_maxBackoff) + lambda / (T_maxBackoff*(exp(lambda)-1));

return ((T_maxBackoff/lambda) * log(x*(exp(lambda)-1)*(T_maxBackoff/lambda)));}

Truncated Exponential Distribution

Request 2

Request 2

Request 2, 2

Must also shed duplicates on the responder

Response 2, 2

Shed second “Request 2” if too soon

X

X

SRMPGMAeron

http://en.wikipedia.org/wiki/Scalable_Reliable_Multicasthttp://www.eurecom.fr/en/publication/107/detail/optimal-multicast-feedback

Case Study 5

Hey, Slow Down!Flow (& Congestion) Control

@toddlmontgomery

Data

Data

DataACK

ACK

ACK

Throughput = Data Length / RTT

RTT

Stop-And-WaitFlow Control

Delay

Bandwidth

BDP = (Byte / sec) * sec = Bytes

BDP(Buffer)

@toddlmontgomery

Data

ACKRT

T

Throughput = N * Data Length / RTT

… N Data“Blobs”

So…How big is N?

This is surprisingly hard to answer

It depends…

Big… but

Don’t overflow receiver

Don’t overflow “network”

TCP Flow Control

Receiver advertises N

TCP Congestion Control

Sender probes for network N

TCP Sender

min(Receiver N, Network N)

Only go as fast as Network & Receiver

TCPAeron

http://en.wikipedia.org/wiki/TCP_congestion-avoidance_algorithm

One more thing…

Queue Management

Perhaps the single most useful thing!

Effective management ofqueues can not be overlooked

Unbounded Queues are bad, m’kay

Bounding implies

Back pressure and/or Dropping

CoDel

locally minimize delay in queuecombat bufferbloat

http://en.wikipedia.org/wiki/CoDel

Just a taste…

Takeaways!

Protocols are a rich source ofsolutions to complicated problems

Protocols of interaction matter & canbe tremendously impactful for

better or worse…

@toddlmontgomery

Questions?

• IETF http://www.ietf.org/• Aeron https://github.com/real-logic/Aeron• SlideShare http://www.slideshare.com/toddleemontgomery• Twitter @toddlmontgomery

Thank You!