Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere!...

41
Resilient design 101 Avishai Ish-Shalom github.com/avishai-ish-shalom @nukemberg [email protected]

Transcript of Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere!...

Page 1: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Resilient design 101

Avishai Ish-Shalom

github.com/[email protected]@wix.com

Page 2: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Wix in numbers

~ 600 Engineers~ 2000 employees

~ 100M users

~ 500 micro services

Lithuania

Ukraine

Vilnius

Kyiv

Dnipro

Wix Engineering Locations

Israel

Tel-Aviv

Be’er Sheva

Page 3: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything
Page 4: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Queues

01

Page 5: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Queues are everywhere!

▪ Futures/Executors

▪ Sockets

▪ Locks (DB Connection pools)

▪ Callbacks in node.js/Netty

Anything async?!

Page 6: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Queues

▪ Incoming load (arrival rate)

▪ Service from the queue (service rate)

▪ Service discipline (FIFO/LIFO/Priority)

▪ Latency = Wait time + Service time

▪ Service time independent of queue

Page 7: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

It varies

▪ Arrival rate fluctuates

▪ Service times fluctuates

▪ Delays accumulate

▪ Idle time wasted

Queues are almost always full or near-empty!

Page 8: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Capacity & Latency▪ Latency (and queue size) rises to infinity

as utilization approaches 1

▪ For QoS ρ << 0.75

▪ Decent latency -> over capacity

ρ = arrival rate / service rate (utilization)

Page 9: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Implications

Infinite queues:

▪ Memory pressure / OOM

▪ High latency

▪ Stale work

Always limit queue size!

Work item TTL*

Page 10: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Latency & Service time

λ = wait timeσ = service timeρ = utilization

Page 11: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Utilization fluctuates!

▪ 10% fluctuation at = 0.5 will hardly affects latency (~ 1.1x)

▪ 10% fluctuation at = 0.9 will kill you (~ 10x latency)

▪ Be careful when overloading resources

▪ During peak load we must be extra careful

▪ Highly varied load must be capped

Page 12: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Practical advice

▪ Use chokepoints (throttling/load shedding)

▪ Plan for low utilization of slow resources

Example

Resource Latency Planned Utilization

RPC thread pool 1ms 0.75

DB connection pool 10ms 0.5

Page 13: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Backpressure

▪ Internal queues fill up and cause latency

▪ Front layer will continue sending traffic

▪ We need to inform the client that we’re out of capacity

▪ E.g.: Blocking client, HTTP 503, finite queues for

threadpools

Page 14: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Backpressure

▪ Blocking code has backpressure by default

▪ Executors, remote calls and async code need explicit

backpressure

▪ E.g. producer/consumer through Kafka

Page 15: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Load shedding

▪ A tradeoff between latency and error rate

▪ Cap the queue size / throttle arrival rate

▪ Reject excess work or send to fallback service

Example: Facebook uses LIFO queue and rejects stale work

http://queue.acm.org/detail.cfm?id=2839461

Page 16: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Thread Pools

02

Page 17: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Jetty architecture

Thread pool (QTP)

Soc

ket

Acceptor thread

Page 18: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Too many threads▪ O/S also has a queue

▪ Threads take memory, FDs, etc

▪ What about shared resources?

Bad QoS, GC storms, ungraceful

degradation

Not enough threads

wrong

▪ Work will queue up

▪ Not enough RUNNING threads

High latency, low resource utilization

Page 19: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Capacity/Latency tradeoffsWhen optimizing for Latency:For low latency, resources must be available when needed

Keep the queue empty

▪ Block or apply backpressure

▪ Keep the queue small

▪ Overprovision

Page 20: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Capacity/Latency tradeoffsWhen optimizing for CapacityFor max capacity, resources must always have work waiting

Keep the queue full

▪ We use a large queue to buffer work

▪ Queueing increases latency

▪ Queue size >> concurrency

Page 21: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

How may threads?

▪ Assuming CPU is the limiting resource

▪ Compute by maximal load (opt. latency)

▪ With a Grid: How many cores???

Java Concurrency in Practice (http://jcip.net/)

Page 22: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

How may threads?How to compute?

▪ Transaction time = W + C

▪ C ~ Total CPU time / throughput

▪ U ~ 0.5 – 0.7 (account for O/S, JVM, GC - and 0.75 utilization target)

▪ Memory and other resource limits

Page 23: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

What about async servers?

Page 24: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Async servers architecture

Soc

ket

Event loop

epoll

Callbacks

O/S

Syscalls

Soc

ket

Soc

ket

Page 25: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Async systems▪ Event loop callback/handler queue

▪ The callback queue is unbounded (!!!)

▪ Event loop can block (ouch)

▪ No inherent concurrency limit

▪ No backpressure (*)

Page 26: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Async systems - overload▪ No preemption -> no QoS

▪ No backpressure -> overload

▪ Hard to tune

▪ Hard to limit concurrency/queue size

▪ Hard to debug

Page 27: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

So what’s the point?▪ High concurrency

▪ More control (timeouts)

▪ I/O heavy servers

Still evolving…. let’s revisit in a few years?

Page 28: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Little’s Law

03

Page 29: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Little’s law

▪ Holds for all distributions

▪ For “stable” systems

▪ Holds for systems and their subsystems

▪ “Throughput” is either Arrival rate or Service rate depending on the context.

Be careful!

L = λ⋅W

L = Avg clients in the systemλ = Avg ThroughputW = Avg Latency

Page 30: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Using Little’s law

▪ How many requests queued inside the system?

▪ Verifying load tests / benchmarks

▪ Calculating latency when no direct measurement is possible

Go watch Gil Tene’s "How NOT to Measure Latency"

Read Benchmarking Blunders and Things That Go Bump in the Night

Page 31: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Using Little’s law

W1 = 0.1

W2 = 0.001

LB

λ2 = 10,000

λ1 = 100

Least connections

Page 32: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Timeouts

04

Page 33: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

How not to timeout

People use arbitrary timeout values

▪ DB timeout > Overall transaction timeout

▪ Cache timeout > DB latency

▪ Huge unrealistic timeouts

▪ Refusing to return errors

P.S: connection timeout, read timeout & transaction timeout are not the same thing

Page 34: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Deciding on timeouts

Use the distribution luke!

▪ Resources/Errors tradeoff

▪ Cumulative distribution chart

▪ Watch out for multiple modes

▪ Context, context, context

Page 35: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Timeouts should be derived from real world constraints!

Page 36: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

UX numbers every developer needs to know

▪ Smooth motion perception threshold: ~ 20ms

▪ Immediate reaction threshold: ~ 100ms

▪ Delay perception threshold: ~ 300ms

▪ Focus threshold: ~ 1sec

▪ Frustration threshold: ~ 10sec

Google's RAIL modelUX powers of 10

Page 37: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Hardware latency numbers every developer needs to know▪ SSD Disk seek: 0.15ms

▪ Magnetic disk seek: ~ 10ms

▪ Round trip within same datacenter: ~ 0.5ms

▪ Packet roundtrip US->EU->US: ~ 150ms

▪ Send 1M over typical user WAN: ~ 1sec

Latency numbers every developer needs to know (updated)

Page 38: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Timeout Budgets▪ Decide on global timeouts

▪ Pass context object

▪ Each stage decrements budget

▪ Local timeouts according to budget

▪ If budget too low, terminate

preemptively

Think microservices

Example

Global: 500ms

Stage Used Budget Timeout

Authorization 6ms 494ms 100ms

Data fetch (DB) 123ms 371ms 200ms

Processing 47ms 324ms 371ms

Rendering 89ms 235ms 324ms

Audit 2ms - -

Filter 10ms 223ms 233ms

Page 39: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

The debt buyer▪ Transactions may return eventually after timeout

▪ Does the client really have to wait?

▪ Timeout and return error/default response to client (50ms)

▪ Keep waiting asynchronously (1 sec)

Can’t be used when client is expecting data back

Page 40: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Questions?

github.com/[email protected]@wix.com

Page 41: Avishai Ish Shalom @nukemberg Resilient Design 101 … Ish-Shalom... · Queues are everywhere! Futures/Executors Sockets Locks (DB Connection pools) Callbacks in node.js/Netty Anything

Thank You

github.com/[email protected]@wix.com