Resilient design 101 (BuildStuff LT 2017)

Resilient design 101

Avishai Ish-Shalom

github.com/[email protected]@wix.com

Wix in numbers

~ 600 Engineers~ 2000 employees

~ 100M users

~ 500 micro services

Lithuania

Ukraine

Vilnius

Kyiv

Dnipro

Wix Engineering Locations

Israel

Tel-Aviv

Be’er Sheva

Queues

01

Queues are everywhere!

▪ Futures/Executors

▪ Sockets

▪ Locks (DB Connection pools)

▪ Callbacks in node.js/Netty

Anything async?!

Queues

▪ Incoming load (arrival rate)

▪ Service from the queue (service rate)

▪ Service discipline (FIFO/LIFO/Priority)

▪ Latency = Wait time + Service time

▪ Service time independent of queue

It varies

▪ Arrival rate fluctuates

▪ Service times fluctuates

▪ Delays accumulate

▪ Idle time wasted

Queues are almost always full or near-empty!

Capacity & Latency▪ Latency (and queue size) rises to infinity

as utilization approaches 1

▪ For QoS ρ << 0.75

▪ Decent latency -> over capacity

ρ = arrival rate / service rate (utilization)

Implications

Infinite queues:

▪ Memory pressure / OOM

▪ High latency

▪ Stale work

Always limit queue size!

Work item TTL*

Latency & Service time

λ = wait timeσ = service timeρ = utilization

Utilization fluctuates!

▪ 10% fluctuation at = 0.5 will hardly affects latency (~ 1.1x)

▪ 10% fluctuation at = 0.9 will kill you (~ 10x latency)

▪ Be careful when overloading resources

▪ During peak load we must be extra careful

▪ Highly varied load must be capped

Practical advice

▪ Use chokepoints (throttling/load shedding)

▪ Plan for low utilization of slow resources

Example

Resource Latency Planned Utilization

RPC thread pool 1ms 0.75

DB connection pool 10ms 0.5

Backpressure

▪ Internal queues fill up and cause latency

▪ Front layer will continue sending traffic

▪ We need to inform the client that we’re out of capacity

▪ E.g.: Blocking client, HTTP 503, finite queues for

threadpools

Backpressure

▪ Blocking code has backpressure by default

▪ Executors, remote calls and async code need explicit

backpressure

▪ E.g. producer/consumer through Kafka

Load shedding

▪ A tradeoff between latency and error rate

▪ Cap the queue size / throttle arrival rate

▪ Reject excess work or send to fallback service

Example: Facebook uses LIFO queue and rejects stale work

http://queue.acm.org/detail.cfm?id=2839461

http://queue.acm.org/detail.cfm?id=2839461

Thread Pools

02

Jetty architecture

Thread pool (QTP)

Soc

ket

Acceptor thread

Too many threads▪ O/S also has a queue

▪ Threads take memory, FDs, etc

▪ What about shared resources?

Bad QoS, GC storms, ungraceful

degradation

Not enough threads

wrong

▪ Work will queue up

▪ Not enough RUNNING threads

High latency, low resource utilization

Capacity/Latency tradeoffsWhen optimizing for Latency:For low latency, resources must be available when needed

Keep the queue empty

▪ Block or apply backpressure

▪ Keep the queue small

▪ Overprovision

Capacity/Latency tradeoffsWhen optimizing for CapacityFor max capacity, resources must always have work waiting

Keep the queue full

▪ We use a large queue to buffer work

▪ Queueing increases latency

▪ Queue size >> concurrency

How may threads?

▪ Assuming CPU is the limiting resource

▪ Compute by maximal load (opt. latency)

▪ With a Grid: How many cores???

Java Concurrency in Practice (http://jcip.net/)

http://jcip.net/

How may threads?How to compute?

▪ Transaction time = W + C

▪ C ~ Total CPU time / throughput

▪ U ~ 0.5 – 0.7 (account for O/S, JVM, GC - and 0.75 utilization target)

▪ Memory and other resource limits

What about async servers?

Async servers architecture

Soc

ket

Event loop

epoll

Callbacks

O/S

Syscalls

Soc

ket

Soc

ket

Async systems▪ Event loop callback/handler queue

▪ The callback queue is unbounded (!!!)

▪ Event loop can block (ouch)

▪ No inherent concurrency limit

▪ No backpressure (*)

Async systems - overload▪ No preemption -> no QoS

▪ No backpressure -> overload

▪ Hard to tune

▪ Hard to limit concurrency/queue size

▪ Hard to debug

So what’s the point?▪ High concurrency

▪ More control (timeouts)

▪ I/O heavy servers

Still evolving…. let’s revisit in a few years?

Little’s Law

03

Little’s law

▪ Holds for all distributions

▪ For “stable” systems

▪ Holds for systems and their subsystems

▪ “Throughput” is either Arrival rate or Service rate depending on the context.

Be careful!

L = λ⋅W

L = Avg clients in the systemλ = Avg ThroughputW = Avg Latency

Using Little’s law

▪ How many requests queued inside the system?

▪ Verifying load tests / benchmarks

▪ Calculating latency when no direct measurement is possible

Go watch Gil Tene’s "How NOT to Measure Latency"

Read Benchmarking Blunders and Things That Go Bump in the Night

https://www.youtube.com/watch?v=lJ8ydIuPFeU

https://arxiv.org/pdf/cs/0404043.pdf

Using Little’s law

W1 = 0.1

W2 = 0.001

LB

λ2 = 10,000

λ1 = 100

Least connections

Timeouts

04

How not to timeout

People use arbitrary timeout values

▪ DB timeout > Overall transaction timeout

▪ Cache timeout > DB latency

▪ Huge unrealistic timeouts

▪ Refusing to return errors

P.S: connection timeout, read timeout & transaction timeout are not the same thing

Deciding on timeouts

Use the distribution luke!

▪ Resources/Errors tradeoff

▪ Cumulative distribution chart

▪ Watch out for multiple modes

▪ Context, context, context

Timeouts should be derived from real world constraints!

UX numbers every developer needs to know

▪ Smooth motion perception threshold: ~ 20ms

▪ Immediate reaction threshold: ~ 100ms

▪ Delay perception threshold: ~ 300ms

▪ Focus threshold: ~ 1sec

▪ Frustration threshold: ~ 10sec

Google's RAIL modelUX powers of 10

https://developers.google.com/web/fundamentals/performance/rail

https://www.nngroup.com/articles/powers-of-10-time-scales-in-ux/

Hardware latency numbers every developer needs to know▪ SSD Disk seek: 0.15ms

▪ Magnetic disk seek: ~ 10ms

▪ Round trip within same datacenter: ~ 0.5ms

▪ Packet roundtrip US->EU->US: ~ 150ms

▪ Send 1M over typical user WAN: ~ 1sec

Latency numbers every developer needs to know (updated)

https://gist.github.com/hellerbarde/2843375

Timeout Budgets▪ Decide on global timeouts

▪ Pass context object

▪ Each stage decrements budget

▪ Local timeouts according to budget

▪ If budget too low, terminate

preemptively

Think microservices

Example

Global: 500ms

Stage Used Budget Timeout

Authorization 6ms 494ms 100ms

Data fetch (DB) 123ms 371ms 200ms

Processing 47ms 324ms 371ms

Rendering 89ms 235ms 324ms

Audit 2ms - -

Filter 10ms 223ms 233ms

The debt buyer▪ Transactions may return eventually after timeout

▪ Does the client really have to wait?

▪ Timeout and return error/default response to client (50ms)

▪ Keep waiting asynchronously (1 sec)

Can’t be used when client is expecting data back

Questions?


Thank You


Resilient design 101 (BuildStuff LT 2017)

Software

Transcript of Resilient design 101 (BuildStuff LT 2017)