Resilient design 101 (BuildStuff LT 2017)
-
Upload
avishai-ish-shalom -
Category
Software
-
view
78 -
download
2
Transcript of Resilient design 101 (BuildStuff LT 2017)
![Page 2: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/2.jpg)
Wix in numbers
~ 600 Engineers~ 2000 employees
~ 100M users
~ 500 micro services
Lithuania
Ukraine
Vilnius
Kyiv
Dnipro
Wix Engineering Locations
Israel
Tel-Aviv
Be’er Sheva
![Page 3: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/3.jpg)
![Page 4: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/4.jpg)
Queues
01
![Page 5: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/5.jpg)
Queues are everywhere!
▪ Futures/Executors
▪ Sockets
▪ Locks (DB Connection pools)
▪ Callbacks in node.js/Netty
Anything async?!
![Page 6: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/6.jpg)
Queues
▪ Incoming load (arrival rate)
▪ Service from the queue (service rate)
▪ Service discipline (FIFO/LIFO/Priority)
▪ Latency = Wait time + Service time
▪ Service time independent of queue
![Page 7: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/7.jpg)
It varies
▪ Arrival rate fluctuates
▪ Service times fluctuates
▪ Delays accumulate
▪ Idle time wasted
Queues are almost always full or near-empty!
![Page 8: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/8.jpg)
Capacity & Latency▪ Latency (and queue size) rises to infinity
as utilization approaches 1
▪ For QoS ρ << 0.75
▪ Decent latency -> over capacity
ρ = arrival rate / service rate (utilization)
![Page 9: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/9.jpg)
Implications
Infinite queues:
▪ Memory pressure / OOM
▪ High latency
▪ Stale work
Always limit queue size!
Work item TTL*
![Page 10: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/10.jpg)
Latency & Service time
λ = wait timeσ = service timeρ = utilization
![Page 11: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/11.jpg)
Utilization fluctuates!
▪ 10% fluctuation at = 0.5 will hardly affects latency (~ 1.1x)
▪ 10% fluctuation at = 0.9 will kill you (~ 10x latency)
▪ Be careful when overloading resources
▪ During peak load we must be extra careful
▪ Highly varied load must be capped
![Page 12: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/12.jpg)
Practical advice
▪ Use chokepoints (throttling/load shedding)
▪ Plan for low utilization of slow resources
Example
Resource Latency Planned Utilization
RPC thread pool 1ms 0.75
DB connection pool 10ms 0.5
![Page 13: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/13.jpg)
Backpressure
▪ Internal queues fill up and cause latency
▪ Front layer will continue sending traffic
▪ We need to inform the client that we’re out of capacity
▪ E.g.: Blocking client, HTTP 503, finite queues for
threadpools
![Page 14: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/14.jpg)
Backpressure
▪ Blocking code has backpressure by default
▪ Executors, remote calls and async code need explicit
backpressure
▪ E.g. producer/consumer through Kafka
![Page 15: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/15.jpg)
Load shedding
▪ A tradeoff between latency and error rate
▪ Cap the queue size / throttle arrival rate
▪ Reject excess work or send to fallback service
Example: Facebook uses LIFO queue and rejects stale work
http://queue.acm.org/detail.cfm?id=2839461
![Page 16: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/16.jpg)
Thread Pools
02
![Page 17: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/17.jpg)
Jetty architecture
Thread pool (QTP)
Soc
ket
Acceptor thread
![Page 18: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/18.jpg)
Too many threads▪ O/S also has a queue
▪ Threads take memory, FDs, etc
▪ What about shared resources?
Bad QoS, GC storms, ungraceful
degradation
Not enough threads
wrong
▪ Work will queue up
▪ Not enough RUNNING threads
High latency, low resource utilization
![Page 19: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/19.jpg)
Capacity/Latency tradeoffsWhen optimizing for Latency:For low latency, resources must be available when needed
Keep the queue empty
▪ Block or apply backpressure
▪ Keep the queue small
▪ Overprovision
![Page 20: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/20.jpg)
Capacity/Latency tradeoffsWhen optimizing for CapacityFor max capacity, resources must always have work waiting
Keep the queue full
▪ We use a large queue to buffer work
▪ Queueing increases latency
▪ Queue size >> concurrency
![Page 21: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/21.jpg)
How may threads?
▪ Assuming CPU is the limiting resource
▪ Compute by maximal load (opt. latency)
▪ With a Grid: How many cores???
Java Concurrency in Practice (http://jcip.net/)
![Page 22: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/22.jpg)
How may threads?How to compute?
▪ Transaction time = W + C
▪ C ~ Total CPU time / throughput
▪ U ~ 0.5 – 0.7 (account for O/S, JVM, GC - and 0.75 utilization target)
▪ Memory and other resource limits
![Page 23: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/23.jpg)
What about async servers?
![Page 24: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/24.jpg)
Async servers architecture
Soc
ket
Event loop
epoll
Callbacks
O/S
Syscalls
Soc
ket
Soc
ket
![Page 25: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/25.jpg)
Async systems▪ Event loop callback/handler queue
▪ The callback queue is unbounded (!!!)
▪ Event loop can block (ouch)
▪ No inherent concurrency limit
▪ No backpressure (*)
![Page 26: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/26.jpg)
Async systems - overload▪ No preemption -> no QoS
▪ No backpressure -> overload
▪ Hard to tune
▪ Hard to limit concurrency/queue size
▪ Hard to debug
![Page 27: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/27.jpg)
So what’s the point?▪ High concurrency
▪ More control (timeouts)
▪ I/O heavy servers
Still evolving…. let’s revisit in a few years?
![Page 28: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/28.jpg)
Little’s Law
03
![Page 29: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/29.jpg)
Little’s law
▪ Holds for all distributions
▪ For “stable” systems
▪ Holds for systems and their subsystems
▪ “Throughput” is either Arrival rate or Service rate depending on the context.
Be careful!
L = λ⋅W
L = Avg clients in the systemλ = Avg ThroughputW = Avg Latency
![Page 30: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/30.jpg)
Using Little’s law
▪ How many requests queued inside the system?
▪ Verifying load tests / benchmarks
▪ Calculating latency when no direct measurement is possible
Go watch Gil Tene’s "How NOT to Measure Latency"
Read Benchmarking Blunders and Things That Go Bump in the Night
![Page 31: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/31.jpg)
Using Little’s law
W1 = 0.1
W2 = 0.001
LB
λ2 = 10,000
λ1 = 100
Least connections
![Page 32: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/32.jpg)
Timeouts
04
![Page 33: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/33.jpg)
How not to timeout
People use arbitrary timeout values
▪ DB timeout > Overall transaction timeout
▪ Cache timeout > DB latency
▪ Huge unrealistic timeouts
▪ Refusing to return errors
P.S: connection timeout, read timeout & transaction timeout are not the same thing
![Page 34: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/34.jpg)
Deciding on timeouts
Use the distribution luke!
▪ Resources/Errors tradeoff
▪ Cumulative distribution chart
▪ Watch out for multiple modes
▪ Context, context, context
![Page 35: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/35.jpg)
Timeouts should be derived from real world constraints!
![Page 36: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/36.jpg)
UX numbers every developer needs to know
▪ Smooth motion perception threshold: ~ 20ms
▪ Immediate reaction threshold: ~ 100ms
▪ Delay perception threshold: ~ 300ms
▪ Focus threshold: ~ 1sec
▪ Frustration threshold: ~ 10sec
Google's RAIL modelUX powers of 10
![Page 37: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/37.jpg)
Hardware latency numbers every developer needs to know▪ SSD Disk seek: 0.15ms
▪ Magnetic disk seek: ~ 10ms
▪ Round trip within same datacenter: ~ 0.5ms
▪ Packet roundtrip US->EU->US: ~ 150ms
▪ Send 1M over typical user WAN: ~ 1sec
Latency numbers every developer needs to know (updated)
![Page 38: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/38.jpg)
Timeout Budgets▪ Decide on global timeouts
▪ Pass context object
▪ Each stage decrements budget
▪ Local timeouts according to budget
▪ If budget too low, terminate
preemptively
Think microservices
Example
Global: 500ms
Stage Used Budget Timeout
Authorization 6ms 494ms 100ms
Data fetch (DB) 123ms 371ms 200ms
Processing 47ms 324ms 371ms
Rendering 89ms 235ms 324ms
Audit 2ms - -
Filter 10ms 223ms 233ms
![Page 39: Resilient design 101 (BuildStuff LT 2017)](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a64a4a67f8b9a2c568b692f/html5/thumbnails/39.jpg)
The debt buyer▪ Transactions may return eventually after timeout
▪ Does the client really have to wait?
▪ Timeout and return error/default response to client (50ms)
▪ Keep waiting asynchronously (1 sec)
Can’t be used when client is expecting data back