(ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014
-
Upload
amazon-web-services -
Category
Technology
-
view
1.225 -
download
2
description
Transcript of (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014
November 12, 2014 | Las Vegas, Nevada
Daniel Jacobson, Netflix
Ben Schmaus, Netflix
Daniel Jacobson
@daniel_jacobson
danieljacobson/linkedin
danieljacobson.com/slideshare
Ben Schmaus
@schmaus
schma.us/in
schma.us/slides
Edge
Engineering
What does Edge Engineering do?
• Broker data between services and devices
• Control playback flow
• Ensure resiliency
• Scale our systems
• Enable high velocity product innovation
• Provide detailed, real-time health insights
“The Edge... the only people who really know
where it is are the ones who have gone over.”
-- Hunter S. Thompson
What does Edge Engineering do?
• Broker data between services and devices
• Control playback flow
• Ensure resiliency
• Scale our systems
• Enable high velocity product innovation
• Provide detailed, real-time health insights
“The Edge... the only people who really know
where it is are the ones who have gone over.”
-- Hunter S. Thompson
What does Edge Engineering do?
• Broker data between services and devices
• Control playback flow
• Ensure resiliency
• Scale our systems
• Enable high velocity product innovation
• Provide detailed, real-time health insights
“The Edge... the only people who really know
where it is are the ones who have gone over.”
-- Hunter S. Thompson
APP-310: Scheduling using
Apache Mesos in the Cloud
9:00 on Friday
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
API
S
E
R
V
I
C
E
S
RxJava
Hystrix
S2S2S2
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
S2S2S2
Playback Playback Website Website Logging
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
APIRxJava
Hystrix
Scripting
S
E
R
V
I
C
E
S
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
S2S2S2
Playback Playback Website Website Logging
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
APIRxJava
Hystrix
Scripting
S
E
R
V
I
C
E
S
Routing Traffic
“There is no Dana, only Zuul!”
ZuulGatekeeper for the Netflix Streaming Application
Zuul
• Multi-Region
Resiliency
• Dynamic Routing
• Squeeze Testing
• Insights
• Load Shedding
• Security
• Authentication
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
APIRxJava
Hystrix
S2S2S2
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
S
E
R
V
I
C
E
S
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
PROD
RxJava
Hystrix
S2S2S2
Scripting
DEBUG
RxJava
Hystrix
Scripting
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
S
E
R
V
I
C
E
S
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
PROD
RxJava
Hystrix
S2S2S2
Scripting
SQUEEZE
RxJava
Hystrix
Scripting
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
S
E
R
V
I
C
E
S
Systems are healthy.
Traffic
from the
east goes
to
US-EAST
Traffic
from the
west goes
to
US-WEST
Systems failure in US-EAST.
US-EAST Zuul routes traffic
to US-WEST Zuul
(until DNS gets resolved)
DNS gets resolved.
Requests from east
go to US-WEST
Systems recover in US-EAST.
DNS set to return to normal
DNS gets resolved.
Both regions return to normal.
Resiliency in Distributed
SystemsPreventing cascading failures
“A distributed system is one in
which the failure of a computer
you didn’t even know existed
can render your own computer
unusable”-- Leslie Lamport
Dependency Relationships
5,000,000,000Incoming Requests Per
DayNetflix API
30Dependent Services
Netflix API
~600Dependency Jars
Netflix API
40,000,000,000Outbound Calls Per Day
to Dependent Services
Netflix API
1Thing is common across
all dependencies…
0Dependent Services
have a 100% SLA
99.99% = 99.7%30
0.3% of 5B = 15M failures per day
2+ Hours of Downtime
Per Month
99.99% = 99.7%30
0.3% of 5B = 15M failures per day
2+ Hours of Downtime
Per Month
99.9% = 97%30
3% of 5B = 150M failures per day
20+ Hours of Downtime
Per Month
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
APIRxJava
Hystrix
S2S2S2
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
S
E
R
V
I
C
E
S
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
APIRxJava
Hystrix
S2S2S2
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
S
E
R
V
I
C
E
S
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
APIRxJava
Hystrix
S2S2S2
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
S
E
R
V
I
C
E
S
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
RxJava
Hystrix
S2S2S2
Scripting
RxJava
Hystrix
Scripting
RxJava
Hystrix
Scripting
RxJava
Hystrix
Scripting
RxJava
Hystrix
Scripting
RxJava
Hystrix
Scripting
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
API API API API API API
S
E
R
V
I
C
E
S
Call Volume and Health /
Last 10 Seconds
Call Volume / Last
2 Minutes
Successful
Requests
Short-Circuited Requests,
Delivering Fallbacks
Timeouts, Delivering
Fallbacks
Full Queues,
Delivering
Fallbacks
Exceptions, Delivering
Fallbacks
Error
Rate
# + # + # + # / (# + # + # + # + #) = Error Rate
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
API
S
E
R
V
I
C
E
S
RxJava
Hystrix
S2S2S2
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
API
S
E
R
V
I
C
E
S
RxJava
Hystrix
S2S2S2
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
D
E
V
I
C
E
S
R
O
U
T
I
N
G
O
R
I
G
I
N
API
S
E
R
V
I
C
E
S
RxJava
Hystrix
S2S2S2
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
APIRxJava
Hystrix
Scripting
S2S2S1S2S2S4
S2S2S3S2S2S6
S2S2S5
S2S2S8S2S2S7
S2S2S10S2S2S9
S2S2S12S2S2S11
S2S2S13
Fallbac
k
Demo
May the demo gods be with us…
Scaling Systems
Preventing failures due to capacity issues
“The possibilities are
numerous once we decide to
act and not react”-- George Bernard Shaw
Reactive Auto Scaling
• Reacts to real-time conditions
• Responds to spikes/dips in metrics– Load average
– Requests per second
• Excellent for many scaling scenarios– Much better than static cluster sizing
Reactive Auto Scaling - Challenges
• Policies can be inefficient w
•
• Outages can trigger scale down events
• Excess capacity at peak and trough
Scryer : Predictive Auto Scaling
Not yet…
Typical Traffic Patterns Over Five Days
Predicted RPS Compared to Actual RPS
Scaling Plan for Predicted Workload
What is Scryer Doing?
• Evaluates needs based on historical data– Week over week, month over month metrics
• Adjusts instance minimums based on algorithms– Constant feedback loops
– Evaluated routinely through squeeze tests
• Relies on Auto Scaling for unpredicted spikes in
traffic
Results
Results : Load Average
Reactive
Predictive
Results : Load Average
Reactive
Predictive
Results : Response Latencies
Reactive
Predictive
Results : Response Latencies
Reactive
Predictive
Results : Outage Recovery
Results : AWS Costs
Key Takeaways
https://www.github.com/Netflix
Netflix talks at re:InventTalk Time Title
PFC-305 Wednesday, 1:15pm Embracing Failure: Fault Injection and Service Reliability
BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix
PFC-306 Wednesday, 2:15pm Performance Tuning EC2
DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source
Tools can accelerate and scale your services
ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale
PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The
Pros and Cons of Micro Services Architectures
ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems
APP-310 Friday, 9:00am Scheduling using Apache Mesos in the Cloud