(ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

67
November 12, 2014 | Las Vegas, Nevada Daniel Jacobson, Netflix Ben Schmaus, Netflix

description

The Netflix service supports more than 50 million subscribers in over 40 countries around the world. These subscribers use more than 1,000 different device types to connect to Netflix, resulting in massive amounts of traffic to the service. In our distributed environment, the gateway service that receives this customer traffic needs to be able to scale in a variety of ways while simultaneously protecting our subscribers from failures elsewhere in the architecture. This talk will detail how the Netflix front door operates, leveraging systems like Hystrix, Zuul, and Scryer to maximize the AWS infrastructure and to create a great streaming experience.

Transcript of (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Page 1: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

November 12, 2014 | Las Vegas, Nevada

Daniel Jacobson, Netflix

Ben Schmaus, Netflix

Page 2: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Daniel Jacobson

@daniel_jacobson

danieljacobson/linkedin

danieljacobson.com/slideshare

Ben Schmaus

@schmaus

schma.us/in

schma.us/slides

Edge

Engineering

Page 3: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014
Page 4: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

What does Edge Engineering do?

• Broker data between services and devices

• Control playback flow

• Ensure resiliency

• Scale our systems

• Enable high velocity product innovation

• Provide detailed, real-time health insights

“The Edge... the only people who really know

where it is are the ones who have gone over.”

-- Hunter S. Thompson

Page 5: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

What does Edge Engineering do?

• Broker data between services and devices

• Control playback flow

• Ensure resiliency

• Scale our systems

• Enable high velocity product innovation

• Provide detailed, real-time health insights

“The Edge... the only people who really know

where it is are the ones who have gone over.”

-- Hunter S. Thompson

Page 6: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

What does Edge Engineering do?

• Broker data between services and devices

• Control playback flow

• Ensure resiliency

• Scale our systems

• Enable high velocity product innovation

• Provide detailed, real-time health insights

“The Edge... the only people who really know

where it is are the ones who have gone over.”

-- Hunter S. Thompson

APP-310: Scheduling using

Apache Mesos in the Cloud

9:00 on Friday

Page 7: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

API

S

E

R

V

I

C

E

S

RxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

Page 8: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

S2S2S2

Playback Playback Website Website Logging

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

APIRxJava

Hystrix

Scripting

S

E

R

V

I

C

E

S

Page 9: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

S2S2S2

Playback Playback Website Website Logging

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

APIRxJava

Hystrix

Scripting

S

E

R

V

I

C

E

S

Page 10: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Routing Traffic

“There is no Dana, only Zuul!”

Page 11: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

ZuulGatekeeper for the Netflix Streaming Application

Page 12: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Zuul

• Multi-Region

Resiliency

• Dynamic Routing

• Squeeze Testing

• Insights

• Load Shedding

• Security

• Authentication

Page 13: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

APIRxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

Page 14: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

PROD

RxJava

Hystrix

S2S2S2

Scripting

DEBUG

RxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

Page 15: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

PROD

RxJava

Hystrix

S2S2S2

Scripting

SQUEEZE

RxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

Page 16: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Systems are healthy.

Traffic

from the

east goes

to

US-EAST

Traffic

from the

west goes

to

US-WEST

Page 17: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Systems failure in US-EAST.

Page 18: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

US-EAST Zuul routes traffic

to US-WEST Zuul

(until DNS gets resolved)

Page 19: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

DNS gets resolved.

Requests from east

go to US-WEST

Page 20: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Systems recover in US-EAST.

DNS set to return to normal

Page 21: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

DNS gets resolved.

Both regions return to normal.

Page 22: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Resiliency in Distributed

SystemsPreventing cascading failures

Page 23: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

“A distributed system is one in

which the failure of a computer

you didn’t even know existed

can render your own computer

unusable”-- Leslie Lamport

Page 24: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Dependency Relationships

Page 25: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

5,000,000,000Incoming Requests Per

DayNetflix API

Page 26: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

30Dependent Services

Netflix API

Page 27: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

~600Dependency Jars

Netflix API

Page 28: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

40,000,000,000Outbound Calls Per Day

to Dependent Services

Netflix API

Page 29: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

1Thing is common across

all dependencies…

Page 30: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

0Dependent Services

have a 100% SLA

Page 31: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

99.99% = 99.7%30

0.3% of 5B = 15M failures per day

2+ Hours of Downtime

Per Month

Page 32: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

99.99% = 99.7%30

0.3% of 5B = 15M failures per day

2+ Hours of Downtime

Per Month

Page 33: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

99.9% = 97%30

3% of 5B = 150M failures per day

20+ Hours of Downtime

Per Month

Page 34: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

APIRxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

Page 35: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

APIRxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

Page 36: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

APIRxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

Page 37: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

RxJava

Hystrix

S2S2S2

Scripting

RxJava

Hystrix

Scripting

RxJava

Hystrix

Scripting

RxJava

Hystrix

Scripting

RxJava

Hystrix

Scripting

RxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

API API API API API API

S

E

R

V

I

C

E

S

Page 38: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014
Page 39: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014
Page 40: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014
Page 41: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Call Volume and Health /

Last 10 Seconds

Call Volume / Last

2 Minutes

Page 42: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Successful

Requests

Short-Circuited Requests,

Delivering Fallbacks

Timeouts, Delivering

Fallbacks

Full Queues,

Delivering

Fallbacks

Exceptions, Delivering

Fallbacks

Page 43: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Error

Rate

# + # + # + # / (# + # + # + # + #) = Error Rate

Page 44: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

API

S

E

R

V

I

C

E

S

RxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

Page 45: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

API

S

E

R

V

I

C

E

S

RxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

Page 46: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

API

S

E

R

V

I

C

E

S

RxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

Fallbac

k

Page 47: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Demo

May the demo gods be with us…

Page 48: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Scaling Systems

Preventing failures due to capacity issues

Page 49: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

“The possibilities are

numerous once we decide to

act and not react”-- George Bernard Shaw

Page 50: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Reactive Auto Scaling

• Reacts to real-time conditions

• Responds to spikes/dips in metrics– Load average

– Requests per second

• Excellent for many scaling scenarios– Much better than static cluster sizing

Page 51: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Reactive Auto Scaling - Challenges

• Policies can be inefficient w

• Outages can trigger scale down events

• Excess capacity at peak and trough

Page 52: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Scryer : Predictive Auto Scaling

Not yet…

Page 53: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Typical Traffic Patterns Over Five Days

Page 54: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Predicted RPS Compared to Actual RPS

Page 55: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Scaling Plan for Predicted Workload

Page 56: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

What is Scryer Doing?

• Evaluates needs based on historical data– Week over week, month over month metrics

• Adjusts instance minimums based on algorithms– Constant feedback loops

– Evaluated routinely through squeeze tests

• Relies on Auto Scaling for unpredicted spikes in

traffic

Page 57: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Results

Page 58: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Results : Load Average

Reactive

Predictive

Page 59: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Results : Load Average

Reactive

Predictive

Page 60: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Results : Response Latencies

Reactive

Predictive

Page 61: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Results : Response Latencies

Reactive

Predictive

Page 62: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Results : Outage Recovery

Page 63: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Results : AWS Costs

Page 64: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Key Takeaways

Page 65: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

https://www.github.com/Netflix

Page 66: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Netflix talks at re:InventTalk Time Title

PFC-305 Wednesday, 1:15pm Embracing Failure: Fault Injection and Service Reliability

BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix

PFC-306 Wednesday, 2:15pm Performance Tuning EC2

DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source

Tools can accelerate and scale your services

ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale

PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The

Pros and Cons of Micro Services Architectures

ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems

APP-310 Friday, 9:00am Scheduling using Apache Mesos in the Cloud

Page 67: (ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

http://bit.ly/awsevalshttp://schma.us/in

http://schma.us/slides