Netflix API - FirstMark Capital, June 2015

46
The API And Things I’ve Learned from 4 Years as a Manager at Netflix Ben Schmaus FirstMark Capital, June 2015 [ [email protected] , @schmaus ]

Transcript of Netflix API - FirstMark Capital, June 2015

The APIAnd Things I’ve Learned from 4 Years as a Manager at Netflix

Ben SchmausFirstMark Capital, June 2015

[ [email protected], @schmaus ]

A Brief Netflix OverviewThen & NowEng & OpsA Few Thoughts on Team & Process

Global Internet TV Network

More and Better Devices

Growing Internet Traffic Share

Since I joined in 2011...

20 million subscribers

1 AWS region

Functional but fragile cloud platform

Datacenter to Cloud Migration

Early Cloud Days

Outages and conference bridges almost daily

International Expansion

Scale 2011 to Now

20 to 60 million subscribers

2 to 50 countries

1 to 3 AWS regions

Recommendations User Info TitleMetadata

TitleRatings Similars My List A/B Test

Allocations

API

Fundamental Mission Unchanged*

Support product innovation

Insulate devices from failure

* modulo pivot from public API

ELB GatewayBackend ServicesAPI

+

ELB GatewayBackend Services

API

API Debug

+

App

Tomcat

JVM

Ubuntu

Ser

vice

Lay

er

/tv

/ios

/web

/android

...

recs

account

search

sims

...

ServiceJARs

EndpointScripts

JavaAPI

Things Will Break

10s of engrs changing production systems everyday

Engineers are feature producers and failure defenders

(From How Complex Systems Fail)

Fallbacks○ JVM cache○ Default value○ Stubbed object

Bulkheads

Ser

vice

Lay

er

/tv

/ios

/web

/android

...

recs

account

search

sims

...

ServiceJARs

EndpointScripts

JavaAPI

Hys

trix

ELB GatewayBackend ServicesAPI

+

ELB GatewayBackend ServicesAPI

+

Humans best at design time

Automatically adapt, degrade gracefully

Clearly report system behavior

Don’t Tweak Knobs

Your Failure Handling Will Fail

...unless you test it regularly

Do your fallbacks work?

Do they trigger before your servers overload?

Retries Seem Simple

Effects compound

Pretty easy to DDOS yourself

Wasted server work on timed out clients

B C D

3 x 3 x

A

B C D

3 x 3 x

A

B C D

3 x 3 x1:9 worst case!

A

B C D

3 x 3 x1:18 worst case!

A

E

B C D

3 x 3 x1:27 worst case!

A

E

Execute Failure Handling and Verify Assumptions

Testing

Not just a discrete development phase

Continuously analyze app behavior

On-call

Team stays connected to prod needs

Avoid burnout

Primary / Secondary Model

Primary Secondary

Primary Secondary

2 week rotation = 1 as secondary then 1 as primary

AllOps

Too much “keep the lights on” work?Spend more on tools automation

If you want people to be creative, let them create.

Team Shapes the Roadmap

Listen to the team; they know where work is needed

Priorities1. Production healthy2. Company (launch new market)3. Biz As Usual (A/B tests, chaos testing)4. Elective (FIT)

“...judgement is the solution for almost every ambiguous problem. Not process.”

- John Ciancutti (former Netflix eng)