Mastering Chaos - A Netflix Guide to Microservices

106
Josh Evans – Engineering Leader November 8, 2016 Mastering Chaos A Netflix Guide to Microservices

Transcript of Mastering Chaos - A Netflix Guide to Microservices

PowerPoint Presentation

Josh Evans Engineering Leader

November 8, 2016Mastering ChaosA Netflix Guide to Microservices

Illness in the Family

Myelin SheatheAutoimmune disorderExternally triggerTreatableGuillain-Barr Syndrome

Is anyone familiar with this condition?

Breathing is a miraculous act of bravery

Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.Pause so youre probably wondering why Im talking about biology and disease in a talk about microservices?

ELBand so is taking traffic

And just as we human beings thrive in a world filled with threats so can your microservice architectureAnd just as my step mother Barbaras own body attacked itself in response to some unknown pathogen our own services can do the same thing. Poorly tuned timouts, retries, and fallbacks can reek havoc and take your entire customer-facing service downThere are big challenges but every challenge has a solutionBut, just as for all of us, it requires discipline to stay fit. You must embrace the chaos and that it is impossible for any one individual to fully understand the whole distributed system.This is why were here today to talk about the Netflix microservice journey. How we walk the razors edge between discipline ad chaos. And how you can benefit from the lessons weve learned over the last 7 years.

IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureOur Talk Today

IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureOur Talk Today

1999 2009Engineer & Engineering ManagerEcommerce (DVD Streaming)

2009 2013Director of Engineering - Playback Services

2013 2016Director of Operations EngineeringJosh Evans

Taking time offSpending time with familyThinking about whats nextToday

Leader in subscription internet tv serviceHollywood, indy, localGrowing slate of original content

86 million members~190 countries, 10s of languages1000s of device types

Microservices on AWS

IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureOur Talk Today

Netflix DVD Data Center - 2000

Linux HostWhat microservices are notApacheTomcatJavawebSTORELoad BalancerBILLINGHTTPJDBCDB LinkHTTP/SMonolithic code baseMonolithic databaseTightly coupled architecture

What is a microservice?

the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API.

- Martin Fowler

Separation of concernsModularity, encapsulation

ScalabilityHorizontally scalingWorkload partitioning

Virtualization & elasticityAutomated operationsOn demand provisioningAn Evolutionary Response

Organ SystemsEach organ has a purposeOrgans form systemsSystems form an organism

Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.

Edge

ELBZuulNCCPAPIMiddle Tier & PlatformProductBucket testingSubscriberRecommendations

PlatformRoutingConfigurationCrypto

PersistenceCacheDatabase

Client ApplicationClient LibraryEVCache ClientService Client

SSSS. . .DBDBDBDB. . .

. . .

Microservices are an abstraction. . .

Microservice

Read from cacheOn cache miss call serviceService calls DB & respondsService updates cache

IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureOur Talk Today

DependencyScaleVarianceChange

Challenges & Solutions

DependencyScaleVarianceChange

Challenges & Solutions

Intra-service requestsClient librariesData PersistenceInfrastructureUse Cases

Intra-service Requests

Crossing the Chasm

External trigger, internal response

Linux HostLinux HostLinux HostLinux HostCrossing the ChasmLinux HostApacheTomcatLinux HostApacheTomcatNetwork latency, congestion, failureLogical or scaling failureService AService B

As soon as you go out of process and/or off box you have a distributed systemCombinatorial math on nines of availability Adrian Cockcroft suggested Netflix in a box as a thought experiment early on to address connectivity concerns

Cascading Failure

* If you do not defend against failure at each level then you have what is essentially a distributed monolith if any microservice fails then they all fail* Calls start failing, retries make it worse, thread pools become saturated, lack of isolation leads to full cascading failure

How do you know if it works?

Inoculation

Device

Service B

Service CInternet

EdgeZuulService A ELBFITSynthetic transactionsOverride by device or account% of live traffic up to 100%Fault Injection Testing (FIT)

Device

Service B

Service CInternet

EdgeZuulService A ELB

FITFault Injection Testing (FIT)Enforced throughout the call path

ELBAPI

How do we constrain testing scope?

API GatewayApp 1App 2App 4App 5App 6App 3App 7App 899.9999.9999.9999.9999.9999.9999.9999.99Proxy99.9999.99Combinatorial Math99.9910 = 99.9

Critical Microservices

Client Libraries

Many clientsCommon business logicCommon access patterns

Return of the Monolith

Heap consumptionLogical defectsTransitive dependenciesParasitic Infestation

This nasty looking creature comes right out of your favorite horror movieThe good news is that its a very tiny creature not something that would destroy TokyoThe bad news is that its a vampire a hookworm that attaches itself to the wall of the intestine, puncturing blood vessels and feeding on blood.This can lead to severe anemia, effecting the health of the whole organismAnd just like the hookworm, client libraries can consume resources of your microservice application

Client ApplicationClient LibraryEVCache ClientService Client

SSSS. . .DBDBDBDB. . .

. . .

Simple Logic, Common Patterns. . .

cache, service, backfillRequest level caching

Persistence

In the presence of a network partition, you must choose between consistency and availabilityCAP TheoremDBDBDB

Network BNetwork CNetwork DService

Network AX

Zone AZone BZone CZone BZone CClientZone ALocal Quorum(Typical)100ms

Eventual Consistency

Client writes to any nodeCoordinator replicates to nodes Nodes ack to coordinatorCoordinator acks to clientWrite to commit logHinted handoff to offline nodes

Infrastructure

December 24th, 2012

On Christmas eve, 2012 Netflix experiences a region-wide outage due an accidental ELB configuration change

Many engineers were on call, missing time with their familiesThey spent much of the night and into the morning trying to mitigate the impact of the outage on our customers but to no availWe ultimately had to wait for Amazon to address the root cause

And our members, many of them new to Netflix were unable to streamTheir responses varied in intensity from

US-East-1CanadaNo place to goUSLatin America

US-East-1US-West-2

EU-West-1

#NetflixEverywhere Global ArchitectureQCon London, 2016https://www.infoq.com/presentations/netflix-failure-multiple-regions

DependencyScaleVarianceChange

Challenges & Solutions

Stateless services Stateful servicesHybrid services

Use Cases

Stateless Services

Not a cache or a databaseFrequently accessed metadataNo instance affinityLoss a node is a non-event

What is a stateless service?

Minimum size

Desired capacity

Maximum size

Scale out as needed

S3AMI retrieved on demandCompute efficiencyNode failureTraffic spikesPerformance bugs

Auto Scaling Groups

Cluster ACluster D

Edge ClusterCluster BCluster CSurviving Instance Failure

Stateful Services

Databases & cachesCustom apps which hold large amounts of dataLoss of a node is a notable event

What is a stateful service?

Dedicated Shards An AntipatternSquid 1Squid 2Squid 3Client ApplicationSubscriber Client LibraryCache ClientService Client

SSSS. . .DBDBDBDB. . .Squid nHA ProxySet 1Set 2Set 3Set nX

Early on we had two competing approaches to caching. The Subscriber service team leaned on Squid caches, applying a dedicated shard modelThis model proved problematic involving long outages for members when a shard went down. In addition lack of proper thread pool isolation meant that the entire Netflix service might be come unavailable when one shard became unavailableI was on a conference call several years ago where it took four hours to recover from such an outage

Redundancy is fundamental

Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.Pause so youre probably wondering why Im talking about biology and disease in a talk about microservices?

Zone AZone BZone C

. . .

. . .

. . .

EVCache WritesClient ApplicationClient LibraryEVCache ClientClient ApplicationClient LibraryEVCache ClientClient ApplicationClient LibraryEVCache Client

Zone AClient ApplicationClient LibraryEVCache ClientZone BClient ApplicationClient LibraryEVCache ClientZone CClient ApplicationClient LibraryEVCache Client

. . .

. . .

. . .

EVCache Reads

Hybrid Services

Client ApplicationClient LibraryEVCache ClientService Client

SSSS. . .DBDBDBDB. . .

. . .

Hybrid Microservice. . .

Now lets look at scale from the perspective of a complex microservice architectureOne in which there is a caching tier fronting the microservice tier

Its easy to take EVCache for granted30 million requests/sec2 trillion requests per day globally

Hundreds of billions of objectsTens of thousands of memcached instances

Milliseconds of latency per request

Batch

SSSS. . .DBDBDBDB. . .

. . .

. . .Member PathMember PathMember PathBatchBatchCalled by many servicesOnline & offline clientsCalled many times / request800k 1M RPS

Fallback to service/dbExcessive Load

In this case the subscriber team heavily relied on the caching tierTaking traffic north of 800k rps

Batch

SSSS. . .DBDBDBDB. . .

. . .

. . .Member PathMember PathMember PathBatchBatchExcessive LoadXX

In this case the subscriber team heavily relied on the caching tierTaking traffic north of 800k rps

Batch

SSSS. . .DBDBDBDB. . .

. . .

. . .Member PathMember PathMember PathBatchBatchWorkload partitioningRequest-level cachingSecure token fallbackChaos under loadSolutions

OnlineOffline

There are several solutions that address this anti-pattern

DependencyScaleVarianceChange

Challenges & Solutions

Operational driftPolyglot & containersUse Cases

Operational Drift(Unintentional Variance)

Over timeAlert thresholdsTimeouts, retries, fallbacksThroughput (RPS)

Across microservicesReliability best practicesOperational Drift

Autonomic NervousSystemYou dont have to think about digestion or breathing

Continuous Learning & Automation

AlertsApache & TomcatAutomated canary analysisAutoscalingChaosConsistent namingELB configHealthcheckImmutable machine imagesSqueeze testingStaged, red/black deploymentsTimeouts, retries, fallbacksProduction Ready

Polyglot & Containers(Intentional Variance)

The Paved RoadStashNebula/GradleBaseAMI/UbuntuJenkinsSpinnakerRuntime Platform

In the Critical Path

In the Critical Path

Productivity toolingInsight & triage capabilitiesBase image fragmentationNode management Library/platform duplicationLearning curve - production expertiseCost of Variance

Raise awareness of costsConstrain centralized supportPrioritize by impactSeek reusable solutionsStrategic Stance

DependencyScaleVarianceChange

Challenges & Solutions

How do we achieve velocity with confidence?

Global Cloud Management & Delivery

Integrated, Automated Practices

Conformity checksRed/black pipelinesAutomated canariesStaged deploymentsSqueeze tests

Story bricked test environment for 6 hours global configuration changeStaging necessary for deployments & configuration changes

AlertsApache & TomcatAutomated canary analysisAutoscalingChaosConsistent namingELB configHealthcheckImmutable machine imagesSqueeze testingStaged, red/black deploymentsTimeouts, retries, fallbacksProduction Ready

https://www.youtube.com/watch?v=IkPb15FfuQU

IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureOur Talk Today

Architecture first, organizational structure secondBlameless incident reviewsCommitment to continuous improvement

Customer DeviceNetflix Data Center - 2009

NCCPElectronic Delivery - NRDP 1.xLoad BalancerNetflix AppSecurityActivationPlaybackPlatform (NRDP)UICollaborative designXML payloads Custom responsesVersioned firmware releasesLong cyclesSimple UI Queue ReaderED

Netflix API - let a 1000 flowers bloom!

Netflix Data Center - 2009

APINetflix API from public to privateLoad BalancerGeneral REST APIJSON schemaHTTP response codesOauth security modelContent MetadataContent MetadataApplication

Customer DeviceNetflix Data Center 2010

APIHybrid ArchitectureLBNetflix AppSecurityActivationPlaybackPlatform (NRDP)UIContent Metadata

NCCPEDLBREST, JSON, OAuthRPC, XML, NTBADistinctServicesProtocolsSchemasSecurity

Different end points, protocols, security made life difficult for client teamsEspecially when we wanted to integrated UI and playback functionality

Josh: what is the right long term architecture?

Peter: do you care about the organizational implications?

Conways LawOrganizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations.Any piece of software reflects the organizational structure that produced it.

Conways LawIf you have four teams working on a compiler you will end up with a four pass compiler

NCCPAPI

Blade Runner

OutcomesProductivity & new capabilitiesRefactored organization

LessonsSolutions first, team secondReconfigure teams to best support your architecture

Outcomes & Lessons

IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureRecapOur Talk Today

Architecture first, organizational structure secondBlameless incident reviewsCommitment to continuous improvement

Microservice architectures are complex and organic

Health depends on discipline and chaos

Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.Pause so youre probably wondering why Im talking about biology and disease in a talk about microservices?

DependencyCircuit breakers, fallbacks, chaosSimple clientsEventual consistencyMulti-region failover

ScaleAuto-scalingRedundancy avoid SPoFPartitioned workloadsFailure-driven designChaos under loadVarianceEngineered operations Understood cost of variancePrioritized support by impactChangeAutomated deliveryIntegrated practicesOrganization & ArchitectureSolutions first, team second

netflix.github.io

techblog.netflix.com

Questions?