TIBCO BWCE and Netflix' Hystrix Circuit Breaker for Cloud Native Middleware Microservices
Mastering Chaos - A Netflix Guide to Microservices
-
Upload
josh-evans -
Category
Internet
-
view
1.396 -
download
10
Transcript of Mastering Chaos - A Netflix Guide to Microservices
PowerPoint Presentation
Josh Evans Engineering Leader
November 8, 2016Mastering ChaosA Netflix Guide to Microservices
Illness in the Family
Myelin SheatheAutoimmune disorderExternally triggerTreatableGuillain-Barr Syndrome
Is anyone familiar with this condition?
Breathing is a miraculous act of bravery
Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.Pause so youre probably wondering why Im talking about biology and disease in a talk about microservices?
ELBand so is taking traffic
And just as we human beings thrive in a world filled with threats so can your microservice architectureAnd just as my step mother Barbaras own body attacked itself in response to some unknown pathogen our own services can do the same thing. Poorly tuned timouts, retries, and fallbacks can reek havoc and take your entire customer-facing service downThere are big challenges but every challenge has a solutionBut, just as for all of us, it requires discipline to stay fit. You must embrace the chaos and that it is impossible for any one individual to fully understand the whole distributed system.This is why were here today to talk about the Netflix microservice journey. How we walk the razors edge between discipline ad chaos. And how you can benefit from the lessons weve learned over the last 7 years.
IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureOur Talk Today
IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureOur Talk Today
1999 2009Engineer & Engineering ManagerEcommerce (DVD Streaming)
2009 2013Director of Engineering - Playback Services
2013 2016Director of Operations EngineeringJosh Evans
Taking time offSpending time with familyThinking about whats nextToday
Leader in subscription internet tv serviceHollywood, indy, localGrowing slate of original content
86 million members~190 countries, 10s of languages1000s of device types
Microservices on AWS
IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureOur Talk Today
Netflix DVD Data Center - 2000
Linux HostWhat microservices are notApacheTomcatJavawebSTORELoad BalancerBILLINGHTTPJDBCDB LinkHTTP/SMonolithic code baseMonolithic databaseTightly coupled architecture
What is a microservice?
the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API.
- Martin Fowler
Separation of concernsModularity, encapsulation
ScalabilityHorizontally scalingWorkload partitioning
Virtualization & elasticityAutomated operationsOn demand provisioningAn Evolutionary Response
Organ SystemsEach organ has a purposeOrgans form systemsSystems form an organism
Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.
Edge
ELBZuulNCCPAPIMiddle Tier & PlatformProductBucket testingSubscriberRecommendations
PlatformRoutingConfigurationCrypto
PersistenceCacheDatabase
Client ApplicationClient LibraryEVCache ClientService Client
SSSS. . .DBDBDBDB. . .
. . .
Microservices are an abstraction. . .
Microservice
Read from cacheOn cache miss call serviceService calls DB & respondsService updates cache
IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureOur Talk Today
DependencyScaleVarianceChange
Challenges & Solutions
DependencyScaleVarianceChange
Challenges & Solutions
Intra-service requestsClient librariesData PersistenceInfrastructureUse Cases
Intra-service Requests
Crossing the Chasm
External trigger, internal response
Linux HostLinux HostLinux HostLinux HostCrossing the ChasmLinux HostApacheTomcatLinux HostApacheTomcatNetwork latency, congestion, failureLogical or scaling failureService AService B
As soon as you go out of process and/or off box you have a distributed systemCombinatorial math on nines of availability Adrian Cockcroft suggested Netflix in a box as a thought experiment early on to address connectivity concerns
Cascading Failure
* If you do not defend against failure at each level then you have what is essentially a distributed monolith if any microservice fails then they all fail* Calls start failing, retries make it worse, thread pools become saturated, lack of isolation leads to full cascading failure
How do you know if it works?
Inoculation
Device
Service B
Service CInternet
EdgeZuulService A ELBFITSynthetic transactionsOverride by device or account% of live traffic up to 100%Fault Injection Testing (FIT)
Device
Service B
Service CInternet
EdgeZuulService A ELB
FITFault Injection Testing (FIT)Enforced throughout the call path
ELBAPI
How do we constrain testing scope?
API GatewayApp 1App 2App 4App 5App 6App 3App 7App 899.9999.9999.9999.9999.9999.9999.9999.99Proxy99.9999.99Combinatorial Math99.9910 = 99.9
Critical Microservices
Client Libraries
Many clientsCommon business logicCommon access patterns
Return of the Monolith
Heap consumptionLogical defectsTransitive dependenciesParasitic Infestation
This nasty looking creature comes right out of your favorite horror movieThe good news is that its a very tiny creature not something that would destroy TokyoThe bad news is that its a vampire a hookworm that attaches itself to the wall of the intestine, puncturing blood vessels and feeding on blood.This can lead to severe anemia, effecting the health of the whole organismAnd just like the hookworm, client libraries can consume resources of your microservice application
Client ApplicationClient LibraryEVCache ClientService Client
SSSS. . .DBDBDBDB. . .
. . .
Simple Logic, Common Patterns. . .
cache, service, backfillRequest level caching
Persistence
In the presence of a network partition, you must choose between consistency and availabilityCAP TheoremDBDBDB
Network BNetwork CNetwork DService
Network AX
Zone AZone BZone CZone BZone CClientZone ALocal Quorum(Typical)100ms
Eventual Consistency
Client writes to any nodeCoordinator replicates to nodes Nodes ack to coordinatorCoordinator acks to clientWrite to commit logHinted handoff to offline nodes
Infrastructure
December 24th, 2012
On Christmas eve, 2012 Netflix experiences a region-wide outage due an accidental ELB configuration change
Many engineers were on call, missing time with their familiesThey spent much of the night and into the morning trying to mitigate the impact of the outage on our customers but to no availWe ultimately had to wait for Amazon to address the root cause
And our members, many of them new to Netflix were unable to streamTheir responses varied in intensity from
US-East-1CanadaNo place to goUSLatin America
US-East-1US-West-2
EU-West-1
#NetflixEverywhere Global ArchitectureQCon London, 2016https://www.infoq.com/presentations/netflix-failure-multiple-regions
DependencyScaleVarianceChange
Challenges & Solutions
Stateless services Stateful servicesHybrid services
Use Cases
Stateless Services
Not a cache or a databaseFrequently accessed metadataNo instance affinityLoss a node is a non-event
What is a stateless service?
Minimum size
Desired capacity
Maximum size
Scale out as needed
S3AMI retrieved on demandCompute efficiencyNode failureTraffic spikesPerformance bugs
Auto Scaling Groups
Cluster ACluster D
Edge ClusterCluster BCluster CSurviving Instance Failure
Stateful Services
Databases & cachesCustom apps which hold large amounts of dataLoss of a node is a notable event
What is a stateful service?
Dedicated Shards An AntipatternSquid 1Squid 2Squid 3Client ApplicationSubscriber Client LibraryCache ClientService Client
SSSS. . .DBDBDBDB. . .Squid nHA ProxySet 1Set 2Set 3Set nX
Early on we had two competing approaches to caching. The Subscriber service team leaned on Squid caches, applying a dedicated shard modelThis model proved problematic involving long outages for members when a shard went down. In addition lack of proper thread pool isolation meant that the entire Netflix service might be come unavailable when one shard became unavailableI was on a conference call several years ago where it took four hours to recover from such an outage
Redundancy is fundamental
Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.Pause so youre probably wondering why Im talking about biology and disease in a talk about microservices?
Zone AZone BZone C
. . .
. . .
. . .
EVCache WritesClient ApplicationClient LibraryEVCache ClientClient ApplicationClient LibraryEVCache ClientClient ApplicationClient LibraryEVCache Client
Zone AClient ApplicationClient LibraryEVCache ClientZone BClient ApplicationClient LibraryEVCache ClientZone CClient ApplicationClient LibraryEVCache Client
. . .
. . .
. . .
EVCache Reads
Hybrid Services
Client ApplicationClient LibraryEVCache ClientService Client
SSSS. . .DBDBDBDB. . .
. . .
Hybrid Microservice. . .
Now lets look at scale from the perspective of a complex microservice architectureOne in which there is a caching tier fronting the microservice tier
Its easy to take EVCache for granted30 million requests/sec2 trillion requests per day globally
Hundreds of billions of objectsTens of thousands of memcached instances
Milliseconds of latency per request
Batch
SSSS. . .DBDBDBDB. . .
. . .
. . .Member PathMember PathMember PathBatchBatchCalled by many servicesOnline & offline clientsCalled many times / request800k 1M RPS
Fallback to service/dbExcessive Load
In this case the subscriber team heavily relied on the caching tierTaking traffic north of 800k rps
Batch
SSSS. . .DBDBDBDB. . .
. . .
. . .Member PathMember PathMember PathBatchBatchExcessive LoadXX
In this case the subscriber team heavily relied on the caching tierTaking traffic north of 800k rps
Batch
SSSS. . .DBDBDBDB. . .
. . .
. . .Member PathMember PathMember PathBatchBatchWorkload partitioningRequest-level cachingSecure token fallbackChaos under loadSolutions
OnlineOffline
There are several solutions that address this anti-pattern
DependencyScaleVarianceChange
Challenges & Solutions
Operational driftPolyglot & containersUse Cases
Operational Drift(Unintentional Variance)
Over timeAlert thresholdsTimeouts, retries, fallbacksThroughput (RPS)
Across microservicesReliability best practicesOperational Drift
Autonomic NervousSystemYou dont have to think about digestion or breathing
Continuous Learning & Automation
AlertsApache & TomcatAutomated canary analysisAutoscalingChaosConsistent namingELB configHealthcheckImmutable machine imagesSqueeze testingStaged, red/black deploymentsTimeouts, retries, fallbacksProduction Ready
Polyglot & Containers(Intentional Variance)
The Paved RoadStashNebula/GradleBaseAMI/UbuntuJenkinsSpinnakerRuntime Platform
In the Critical Path
In the Critical Path
Productivity toolingInsight & triage capabilitiesBase image fragmentationNode management Library/platform duplicationLearning curve - production expertiseCost of Variance
Raise awareness of costsConstrain centralized supportPrioritize by impactSeek reusable solutionsStrategic Stance
DependencyScaleVarianceChange
Challenges & Solutions
How do we achieve velocity with confidence?
Global Cloud Management & Delivery
Integrated, Automated Practices
Conformity checksRed/black pipelinesAutomated canariesStaged deploymentsSqueeze tests
Story bricked test environment for 6 hours global configuration changeStaging necessary for deployments & configuration changes
AlertsApache & TomcatAutomated canary analysisAutoscalingChaosConsistent namingELB configHealthcheckImmutable machine imagesSqueeze testingStaged, red/black deploymentsTimeouts, retries, fallbacksProduction Ready
https://www.youtube.com/watch?v=IkPb15FfuQU
IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureOur Talk Today
Architecture first, organizational structure secondBlameless incident reviewsCommitment to continuous improvement
Customer DeviceNetflix Data Center - 2009
NCCPElectronic Delivery - NRDP 1.xLoad BalancerNetflix AppSecurityActivationPlaybackPlatform (NRDP)UICollaborative designXML payloads Custom responsesVersioned firmware releasesLong cyclesSimple UI Queue ReaderED
Netflix API - let a 1000 flowers bloom!
Netflix Data Center - 2009
APINetflix API from public to privateLoad BalancerGeneral REST APIJSON schemaHTTP response codesOauth security modelContent MetadataContent MetadataApplication
Customer DeviceNetflix Data Center 2010
APIHybrid ArchitectureLBNetflix AppSecurityActivationPlaybackPlatform (NRDP)UIContent Metadata
NCCPEDLBREST, JSON, OAuthRPC, XML, NTBADistinctServicesProtocolsSchemasSecurity
Different end points, protocols, security made life difficult for client teamsEspecially when we wanted to integrated UI and playback functionality
Josh: what is the right long term architecture?
Peter: do you care about the organizational implications?
Conways LawOrganizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations.Any piece of software reflects the organizational structure that produced it.
Conways LawIf you have four teams working on a compiler you will end up with a four pass compiler
NCCPAPI
Blade Runner
OutcomesProductivity & new capabilitiesRefactored organization
LessonsSolutions first, team secondReconfigure teams to best support your architecture
Outcomes & Lessons
IntroductionsMicroservice BasicsChallenges & SolutionsOrganization & ArchitectureRecapOur Talk Today
Architecture first, organizational structure secondBlameless incident reviewsCommitment to continuous improvement
Microservice architectures are complex and organic
Health depends on discipline and chaos
Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.Pause so youre probably wondering why Im talking about biology and disease in a talk about microservices?
DependencyCircuit breakers, fallbacks, chaosSimple clientsEventual consistencyMulti-region failover
ScaleAuto-scalingRedundancy avoid SPoFPartitioned workloadsFailure-driven designChaos under loadVarianceEngineered operations Understood cost of variancePrioritized support by impactChangeAutomated deliveryIntegrated practicesOrganization & ArchitectureSolutions first, team second
netflix.github.io
techblog.netflix.com
Questions?