Mini-Training: Netflix Simian Army
-
Upload
betclic-everest-group-tech-team -
Category
Technology
-
view
1.682 -
download
1
description
Transcript of Mini-Training: Netflix Simian Army
Netflix
Founded in 1997
World's leading internet subscription service for enjoying movies and TV programs
More than 48 million members in more than 40 countries enjoying more than one billion hours of TV shows and movies per month
Watch it anytime anywhere on all your connected devices
From 2 to 60 billion requests a day to their api in 2 years
12 billion outbound requests to api dependencies
A complex distributed system
2
Amazon Web Services AWS
• Officially launched in 2006
• Offers a broad set of global compute, storage, database, analytics, application, and deployment services
• Accessible by HTTP via REST or SOAP
• Data centers localized in 8 different world regions
• Has Nasa, Netflix and the CIA (AWS private replica) as customers
What it provides• Elastic Compute Cloud (EC2), resizable compute capacity in the cloud
• Elastic Block Store (EBS), block level storage volumes used by EC2 instances
• Elastic Load Balancing, automatic incoming application traffic distribution across multiple EC2 instances
3
Architecture picture
4
BC
A
ASG 2
BC
A
ASG 1
Availability Zone 2
Region A BC
A
ASG 2
BC
A
ASG 1
Availability Zone 1
BC
A
ASG 2
BC
A
ASG 1
Availability Zone 3
Transition to AWS Dorothy, you’re not in Kansas anymore
• Prepare to unlearn a lot of what you know
• Be much more structured about “over the wire” interactions
Co-tenancy is hard• Build your system to expect and accommodate failure at any level
The best way to avoid failure is to fail constantly• Design each distributed system to expect and tolerate failure
from other systems on which it depends
• Constantly test your ability to succeed despite failure
Learn with real scale, not toy models• Try doing it at full scale with real data
• Validate your design choices, with real scale comes trouble
Commit yourself• It is hard to start
• Learn from your mistakes
5
The Simian Army
Availability and Resiliency as a Service
A set of tools (scheduled agents) that deliberately shuts down services, slows down performances, checks conformity, … And tests the ability to survive them• Chaos Monkey * (Chaos Gorilla and Chaos Kong)
• Latency Monkey
• Conformity Monkey *
• Security Monkey
• Doctor Monkey
• Janitor Monkey *
• Howler Monkey
• More to come, …
6
The Chaos Monkey
How• Service running on AWS and seeking out Auto Scaling groups and terminating
instances per group
• Flexible enough design to work on other cloud providers or instances grouping an can be enhanced to support that
• Has a configurable schedule, running by default on non-holiday weekdays between 9am and 3pm
• Gorilla monkey simulates the outage of an entire Availability Zone
• Kong Monkey simulates the outage of an entire Region
Why• Prepare to fail to ensure you can tolerate instance failure
• Learn from new unpredicted issues that may occur
• Check services automatic rebalance without user visual impact
7
The Latency Monkey
How• Service running on AWS and inducing artificial delay on the RESTful
communication layer and measuring upstream services response
• With large delays, a node or even an entire service downtime can be simulated without physically bringing the instances down
Why• Simulate service degradation
• Test that services respond appropriately
• Test the ability to survive an entire service downtime
• Test the fault tolerance of a new service by simulating the failure of its dependencies without affecting the rest of the system
8
The Conformity Monkey
How• Service running on AWS and finding instances that don’t comply to predefined
best practices
• Marks non compliant instances, shuts them down and notifies corresponding owners
• Check is performed every hour by default
• Notification is sent only once per day at noon time
Why• Non compliant instances, like not belonging to and Auto Scaling Group are
trouble waiting to happen
• Anticipate and give the owners a chance to relaunch them properly
9
The Security Monkey
How• Service running on AWS and an extension of the Conformity Monkey
• Finds security violations or vulnerabilities
Why• Track improperly configured security groups to terminate offending instances
• Ensure that all their SSL and DRM certificates are valid and are not coming up for renewal
10
The Doctor Monkey
How• Service running on AWS and taping into health checks running on each instance
• Monitors other external health signs like cpu load or memory usage
• Remove unhealthy detected instances from service
Why• Give time to service owners to root cause the problem
• Eventually terminate the detected instances
11
The Janitor Monkey
How (mark, notify, delete)• Service running on AWS and searching for unused resources
• Mark resources as cleanup candidates
• Schedule resources disposal time• Cleanup deadline is defined in the rule that allows to mark the resource
• Notify the owners of the marked resources• Notification time is 2 business days before the cleanup deadline by default
• During this period the owner can decide to cleanup or retain the resource
• Dispose resources once deadline is met
Why• Ensure that the cloud environment is running free of clutter and waste.
• Save costs on operations (The more and the longer you use, the more you pay)
• Free up engineering time, no need to manage unused resources anymore
12
The Howler Monkey
How• Service running on AWS and monitoring whether a workload meets
AWS possible limitations and reports it
Why• Maintain healthy operations by ensuring that AWS limitations are respected
• Save costs on operations
13
Netflix Open Source Software Build your own robust and highly available platform
Release PaaS components git by git to incentive others• Sources at github.com/netflix
• Intros and techniques at http://techblog.netflix.com
• Blog post or new code every few weeks
Motivations• Give back to Apache licensed OSS community
• Motivate, retain, hire top engineers
• "Peer pressure" code cleanup, external contributions
Users and contributors• IBM
• Waze
• Yahoo
• Eucalyptus (Scalable cloud software)
• Yammer (Private social network), …
14
References
http://ir.netflix.com/
http://aws.amazon.com/
https://github.com/Netflix/SimianArmy/wiki
http://www.zdnet.com/netflix-how-we-got-a-grip-on-awss-cloud-3040095277/
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
http://gigaom.com/2013/07/21/ibm-high-fives-netflix-open-source-tools/
http://sssslide.com/www.slideshare.net/adrianco/netflix-and-open-source
15
We want our Sports betting, Poker, Horseracing and Casino & Games brands to be easyto use for every gamer around the world.Code with us to make that happen.
Look at all the challenges we offer HERE
We are hiring !
Check our Employer Page
Follow us on LinkedIn
About Us• Betclic Everest Group, one of the world leaders in online
gaming, has a unique portfolio comprising variouscomplementary international brands: Betclic, Everest, Bet-at-home.com, Expekt, Monte-Carlo Casino…
• Through our brands, Betclic Everest Group places expertise,technological know-how and security at the heart of ourstrategy to deliver an on-line gaming offer attuned to thepassion of our players. We want our brands to be easy to usefor every gamer around the world. We’re building ourcompany to make that happen.
• Active in 100 countries with more than 12 million customersworldwide, the Group is committed to promoting secure andresponsible gaming and is a member of several internationalprofessional associations including the EGBA (EuropeanGaming and Betting Association) and the ESSA (EuropeanSports Security Association).