Mini-Training: Netflix Simian Army

18

description

A short presentation on Netflix robots which allow them to ensure reliability and resilience of their massive distributed system.

Transcript of Mini-Training: Netflix Simian Army

Page 1: Mini-Training: Netflix Simian Army
Page 2: Mini-Training: Netflix Simian Army

Netflix

Founded in 1997

World's leading internet subscription service for enjoying movies and TV programs

More than 48 million members in more than 40 countries enjoying more than one billion hours of TV shows and movies per month

Watch it anytime anywhere on all your connected devices

From 2 to 60 billion requests a day to their api in 2 years

12 billion outbound requests to api dependencies

A complex distributed system

2

Page 3: Mini-Training: Netflix Simian Army

Amazon Web Services AWS

• Officially launched in 2006

• Offers a broad set of global compute, storage, database, analytics, application, and deployment services

• Accessible by HTTP via REST or SOAP

• Data centers localized in 8 different world regions

• Has Nasa, Netflix and the CIA (AWS private replica) as customers

What it provides• Elastic Compute Cloud (EC2), resizable compute capacity in the cloud

• Elastic Block Store (EBS), block level storage volumes used by EC2 instances

• Elastic Load Balancing, automatic incoming application traffic distribution across multiple EC2 instances

3

Page 4: Mini-Training: Netflix Simian Army

Architecture picture

4

BC

A

ASG 2

BC

A

ASG 1

Availability Zone 2

Region A BC

A

ASG 2

BC

A

ASG 1

Availability Zone 1

BC

A

ASG 2

BC

A

ASG 1

Availability Zone 3

Page 5: Mini-Training: Netflix Simian Army

Transition to AWS Dorothy, you’re not in Kansas anymore

• Prepare to unlearn a lot of what you know

• Be much more structured about “over the wire” interactions

Co-tenancy is hard• Build your system to expect and accommodate failure at any level

The best way to avoid failure is to fail constantly• Design each distributed system to expect and tolerate failure

from other systems on which it depends

• Constantly test your ability to succeed despite failure

Learn with real scale, not toy models• Try doing it at full scale with real data

• Validate your design choices, with real scale comes trouble

Commit yourself• It is hard to start

• Learn from your mistakes

5

Page 6: Mini-Training: Netflix Simian Army

The Simian Army

Availability and Resiliency as a Service

A set of tools (scheduled agents) that deliberately shuts down services, slows down performances, checks conformity, … And tests the ability to survive them• Chaos Monkey * (Chaos Gorilla and Chaos Kong)

• Latency Monkey

• Conformity Monkey *

• Security Monkey

• Doctor Monkey

• Janitor Monkey *

• Howler Monkey

• More to come, …

6

Page 7: Mini-Training: Netflix Simian Army

The Chaos Monkey

How• Service running on AWS and seeking out Auto Scaling groups and terminating

instances per group

• Flexible enough design to work on other cloud providers or instances grouping an can be enhanced to support that

• Has a configurable schedule, running by default on non-holiday weekdays between 9am and 3pm

• Gorilla monkey simulates the outage of an entire Availability Zone

• Kong Monkey simulates the outage of an entire Region

Why• Prepare to fail to ensure you can tolerate instance failure

• Learn from new unpredicted issues that may occur

• Check services automatic rebalance without user visual impact

7

Page 8: Mini-Training: Netflix Simian Army

The Latency Monkey

How• Service running on AWS and inducing artificial delay on the RESTful

communication layer and measuring upstream services response

• With large delays, a node or even an entire service downtime can be simulated without physically bringing the instances down

Why• Simulate service degradation

• Test that services respond appropriately

• Test the ability to survive an entire service downtime

• Test the fault tolerance of a new service by simulating the failure of its dependencies without affecting the rest of the system

8

Page 9: Mini-Training: Netflix Simian Army

The Conformity Monkey

How• Service running on AWS and finding instances that don’t comply to predefined

best practices

• Marks non compliant instances, shuts them down and notifies corresponding owners

• Check is performed every hour by default

• Notification is sent only once per day at noon time

Why• Non compliant instances, like not belonging to and Auto Scaling Group are

trouble waiting to happen

• Anticipate and give the owners a chance to relaunch them properly

9

Page 10: Mini-Training: Netflix Simian Army

The Security Monkey

How• Service running on AWS and an extension of the Conformity Monkey

• Finds security violations or vulnerabilities

Why• Track improperly configured security groups to terminate offending instances

• Ensure that all their SSL and DRM certificates are valid and are not coming up for renewal

10

Page 11: Mini-Training: Netflix Simian Army

The Doctor Monkey

How• Service running on AWS and taping into health checks running on each instance

• Monitors other external health signs like cpu load or memory usage

• Remove unhealthy detected instances from service

Why• Give time to service owners to root cause the problem

• Eventually terminate the detected instances

11

Page 12: Mini-Training: Netflix Simian Army

The Janitor Monkey

How (mark, notify, delete)• Service running on AWS and searching for unused resources

• Mark resources as cleanup candidates

• Schedule resources disposal time• Cleanup deadline is defined in the rule that allows to mark the resource

• Notify the owners of the marked resources• Notification time is 2 business days before the cleanup deadline by default

• During this period the owner can decide to cleanup or retain the resource

• Dispose resources once deadline is met

Why• Ensure that the cloud environment is running free of clutter and waste.

• Save costs on operations (The more and the longer you use, the more you pay)

• Free up engineering time, no need to manage unused resources anymore

12

Page 13: Mini-Training: Netflix Simian Army

The Howler Monkey

How• Service running on AWS and monitoring whether a workload meets

AWS possible limitations and reports it

Why• Maintain healthy operations by ensuring that AWS limitations are respected

• Save costs on operations

13

Page 14: Mini-Training: Netflix Simian Army

Netflix Open Source Software Build your own robust and highly available platform

Release PaaS components git by git to incentive others• Sources at github.com/netflix

• Intros and techniques at http://techblog.netflix.com

• Blog post or new code every few weeks

Motivations• Give back to Apache licensed OSS community

• Motivate, retain, hire top engineers

• "Peer pressure" code cleanup, external contributions

Users and contributors• IBM

• Waze

• Yahoo

• Eucalyptus (Scalable cloud software)

• Yammer (Private social network), …

14

Page 15: Mini-Training: Netflix Simian Army

References

http://ir.netflix.com/

http://aws.amazon.com/

https://github.com/Netflix/SimianArmy/wiki

http://www.zdnet.com/netflix-how-we-got-a-grip-on-awss-cloud-3040095277/

http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html

http://gigaom.com/2013/07/21/ibm-high-fives-netflix-open-source-tools/

http://sssslide.com/www.slideshare.net/adrianco/netflix-and-open-source

15

Page 16: Mini-Training: Netflix Simian Army

Find out more

• On https://techblog.betclicgroup.com/

Page 17: Mini-Training: Netflix Simian Army

We want our Sports betting, Poker, Horseracing and Casino & Games brands to be easyto use for every gamer around the world.Code with us to make that happen.

Look at all the challenges we offer HERE

We are hiring !

Check our Employer Page

Follow us on LinkedIn

Page 18: Mini-Training: Netflix Simian Army

About Us• Betclic Everest Group, one of the world leaders in online

gaming, has a unique portfolio comprising variouscomplementary international brands: Betclic, Everest, Bet-at-home.com, Expekt, Monte-Carlo Casino…

• Through our brands, Betclic Everest Group places expertise,technological know-how and security at the heart of ourstrategy to deliver an on-line gaming offer attuned to thepassion of our players. We want our brands to be easy to usefor every gamer around the world. We’re building ourcompany to make that happen.

• Active in 100 countries with more than 12 million customersworldwide, the Group is committed to promoting secure andresponsible gaming and is a member of several internationalprofessional associations including the EGBA (EuropeanGaming and Betting Association) and the ESSA (EuropeanSports Security Association).