Download - Architecture for the cloud deployment case study future

Transcript
Page 1: Architecture for the cloud deployment case study future

Architecting for the Cloud

Len and Matt Bass

Deployment

Page 2: Architecture for the cloud deployment case study future

Deployment pipeline

• Developers commit code

• Code is compiled

• Binary is processed by a build and unit test tool which builds the service

• Integration tests are run followed by performance tests.

• Result is a machine image (assuming virtualization)

• The service (its image) is deployed to production.

2

Page 3: Architecture for the cloud deployment case study future

Deployment Overview

3

Multiple instances of a service are executing • Red is service being replaced

with new version • Blue are clients • Green are dependent services

VA VB VB VB

UAT / staging / performance

tests

Page 4: Architecture for the cloud deployment case study future

Additional considerations

• Recall release plan. • Each release requires coordination among

stakeholders and development teams. – Can be explicitly performed – Can be implicit in the architecture (we saw an

example of this)

• You as an architect need to know how frequently new releases will be deployed. – Some organizations deploy dozens of times a day – Other organizations deploy once a week – Still others once a month or longer

Page 5: Architecture for the cloud deployment case study future

Monthly deployment

• There is time for coordination among development teams to ensure that components are consistent

• There is time for “release manager” to ensure that a release is correct before moving it to production

Page 6: Architecture for the cloud deployment case study future

Weekly deployment

• Limited time for coordination among development teams

• Time for “release manager” to ensure that a release is correct before moving it to production

Page 7: Architecture for the cloud deployment case study future

Daily deployment

• No time for

– Coordination among development teams

– Release manager to validate release

• These items must be implicit in the architecture.

Page 8: Architecture for the cloud deployment case study future

Deployment goal and constraints

• Goal of a deployment is to move from current state (N instances of version A of a service) to a new state (N instances of version B of a service)

• Constraints associated with Continuous Deplloyment: – Any development team can deploy their service at any

time. I.e. New version of a service can be deployed either before or after a new version of a client. (no synchronization among development teams)

– It takes time to replace one instance of version A with an instance of version B (order of minutes)

– Service to clients must be maintained while the new version is being deployed.

8

Page 9: Architecture for the cloud deployment case study future

Deployment strategies

• Two basic all of nothing strategies

– Red/Black– leave N instances with version A as they are, allocate and provision N instances with version B and then switch to version B and release instances with version A.

– Rolling Upgrade – allocate one instance, provision it with version B, release one version A instance. Repeat N times.

• Other deployment topics

– Partial strategies (canary testing, A/B testing,). We will discuss them later. For now we are discussing all or nothing deployment.

– Rollback

– Packaging services into machine images

9

Page 10: Architecture for the cloud deployment case study future

Trade offs – Red/Black and Rolling Upgrade

• Red/Black – Only one version available to the

client at any particular time. – Requires 2N instances (additional

costs) – Accomplished by moving

environments

• Rolling Upgrade – Multiple versions are available for

service at the same time – Requires N+1 instances.

• Both are heavily used. Choice depends on – Cost – Managing complications from

using rolling upgrade

10

Update Auto Scaling Group

Sort Instances

Remove & Deregister Old Instance from ELB

Confirm Upgrade Spec

Terminate Old Instance

Wait for ASG to Start New Instance

Register New Instance with ELB

Rolling Upgrade in EC2

Page 11: Architecture for the cloud deployment case study future

Types of failures during rolling upgrade

Rolling Upgrade Failure

Provisioning

Research topic

Logical failure

Inconsistencies to be discussed

Instance failure

Handled by Auto Scaling Group in EC2

11

Page 12: Architecture for the cloud deployment case study future

What are the problems with Rolling Upgrade?

• Recall that any development team can deploy their service at any time.

• Three concerns

– Maintaining consistency between different versions of the same service when performing a rolling upgrade

– Maintaining consistency among different services

– Maintaining consistency between a service and persistent data

12

Page 13: Architecture for the cloud deployment case study future

Maintaining consistency between different versions of the same

service • Key idea – differentiate between installing a new

version and activating a new version • Involves “feature toggles” (described momentarily) • Sequence

– Develop version B with new code under control of feature toggle

– Install each instance of version B with the new code toggled off.

– When all of the instances of version A have been replaced with instances of version B, activate new code through toggling the feature.

13

Page 14: Architecture for the cloud deployment case study future

Issues

• Do you remember feature toggles?

• How do I manage features that extend across multiple services?

• How do I activate all relevant instances at once?

14

Page 15: Architecture for the cloud deployment case study future

Feature toggle

• Place feature dependent new code inside of an “if” statement where the code is executed if an external variable is true. Removed code would be the “else” portion.

• Used to allow developers to check in uncompleted code. Uncompleted code is toggled off.

• During deployment, until new code is activated, it will not be executed.

• Removing feature toggles when a new feature has been committed is important.

15

Page 16: Architecture for the cloud deployment case study future

Multi service features

• Most features will involve multiple services. • Each service has some code under control of a

feature toggle. • Activate feature when all instances of all services

involved in a feature have been installed. – Maintain a catalog with feature vs service version

number. – A feature toggle manager determines when all old

instances of each version have been replaced. This could be done using registry/load balancer.

– The feature manager activates the feature.

16

Page 17: Architecture for the cloud deployment case study future

Activating feature

• The feature toggle manager changes the value of the feature toggle. Two possible techniques to get new value to instances. – Push. Broadcasting the new value will instruct each

instance to use new code. If a lag of several seconds between the first service to be toggled and the last can be tolerated, there is no problem. Otherwise synchronizing value across network must be done.

– Pull. Querying the manager by each instance to get latest value may cause performance problems.

• A coordination mechanism such as Zookeeper will overcome both problems.

17

Page 18: Architecture for the cloud deployment case study future

Maintaining consistency across versions (summary)

• Install all instances before activating any new code

• Use feature toggles to activate new code

• Use feature toggle manager to determine when to activate new code

• Use Zookeeper to coordinate activation with low overhead

18

Page 19: Architecture for the cloud deployment case study future

Maintaining consistency among different services

• Use case:

– Wish to deploy new version of service A without coordinating with development team for clients of service A.

• I.e. new version of service A should be backward compatible in terms of its interfaces.

• May also require forward compatibility in certain circumstances, e.g. rollback

19

Page 20: Architecture for the cloud deployment case study future

Achieving Backwards Compatibility

• APIs can be extended but must always be backward compatible.

• Leads to a translation layer

External APIs (unchanging but with ability to extend or add new ones)

Translation to internal APIs

Client Client

Internal APIs (changes require changes to translation layer but do not propagate further)

Page 21: Architecture for the cloud deployment case study future

What about dependent services?

• Dependent services that are within your control should maintain backward compatibility

• Dependent services not within your control (third party software) cannot be forced to maintain backward compatibility. – Minimize impact of changes by localizing interactions

with third party software within a single module. – Keeping services independent and packaging as much

as possible into a virtual machine means that only third party software accessed through message passing will cause problems.

21

Page 22: Architecture for the cloud deployment case study future

Forward Compatibility

• Gracefully handle unknown calls and data base schema information

– Suppose your service receives a method call it does not recognize. It could be intended for a later version where this method is supported.

– Suppose your service retrieves a data base table with an unknown field. It could have been added to support a later version.

• Forward compatibility allows a version of a service to be upgraded or rolled back independently from its clients. It involves both

– The service handling unrecognized information – The client handling returns that indicate unrecognized

information.

22

Page 23: Architecture for the cloud deployment case study future

Maintaining consistency between a service and persistent data

• Assume new version is correct – we will discuss the situation where

it is incorrect in a moment.

• Inconsistency in persistent data can come about because data schema or semantics change.

• Effect can be minimized by the following practices (if possible).

– Only extend schema – do not change semantics of existing fields. This preserves backwards compatibility.

– Treat schema modifications as features to be toggled. This maintains consistency among various services that access data.

23

Page 24: Architecture for the cloud deployment case study future

I really must change the schema

• In this case, apply pattern for backward compatibility of interfaces to schemas.

• Use features of database system (I am assuming a relational DBMS) to restructure data while maintaining access to not yet restructured data.

24

Page 25: Architecture for the cloud deployment case study future

Summary of consistency discussion so far.

• Feature toggles are used to maintain consistency within instances of a service

• Backward compatibility pattern is used to maintain consistency between a service and it s clients.

• Discouraging modification of schema will maintain consistency between services and persistent data. – If schema must be modified, then synchronize

modifications with feature toggles.

25

Page 26: Architecture for the cloud deployment case study future

Canary testing

• Canaries are a small number of instances of a new version placed in production in order to perform live testing in a production environment.

• Canaries are observed closely to determine whether the new version introduces any logical or performance problems. If not, roll out new version globally. If so, roll back canaries.

• Named after canaries

in coal mines.

26

Page 27: Architecture for the cloud deployment case study future

Implementation of canaries

• Designate a collection of instances as canaries. They do not need to be aware of their designation.

• Designate a collection of customers as testing the canaries. Can be, for example

– Organizationally based

– Geographically based

• Then

– Activate feature or version to be tested for canaries. Can be done through feature activation synchronization mechanism

– Route messages from canary customers to canaries. Can be done through making registry/load balancer canary aware.

27

Page 28: Architecture for the cloud deployment case study future

A/B testing

• Suppose you wish to test user response to a system variant. E.g. UI difference or marketing effort. A is one variant and B is the other.

• You simultaneously make available both variants to different audiences and compare the responses.

• Implementation is the same as canary testing.

28

Page 29: Architecture for the cloud deployment case study future

Rollback

• New versions of a service may be unacceptable either for logical or performance reasons.

• Two options in this case • Roll back (undo deployment)

• Roll forward (discontinue current deployment and create a new release without the problem).

• Decision to rollback or roll forward is almost never automated because there are multiple factors to consider.

• Forward or backward recovery

• Consequences and severity of problem

• Importance of upgrade

29

Page 30: Architecture for the cloud deployment case study future

States of upgrade.

• An upgrade can be in one of two states when an error is detected.

– Installed (fully or partially) but new features not activated

– Installed and new features activated.

30

Page 31: Architecture for the cloud deployment case study future

Possibilities

• For this slide assume persistent data is correct.

• Installed but new features not activated – Error must be in backward compatibility

– Halt deployment

– Roll back by reinstalling old version

– Roll forward by creating new version and installing that

• Installed with new features activated – Turn off new features

– If that is insufficient, we are at prior case.

31

Page 32: Architecture for the cloud deployment case study future

Persistent data may be incorrect

• Keep log of user requests (each with their own identification)

• Identification of incorrect persistent data • Tag each data item with metadata that provides service and

version that wrote that data • user request that caused the data to be written

• Correction of incorrect persistent data (simplistic version) – Remove data written by incorrect version of a service – Install correct version – Replay user requests that caused incorrect data to be

written

32

Page 33: Architecture for the cloud deployment case study future

Persistent data correction problems

I will not present good solutions to these problems. 1. Replaying user requests may involve requesting features that are

not in the current version.

– Requests can be queued until they can be correctly re-executed

– User can be informed of error (after the fact) 2. There may be domino effects from incorrect data. i.e. other

calculations may be affected.

– Keep pedigree for data items that allows determining which additional data items are incorrect. Remove them and regenerate them when requests replayed.

– Data that escaped the system, e.g. sent to other system or shown to a user, cannot be retrieved.

33

Page 34: Architecture for the cloud deployment case study future

Summary of rollback options

• Can roll back or roll forward

• Rolling back without consideration of persistent data is relatively straightforward.

• Managing erroneous persistent data is complicated and will likely require manual processing.

34

Page 35: Architecture for the cloud deployment case study future

Packaging of services

• The last portion of the deployment pipeline is packaging services into machine images for installation.

• Two dimensions

– Flat vs deep service hierarchy

– One service per virtual machine vs many services per virtual machine

35

Page 36: Architecture for the cloud deployment case study future

Flat vs Deep Service Hierarchy

• Trading off independence of teams and possibilities for reuse.

• Flat Service Hierarchy – Limited dependence among services & limited

coordination needed among teams – Difficult to reuse services

• Deep Service Hierarchy – Provides possibility for reusing services – Requires coordination among teams to discover reuse

possibilities. This can be done during architecture definition.

36

Page 37: Architecture for the cloud deployment case study future

Services per VM Image

37

Service1

Service2

VM image

Develop

Develop

Embed

Embed

One service per VM

Service VM image

Develop Embed

Multiple services per VM

Page 38: Architecture for the cloud deployment case study future

One Possible Race Condition with Multiple Services per VM

38

TIME

Initial State: VM image with Version N of Service 1 and Version N of Service 2

Developer 1

Build new image with VN+1|VN

Begin provisioning process with new image

Developer 2

Build new image with VN|VN+1

Begin provisioning process with new image without new version of Service 1

Results in Version N+1 of Service 1 not being updated until next build of VM image Could be prevented by VM image build tool

Page 39: Architecture for the cloud deployment case study future

Another Possible Race Condition with Multiple Services per VM

39

TIME

Initial State: VM image with Version N of Service 1 and Version N of Service 2

Developer 1

Build new image with VN+1|VN

Begin provisioning process with new image overwrites image created by developer 2

Developer 2

Build new image with VN+1|VN+1

Begin provisioning process with new image

Results in Version N+1 of Service 2 not being updated until next build of VM image Could be prevented by provisioning tool

Page 40: Architecture for the cloud deployment case study future

Trade offs

• One service per VM – Message from one service to another must go

through inter VM communication mechanism – adds latency

– No possibility of race condition

• Multiple Services per VM – Inter VM communication requirements reduced –

reduces latency

– Adds possibility of race condition caused by simultaneous deployment

40

Page 41: Architecture for the cloud deployment case study future

Summary of Deployment

• Rolling upgrade is common deployment strategy • Introduces requirements for consistency among

– Different versions of the same service – Different services – Services and persistent data

• Other deployment considerations include – Canary deployment – A/B testing – Rollback – Business continuity

41

Page 42: Architecture for the cloud deployment case study future

Architecting for the Cloud

Case Study

Page 43: Architecture for the cloud deployment case study future

Overview

• What is Netflix

• Migrating to the cloud – Data

– Build process

– Testing process

• Withstanding Amazon outage of April 21, 2011.

Page 44: Architecture for the cloud deployment case study future

Overview

• What is Netflix

• Migrating to the cloud – Data

– Build process

– Testing process

• Withstanding Amazon outage of April 21, 2011.

Page 45: Architecture for the cloud deployment case study future

Netflix Corporation

Launched in 1998 after founder was irritated at having to pay late fees on a DVD rental.

DVD Model

• Pay monthly membership fee that includes rentals, shipping and no late fees

• Maintain online queue of desired rentals

• When return last rental (depending on service plan), next item in queue is mailed to you together with a return envelope.

• Customers rate movies and Netflix recommends based on your preferences

Unusual culture

• Unlimited vacation time for salaried workers

• Workers can take any percentage of their salary in stock options (up to 100%).

Page 46: Architecture for the cloud deployment case study future

Streaming video - 1

Customers can watch Netflix streaming video on a wide variety of devices many of which feed into a TV

– Roku set top box

– Blu-ray disk players

– Xbox 360

– TV directly

– PlayStation 3

– Wii

– DVRs

– Nintendo

Customers can stop and restart video at will. Netflix calls these locations in the films “bookmarks”.

Page 47: Architecture for the cloud deployment case study future

Streaming video - 2

In 2007 Netflix began to offer streaming video to it’s subscribers

Initially, one hour of streaming video was available to customers for every dollar they spent on their plan

In Jan, 2008, every customer was entitled to unlimited streaming video.

In May, 2011, Netflix streaming video accounted for 22% of all internet traffic. 30% of traffic during peak usage hours.

Three bandwidth tiers

• Continuous bandwidth to the client of 5 Mbit/s. HDTV, surround sound

• Continuous bandwidth to the client of 3Mbit/s – better than DVD

• Continuous bandwidth to the client of 1.5Mbit/s – DVD quality

Page 48: Architecture for the cloud deployment case study future

Overview

• What is Netflix

• Migrating to the cloud – Data

– Build process

– Testing process

• Withstanding Amazon outage of April 21, 2011.

Page 49: Architecture for the cloud deployment case study future

Netflix’s Growth

In late 2008, Netflix had a single data center with Oracle as the main database system.

With the growth of subscriptions and streaming video, it was clear that they would soon outgrow the data center.

Page 50: Architecture for the cloud deployment case study future

Not Just More Requests

• Netflix was expanding internationally

• They had unpredictable usage spikes with product releases

– E.g. Wii, and Xbox

– Need infrastructure to handle usage spikes

• Datacenters are expensive

– Netflix was running Oracle on IBM hardware (not commodity hardware)

– Could switch to commodity hardware but would have to have in-house expertise

• Can’t hire enough system and database administrators to grow fast enough

Page 51: Architecture for the cloud deployment case study future

Netflix Moves to Cloud

Four reasons cited by Netflix for moving to the cloud

1. Every layer of the software stack needed to scale horizontally, be more reliable, redundant, and fault tolerant. This leads to reason #2

2. Outsourcing data center infrastructure to Amazon allowed Netflix engineers to focus on building and improving their business.

3. Netflix is not very good at predicting customer growth or device engagement. They underestimated their growth rate. The cloud supports rapid scaling.

4. Cloud computing is the future. This will help Netflix with recruiting engineers who are interested in honing their skills, and will help scale the business. It will also ensure competition among cloud providers helping to keep costs down.

Why Amazon and EC2? In 2008, Amazon was the leading supplier. Netflix wanted an IaaS so they could focus on their core competencies.

Page 52: Architecture for the cloud deployment case study future

Netflix applications

What applications does Netflix have?

• Video ratings, reviews, and recommendations

• Video streaming

• User registration, log-in

• Video queues

• Billing

• DVD disc management – inventory and shipping

• Video metadata management – movie cast information

Strategy was to move in a phased manner.

• A subset of applications would move in a particular phase

• Means that data may have to be replicated between cloud and data center during the move.

Page 53: Architecture for the cloud deployment case study future

Amazon’s S3 and SimpleDB

SimpleDB is Amazon’s non-relational data store

– Optimized for data access speed

– Indexes data for fast retrieval

– Physical drives are less dense to support this

Amazon’s Simple Storage Service (S3)

– Stores raw data

– Optimized on storing larger data sets inexpensively

Page 54: Architecture for the cloud deployment case study future

Choosing Cloud-Based Data Store

SimpleDB and S3 both provide

• Disaster recovery

• Managed fail over and fail back

• Distribution across availability zones

• Persistence

• Eventual consistency

• SimpleDB has a query language

Page 55: Architecture for the cloud deployment case study future

Cloud-Based Data Store II

SimpleDB and S3 do not provide:

• Transactions

• Locks

• Sequences

• Triggers

• Clocks

• Constraints

• Joins

• Schemas and associated integrity checking

• Native data types. All data in SimpleDB is string data

Page 56: Architecture for the cloud deployment case study future

Migrating Data from Oracle to SimpleDB

Whole data sets were moved. E.g. instant watch bookmarks were lifted wholesale into SimpleDB

Incremental data changes were kept in synch (bi-directional) between Oracle and SimpleDB. Device support moved from Oracle to the cloud. During that transition, a user may begin watching a movie on a device supported in the cloud and continue watching on a device supported in Oracle. Instant watch bookmarks are kept in synch for these types of possibilities.

Once, the need for the Oracle version is removed, then the synchronization is discontinued and the data no longer resides on the Oracle database.

Page 57: Architecture for the cloud deployment case study future

Using SimpleDB and S3

Move normal DB functionality up the stack to the application.

Specifically:

• Do “Group By” and “Join” in application

• De-normalize tables to reduce necessity for joins

• Do without triggers

• Manage concurrency using time stamps and conditional Put and Delete

• Do without clock operations

• Application checks on constraints on read and repair data as a side effect

Page 58: Architecture for the cloud deployment case study future

Exploiting SimpleDB

All datasets with a significant write load are partitioned across multiple SimpleDB domains

Since all data is stored as strings • Store all date-times in ISO8601 format

• Zero pad any numeric columns used in sorts or WHERE clause inequalities

Atomic writes requiring both a delete of an attribute and an update of another attribute in the same item are accomplished through deleting and rewriting the entire item.

SimpleDB is case sensitive to attribute names. Netflix implemented a common data access layer that normalizes all attribute names (e.g. TO_UPPER)

Misspelled attribute names can fail silently so Netflix has a schema validator in the common data access layer

Deal with eventual consistency by avoiding reads immediately after writes. If this is not possible, use ConsistentRead.

Page 59: Architecture for the cloud deployment case study future

Netflix Build Process - Ivy

Ant is a tool for automating the software build process (like Make).

Ivy is an extension to Ant for managing (recording, tracking, resolving and reporting) project dependencies.

Ant, Ivy

Page 60: Architecture for the cloud deployment case study future

Netflix Build Process - Artifactory

Artifactory is a repository manager.

• Artifactory acts as a proxy between your build tool (Maven, Ant, Ivy, Gradle etc.) and the outside world.

• It caches remote artifacts so that you don’t have to download them over and over again.

• It blocks unwanted (and sometimes security-sensitive) external requests for internal artifacts and controls how and where artifacts are deployed, and by whom.

Artifactory

Page 61: Architecture for the cloud deployment case study future

Netflix Build Process - Jenkins

Jenkins is an open source continuous integration tool.

Jenkins provides a so-called continuous integration system, making it easier for developers to integrate changes to the project, and making it easier for users to obtain a fresh build.

Jenkins monitors executions of externally-run jobs, such as cron jobs and procmail jobs, even those that are run on a remote machine. For example, with cron, Jenkins keeps e-mail outputs in a repository.

Page 62: Architecture for the cloud deployment case study future

Migration Lessons

• The change from reliable to commodity hardware impacts many things

– E.g. in the old world memory based session state was fine

• Hardware failures were rare

– Not appropriate on the cloud

• Co-tenancy is hard

– In the cloud all resources are shared e.g.

• Hardware, network, storage, …

– This introduces variable performance depending on load

• At all levels of the stack

– Have to be willing to abandon sub-task or manage resources to avoid co-tenancy

• The best way to avoid failure is to fail consistently

– Netflix has designed their systems to be tolerant to failure

– They test this capability regularly

Page 63: Architecture for the cloud deployment case study future

Unleash the Monkey

• As a result of disruption of services Netflix adopted a process of testing the system

• They learned that the best way to test the system was at scale

• In order to learn how their systems handle failure they introduced faults

– Along came the “Chaos Monkey”

• The Chaos Monkey randomly disables production instances

• This is done in a tightly controlled environment with engineers standing by

– But it is done in the production environment

• The engineers then learn how their systems handle failures and can improve the solution accordingly

Page 64: Architecture for the cloud deployment case study future

Increase the Chaos

• The Chaos Monkey was so successful Netflix created a “Simian Army” to test the system

• The “Army” includes:Netflix Test Suite - 1

– Chaos monkey. Randomly kill a process and monitor the effect.

– Latency monkey. Randomly introduce latency and monitor the effect.

– Doctor monkey. The Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances.

– Janitor Monkey. The Janitor Monkey ensures that the Netflix cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them.

Page 65: Architecture for the cloud deployment case study future

Netflix Test Suite - 2

• Conformity Monkey. The Conformity Monkey finds instances that don’t adhere to best-practices and shuts them down. For example, if an instance does not belong to an auto-scaling group, that is a potential problem.

• Security Monkey The Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal.

• 10-18 Monkey The 10-18 Monkey (Localization-Internationalization) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets. The name 10-18 comes from L10n and I18n which are the number of characters in the words localization and internationalization.

• Chaos Gorilla: The Gorilla is similar to the Chaos Monkey but simulates the outage of an entire Amazon availability zone

Page 66: Architecture for the cloud deployment case study future

Overview

• What is Netflix

• Migrating to the cloud – Data

– Build process

– Testing process

• Withstanding Amazon outage of April 21, 2011

Page 67: Architecture for the cloud deployment case study future

April 21, 2011 Outage – Background

An EBS cluster is comprised of a set of EBS nodes.

When Node A loses connectivity to a node (Node B) to which it is replicating data, Node A assumes node B has failed

In this case, it must find another node to which data is replicated. This is called re-mirroring.

When data is being re-mirrored, all nodes that have copies of that data retain that data and block all external access to that data. This is called the node being “stuck”

In the normal case, a node is stuck for only a few milli-seconds.

Page 68: Architecture for the cloud deployment case study future

April 21, 2011 Outage - Events

At 12:47 AM PDT, a configuration change was made to upgrade the capacity of the primary EBS network within one availability zone in the Eastern Region.

• Standard practice is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen

• Traffic was shifted incorrectly to the lower capacity backup network.

Consequently, there was no functioning primary or secondary network in the affected portion of the availability zone since traffic was shifted off of the primary and the secondary became quickly overloaded.

When the error was detected and connectivity restored, a large number of nodes that had been isolated began searching for other nodes to which they could re-mirror their data.

This caused a re-mirroring storm where all of the available space in the affected cluster was used up and the attempts to gain access to another availability zone overwhelmed the primary communication network.

This caused the whole region to be unavailable

Page 69: Architecture for the cloud deployment case study future

April 21, 2011 Outage – Consequences At 12:04 PM PDT, the outage was contained to the one affected availability zone. 13%

of the nodes in the availability zone remained “stuck.

Additional capacity was added to the availability zone to allow those nodes to become unstuck.

In the meantime, some of the stuck volumes experienced hardware failures. Hardware failures are a normal occurrence within a cluster.

Ultimately, the data on .07% of the nodes in the affected availability zone could not be recovered and this data was permanently lost.

Page 70: Architecture for the cloud deployment case study future

Netflix and the Amazon Outage of April, 2011

On April 21, 2011, one of the availability zones in Amazon’s Eastern Region failed.

– Many organizations were unavailable for the duration of the outage (24 hours)

– Some organizations permanently lost data

Netflix was largely unaffected

– Customers experienced no disruption of service

– There were some increased latencies

– Higher than normal error rate

– No interruption on customers ability to find and watch movies

Page 71: Architecture for the cloud deployment case study future

How Did Netflix Accomplish This?

Stateless services

• Services are largely stateless

• If a server fails the request can be re-routed

• Remember the discussion of modular redundancy?

– The lack of state means failover times are negligible

Data stored across availability zones

• In some cases it was not practical to re-architect the system to be stateless

• In these cases the there are multiple hot copies of the data across availability zones

• In the event of a failure again the failover time is negligible

– What is the cost of doing this?

Page 72: Architecture for the cloud deployment case study future

Design Decisions II

Graceful degradation: The Netflix systems are designed for failure. Allot of thought went into what to do when they fail.

• Fail fast – aggressive time outs so that failing components do not slow the whole system down

• Fallbacks – every feature has the ability to fall back to a lower quality representation. E.g. if Netflix cannot generate personalized recommendations, they will fall back to non-personalized results

• Feature removal. If a feature is non-critical and is slow, then it may be removed from any given page.

N+1 redundancy

• The system is architected with more capacity than it needs at any time

• This allows them to cope with large spikes in load caused by users directly or the ripple effect of transient failures

Page 73: Architecture for the cloud deployment case study future

What Issues Were Experienced?

Netflix did have some issues, however, including: • As the availability zone started to fail Netflix decided to pull out all-together

• This required manual configuration changes to AWS

• Each service team was involved in moving their service out of the zone

• This was a time consuming and error prone process

Load balancers had to be manually adjusted to keep traffic out of affected availability zone.

• Netflix uses Amazon’s Elastic Load Balancing (ELB)

• This first balances load to availability zones and then instances

• If you have many instances go down the others in that zone have to pick up the slack

• If you can’t bring up more nodes you will experience a cascading effect until all nodes go down

Page 74: Architecture for the cloud deployment case study future

Amazon Outage of Aug 8, 2011

On Aug 8, 2011, the Eastern Region of AWS went down for about half an hour due to connectivity issues that affected all availability zones …

Netflix was down as well…

Page 75: Architecture for the cloud deployment case study future

Summary

Netfflix moved to the cloud because

• They doubted their ability to correctly predict the load

• They did not want to invest in data centers directly

Netflix chose SimpleDB as their original target database system and their migration strategy involved

• Running Oracle and SimpleDB concurrently for a period

• Moving some features provided by Oracle up the application stack

Netflix has a number of test programs that inject faults or- look for specific types of violations.

Neflix re-architected their build process to support sophisticated deployment practices and to allow for continuous integration.

Netflix did a number of things to promote availability and these enabled Netflix to continue customer service through an extensive AWS outage.

Page 76: Architecture for the cloud deployment case study future

Reference

http://techblog.netflix.com/

Page 77: Architecture for the cloud deployment case study future

Questions??

Page 78: Architecture for the cloud deployment case study future

Architecting for the Cloud

Future Trends

Page 79: Architecture for the cloud deployment case study future

Topics

• Containers

• Augmented reality

• Cloudlets

• Internet of things – cars, applicances, etc

Page 80: Architecture for the cloud deployment case study future

Containers

• Loading full VM at deployment time – Takes time for large machine images

– Results in many slightly different VM images for different versions

– Depends on particular processor even if virtualized.

• Containers are a proposed solution – Machine image consists of OS + libraries

– Application portion runs inside of this machine image and is loaded by the machine image once it is instantiated.

Page 81: Architecture for the cloud deployment case study future

What is a Container

Page 82: Architecture for the cloud deployment case study future

Containers vs VMs

Page 83: Architecture for the cloud deployment case study future

Virtues of Containers

Page 84: Architecture for the cloud deployment case study future

Adoption of Containers

• Containers are an old concept

• Docker is a software system that packages the creation of containers

• Cloud providers are adopting containers since they control the entire stack and can choose OSs.

Page 85: Architecture for the cloud deployment case study future

Topics

• Containers

• Augmented reality

• Cloudlets

• Internet of things – cars, applicances, etc

Page 86: Architecture for the cloud deployment case study future

Augmented Reality

• Superimpose computer generated images over visual images

Page 87: Architecture for the cloud deployment case study future

Simple text has been available for a long time

• CMU wearable computer group early 1990s

• Google Glass modern display device

Page 88: Architecture for the cloud deployment case study future

Computer power needed for more sophisticated images

• Route planning – showing what is ahead if you take a particular route

• Visualizing furniture in a room

• These applications require – Location determination

– Orientation determination

• This computer power is not available on mobile devices.

• This is a lead in to “cloudlets”

Page 89: Architecture for the cloud deployment case study future

Cloudlets

• Mobile devices do not have the CPU power necessary for some applications.

• Consequently, they act as data sources for apps in the cloud. – This introduces latency in responding to a user

request

– This latency does not matter for some purposes – e.g. airline reservations, music streaming

– But – it does matter for others, e.g. voice recognition, translation, augmented reality.

Page 90: Architecture for the cloud deployment case study future

How does one match computation power and data with low latency?

• Move computation power closer to the data.

• But, but, but – Data is inherently mobile (with some latency)

– Computation is tied to a server. Real computation power is not so mobile.

– I am in an automobile and my mobile is constantly moving and getting further away from any fixed computation source

Page 91: Architecture for the cloud deployment case study future

Anyway, won’t my mobile have sufficient computational power?

• Mobiles will always inherently be less powerful relative to static client and server hardware

• Mobiles have to trade off – size,

– weight,

– battery life,

– storage

with computational power

• Computational power is always going to be of lower priority than these other qualities.

Page 92: Architecture for the cloud deployment case study future

Cloudlet infrastructure

• A cloudlet infrastructure is – A collection of servers with more computational power

than my mobile – Connected to the internet – Connected to a power source – One data hop away from my mobile

• Where would these servers live? – Doctor’s offices – Coffee shops – Think wireless access points

• A new client would instantiate a VM with the desired app in a local cloudlet server.

Page 93: Architecture for the cloud deployment case study future

How would it work?

• Three models 1. Download application VM from cloud into cloudlet

server on demand. • Time consuming • Allows for bare cloudlet server • Allows arbitrary applications

2. Pre load most popular application VMs • Fast • Limited in terms of number of applications available

3. Use containers to preload most popular platforms and download remaining software of an app. • Intermediate in time required to initiate • Intermediate in flexibility of apps.

Page 94: Architecture for the cloud deployment case study future

Revisit AR application

• Consider what is involved to show what is ahead based on head location. – Location

– Orientation

• Cloudlets have potential to supply necessary infrastructure.

Page 95: Architecture for the cloud deployment case study future

Summary of cloudlets

• Locally available servers with higher computational power than mobiles

• Enables real time application of some technologies such as augmented reality or real time translation

• Based on creating VM in local server with necessary app

• Feasible with current technology

Page 96: Architecture for the cloud deployment case study future

Topics

• Containers

• Augmented reality

• Cloudlets

• Internet of things – cars, applicances, etc

Page 97: Architecture for the cloud deployment case study future

Internet of Things

• “Everything” is connected to the internet – Toasters

– Refrigerators

– Automobiles

– …

• Every device has both data generation and control aspects.

• From a cloud perspective, data generation is important.

Page 98: Architecture for the cloud deployment case study future

Data available from a device

• Current state

• Current activity

• Current location

• Current power usage

• Information about the environment of the device

• …

Page 99: Architecture for the cloud deployment case study future

How frequently is the data sampled?

• Suppose each device is sampled once a minute

• Suppose each sample contains 1MB

• The number of devices being sampled may be in the 100,000s or1,000,000s (think automobiles or engines, or tires, or …)

• Petabytes of data streaming into a data center.

• “Big Data” is the buzz term used for this level of data.

Page 100: Architecture for the cloud deployment case study future

Leads to streaming DBMSs

• SQLstream • STREAM [1] • AURORA,[2] StreamBase Systems, Inc. • TelegraphCQ [3] • NiagaraCQ,[4] • QStream • PIPES, webMethods Business Events • StreamGlobe • Odysseus • StreamInsight • InfoSphere Streams • Kinesis

Page 101: Architecture for the cloud deployment case study future

Requirements for streaming dbms

Database management system (DBMS) Data stream management system (DSMS)

Persistent data (relations) volatile data streams

Random access Sequential access

One-time queries Continuous queries

(theoretically) unlimited secondary storage

limited main memory

Only the current state is relevant Consideration of the order of the input

relatively low update rate potentially extremely high update rate

Little or no time requirements Real-time requirements

Assumes exact data Assumes outdated/inaccurate data

Plannable query processing Variable data arrival and data characteristics

Page 102: Architecture for the cloud deployment case study future

Managing the data

• Synopses – Select sample of raw data to support a particular analysis.

– Moving average of data

– May be inaccurate when performing analysis

• Windows – Only look at a portion of the data

– Last 10 elements

– Last 10 seconds

• Choice of approach for managing the data is driven by the types of analyses that the data is used for.

Page 103: Architecture for the cloud deployment case study future

Demand is for ever fasting processing

• Businesses want to use the data for real time reaction

• E.g. Amazon has “customers who look at this also looked at”. Suppose they use this information and your browsing history to offer you a deal on two items rather than have you decide individually to buy the two items.

• Just one example out of hundreds.

Page 104: Architecture for the cloud deployment case study future

Issues

• Managing the volume of data

• Security – how do you know data is correctly identified?

• Privacy – what kind of information sharing will consumers tolerate?

Page 105: Architecture for the cloud deployment case study future

Summary for Internet of Things

• If every device is connected to the internet, the data generated will be massive

• Specialized data management systems

• Specialized analysis systems.

• Problems include security and privacy