Amazon Dynamo

Amazon Dynamo

Ali S. Bilal ([email protected])

University of Tehran

Fall 2014

What is this presentation all about?

We present the design and implementation

of Dynamo, a highly available key-value

storage system that some of Amazon’s core

services use to provide an “always-on”

experience. To achieve this level of

availability, Dynamo sacrifices consistency

under certain failure scenarios. It makes

extensive use of object versioning and

application-assisted conflict resolution in a

manner that provides a novel interface for

developers to use.

2

Motivation

Amazon runs a world-wide e-commerce

platform that serves tens of millions

customers at peak times using tens of

thousands of servers located in many data

centers around the world.

There are strict operational requirements on

Amazon’s platform in terms of performance,

reliability and efficiency, and to support

continuous growth the platform needs to be

highly scalable.

3

Motivation

Reliability is one of the most important

requirements because even the slightest

outage has significant financial

consequences and impacts customer trust.

Amazon uses a highly decentralized, loosely

coupled, service oriented architecture

consisting of hundreds of services.

4

Motivation

In this environment there is a particular need for storage technologies that are always available. For example, customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados. Therefore, the service responsible for managing shopping carts requires that it can always write to and read from its data store, and that its data needs to be available across multiple data centers.

5

Motivation

Dealing with failures in an infrastructure

comprised of millions of components is our

standard mode of operation; there are

always a small but significant number of

server and network components that are

failing at any given time. As such Amazon’s

software systems need to be constructed in

a manner that treats failure handling as the

normal case without impacting availability

or performance.

6

Motivation

There are many services on Amazon’s

platform that only need primary-key access

to a data store. For many services, such as

those that provide best seller lists, shopping

carts, customer preferences, session

management, sales rank, and product

catalog, the common pattern of using a

relational database would lead to

inefficiencies and limit scale and availability.

7

ASSUMPTION

Dynamo targets apps with weaker

consistency

non-hostile environment

commodity hardware

simple read/write operation to data item

uniquely identified by a key

8

ASSUMPTION

always-on key-value storage system

write: always succeeds (even incase of

failures or network partition)

read: return what ever can be found (and

may let apps deal with inconsistency)

eventual consistency. Why?

+performance +availability

9

ASSUMPTION

who resolves inconsistency?

data store: can apply simple policy like (last

write win)

app: may have better understanding how

to resolve

incremental scalability

symmetry: no node take special roles

heterogeneity

10

ASSUMPTION

always writable

security is not a concern (because of single

domain administration)

flat namespace (key-value), no relational

schema

ultimate goal is fast response (for both

read/write)

11

Service Level Agreements (SLA)

negotiated contract where a client and a service agree on several system-related characteristics, which most prominently include the client’s expected request rate distribution for a particular API and the expected service latency under those conditions.

An example of a simple SLA is a service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.

12

Service Oriented Architecture

13

System Interface

14

Dynamo stores objects associated with a key through a simple interface; it exposes two operations: get() and put().

The get(key) operation locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context.

The put(key, context, object) operation determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk.

System Interface

15

Dynamo treats both the key and the object

supplied by the caller as an opaque array of

bytes.

It applies a MD5 hash on the key to

generate a 128-bit identifier, which is used to

determine the storage nodes that are

responsible for serving the key.

Partitioning (consistent hashing)

16

to distribute load across multiple storage

host each node is assigned a number

(position in the ring)

hashing data item's key to yield its position

on the ring, and walk the ring clock-wise to

find the first node with a position larger than

the item's position

each node responsible for the region in the

ring between it and its predecessor node on

the ring


17


18

non-uniform data and load distribution because we assign each node a random position

the algorithm is oblivious to heterogeneity in the performance

use virtual nodes: each node is assigned to multiple locations

if a physical node becomes unavailable, the load handled by this node is evenly dispersed across remaining available nodes.

deal with heterogeneity

Replication

19

for high availability and durability

data item is stored at a node and

replicated in its next N successors

with virtual nodes, the first N successor

positions for a particular key may be owned

by less than N distinct physical nodes hence,

when building this list, skip positions to make

sure the list contains only distinct physical

nodes

Data Versioning

20

for high availability for write

write can return before update propagate hence, eventual consistency

but, subsequent get() may return data that is not the latest

treat result of each modification as a new and immutable version of data

allow multiple version of objects at the system, resolve later

Data Versioning

21

most of the time, resolve is simple, new

version subsumes previous ones

In some scenarios we end up with

conflicting versions, in this case client makes

semantic reconciliation: collapse multiple

branches of data evolution back in to one.

For example "merging" different versions of a

customer's shopping cart.

Resolve Inconsistency

22

use vector clocks (list of <node, counter>) to

define causality between version

system level: if V1 <= V2, then V2 is newer

and V1 can be discarded

otherwise, V1 and V2 are parallel branches,

need to resolve at app level (e.g, merge)

when read

When read, all versions of data are

presented to client for semantic resolve


23


24

size of vector can grow too large

store a timestamp for each pair, indicate

the last time the node updated the data

item

when need to truncate, choose the oldest

pair

just quick fix, and may loose causal

relationship, make it hard to reconcile but

they claimed it is ok, not a problem in

practice

Execution of put and get1st way

25

Directly contact a storage node

Dynamo library links with client code

Lower latency because of skipping a

potential forwarding step but need to

update the membership information

periodically. default: 10 seconds

Execution of put and get2nd way

26

Through a load balancer, which in turns pick

a node

Client does not have to link with dynamo

code but may have higher latency

Any storage node can receive get and put

operations for any key. If this node is not in

preference list, forward to other node in

key's preference list

Sloppy Quorum Consistency Protocol

27

read/write operations are performed on first

N healthy nodes from the preference list

R: # nodes that must participate in a

successful read operation

W: # nodes that must participate in a

successful write operation

Condition: R + W > N


28

What happens when put?

upon receiving put() request, the coordinator:

~ generates the vector clock for the new version

~ write the new version locally

~ sends new version (along with the vector clock) to

N highest-ranked nodes

~ Write succeeds if at least W-1 nodes return

success


29

What happens when get?

the coordinator receives get() request

~ the coordinator request all existing version of data for that key from N highest-ranked reachable nodes in the preference list for that key

~ then waits for R responses before returning result to the client

~ if there are multiple versions of the data, it returns all versions that it thinks to be causally unrelated

~ The client code reconciled divergent versions, and the new version is written backs

Handling Temporary Failure

30

use hinted handoff to remember temporary

unavailable real destination hence, can

deal with temporary partition or crashes

later on, when real destination comes back,

hinted handoff is used to copy replica to

that destination


31


32

How about handling datacenter failures?

solution: put replicas across data center by

constructing preference list such that storage nodes are spread across multiple data centers.

What bad about hinted handoff?

not so good in case of permanent failure

in above example, what if D becomes unavailable before it can return hinted replicas

to A

Handling Permanent Failure

33

hinted handoff is not effective in case of permanent failure

Solution use: anti-entropy using Merkle Tree

helps compare a lot of data that hasn't

change

minimize the amount of transfer data

each node maintain a Merkle Tree for a

particular key range

Handling Permanent Failure

34

Membership and Failure detection

35

admin make request explicitly

use gossip-based protocol to learn about

other node key ranges and membership

every interval, contact a random node, and

exchange membership information

not only membership info, the partition and

placement info is also exchanged hence, a

node can route a request directly to a

responsible node

External discovery

36

An admin contacts node A and join node A

to the ring then contacts node B and join

node B to the ring.

Nodes A and B would each consider itself a

member of the ring, yet neither be

immediately aware of the other.

use seeds to prevent logical partition

seeds are node knows to all node, and

discovered by external mechanism

Failure Detection: using time out

37

if B does not response to A message, then to

A, B failed (even B responses to C message)

If clients request a lot, A can detect B fails

quickly

If no client request, no need to detect at all

(not necessary)

Adding/Removing Storage Nodes

38

When we add new node X to the ring, some

nodes no longer have to store some of their

keys, and these nodes transfer those keys to

X.

By adding a confirmation round between

the source and the destination, it is made

sure that the destination node does not

receive any duplicate transfers for a given

key range

Adding/Removing Storage Nodes

39

Implementation

In Dynamo, each storage node has three

main software components:

Request coordination

Membership and failure detection

Local persistence engine

All these components are implemented in

Java

40

Implementation

Dynamo’s local persistence component allows for different storage engines to be plugged in:

Berkeley Database (BDB) Transactional Data Store

BDB Java Edition

MySQL

in-memory buffer with persistent backing store

41

Implementation

The request coordination component is built on top of an event- driven messaging substrate where the message processing pipeline is split into multiple stages similar to the SEDA architecture

All communications are implemented using Java NIO channels

The state machine contains all the logic for identifying the nodes responsible for a key, sending the requests, waiting for responses, potentially doing retries, processing the replies and packaging the response to the client.

42

Implementation

For instance, a read operation implements the following state machine: (i) send read requests to the nodes, (ii) wait for minimum number of required responses, (iii) if too few replies were received within a given time bound, fail the request, (iv) otherwise gather all the data versions and determine the ones to be returned and (v) if versioning is enabled, perform syntactic reconciliation and generate an opaque write context that contains the vector clock that subsumes all the remaining versions

43

Implementation

After the read response has been returned

to the caller the state machine waits for a

small period of time to receive any

outstanding responses. If stale versions were

returned in any of the responses, the

coordinator updates those nodes with the

latest version. This process is called read

repair because it repairs replicas that have

missed a recent update at an opportunistic

time and relieves the anti-entropy protocol

from having to do it

44

Implementation

write requests are coordinated by one of the top N nodes in the preference list

this approach has led to uneven load distribution resulting in SLA violations

This is because the request load is not uniformly distributed across objects

In particular, since each write usually follows a read operation, the coordinator for a write is chosen to be the node that replied fastest to the previous read operation thereby increasing the chances of getting “read-your-writes” consistency

45

Experiences andLesson Learned

We can use Dynamo with different

configuration:

Business logic specific reconciliation:

client application performs its own

reconciliation logic

For example: shopping cart service (just

merge everything)

46


Timestamp based reconciliation:

"last write win": the object with the largest

physical timestamp value is chosen as the

correct version

For example: maintain customer's session

47


High performance read engine:

some services are read mostly (require high

performance), and rarely write

R = 1, and W = N

For example: services that maintain product

catalog and promotional items

48


Client applications can tune the values of N,

R, and W to achieve their desired levels of

performance, availability and durability

The value of N determines the durability of

each object

If W is set to 1, the system will never reject a

write request as long as there is at least one

node in the system that can successfully

process a write request

49

Balancing performance and Durability

Performance of read or write operation is limited by the slowest among R and W replicas

Trade-off durability for performance

each storage node maintains an object buffer in main memory

write request is stored in the buffer, get periodically written to disk by a writer thread

read check the buffer first

50


To reduce durability risk

coordinator chooses one out of the N

replicas to perform a “durable write”

it wait for only W response, it is likely that

most of the time, the performance is not

affected by the "durable write"

51


52


53

Ensuring Uniform Load Distribution

under high loads, a large number of popular

keys are accessed and due to uniform

distribution of keys, the load is evenly

distributed

during low loads, fewer popular keys are

accessed, resulting in higher load

imbalance

54


55


Various partition schemes and its implication on

load distribution:

Strategy 1: T random tokens per node, and

partition by token value

each node is assigned T tokens (virtual nodes)

the tokens of all nodes are ordered according

to their hash space

Every two consecutive tokens define a range

56


57


Scanning problem: Bootstrapping takes a long time:

new node joins the system, "steals" its key range from other nodes

these others node need to scan their local store

scans: expensive, resource intensive, execute in background

key ranges for many nodes changes when a node joins/leaves hence, recalculate Merkel trees for these node. expensive, too

58


Strategy 2: T random tokens per node and equal

sized partitions

hash space is divided into Q equally sized

partitions/ranges

each node is assigned T random tokens

Q is set such that: Q >> N and Q >> S * T

Tokens are only used to build the function that

maps values in the hash space to the ordered

list of nodes not to decide partitioning

59


60


Strategy 3: Q/S tokens per node, equal-sized

partitions

Q/S tokens per node

when a node leaves the system, its tokens

are randomly distributed to remaining nodes

such that these properties are preserved

when a node joins the system it "steals"

tokens from nodes in the system in a way

that preserves these properties

61


62


Strategy 3 seems to be best choice because:

Faster of bootstrapping/recovery

Since partition ranges are fixed, they can be

stored in separate files

a partition can be relocated as a unit by simply

transferring the file avoiding random accesses

needed to locate specific items

hence simplifies the process of bootstrapping

and recovery

63


Ease of archival:

simpler archival because the partition files can be archived separately

By contrast, in Strategy 1 the tokens are chosen randomly and archiving the data stored in Dynamo requires retrieving the keys from individual nodes separately and is usually inefficient and slow

changing the node membership requires coordination in order to preserve the properties required of the assignment

64


The efficiency of these three strategies is

evaluated for a system with S=30 and N=3

The load balancing efficiency of each

strategy was measured for different sizes of

membership information that needs to be

maintained at each node, where Load

balancing efficiency is defined as the ratio

of average number of requests served by

each node to the maximum number of

requests served by the hottest node.

65


66

Divergent Versions: When and How many

In our next experiment, the number of

versions returned to the shopping cart

service was profiled for a period of 24 hours.

During this period, 99.94% of requests saw

exactly one version; 0.00057% of requests

saw 2 versions; 0.00047% of requests saw 3

versions and 0.00009% of requests saw 4

versions. This shows that divergent versions

are created rarely.

67

Divergent Versions: When and How many

When?

failure: node failure, data center failure,

network partition

large number of concurrent writers to a

single data item, handled by multiple nodes

Resolve:

by vector clocks

and if not, by client apps

68

Client-driven or Server-driven Coordination

2 way:

directly contact a storage node:

dynamo library links with client code

lower latency because of skipping a potential

forwarding step but need to update the

membership information periodically

default: 10 seconds

see stale membership for duration of 10 seconds

69


through a load balancer, which in turns pick

a node

client does not have to link with dynamo

code

but may have higher latency

70


71

Balancing background vs. foreground task

foreground: get/put

background:

data handoff (due to hinting or node leaves/joins)

replica synchronization (due to permanent failure)

Challenges: background activity does not affect

foreground task

Solution: monitoring foreground operation, enable

background task when appropriate

72

Final Words

Many Amazon internal services have used Dynamo for the past two years and it has provided significant levels of availability to its applications

In particular, applications have received successful responses (without timing out) for 99.9995% of its requests and no data loss event has occurred to date.

Moreover, the primary advantage of Dynamo is that it provides the necessary knobs using the three parameters of (N,R,W) to tune their instance based on their needs

73

Final Words

Unlike popular commercial data stores,

Dynamo exposes data consistency and

reconciliation logic issues to the developers

Dynamo adopts a full membership model

where each node is aware of the data

hosted by its peers

Exchanging membership table (routing

table) works for hundreds nodes but may

not scale to thousands node.

74

Amazon Dynamo

Software

Transcript of Amazon Dynamo