Dynamo: Amazon’s Highly Available Key-value Store · 2012. 11. 27. · Dynamo CS5204 –...

24
Dynamo: Amazon’s Highly Available Key-value Store Presented By: Devarsh Patel 1 CS5204 – Operating Systems

Transcript of Dynamo: Amazon’s Highly Available Key-value Store · 2012. 11. 27. · Dynamo CS5204 –...

  • Dynamo: Amazon’s Highly Available Key-value Store

    Presented By: Devarsh Patel

    1 CS5204 – Operating Systems

  • Dynamo

    CS5204 – Operating Systems

    Introduction

    Amazon’s e-commerce platform Requires performance, reliability and efficiency To support continuous growth, platform needs to be highly scalable

    Dynamo – A highly available and scalable distributed data store built for Amazon’s platform

    Dynamo is used to manage services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost-effectiveness and performance.

    Dynamo provides a simple primary-key only interface to meet

    requirements of applications like best seller lists, shopping carts, customer preferences, session management, etc.

    A completely decentralized system with minimal need for manual

    administration.

    2

  • Dynamo

    System Assumptions and Requirements Simple key-value interface

    Highly available Efficient in resource usage Simple scale out scheme to address growth in data set size or

    request rates Each service that uses Dynamo runs its own Dynamo

    instances Used only by Amazon’s internal services

    Non-hostile environment No security requirements like authentication and authorization

    Targets applications that operate with weaker consistency in favor of high availability

    Service level agreements (SLA) Measured at the 99.9th percentile of the distribution Key factors: service latency at a given request rate Example: response time of 300ms for 99.9% of requests at peak

    client load of 500 requests per second State management is the main component of a service’s SLAs

    CS5204 – Operating Systems 3

  • Dynamo

    Design Considerations

    Designed to be an eventually consistent data store “Always writeable” data store Consistency vs. availability

    To achieve a level of consistency, replication algorithms are forced to tradeoff the availability of the data under certain failure scenarios.

    To improve availability, Dynamo uses weaker form of consistency (eventual consistency) Allows optimistic replication techniques

    Can lead to conflicting changes which must be detected and resolved

    Data store or application performs conflict resolution to the reads Other key principles

    Incremental scalability – One storage node at a time Symmetry – Every node has same set of responsibilities Decentralization – Favor decentralized peer-to-peer techniques Heterogeneity – Work distribution must be proportional

    CS5204 – Operating Systems 4

  • Dynamo

    System Architecture

    Core distributed system techniques used in Dynamo: Partitioning, Replication, Versioning, Membership, Failure

    handling and Scaling

    CS5204 – Operating Systems 5

  • Dynamo

    System Interface

    Two operations: get() and put() get(key) – Locates the object replicas associated with

    the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context

    put(key, context, object) - Determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk

    context – encodes system metadata about the object MD5 hash on the key generates 128-bit identifier to

    identify storage nodes

    CS5204 – Operating Systems 6

  • Dynamo

    Partitioning Algorithm

    Consistent Hashing Output range is a fixed circular space or “ring” Advantage

    Departure or arrival of a node only affects immediate neighbors

    Issues Non-uniform data and load distribution

    Dynamo uses a variant of consistent hashing by using concept of “virtual nodes”

    CS5204 – Operating Systems 7

  • Dynamo

    Replication

    Replicate data on multiple hosts Reason – To achieve high availability and durability

    “per-instance” Preference list – List of nodes responsible for storing particular key Figure 1: Partitioning and replication of keys in Dynamo ring.

    CS5204 – Operating Systems 8

  • Dynamo

    Data Versioning

    Dynamo treats the result of each modification as a new and immutable version of the data

    Allows for multiple versions of an object to be present in the system at the same time.

    Problem Version branching due to failures combined with

    concurrent updates, resulting in conflicting versions of object

    Updates in the presence of network partitions and node failures result in an object having distinct version sub-histories

    CS5204 – Operating Systems 9

  • Dynamo

    Data Versioning

    Uses vector clocks – A list of (node, counter) pairs

    Determines two version of an object are on parallel branches or have causal ordering

    Conflict requires reconciliation Conflicting versions passed to application as

    output of get operation Application resolves conflicts and puts a new

    (consistent) version CS5204 – Operating Systems 10

  • Dynamo

    Data Versioning

    Figure: Version evolution of an object over time

    CS5204 – Operating Systems 11

  • Dynamo

    Execution of get/put operations

    Two strategies to select a node: Request through a load balancer Request directly to the coordinator nodes

    Coordinator – Node handling read and write operation First among the top N nodes in the preference list

    Quorum system

    Two key configurable values: R and W R - minimum nodes participated in successful read operation W - minimum nodes participated in successful write operation Quorum like system requires, R+W > N (N, R, W) can be chosen to achieve desired tradeoff R and W are usually configured to be less than N, to provide

    better latency. Write is successful – If W-1 nodes respond to put() request Read is successful – If R noes respond to get() request

    CS5204 – Operating Systems 12

  • Dynamo

    Hinted Handoff

    “Sloppy quorum” All read and write operations are done on Top N

    healthy nodes in the preference list Coordinator is first in this group Replicas sent to node will have a “hint” in its

    metadata indicating the original node that should hold the replica

    Hinted replicas are stored by available node and sent forwarded when original node recovers.

    Ensures read and write operations are not failed due to node or network failures

    CS5204 – Operating Systems 13

  • Dynamo

    Replica synchronization

    Detect the inconsistencies between replicas faster and to minimize the amount of transferred data using Merkle tree.

    Separate tree maintained by each node for each key range

    Advantage: each branch of the tree can be checked

    independently without requiring nodes to download the entire tree or the entire data set

    Disadvantage: Adds overhead to maintain Merkle trees when a

    node joins or leaves the system

    CS5204 – Operating Systems 14

  • Dynamo

    Membership and Failure Detection

    Ring Membership Explicit mechanism to add or remove node from a ring Done by administrator using command line tool or browser Gossip-based protocol propagates membership, partitioning,

    and placement information via periodic exchanges Nodes eventually know key ranges of its peers and can

    forward requests to them

    External Discovery To prevent logical partitions, some nodes play role of seeds “Seed” nodes discovered via external mechanism are known

    to all nodes

    Failure Detection Nodes failures are detected by lack of responsiveness and

    recovery detected by periodic retry

    CS5204 – Operating Systems 15

  • Dynamo

    Experiences & Lessons Learned

    Main patterns in which Dynamo is used: Business logic specific reconciliation Timestamp based reconciliation High performance read engine

    Client applications can tune values of N, R and W Common (N,R,W) configuration used by several

    instances of Dynamo is (3,2,2)

    CS5204 – Operating Systems 16

  • Dynamo

    Experiences & Lessons Learned

    Balancing performance and Durability

    CS5204 – Operating Systems 17

  • Dynamo

    Experiences & Lessons Learned

    Ensuring Uniform Load Distribution

    CS5204 – Operating Systems 18

  • Dynamo

    Partitioning & Placement Strategies

    CS5204 – Operating Systems 19

    Partitioning and placement of keys in the three strategies. A, B, and C depict the three unique nodes that form the preference list for the key k1 on the consistent hashing ring (N=3). The shaded area indicates the key range for which nodes A, B, and C form the preference list. Dark arrows indicate the token locations for various nodes.

  • Dynamo

    Partitioning & Placement Strategies

    Strategy 1 T random tokens per node and partition by token

    value: It needs to steal its key ranges from other nodes Bootstrapping of new node is lengthy Other nodes process scanning/transmission of key

    ranges for new node as background activities Disadvantages:

    Numerous nodes have to adjust their Merkle trees when a new node joins or leaves system

    Archiving entire key space is highly inefficient

    CS5204 – Operating Systems 20

  • Dynamo

    Partitioning & Placement Strategies

    Strategy 2 T random tokens per node and equal sized partitions:

    Divided into Q equally sized partitions Q >> N and Q >> S*T, where S is no. of nodes in the system Advantages:

    Decoupling of partition and partition placement Allows changing of placement scheme at run-time

    Strategy 3 Q/S tokens per node, equal sized partitions:

    Decoupling of partition and placement Advantages:

    Faster bootstrapping/recovery Ease of archival

    CS5204 – Operating Systems 21

  • Dynamo

    Partitioning & Placement Strategies

    Strategies have different tuning parameters Fair way to compare strategies is to evaluate the skew in their load

    distributions for a fixed amount of space to maintain membership information Strategy 3 achieves best load balancing efficiency

    CS5204 – Operating Systems 22

  • Dynamo

    Client-driven or Server-driven Coordination

    Any node can coordinate read requests; write requests handled by coordinator State-machine for coordination can be in load balancing server or incorporated

    into client Client-driven coordination has lower latency because it avoids extra network

    hop (redirection)

    CS5204 – Operating Systems 23

  • Dynamo

    Thank You

    CS5204 – Operating Systems 24