Amazon Dynamo DB- a term project presentation
CSc 244 | CSU SacramentoSpring 2014
Instructor : Dr. Ying Jin 8th May, 2014
Team 1 :
Chetan Nagarkar
Saransh
Agenda
• NoSQL background
• CAP theorem
• AWS offerings
• Dynamo DB
• Dynamo DB Architecture
• How to use
• Pricing
• Advantages
• Limitations
• References
• Questions 2
3
NoSQL background• Traditional databases have a concrete schema.
4
NoSQL systems are designed to be schema-less.
• Entry 1
• {“name”:“emp1”}
• Entry 2
• {“name”:“emp2”,“e_id”:“1”,“e_addr”:“Cupertino”}
• Entry 3
• {“name”:“emp3”,“e_id”:“3”}
• Entry 4
• {“name”:“emp4”,“e_id”:“6”, “dob”:“03-Sep-1964”}
5
Types of NoSQL systems
1. Key-Value Pair : • Every item is a set of key-value pair
Voldemort(LinkedIn), DynamoDB(Amazon), Redis(VMWare), etc.
2. Document Oriented :
• Each key is paired with a complex data structure called ‘document’. Documents may contain multiple key-value pairs
CouchDB, MongoDB, etc.
3. Column Family Stores :
• Key value is mapped to set of columns optimized for query over large datasets.Cassandra(Facebook), HyperTable, Hbase, BigTable(Google) etc.
4. Graph Databases :
• Contains information about network, social connectionsNeo4j, FlockDB(Twitter).
6
ACID properties in traditional RDBMs
• Atomicity
• Requires each transaction to be all or nothing
• Consistency
• Brings databases from one valid state to another
• Isolation
• Takes care of concurrent execution and avoid overwritting
• Durability
• Commited transactions will remain so in the event of power loss, crashes etc
7
CAP theorem
• It states : A distributed system possibly cannot guarantee all of the following simultaneously –
• Consistency: All nodes see the same data at the same time
• Availability: The system is ‘always on’, i.e., every request receives a response, whether it serves the request or not.
• Partition Tolerance: System continues to operate in case of system failure or message loss.
8
9
Amazon Web Services(AWS) offerings• Launched in 2006, AWS is used by thousands of companies
today for cloud based computing services.
• Includes an array of services: Remote Computing(EC2), Identity(IAM), Database services(SimpleDB, RDS, ElastiCache, DynamoDB, etc.)
• 8 Regions and Availability zones :Code Name
ap-northeast-1 Asia Pacific (Tokyo) Region
ap-southeast-1 Asia Pacific (Singapore) Region
ap-southeast-2 Asia Pacific (Sydney) Region
eu-west-1 EU (Ireland) Region
sa-east-1 South America (Sao Paulo) Region
us-east-1 US East (Northern Virginia) Region
us-west-1 US West (Northern California) Region
us-west-2 US West (Oregon) Region
10
Dynamo DB• Dynamo DB is a managed and eventually consistent NoSQL
database service to support highly available data access.
• A flexible data model with key/attribute pairs. No schema required.
• Optimized for availability to maximize :
• Data consistency
• Durability
• Performance
ALL THIS WITHOUT THE OPERATIONAL BURDEN
11
What is ‘Fully Managed’?
Never worry about:
• Hardware provisioning
• Cross-availability zone replication
• Hardware and Software updates
• Monitoring and handling of hardware failures
• Replicas automatically generated.
DynamoDB Architecture
• True distributed architecture
• Data is spread across hundreds of servers called storage nodes
• Hundreds of servers form a cluster in the form of a “ring”
• Client application can connect using one of the two approaches
• Routing using a load balancer
• Client-library that reflects Dynamo’s partitioning scheme and can determine the storage host to connect
• DynamoDB is designed to be “always writable” storage solution
• Allows multiple versions of data on multiple storage nodes
• Conflict resolution happens while reads and NOT during writes
• Syntactic conflict resolution
• Symantec conflict resolution
• Put(key,context object)
• Determines where replicas of the object should be placed based on key,& write replicas to disk
• Context Info is stored along with object so as to verify validity of context object
• If at least W-1 nodes respond, write is successful
• Get(key)-
• Operation locates all object replicas associated with key in storage system
• Returns a single or list of object with conflict version along with context
• Client reconcile divergent versions and supersede current version with based on context
• Node handling read and write operation is called “coordinator”
• For reads and writes dynamo uses consistency protocol
• Consistency protocol has 3 variables
• N – Number of replicas of data to be read or written
• W – Number of nodes that must participate in successful write operation
• R – Number of machines contacted in read operation
• R+W > N Latency determined by slowest of R or W replicas. Thus R&W less than N
DynamoDB System Interface – get() and put()
DynamoDB Architecture - Partitioning• Data is partitioned over multiple hosts called storage nodes (ring)
• Uses consistent hashing to dynamically partition data across storage hosts
• Two problems associated with Consistent hashing
• Hashing of storage hosts can cause imbalance of data and load
• Consistent hashing treats every storage host as same capacity
A [3,4]
B [1]
C[2]
1
2
3
4
3
4
A[4]
B[1]
1
2D[2,3]
Initial Situation
Situation after C left and D joined
DynamoDB Architecture - Partitioning• Solution – Virtual hosts mapped to physical hosts (tokens)
• Number of virtual hosts for a physical host depends on capacity of physical host
• Virtual nodes are function-mapped to physical nodes
AB
C
D
E
F
G
H
I
J
K
L
C
B Physical HostVirtual Host
DynamoDB Architecture – Scaling• Gossip protocol used for node membership
• Every second, storage node randomly contact a peer to bilaterally reconcile persisted membership history
• Doing so, membership changes are spread and eventually consistent membership view is formed
• When a new members joins, adjacent nodes adjust their object and replica ownership
• When a member leaves adjacent nodes adjust their object and replica and distributes keys owned by leaving member
A
B
C
D
E
F
G
H
Replication factor = 3
DynamoDB Architecture – Data Versioning• Eventual Consistency – data propagates asynchronously
• Its possible to have multiple version on different storage nodes
• If latest data is not available on storage node, client will update the old version of data
• Conflict resolution is done using “vector clock”
• Vector clock is a metadata information added to data when using get() or put()
• Vector clock effectively a list of (node, counter) pairs
• One vector clock is associated with every version of every object
• One can determine two version has causal ordering or parallel branches by examining vector clocks
• Client specify vector clock information while reading or writing data in the form of “context”
• Dynamo tries to resolve conflict among multiple version using syntactic reconciliation
• If syntactic reconciliation does not work, client has to resolve conflict using semantic reconciliation
18
Strongly consistent vs Eventually Consistent
• Multiple copies of same item to ensure durability
• Takes time for update to propagate multiple copies
• Eventually consistent
• After write happens, immediate read may not give latest value
• Strongly consistent
• Requires additional read capacity unit
• Gets most up-to-date version of item value
• Eventually consistent read consumes half the read capacity unit as strongly consistent read.
20
How to use
• AWS provides SDKs to interact with DynamoDB for the following platforms/languages:
• Android
• iOS
• Java
• .Net
• In Java, there are two means to interact with your DynamoDB table:
• AWS Management Console
• AWS Toolkit for Eclipse
• Python
• PHP
• Node.js
• Ruby
21
Management Console
22
Console – create table
Inside Amazon DynamoDB
• Primary key (mandatory for every table)
• Hash or Hash + Range
• Data model in the form of tables
• Data stored in the form of items (name – value attributes)
• Secondary Indexes for improved performance
• Local secondary index
• Global secondary index
• Scalar data type (number, string etc) or multi-valued data type (sets)
key=value key=value key=value key=value Table
Item (64KB max)
Attributes
24
Create Instance in Java
25
Using AWS SDK: AWS Toolkit for Eclipse
26
Pricing
• In the free tier, AWS offers 100 MB of free data storage per month.
• Read capacity: 1 read = 4 KB data
• Write capacity: 1 write = 1 KB data
• With eventually consistant reads, you get twice reads/sec
• In free tier you get 5 writes/sec & 10 reads/sec
• Or 432,000 writes and 864,000 writes per day or 40 million data operations per month.
27
Advantages
• Easy Administration: No h/w or s/w provisioning, setup or configuration
• Flexible: Secondary Indexes: query on whichever attribute
• Fast, predictable performance: SSDs, single digit latency
• Built-in fault tolerance
• Stong consistency and atomic counters
• Automatic Data Replication: Data stored across 3 availability zones
• Security: AWS Identity Access Management(IAM)
• CloudWatch Alarms: Alarms are raised if you are near to crossing provisioned throughput or storage.
• Provisioned throughput/Cost effective: Pay how you use
28
Limitations
• 64KB limit on item size (row size)
• 1 MB limit on fetching data
• Pay more if you want strongly consistent data
• Size is multiple of 4KB (provisioning throughput wastage)
• Cannot join tables
• Indexes should be created during table creation only
• No triggers or server side scripts
• Limited comparison capability (no not_null, contains etc)
29
References:
• http://aws.amazon.com/dynamodb/
• http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
• http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
• http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html
30
Questions ?
Thank You !
Top Related