NoSQL DBMS: Concepts, Techniques and Patterns

NoSQL DBMS: Concepts, Techniques and Patterns

Iztok Savnik, FAMNIT, 2017-19

Contents

•Requirements•Data model (+)•Consistency model•Partitioning•Replication (+)•Storage Layout (+)•Query Models (+)

Requirements

•Requirements of the new web applications•ACID vs. BASE•CAP theorem

ACID vs. BASE

•ACID Atomicity Consistency Isolation Durability• •Characteristics

Strong consistency Focus on “commit” Nested transactions Availability? Conservative (pessimistic) Difficult evolution

•BASE Basically Available Soft-state Eventual consistency• •Characteristics

Weak consistency Availability first Approximate answers OK Aggressive (optimistic) Simpler! Faster Easier evolution

CAP theorem

•“Towards Robust Distributed Systems” at ACM’s PODC 2000, Eric Brewer•CAP theorem

Consistency System is in a consistent state after the execution of an operation After an update of some writer all readers see his updates in some shared data source

Availability A system is designed and implemented in a way that allows it to continue operation if e.g.

nodes in a cluster crash or some hardware or software parts are down due to upgrades Partition tolerance

The ability of the system to continue operation in the presence of network partitions

•Theorem: You can have at most two of these properties for any shared-data system•Widely adopted today by large web companies

CAP theorem

Forfeit Partitions

•Consistency and availability are possible if network always works•Examples

Single - site databases Cluster databases LDAP xFS file system•Traits

2 - phase commit cache validation protocols

Forfeit Availability

• In the presence of partitions, consistency means no availability (in some cases)•Examples

Distributed databases Distributed locking Majority protocols•Traits

Pessimistic locking Make minority partitions unavailable

Forfeit Consistency

• In the presence of partitions, availability means no consistency (in some cases)•Examples

Coda Web caching DNS•Traits

expirations/leases conflict resolution optimistic

Contents

•Requirements•Consistency model•Partitioning

Consistency Models

•A consistency model determines rules for visibility and apparent order of updates.

System is in a consistent state after the execution of an operation•Example:

Row X is replicated on nodes M and N Client A writes row X to node N Some period of time t elapses. Client B reads row X from node M Does client B see the write from client A?•Consistency is a continuum with tradeoffs•Consistency models

Timestamps Optimistic Locking Vector Clocks Multiversion Storage

Strict Consistency

•All read operations must return the data from the latest completed write operation, regardless of which replica the operations went to• Implies either:

All operations for a given row go to the same node (replication for availability)

or nodes employ some kind of distributed transaction protocol (eg 2 Phase Commit or Paxos)•CAP Theorem:

Strict Consistency can’t be achieved at the same time as availability and partition-tolerance.

Eventual Consistency

•As t → ∞, readers will see writes.• In a steady state, the system is guaranteed to eventually return

the last written value•For example: DNS, or MySQL Slave•Replication (log shipping)•Special cases of eventual consistency:

Read-your-own-writes consistency (“sent mail” box) Causal consistency (if you write Y after reading X, anyone who reads Y sees

X) gmail has RYOW but not causal!

Timestamps and Vector Clocks

•Determining a history of a row•Eventual consistency relies on deciding what value a row will

eventually converge to• In the case of two writers writing at “the same” time, this is

difficult•Timestamps are one solution, but rely on synchronized clocks and

don’t capture causality•Vector clocks are an alternative method of capturing order in a

distributed system

Lamport Timestamps

•Ordering events in a distributed system● Trying to sync clocks is one approach•Timestamps that are not absolute time•Such timestamps should obey causality

● If an event A happens causally before an event B then timestamp(A) < timestamp(B)

•Humans use causallity often● “I entered the house only after I unlocked the door”● “You receive the letter only after I send it”

Lamport Timestamps

•Define a logical relationship Happens-Before among pairs of events● Happens-Before is denoted →•Three rules

● On the same computer: A → B, if time(A) < time(B) (local clock)● Site s1 sends a message m to the site s2: send(m) → receive(m)● Transitivity: A → B and B → C implies A → C•Relationship → is patially ordered

● Not all events are related by →

Lamport Timestamps

Lamport Timestamps

•Task: Assign logical timestamps to each event•Timestamps obey causality•Rules

● Each site uses local counter (integer); initial value is 0● Site increment its counter by one when

● it sends a message to some other site or ● an operation happens

● Send message carries its timestamp● For a receive message timestamp is updated to

● max(local clock,message timestamp)+1

Lamport Timestamps

•A pair of concurrent events does not have a causal path from one event to another•For concurrent events Lamport timestamps are not ordered!

● This is OK, since there is no relationship between the events

•Therefore:● E1 → E2 then timestamp(E1) < timestamp(E2)● timestamp(E1) < timestamp(E2) then either E1 and E2 are parallel events,

or E1 → E2

Vector Clock - Details

•A vector clock is defined as a tuple V[0], V[1], ..., V[n] of clock values from each node . • In a distributed scenario node i maintains such a tuple of clock

values, which represent the state of itself and the other (replica) nodes’ state as it is aware about at a given time

Vi[0] for the clock value of the first node,

Vi[1] for the clock value of the second node, . . .

Vi[i] for itself, . . .

Vi[n] for the clock value of the last node.

•Clock values may be real timestamps derived from a node’s local clock, version/revision numbers or some other ordinal values.

Vector Clocks•Vector clocks are updated in a way defined by the following rules

● If an internal operation happens at node i, this node will increment its clock Vi[i].

● This means that internal updates are seen immediately by the executing node.● If node i sends a message to node k, it first advances its own clock value V

i[i] and

attaches the vector clock Vi to the message to node k. Thereby, he tells the

receiving node about his internal state and his view of the other nodes at the time the message is sent.

● If node i receives a message from node j, it first advances its vector clock Vi[i] and

then merges its own vector clock with the vector clock V message attached to the message from node j so that: V

i = max(V

i , V

message)

● To compare two vector clocks Vi and Vj in order to derive a partial ordering, the

following rule is applied: V

i > V

j , if ∀k V

i[k] >= V

j[k] and ∃l V

i[l] > V

j[l]

● If neither Vi > V

j nor V

i < V

j applies, a conflict caused by concurrent updates has

occurred and needs to be resolved by e.g. a client application.

Vector Clocks

Vector clocks – example

Vector Clocks

•Causal reasoning between updates•Clients participate in the vector clock scenario

They keep a vector clock of the last replica node they have talked to and use this vector clock depending on the client consistency model that is required

For monotonic read consistency a client attaches this last vector clock it received to requests and the contacted replica node makes sure that the vector clock of its response is greater than the vector clock the client submitted

Clients can be sure to see only newer versions of some piece of data•Advantages of vector clocks are

No dependence on synchronized clocks No total ordering of revision numbers required for causal reasoning No need to store and maintain multiple revisions of a piece of data on all

nodes

Optimistic Concurency Control

Locking is conservative approach that avoids conflicts Additional code for locking Checking deadlock Lock saturation for all objects that are frequently used

If conflicts are rare we can get concurrent transactions by not locking Instead we check the conflicts before commiting transaction Optimistic control is faster than locking if there are not a lot of conflicts (worse

when there are many conflicts) OCC is used

Couchbase MongoDB

Kung-Robinson model

Transaction has 3 phases READ: Transaction reads from the databse but makes

changes to a local copy. VALIDATE: validate changes. WRITE: make local changes public.

ROOT

old

new

changedobjects

Validation Check the conditions that are sufficient Every transaction gets numerical timestamp Transaction IDs are assingned just before validation

● ReadSet(Ti): Set of objects read by transaction Ti● WriteSet(Ti): Set of objects changed by Ti

Optimistic Locking

•Couchbase optimistic locking We optimistically speculate that there will be no conflict Before commit we check if there are conflicts

Commit is issued if there are no conflicts Handling one read/write operation is easier than the complex transaction Example:

Let’s assume we are building an online wikipedia – like application using Couchbase Server: users can update an article and add newer articles. Let’s assume Alice is using this application to edit an article on ‘bicycles’ to correct some information. Alice opens up the article and makes those changes but before she hits save, she gets distracted and walks away from her desk. In the meantime, let’s assume Joe notices the same error in the bicycle article and wants to correct the mistake.

If optimistic locking is used in the application, Joe can edit the article and save his changes. When Alice returns and wants to save her changes, either Alice or the application will want to handle the latest updates before allowing Alice’s action to change the document.

Optimistic locking takes the “optimistic” view that data conflicts due to concurrent edits occur rarely, so it’s more important to allow concurrent edits.

Optimistic Locking

•Couchbase pessimistic locking Application simply locks the objects and unlocks it after commit Example:

Now let’s assume that your business process requires exclusive access to one or more documents or a graph of documents.

Referring to our previous example, when Alice is editing the document she does not want any other user to edit the same document. If Joe tries to open the page, he will have to wait until Alice has released the lock.

With pessimistic locking, the application will need to explicitly get a lock on the document to guarantee exclusive user access. When the user is done accessing the document, the locks can be removed either manually or using a timeout.

Multiversion Concurrency Control

•MVCC is a concurrency control method commonly used by database management systems to provide concurrent access to the database • Isolation is the property that provides guarantees in the

concurrent accesses to data Isolation is implemented by means of a concurrency control protocol•MVCC aims at solving the problem by keeping multiples copies of

each data item On update MVCC creates a newer version of the data item Isolation level commonly implemented with MVCC is snapshot isolation

All reads made in a transaction will see a consistent snapshot of the database It reads the last committed values that existed at the time it started

How to remove versions that become obsolete? PostgreSQL adopts this approach with its VACUUM proces

Multiversion Concurrency Control

•Each table cell has a timestamp•Timestamps don’t necessarily need to correspond to real life•Multiple versions can exist concurrently for a given row•Reads may return “most recent”, “most recent before T”, etc.

(free snapshots)•System may provide optimistic concurrency control with

compare-and-swap on timestamps SQL Anywhere, InterBase, Firebird, Oracle, PostgreSQL, MongoDB and

Microsoft SQL Server (2005 and later) Not the same as serializability

Gossip

•Gossip is one method to propagate a view of cluster status.•Every t seconds, on each node:

The node selects some other node to chat with. The node reconciles its view of the cluster with its gossip buddy. Each node maintains a “timestamp” for itself and for the most recent

information it has from every other node.• Information about cluster state spreads in O(log n) rounds

(eventual consistency)•Scalable and no SPOF, but state is only eventually consistent

Gossip

4 rounds of the protocol

Propagate State via Gossip

•Gossip protocol operates in either a state or an operation transfer model to handle read and update operations as well as replication among database nodes•State Transfer Model

Data or deltas of data are exchanged between clients and servers as well as among servers

Database server nodes maintain vector clocks for their data and also state version trees for conflicting versions versions where the corresponding vector clocks cannot be brought into a V

A < V

B or V

A >

VB relation

Clients maintain vector clocks for pieces of data they have already requested or updated Vector clocks are exchanged and processed in the following manner

Query Processing When a client queries for data it sends its vector clock of the requested data along with the request Server node responds with part of his state tree for the piece of data that precedes the vector clock attached

to the client request and the server’s vector clock Client has to resolve potential version conflicts

Update Processing Internode Gossiping

Propagate State via Gossip

•Operation Transfer Model Operations applicable to locally maintained data are communicated among

nodes in the operation transfer mode lesser bandwidth is consumed to interchange operations in contrast to actual data or

deltas of data Importance to apply the operations in correct order on each node

a replica node first has to determine a casual relationship between operations (by vector clock comparison) before applying them to its data

In this model a replica node maintains the following vector clocks V

state : The vector clock corresponding to the last updated state of its data.

Vi : A vector clock for itself where—compared to V

state —merges with received vector

clocks may already have happened V

j : A vector clock received by the last gossip message of replica node j (for each of those

replica nodes). Exchange and processing vector clocks in

read-, update- and internode-messages Complex protocols for handling queue of (deferred) operations

correct (casual) order reasoned by the vector clocks attached to these operations

Contents

•Requirements•Consistency model•Partitioning

Partitioning

•Assuming that data in large scale systems exceeds the capacity of a single machine and should also be replicated to ensure reliability and allow scaling measures such as load-balancing, •Ways of partitioning the data of such a system have to be

thought about Depending on

the size of the system and the other factors like dynamism

(e. g. how often and dynamically storage nodes may join and leave)

•There are different approaches to this issue: Memory Caches Clustering Master(s)/slaves model Sharding Consistent Hashing

Memory Caches

•memcached● Simple and light-weight key-value interface where all key-value tuples are

stored in and served from DRAM● Originally developed for LiveJournal● YouTube, Reddit, Facebook, Twitter, Wikipedia, …● Transactions per second: 1-5 M ! (improved memC3)

● Interface● SET/ADD/REPLACE(key, value): add a (key,value) object to the cache● GET(key): retrieve the value associated with a key● DELETE(key): delete a key

● Internally, memcached uses a hash table to index the key-value entries● Entries are also in a linked list sorted by their most recent access time● LRU entry is evicted and replaced by a newly inserted entry when the cache is full● Hash collisions are resolved by chaining● Slab-based memory allocation (1 MB pages, sub-divided into fixed-length chunks)● Chunk size of slab class: class 1 is 72 bytes … class 43 is 1 MB● Trade-off: Uniformity against the speed of processing● Cache policy: each slab class maintains its own objects in an LRU order

Memory Caches

•memcached (memcached.org/) partitioned—though transient—in-memory databases

replicate most frequently requested parts of a database to main memory rapidly deliver data to clients and disburden database servers significantly

consists of an array of processes with an assigned amount of memory launched on several machines in a network made known to an application via configuration. •memcached protocol

implementation in 3 different programming languages client applications provide a simple key-/value-store API 3 It stores objects placed under a key into the cache by hashing that key

against the configured memcached-instances. If a memcached process does not respond, most API-implementations

Ignore the non-answering node and use the responding nodes instead Implicit rehashing of cache-objects as part of them gets hashed to a different

memcached-server after a cache-miss

Memory Caches

Formerly non-answering node joins the memcached server array again keys for part of the data get hashed to it again after a cache miss and objects now dedicated to that node will implicitly leave the memory of some other

memcached server they have been hashed to while the node was down memcached applies a LRU strategy for cache cleaning and allows it to specify timeouts for cache-objects

•Logic Half in Client, Half in Server Clients understand how to choose which server to read or write to an item,

what to do when it cannot contact a server. The servers understand how store and fetch items. They also manage when

to evict or reuse memory.•O(1)

All commands are implemented to be as fast and lock-friendly as possible. This allows near-deterministic query speeds for all use cases.

Queries on slow machines should run in well under 1ms. High end servers can serve millions of keys per second in throughput.•Other implementations of memory caches are available e. g. for

application servers like JBoss

Database clusters

•Cost-effective alternative to supercomputers or tightly-coupled multiprocessors● Major difference with a parallel DBMS: “black-box” DBMS at each node● Fast network interconnect; Low disk and network latency essential● Clustering uses multiple servers to work with a shared database system● Each server within a cluster is called a node.•Database cluster architecture

● Can have a shared-disk or shared-nothing (preferred) architecture● Parallel data management must be implemented via middleware

Database clusters

•Database cluster architecture● To improve performance and availability, data can be replicated at

different nodes using the local DBMS ● Application servers interact with the middleware in a classical way to

submit ad-hoc queries, transactions, or calls to stored procedures● Specialized nodes with directories

● Middleware has several layers● Load balancer, replication manager, query processor and fault-tolerance manager● Load balancer runs transaction at node with lightest transaction load● Replication manager manages access to replicated data and assures strong

consistency (same serial order at each node)● Query processor exploits both inter- and intra-query parallelism● Fault-tolerance manager provides on-line recovery and failover

● Partitioning● Physical partitioning defines horizontal fragments; allocates them at cluster nodes

(ressembles fragmentation and allocation design in distributed databases)● Virtual partitioning avoids the problems of static physical partitioning using a dynamic

approach and full replication● Virtual partitions are dynamically produced for each query

Database clusters

•High Availability Components of a system from end-to-endprovide an uninterrupted service High availability can be achieved through redundancy •MySQL InnoDB

InnoDB is a general-purpose storage engine that balances high reliability and high performance

Nice things: ACID, recovery, primary keys, foreign keys, ... InnoDB cluster is a collection of products that work together

A group of MySQL servers can be configured in a cluster using MySQL Shell The cluster has a single master (the primary) = read-write master Multiple secondary servers are replicas of the master A minimum of three servers are required to create a high availability cluster A client application is connected to the primary via MySQL Router

MySQL distributed key-value store ● Shared-nothing clustering ● Auto-sharding ● Designed to provide high availability and high throughput with low latency● Allowing for near linear scalability

Database clusters

•Pros and Cons● If the primary node fails, then the virtual database service willbecome

active on one of the secondary nodes Some clustering systems operate at the Operating System level not the

database level e.g. MS-SQL Strives for transparency towards clients who should not notice talking to a

cluster of database servers instead of a single server Scale the persistence layer of a system to a certain degree Many criticize that clustering features have only been added on top of

DBMSs that were not originally designed for distribution

Master(s)/Slaves models

•Each data partition has a single master and multiple slaves Write-operations for all or parts of the data are routed to master(s) Number of replica-servers satisfying read-requests (slaves) •Read/write protocol

If the master replicates to its clients asynchronously there are no write lags If the master crashes before completing replication to at least one client the

write-operation is lost If the master replicates writes synchronously to at least one slave Read requests can go to any replicas if the client can tolerate some degree

of data staleness "mastership" happens at the virtual node level

There is NO single particular physical node that plays the role as the master Therefore, the write workload is also distributed across different physical node

When a physical node crashes The masters of certain partitions will be lost The most updated slave will be nominated to become the new master


•Master Slave model works very well in general when the application has a high read/write ratio

It also works very well when the update happens evenly in the key range So it is the predominant model of data replication•2 ways how the master propagate updates to the slave

State transfer and Operation transfer The state transfer model is more robust against message lost Deltas are used many times (hash trees of the objects)•Multi-master model allows updates to happen at any replica

One master / slaves model unable to spread the workload evenly Problems to retain consistency


•Consistency of Master(s) models Strict consistency can be achieved by 2PC

bring all N replicas to the same state at every update “prepare" phase where the coordinator ask every replica to confirm whether each of them

is ready to perform the update After gathering all N replicas responses positively → "commit" phase Ask every replicas to commit If any one of the replica crashes, the update will be unsuccessful

Quorum based 2PC the coordinator only need to update W<N replicas Write to all the N replicas but only wait for positive acknowledgment for any W On read we need to access R replicas (W+R>N) and return the latest (timestamp)

• If the client can tolerate a more relax consistency model We don't need to use the 2PC commit or quorum based protocol

Sharding

•Partition the data in such a way that data typically requested and updated together resides on the same node

There are different interpretations of the sharding Horizontal partitioning of the data store (not single table!)

Each shard may have the same schemata Usually some subset of attributes are used for the definition of partitioning

These attributes form the shard key (sometimes referred to as the partition key)

•Sharding physically organizes the data Load and storage volume is evenly distributed among the servers Data shards may also be replicated for reasons of reliability and load-

balancing It may be either allowed to write to a dedicated replica only or to all replicas

maintaining a partition of the data.

Sharding

•When an application stores and retrieves data Sharding logic directs the application to the appropriate shard This sharding logic can be implemented as

part of the data access code in the application, or it could be implemented by the data storage system

•Benefits of sharding You can scale the system out by adding further shards running on additional

storage nodes. A system can use off-the-shelf hardware rather than specialized and

expensive computers for each storage node. You can reduce contention and improve performance by balancing the

workload across shards. In the cloud, shards can be located physically close to the users that'll

access the data.•Sharding strategies (MS-SQL)

Three strategies are commonly used when selecting the shard key and deciding how to distribute data across shards

Sharding

•Lookup strategy The mapping between a virtual shard and the physical partitions the sharding logic implements a map that routes a request for data to the shard that contains that data using the shard key

Sharding

•Range strategy This strategy groups related items together in the same shard, and orders them by shard key— the shard keys are sequential Data is usually held in row key order in the shard Items that are subject to range queries and need to be grouped together can use a shard key that has the same value for the partition key but a unique value for the row key.

Sharding

•Hash strategy The sharding logic computes the shard to store an item in based on a hash of one or more attributes of the data The purpose of this strategy is to reduce the chance of hotspots Achieves a balance between the size of each shard and the average load that each shard will encounter

Sharding

•Advantages and considerations Lookup.

This offers more control over the way that shards are configured and used Using virtual shards reduces the impact when rebalancing data

New physical partitions can be added to even out the workload. The mapping between a virtual shard and the physical partitions can be modified without affecting application code

Range. This is easy to implement and works well with range queries

They can often fetch multiple data items from a single shard in a single operation. This strategy offers easier data management

For example, users in the same region are in the same shard This strategy doesn't provide optimal balancing between shards.

Rebalancing shards is difficult and might not resolve the problem of uneven load Hash.

This strategy offers a better chance of more even data and load distribution. Request routing can be accomplished directly by using the hash function.

There's no need to maintain a map. Rebalancing shards is difficult.

Consistent Hashing

•The topic is “modern” It is motivated by issues in present-day systems

Web caching, Peer-to-peer systems, Database partitioning it’s general and flexible enough to be useful for other problems •Web caching was original motivation for consistent hashing, 1997

Store requested pages in the local cache Next time there is no need to retrieve the page the from remote source

Obvious benefits: less Internet traffic, faster access The amount of the data used to store pages cached for 100 machines can

be substantial Where to store them? One machine? Cluster?•Use hash function to map URLs to caches

First intuition of the computer scientist? Nice properties of hash functions

Behaves like a random function, spreading data out evenly and without noticeable correlation across the possible buckets

Designing good hash functions is not easy

Consistent Hashing

Say there are n caches, named {0, 1, 2, ... , n−1}. We store the Web page with URL x at the cache server h(x) mod n

h(x) way bigger than n Solution works well if you do not change n

All elements would have to be re-hashed if n changes n changes if a cache is added or a cache fails•Consistent hashing

What we want? hash table-type functionality and changing n does not affect hashing and almost all objects stay assigned to the same cache when n changes

Leading idea: hash server names and URLs with the same hash function Which objects are assigned to which caches

Consistent Hashing

Inserting a new object x to the database Given an object x that hashes to the bucket h(x) Imagine we have already hashed all the cache server names we scan buckets to the right of h(x) until we find a bucket h(s) to which the name of some cache s hashes.

Expected load on each of the n cache servers is exactly 1/n fraction of the objects Suppose we add a new cache server s — which objects have to move?

Only the objects stored at s. Adding the nth cache causes only a 1/n fraction of the objects to relocate

This approach to consistent hashing can also be visualized on a circle

Consistent Hashing

•Standard hash table operations Lookup and Insert? Efficiently implementation of the rightward/clockwise scan for the cache

server s that minimizes h(s) subject to h(s) ≥ h(x) We want a data structure for storing the cache names, with the

corresponding hash values as keys, that supports a fast Successor operation Which data structure? Given a key we need the next key and its value.

Not hash table, heap? How about binary search tree? Complexity is O(log n) •Reducing the variance

Expected load of each cache server is a 1/n fraction of the objects The realized load of each cache will vary An easy way to decrease this variance is to make k “virtual copies” of each

cache s hashing with k different hash functions to get h1(s), ... , hk(s) Each server now has k copies—all of them are labeled s Objects are assigned as before — from h(x), we scan clockwise until we encounter one of

the hash values of some cache s Choosing k ≈ log 2 n is large enough to obtain reasonably balanced loads

Consistent Hashing

Virtual copies are also useful for dealing with heterogeneous caches that have different capacities. Use the number of virtual copies of a cache server proportional to the server’s capacity;

Literature

•Christof Strauch, NoSQL Databases. •www.christof-strauch.de/nosqldbs.pdf•Ho, Ricky: NOSQL Patterns. November 2009. •horicky.blogspot.com/2009/11/nosql-patterns.html•Lipcon, Todd: Design Patterns for Distributed Non-Relational

Databases, Presentation, 2009. •static.last.fm/johan/nosql-20090611/intro_nosql.pdf•Karger, David, Et.Al., Consistent Hashing and Random Trees:

Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web, ACM symposium on Theory of computing, 1997•Microsoft Azure, Sharding pattern, Jun 2017.•docs.microsoft.com/en-us/azure/architecture/patterns/sharding•Tim Roughgarden, Gregory Valiant, CS168: The Modern

Algorithmic Toolbox, Lecture #1: Introduction and Consistent Hashing, Stanford 2017.

Literature

•Cauchbase, Optimistic or pessimistic locking – Which one should you pick?. https://blog.couchbase.com/optimistic-or-pessimistic-locking-which-one-should-you-pick/

NoSQL DBMS: Concepts, Techniques and Patterns

Documents

Transcript of NoSQL DBMS: Concepts, Techniques and Patterns