NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS...

NoSQL

Or Peles

What is NoSQL

• A collection of various technologies meant to work around RDBMS limitations (mostly performance)

• Not much of a definition...

RDBMS Limitations

• Hard to scale horizontally (for updates)– Distributed ACID requires 2 phase commit.

• Schema can be a bitch– Hard to change.– Data normalization can slow down queries.

Web Scale

• Some numbers:– Youtube serves over 100MM videos a day.– Ebay adds over 10TB of storage every week.– Facebook holds over 80 Billion photos, and serves

hundreds of thousands of requests/second.

Ideal System

• Available – can always read and write.• Consistent – Reads always pick up the latest

write.• Partition tolerant – The system can be split

across multiple machines and datacenters.

Starbucks doesn’t use two phase commit

• A great example presented here.• Asynchronous execution• Correlation• Exception handling:– Write off– Retry– Compensation

• 2 phase commit would create a choke point.

http://eaipatterns.com/ramblings/18_starbucks.html

CAP Theorem

• CAP (Eric Brewer, 2000):Simply put, of the following 3 properties: • Consistency• Availability• Partition tolerance

Only two can hold at any system.

CAP in practice

• CATwo phase commits, works best at a single data

center. Scaling issues.• CP

Sharding. Data may become unavailable if a shard fails.

• APMay return inaccurate data. DNS is a prime example.

Consistency Types

• Strict• Eventual– Causual– Read your writes– Session– Monotonic read– Monotonic write

Concepts

• In memory vs. disk based.• Shared everything vs. shared nothing.• Master slave vs. server symmetry.• Elastic scalability.• MapReduce.

Sharding

• Split data across machines (database instances).– Feature based sharding.– Key based sharding.– Lookup table.

NoSQL Categories

• Key-Value stores• Document store• Tabular

Lean & MeanThe Key-Value In-Memory DBs

• In memory DBs are simpler and faster than their on-disk counterparts.

• Key value stores offer a simple interface with no schema.

• Major limitation – data size is limited to RAM size.

• Often used as caches for on-disk DB systems.

Open Source In-Memory DBs

• Memcached/MemchachedDB• Redis– Both are key-value stores that rely on hash

partitioning– Memcached is an LRU based cache.– Redis is more of a data structure server.– Both use a Shared-Nothing architecture

Memcached

• Really a giant, distributed hash table.• Advantages:– Relatively simple– Practically no server to server talk.– Linear scalability

• Disadvantages:– Doesn’t understand data – no server side operations.

The key and value are always strings.– It’s really meant to only be a cache – no more, no less.– No recovery, limited elasticity.

Redis

• Like Memcached, it’s a distributed hash in memory.

• Offers support for lists and sets, as well as strings.

• Offers limited server side operations.• Supports master-slave architecture and data

replicas for scalability and high availability. • Also supports a persistent mode that writes to

disk.

Document Stores

• As the name implies, these databases store documents.

• Usually schema-free. The same database can store multiple documents.

• Allow indexing based on document content.• Prominent examples: CouchDB, MongoDB.

Documents

• A document is just a collection of values, usually serialized in JSON.

• Many implementations offer nesting of documents

• Example:{ "username" : "bob", "address" : { "street" : "123 Main Street", "city" :

"Springfield", "state" : "NY" } }

CouchDB

• Written in ERLANG.• Offers ACID guarantees based on multi-version

control.• Supports replication, but isn’t a real

distributed database.

MongoDB

• Written in C++.• Atomic operations on single documents only.• Excellent scalability based on sharding.• Support for server side javascript and

MapReduce.

Tabular stores

• The original: Google’s BigTable– Proprietary, not open source.

• The open source elephant alternative – Hadoop with HBase.

• A top level Apache Project.• Large number of users.• Contains a distributed file system, MapReduce, a

database server (Hbase), and more.• Rack aware.

Hadoop components

Hadoop basic components

• At it’s core, Hadoop is a framework for running MapReduce operations on large data sets.

• The data sets are placed as text files on the distributed file system.

Hadoop MapReduce Flow

HBase

• A database engine built on top of Hadoop distributed file system.

• Scales up to Billions of rows with Millions of columns.

• Has a Java interface for queries.

The Tradeoff – SQL vs. NoSQL

• RDBMS:– Mature.– Standard SQL (but not for DDL, extensions).– Robust tools.

• NoSQL:– Scale– Schemaless

References

• Eventual Consistency (Werner Vogels, CTO, Amazon)• Starbucks doesn’t use two phase commit.• Hadoop the definitive guide (O’Reilly) • MongoDB the definitive guide (O’Reilly)• Many wiki pages

http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

http://eaipatterns.com/ramblings/18_starbucks.html

Questions

NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS...

Documents

Transcript of NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS...