NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS...
-
Upload
ann-shelton -
Category
Documents
-
view
215 -
download
0
Transcript of NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS...
NoSQL
Or Peles
What is NoSQL
• A collection of various technologies meant to work around RDBMS limitations (mostly performance)
• Not much of a definition...
RDBMS Limitations
• Hard to scale horizontally (for updates)– Distributed ACID requires 2 phase commit.
• Schema can be a bitch– Hard to change.– Data normalization can slow down queries.
Web Scale
• Some numbers:– Youtube serves over 100MM videos a day.– Ebay adds over 10TB of storage every week.– Facebook holds over 80 Billion photos, and serves
hundreds of thousands of requests/second.
Ideal System
• Available – can always read and write.• Consistent – Reads always pick up the latest
write.• Partition tolerant – The system can be split
across multiple machines and datacenters.
Starbucks doesn’t use two phase commit
• A great example presented here.• Asynchronous execution• Correlation• Exception handling:– Write off– Retry– Compensation
• 2 phase commit would create a choke point.
CAP Theorem
• CAP (Eric Brewer, 2000):Simply put, of the following 3 properties: • Consistency• Availability• Partition tolerance
Only two can hold at any system.
CAP in practice
• CATwo phase commits, works best at a single data
center. Scaling issues.• CP
Sharding. Data may become unavailable if a shard fails.
• APMay return inaccurate data. DNS is a prime example.
Consistency Types
• Strict• Eventual– Causual– Read your writes– Session– Monotonic read– Monotonic write
Concepts
• In memory vs. disk based.• Shared everything vs. shared nothing.• Master slave vs. server symmetry.• Elastic scalability.• MapReduce.
Sharding
• Split data across machines (database instances).– Feature based sharding.– Key based sharding.– Lookup table.
NoSQL Categories
• Key-Value stores• Document store• Tabular
Lean & MeanThe Key-Value In-Memory DBs
• In memory DBs are simpler and faster than their on-disk counterparts.
• Key value stores offer a simple interface with no schema.
• Major limitation – data size is limited to RAM size.
• Often used as caches for on-disk DB systems.
Open Source In-Memory DBs
• Memcached/MemchachedDB• Redis– Both are key-value stores that rely on hash
partitioning– Memcached is an LRU based cache.– Redis is more of a data structure server.– Both use a Shared-Nothing architecture
Memcached
• Really a giant, distributed hash table.• Advantages:– Relatively simple– Practically no server to server talk.– Linear scalability
• Disadvantages:– Doesn’t understand data – no server side operations.
The key and value are always strings.– It’s really meant to only be a cache – no more, no less.– No recovery, limited elasticity.
Redis
• Like Memcached, it’s a distributed hash in memory.
• Offers support for lists and sets, as well as strings.
• Offers limited server side operations.• Supports master-slave architecture and data
replicas for scalability and high availability. • Also supports a persistent mode that writes to
disk.
Document Stores
• As the name implies, these databases store documents.
• Usually schema-free. The same database can store multiple documents.
• Allow indexing based on document content.• Prominent examples: CouchDB, MongoDB.
Documents
• A document is just a collection of values, usually serialized in JSON.
• Many implementations offer nesting of documents
• Example:{ "username" : "bob", "address" : { "street" : "123 Main Street", "city" :
"Springfield", "state" : "NY" } }
CouchDB
• Written in ERLANG.• Offers ACID guarantees based on multi-version
control.• Supports replication, but isn’t a real
distributed database.
MongoDB
• Written in C++.• Atomic operations on single documents only.• Excellent scalability based on sharding.• Support for server side javascript and
MapReduce.
Tabular stores
• The original: Google’s BigTable– Proprietary, not open source.
• The open source elephant alternative – Hadoop with HBase.
• A top level Apache Project.• Large number of users.• Contains a distributed file system, MapReduce, a
database server (Hbase), and more.• Rack aware.
Hadoop components
Hadoop basic components
• At it’s core, Hadoop is a framework for running MapReduce operations on large data sets.
• The data sets are placed as text files on the distributed file system.
Hadoop MapReduce Flow
HBase
• A database engine built on top of Hadoop distributed file system.
• Scales up to Billions of rows with Millions of columns.
• Has a Java interface for queries.
The Tradeoff – SQL vs. NoSQL
• RDBMS:– Mature.– Standard SQL (but not for DDL, extensions).– Robust tools.
• NoSQL:– Scale– Schemaless
References
• Eventual Consistency (Werner Vogels, CTO, Amazon)• Starbucks doesn’t use two phase commit.• Hadoop the definitive guide (O’Reilly) • MongoDB the definitive guide (O’Reilly)• Many wiki pages
Questions