Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than...
Transcript of Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than...
![Page 1: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/1.jpg)
Cloud Scale Distributed Data StorageJürmo Mehine
2013
![Page 2: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/2.jpg)
2/28
Outline• Background
– Relational model– Database scaling– Keys, values and aggregates– The NoSQL landscape
• Non-relational data models– Key-value– Document-oriented– Column family– Graph
• Five databases– Riak– MongoDB– CouchDB– HBase– Cassandra
![Page 3: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/3.jpg)
3/28
The Relational Model
• Data in tables (relations)
• Connections between data are expressed as foreign key references between rows
• The expected format of the data is specified with a schema
• Data is accessed using SQL
![Page 4: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/4.jpg)
4/28
Database Scaling
• Vertical scaling – on one machine
• Horizontal scaling – across a cluster of machines
• Because of the way data is structured in relational databases, the relational model does not scale well horizontally
![Page 5: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/5.jpg)
5/28
Keys, Values and Aggregates
• Non-relational data models are based on a simpler key-value structure
• An aggregate is a collection of data that is treated as a unit– E.g. a customer and all of his orders
• In normalized relational databases aggregates are computed using JOIN operations
• Key-value, document-oriented and column family are said to be aggregate-oriented data models because they store aggregated data together
• Keyed aggragates make for a natural unit of data sharding
![Page 6: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/6.jpg)
6/28
Benefits of the Key-value Model
• Horizontal scalability (suitable for cloud computing)
• Higher access speeds
• Flexible schemaless models suitable for unstructured data
• More suitable for Object Oriented languages
![Page 7: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/7.jpg)
7/28
The NoSQL Movement
• NoSQL is a broad term with no clear boundaries• Emergence of persistence solutions using non-relational
data models• Schemaless data models• Driven partly by the rise of cloud computing• Simpler key-value based data models scale better
horizontally than the relational model• Tradeoff between data consistency and high availability
![Page 8: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/8.jpg)
8/28
The NoSQL Landscape
• Distributed key-value based data stores• Embedded non-relational data stores (LevelDB,
Tokyo Cabinet)• In-memory data stores (Redis, Memcached)• Hosted services (Amazon S3, DynamoDB)• Data analysis and processing platforms/tools
(Hadoop/HDFS, Pig, Hive)• Object databases (ZODB, GemStone)• NewSQL (Drizzle, SQLFire)
![Page 9: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/9.jpg)
9/28
Four Non-relational Data Models
![Page 10: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/10.jpg)
10/28
The Key-value Model
• Data stored as key-value pairs
• The value is an opaque blob to the database
• Examples: Dynamo, Riak
![Page 11: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/11.jpg)
11/28
The Document-oriented Model
• Data stored as key-value pairs
• Values have further structure in the form of fields and values
• Examples: CouchDB, MongoDB
![Page 12: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/12.jpg)
12/28
Example
• Aggregates described in JSON using map and array data structures
![Page 13: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/13.jpg)
13/28
The Column Family Model
• Data stored in large tabular structures (sparse matrix)
• Columns grouped into column families
• A record or row can be thought of as a two-level map
• Columns can be added at any time
• Examples: BigTable, Cassandra, HBase http://www.datastax.com/docs/1.1/ddl/column_family
![Page 14: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/14.jpg)
14/28
Example
• Aggregates as rows in a column family store
![Page 15: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/15.jpg)
15/28
Graph Databases
• Data stored as nodes and edges of a graph
• The nodes and edges can have fields and values
• Aimed at graph traversal queries in connected data
• Examples: Neo4J, FlockDB
![Page 16: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/16.jpg)
16/28
Non-relational Data Models
• In non-relational data stores data is denormalized
• It is common to also store JSON documents in key-value stores• The key-value, document-oriented and column family models
are aggregate oriented models• The other models are based on the key-value model• We will not talk about graph databases, because they are not
essential to cloud storage• In reality the classification of databases into different models is
not as straight forward• Multiparadigm databases also exist (e.g ArangoDB)
![Page 17: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/17.jpg)
17/28
Five Databases
![Page 18: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/18.jpg)
18/28
Riak
• Key-value
• Based on Amazon's Dynamo specification
• Distributed decentralized persistent hash table
• Consistent Hashing http://basho.com/products/riak-overview/
![Page 19: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/19.jpg)
19/28
Riak
• Gossip protocol
• MVCC using vector clocks
• RESTful query interface
• Secondary indexes as item metadata
• Links and link walking
• Riak search
• MapReduce in Erlang and JavaScript
![Page 20: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/20.jpg)
20/28
MongoDB
• Document oriented (BSON)• Query language based on
JavaScript• GridFS for BLOB file storage• Master-slave architecture• Linking between documents• Supports MapReduce in JavaScript
http://www.mongodb.org/
![Page 21: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/21.jpg)
21/28
CouchDB
• Document oriented (JSON)
• RESTful query interface
• Built in web server
• Web based query front-end Futon
http://docs.couchdb.org/en/latest/
![Page 22: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/22.jpg)
22/28
CouchDB
• MapReduce is the primary query method (JavaScript and Erlang)
• Materialized views as results of incremental MapReduce jobs
• CouchApps – javascript-heavy web applications built entirely on top of CouchDB without a separate web server for the logic layer
![Page 23: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/23.jpg)
23/28
HBase
• Column family
• Master-slave architecture
• Built on top of Hadoop
• Data stored in HDFS
• A single table with several column families
• Supports Java MapReduce
http://hbase.apache.org/
![Page 24: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/24.jpg)
24/28
Cassandra
• Column family
• Data model from BigTable, distribution model from Dynamo (decentralized) http://cassandra.apache.org/
![Page 25: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/25.jpg)
25/28
Cassandra
• Different implementation of column family than HBase
• The concept of super-columns
• Static and dynamic column families
• Dynamic column families as materialized views on data
• CQL for querying data
![Page 26: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/26.jpg)
26/28
Polyglot Persistence
http://martinfowler.com/bliki/PolyglotPersistence.html
![Page 27: Cloud Scale Distributed Data Storage · Cassandra •Different implementation of column family than HBase •The concept of super-columns •Static and dynamic column families •Dynamic](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd54e51248c352f316e14b5/html5/thumbnails/27.jpg)
27/28
Conclusions
• In recent years there has been a rise in non-relational (NoSQL) data stores
• This is related to the rise of cloud computing – key-value based models offer better horizontal scalability across a cluster of computers
• The NoSQL landscape is extremely varied• The four basic non-relational data models are:
– The key-value model
– The document-oriented model
– The column family model
– Graph databases