Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani

Post on 16-Jul-2015

198 views 0 download

Transcript of Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani

Email polymathiccoder@gmail.com

Twitter @PolymathicCoder

Finding Your Way In the Midst of the NoSQL Haze

Abdelmonaim Remani

JAX London 2014 London, UK

October14, 2014

The DatabaseAny system that primarily allows to store data in a

certain way with the main purpose of reading it back in a later time

The Story of Data

The Invention of Paper

Gutenberg’s Printing Press

Turing Machine

The Flat File Model

The Invention of Paper

The Flat File Model

A single binary or text file that contains data organized in a tabular format.Each line corresponds to a record represented as a series of values formatted

as fixed or variable length fields. The latter requires one or more extra characters to function as the separator or the delimiter.

A header record specifying the field names associated with the values given their positions.

The most primitive form of databasesData storage and retrieval

The Flat File Model

• Nothing to describe the one record outside of what is implied in the table structure in the done file (A header record is the closest to any metadata)

• No way to model associations between records or fields

• Data schema can be expressed in DTD or XML Schema• Associations are modeled by nesting tags or embedding URIs of other files

The cost is XML Parsing/building overhead

Comma-Separated Values (CSV)(One of the most popular flat file formats)

eXtendable Markup Language (XML)

We needed a data schema & a way to represent more structured data

A Double-Edged Sword

Easy to share across the network

(Using ubiquitous protocols like FTP)

Data access is deferred to the underlying OS

Straight-forward to manipulate

(A File I/O API is a part of the core library of most

Every programing language)

Easy to secure (File system permissions of

the host OS)

Not Always Available When a file is open for

reading or writing it is lockedNo concurrent access

Sequential Operations involve loading the

entire dataset into memory(Sorting, Filtering, Aggregations, etc…)

But…

Simple

Relational Databases

Gutenberg’s Printing Press

A system that is always available, providing concurrent access to data at all times, while preserving data integrity and consistency

Relational DatabasesA system that manages data beyond basic storage

and retrieval

A wide adoption and familiarity A plethora of tools and mature RDBMS solutions

The De-facto Standard

A solution that is generic enough to be applicable to any business case,yet accommodating enough to fulfill its very specific needs

The Relational Model

First-introduced in 1969 by Edgar F. Codd in “Deliverability, Redundancy,, and Consistency of Relations Stored in Large Data Banks”

Data is stored according to a schema of two diminutional tables (relations) as a collection of rows (tuples) of values to columns (attributes). Data integrity and referential integrity are guaranteed within the schema by

defining a set of explicitly defined constraints

Normalization, SQL, and ACID Transactions

A new data model

Relational Concepts

The Relational Model

With the goal of guaranteeing consistency, adhering to a set recommendations while designing or evolving a database schema with the

goal of preventing addition, deletion, and modification anomalies

Normal Form A schema is said to be normalized or in the normal form when it

satisfies all recommendationsThere are many normal forms. The most notable is Byce Codd Normal Form

(BCNF or 3.5 NF)

Normalization

Most normal forms favor smaller tables with well-defined constraints and discourage redundancy

The Relational Model

Very flexible and somewhat standardized that allows for managing data in relational databases

DDL (Data Definition Language)

To define the database schema

DML (Data manipulation Langaue)

Basic CRUD on one tableOperations across multiple tables (JOIN, etc…)

SQL (Structured Query language)

The Relational Model

A Transaction is a logical unit of work that constitutes of multiple operations

When processing a transaction the relation model guarantees all ACID properties

All occur or none occurs

Will aways transition from a valid state to another valid state

Concurrent operations will never force the database into an invalid state

The data is permanently persisted once the transaction is committed

ACID Transactions

Consistency Durability

IsolationAtomicity

The Relational Model

The user will directly interact with the RDBMS through a terminal console

Usability as a design goal(It must provide a reasonable and human-friendly interface)• The system enforces data integrity and guarantee ACIDity• SQL is english-like

The data access pattern are unknown and virtually infinite• Data is structured in such a way that is not biased towards any

particular access pattern• SQL is flexible enough to virtually express any query

Designed under an important assumption

Implication 1

Implication 2

Conversations Messages Medias1 * 1 *

For a “Conversation n”🔑 CVn Unique Identifier / ID📜 CVn Associated Data

For a “Media n”🔑 MDn Unique Identifier / ID📜 MDn Associated Data

For a “Message n”🔑 MGn Unique Identifier / ID📜 MGn Associated Data

Write📜 CVn

Read📜 CVn by 🔑 CVn

Read (Direction A)

Read (Direction B)📜 CVm by 🔑 MGn📜 CVm by 🔑 MDn

Write📜 MGn

Read📜 MGn by 🔑 MGn

Read (Direction A)📜 MG* by🔑 CVn

Read (Direction B)📜 MGm by🔑 MDn

Write 📜 MDn

Read📜 MDn by 🔑 MDn

Read (Direction A)📜 MD* by 🔑 MGn📜 MD* by 🔑 CVn

Read (Direction B)

1 1 1 1

Direction A

Direction B

Assuming the Logical Data Model below

With the Following Access Patterns

🔑 MG1 / 📜 MG1🔑 MG2 / 📜 MG2🔑 MG3 / 📜 MG3

🔑 MD1 / 📜 MD1🔑 MD2 / 📜 MD2🔑 MD3 / 📜 MD3🔑 MD4 / 📜 MD4

🔑 CV1 / 📜 CV1🔑 CV2 / 📜 CV2

🔑 CV1 - 🔑 MG1🔑 CV1 - 🔑 MG2🔑 CV2 - 🔑 MG3

🔑 MG1 - 🔑 MD1🔑 MG1 - 🔑 MD2🔑 MG2 - 🔑 MD3🔑 MG3 - 🔑 MD4

The Relational Physical Data Model Would Be

is the root of all evil!

The Relational Model

The gist of it!

Redundancy

Database Darwinism

Survival of the fittest

RDBMS vs. The World Fight!

Let's get ready to rumble….

Michael Buffer

Embedding business logic through stored procedures and triggersSupporting user management and security

Etc…

Some morphed into full-blown application platforms embedding and providing generic extensions of SQL and runtimes

(Oracle introduced PL/SQL and shipped with an implementation of the JVM)

More than a datastore

Relational databases evolved

Software evolved beyond data management to data processing

Users demand more elaborate user interfaces

RDBMS vs. The World 1 - 0

No Silver Bullet

Performance issues worsenedBrittle Deployment

Data corruption risk increasedSecurity concerns

Fueling Developer-DBA Wars

An architecture where the UI and business logic are off-loaded to external applications built on top of the database

As complexity of data processing increased and GUI (Graphical User Interface) becoming the norm

Relational databases struggled to deliver adequate performance

Deploying code on the Database

The database is designed to manage data not to process it

Software got more complex

RDBMS vs. The World 1 - 1

O/R Impedance Mismatch

OOP (Object-Oriented Programming) Interaction of hierarchical object structures each encapsulating their own data and

behavior

The single object does not necessarily map to a single row and vise-versa

OOP concepts like polymorphism and inheritance simply do not exist in the relational model

We dealt with it “The Third Manifesto” by Christopher J. Date

Design Patterns Active RecordThe development of ORM frameworks like Hibernate, etc…

Software got even more complex

Mismatch!

Persisting data in relational databases became difficult

RDBMS vs. The World 1 - 1

Hoarding data & asking questions…

At the level of the Schema

The increasing data volume inversely impacted the performance of relational databases

Tuning Vendor-specific parametersSecondary indexesOptimizing queries

Higher-throughput application with bigger datasets and more complex queries

Database Tuning and optimization

Scaling Up/Vertically Buy the beefiest machine you can affordExpensive and certainly not sustainable

At the level of the RDBMS as a whole

RDBMS vs. The World 2 - 1

Scaling Out/Horizontally Running on a cluster

Never designed to be that way

Master/Slave Model Data Sharding

Periodically refreshed Materialized viewsDe-normalizing the schema

Hoarding more data & asking more questions…

More intrusive measures

At the level of the Schema

At the level of the RDBMS as a whole

Writes are handled by a single node (the master)Reads are handled by the rest of the nodes (the slaves)

The master propagates updates to slaves

The Master/Slave Model

• Improves reads only• The dataset must fit in one machine• Risks dirty reads from out-of-synch nodes

Dividing up the dataset according to some criteria, called the shard key, into subsets, called partitions. A partition of data must be small enough to fit in a single node, and a good shard key is one that distributes data evenly across all

partitions.

Data Sharding

• Improves both reads and writes• The dataset does not have to fit in one machine• Applications must be aware of the sharing strategy• Re-sharding sucks (Possibly having to re-shuffle data)• You can’t JOIN across partitions or enforce referential integrity

Hoarding more data & asking more questions…

RDBMS vs. The World

What happened to “Redundancy is the root of all evil” !??

Time-out!

RDBMS vs. The World Forfeit

The World Wins!

The CAP TheoremEric Brewer on Distributed Systems

Pick two out of the three Consistency, Availability, and Partition Tolerance

There is no good, cheap, and fast service

Think “The Iron Triangle” (The Project Management Triangle)

The CAP Theorem

It is a CA systemIt favors Consistency and Availability

It can never be Partition Tolerant

The Relational Model?

It makes sense… it was designed to run on one machine!

It is not like we have choice… Are there any successful distributed systems that are

Partition tolerant?

An AP SystemDNS (Domain Naming Service)

Not all the nodes have the most updated record set You register a domain name and wait for some time for the rest of the

DNS systems on the internet to be synched up eventually

Did we give up consistency all together?

Eventual Consistency as opposed to Immediate Consistency

We settled for a lesser degree of consistency

BASE Basically Available

Soft State Eventual Consistency

An AP Datastore

• Mohammed in Morocco changed his relationship status to single on a nearby edge node

• His cousin in Spain saw the status change immediately because they happen to get the data from the node

• His secret admirer Sara who lives across the Atlantic in the United States could not see it until an hour later

• His bother in Japan got the update the next day

They all got it eventually!

Is that even possible?

Welcome to curious case of NoSQL datastores!

NoSQL

Turing Machine

A wide range of specialized datastores with the goal of addressing the challenge of the relational model

“The whole point of seeking alternatives is that you need to solve a problem that relational

databases are a bad fit for” -Eric Evans

NoSQL

NoSQL doe not mean anti-SQL or anti-relational It is simply

any datastore that is not relational

It’s a slippery slope…

Logical Schema? Well-defined and rigid in

relationalWhy not going commando?

Integrity Constraints? Who cares!

A query language That can wait!

Security & User Management

Forget it!

Physical Schema? B-Trees in relational

Why not use another data structure?

Since we are willing to drop consistency why not…

Designed to run on a single machine Designed to run on a cluster

CA AP/CA/CP

Scales Vertically Scales Horizontally

Full Indexes On keys mostly

Regid schema Flexible or no schema

Any queries Pre-defined queries

SQL vs. NoSQL

It’s the wild west… There are many outliers and hybrid datastore!

A wide range of specialized datastores with the goal of addressing the challenge of the relational model

Key-value Datastores Columnar Datastores Document Datastores

Graph Datastore

The NoSQL Zoo

A wide variety!

Conversations Messages Medias1 * 1 *

For a “Conversation n”🔑 CVn Unique Identifier / ID📜 CVn Associated Data

For a “Media n”🔑 MDn Unique Identifier / ID📜 MDn Associated Data

For a “Message n”🔑 MGn Unique Identifier / ID📜 MGn Associated Data

Write📜 CVn

Read📜 CVn by 🔑 CVn

Read (Direction A)

Read (Direction B)📜 CVm by 🔑 MGn📜 CVm by 🔑 MDn

Write📜 MGn

Read📜 MGn by 🔑 MGn

Read (Direction A)📜 MG* by🔑 CVn

Read (Direction B)📜 MGm by🔑 MDn

Write 📜 MDn

Read📜 MDn by 🔑 MDn

Read (Direction A)📜 MD* by 🔑 MGn📜 MD* by 🔑 CVn

Read (Direction B)

1 1 1 1

Direction A

Direction B

Assuming the Logical Data Model below

With the Following Access Patterns

Document Datastores

Documents of nested structures of hashes and their values

Biggest Concern Complex JOINs across documents

Querying against the root of aggregates

Biggest Advantage Very flexible schema

Good queriablity No impedance mismatch

Very good leverage of Map/Reduce Great JSON support

Most Popular Solutions MongoDB CounchDB

🔑 CV1 / 📜 CV1

🔑 MD1 / 📜 MD1

🔑 MG1 / 📜 MG1

🔑 MG2 / 📜 MG2

🔑 MD2 / 📜 MD2

🔑 MD3 / 📜 MD3

🔑 CV2 / 📜 CV2

🔑 MG3 / 📜 MG3

🔑 MD4 / 📜 MD4

The Document Physical Data Model Would Be

🔑 CV1 / 📜 CV1

🔑 MD1 / 📜 MD1

🔑 MG1 / 📜 MG1

🔑 MG2 / 📜 MG2

🔑 MD2 / 📜 MD2

🔑 MD3 / 📜 MD3

🔑 MD1 / 📜 MD1

🔑 MG1 / 📜 MG1

🔑 MD2 / 📜 MD2

🔑 CV1 / 📜 CV1

🔑 MD3 / 📜 MD4

🔑 MG2 / 📜 MG2

🔑 CV1 / 📜 CV1

🔑 MD1 / 📜 MD1

🔑 CV1 / 📜 CV1

🔑 MG1 / 📜 MG1

🔑 CV2 / 📜 CV2

🔑 MG3 / 📜 MG3

🔑 MD4 / 📜 MD4

🔑 MD4 / 📜 MD4

🔑 MG3 / 📜 MG3

🔑 CV2 / 📜 CV2

🔑 MD2 / 📜 MD2

🔑 CV1 / 📜 CV1

🔑 MG1 / 📜 MG1

🔑 MD3 / 📜 MD3

🔑 CV1 / 📜 CV1

🔑 MG2 / 📜 MG2

🔑 MD4 / 📜 MD4

🔑 CV2 / 📜 CV2

🔑 MG3 / 📜 MG3

The Document Physical Data Model Would Be

Key-Value Datastores

A big distributed hash map or associative array

Biggest Concern Querying by anything other than the key

(No secondary indexes mostly)

Biggest Advantage A simple data model

Very fast reads and writes Highly scalable

Most Popular Solutions Amazon DynamoDB

Riak Redis

🔑 MG1 📜 MG1

🔑 MG2 📜 MG2

🔑 MG3 📜 MG3

🔑 MD1 📜 MD1

🔑 MD2 📜 MD2

🔑 MD3 📜 MD3

🔑 MD4 📜 MD4

🔑 CV1 📜 CV1🔑 CV2 📜 CV2

The Key-Value Physical Data Model Would Be

🔑 MG1 📜 MG1🔑 CV1#MG1 📜 MG1

🔑 MG2 📜 MG2🔑 CV1#MG2 📜 MG2

🔑 MG3 📜 MG3🔑 CV2#MG3 📜 MG3

🔑 MD1 📜 MD1🔑 MG1#MD1 📜 MD1🔑 CV1#MD2 📜 MD2

🔑 MD2 📜 MD2🔑 MG1#MD2 📜 MD2🔑 CV1#MD2 📜 MD2

🔑 MD3 📜 MD3🔑 MG2#MD3 📜 MD3🔑 CV1#MD3 📜 MD3

🔑 MD4 📜 MD4🔑 MG3#MD4 📜 MD4🔑 CV2#MD4 📜 MD4

🔑 CV1 📜 CV1🔑 CV2 📜 CV2

The Key-Value Physical Data Model Would Be

Columnar Datastores

A table where data of the same column is stored together

Biggest Concern Key design is not trivial

(Need to know your access pattern before-hand)

Biggest Advantage Great for sparse data

Very fast column operations (Ex. Aggregation) Support versioning and data compression

Most Popular Solutions Google BigTable

HBase Cassandra

Conversations Messages Medias1 * 1 *

For a “Conversation n”🔑 CVn Unique Identifier / ID📜 CVn Associated Data

For a “Media n”🔑 MDn Unique Identifier / ID📜 MDn Associated Data

For a “Message n”🔑 MGn Unique Identifier / ID📜 MGn Associated Data

Write📜 CVn

Read📜 CVn by 🔑 CVn

Read (Direction A)

Read (Direction B)📜 CVm by 🔑 MGn📜 CVm by 🔑 MDn

Write📜 MGn

Read📜 MGn by 🔑 MGn

Read (Direction A)📜 MG* by🔑 CVn

Read (Direction B)📜 MGm by🔑 MDn

Write 📜 MDn

Read📜 MDn by 🔑 MDn

Read (Direction A)📜 MD* by 🔑 MGn📜 MD* by 🔑 CVn

Read (Direction B)

1 1 1 1

Direction A

Direction B

Assuming the Logical Data Model below

With the Following Access Patterns

🔑 MG1 📜 MG1🔑 CV1#MG1 📜 MG1

🔑 MG2 📜 MG2🔑 CV1#MG2 📜 MG2

🔑 MG3 📜 MG3🔑 CV2#MG3 📜 MG3

🔑 MD1 📜 MD1🔑 MG1#MD1 📜 MD1🔑 CV1#MD2 📜 MD2

🔑 MD2 📜 MD2🔑 MG1#MD2 📜 MD2🔑 CV1#MD2 📜 MD2

🔑 MD3 📜 MD3🔑 MG2#MD3 📜 MD3🔑 CV1#MD3 📜 MD3

🔑 MD4 📜 MD4🔑 MG3#MD4 📜 MD4🔑 CV2#MD4 📜 MD4

🔑 CV1 📜 CV1🔑 CV2 📜 CV2

The Columnar Physical Data Model Would Be

Graph Datastores

A graph data structure

Biggest Concern Does NOT scale horizontally

Biggest Advantage Perfect for interconnected data

Allows for model explicit relationships Fine-grained graph travel

Supports ACID Transactions

Most Popular Solutions Neo4J

🔑 CV1 / 📜 CV1

🔑 CV2 / 📜 CV2

🔑 MG1 / 📜 MG1

🔑 MG3 / 📜 MG3

🔑 MG2 / 📜 MG2

🔑 MD1 / 📜 MD1

🔑 MD2 / 📜 MD2

🔑 MD3 / 📜 MD3

🔑 MD4 / 📜 MD4

The Graph Physical Data Model Would Be

Conversation 1

Conversation 2

Message 1

Message 3

Message 2

Media 1

Media 2

Media 3

Media 4

Has

Has

Has

Has

Has

Has

Has

Reply To

Belongs In

Belongs In

The Graph Physical Data Model Would Be

@PolymathicCoder

ImplicationsConcurrent Joins & Distributed Transactions

in Code…

@PolymathicCoder

How to choose?

Use a relational database unless you are expecting a lot of data

Know your data Schema, density, V3 (Volume, Velocity, and Variety), etc…

Know your access patterns Read/Write ratio, frequency, the likelihood of it changing, etc…

Know the associate development effort and cost

Know your administration effort and cost

Consider using Big Data technologies if needed(HDFS, Hadoop, Pige Hive, etc…)

Go PolyglotThe idea of that one data model will perfectly fit the

complexity of data and accommodate the variety of its all access patterns of the one sufficiently-elaborate application is

absurd

Leveraging multiple datastores based on the specific way the data is structured and the way it is accessed

Polyglot Persistence?

Don’t go overboard! The learning curve can be very steep

The Dev effort can be significant

To be fair…

A flathead screwdriver to work on a Philips screw as well as one with the matching Philips blade

You can’t expect…

Relational Databases are so awesome

they deserve the title of “The Honey Badger of Datastores”

Thank You!

Email polymathiccoder@gmail.com

Twitter @PolymathicCoder