Relational databases vs Non-relational databases

Relational databases vs Non-relational databases

James SerraBig Data [email protected]

(RDBMS vs NoSQL vs Hadoop)

About Me Microsoft, Big Data Evangelist In IT for 30 years, worked on many BI and DW projects Worked as desktop/web/database developer, DBA, BI and DW architect and

developer, MDM architect, PDW/APS developer Been perm employee, contractor, consultant, business owner Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data

World conference Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting

Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions

Blog at JamesSerra.com Former SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”

Agenda Definition and differences ACID vs BASE Four categories of NoSQL Use cases CAP theorem On-prem vs cloud Product categories Polyglot persistence Architecture samples

GoalMy goal is to give you a high level overview of all the technologies so you know where to start and put you on the right path to be a hero!

Relational and non-relational definedRelational databases• Also called relational database management systems (RDBMS) or SQL databases• Most popular are Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2• Mostly used in large enterprise scenarios (exception is MySQL, which is mostly used

to store data for web applications, typically as part of the popular LAMP stack)• Analytical RDBMS (OLAP, MPP) solutions are Analytics Platform System, Teradata,

NetezzaNon-relational databases• Also called NoSQL databases• Most popular being MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis,

and Neo4j• Usually grouped into four categories: Key-value stores, Wide-column stores,

Document stores and Graph stores

Hadoop: Made up of Hadoop Distributed File System (HDFS), YARN and MapReduce

OriginsUsing SQL Server, I need to index a few thousand documents and search them.

No problem. I can use Full-Text Search. I’m a healthcare company and I need to store and analyze millions of medical claims per day.

Problem. Enter Hadoop.

Using SQL Server, my internal company app needs to handle a few thousand transactions per second.

No problem. I can handle that with a nice size server.Now I have Pokémon Go where users can enter millions of transactions per second.

Problem. Enter NoSQL.

But most enterprise data just needs an RDBMS (89% market share – Gartner).

Main differences (Relational)Pros• Works with structured data• Supports strict ACID transactional consistency• Supports joins• Built-in data integrity• Large eco-system• Relationships via constraints• Limitless indexing• Strong SQL• OLTP and OLAP• Most off-the-shelf applications run on RDBMS

Main differences (Relational)Cons• Does not scale out horizontally (concurrency and data size) – only vertically,

unless use sharding• Data is normalized, meaning lots of joins, affecting speed• Difficulty in working with semi-structured data• Schema-on-write• Cost

Main differences (Non-relational/NoSQL)Pros• Works with semi-structured data (JSON, XML)• Scales out (horizontal scaling – parallel query performance, replication)• High concurrency, high volume random reads and writes• Massive data stores• Schema-free, schema-on-read• Supports documents with different fields• High availability• Cost• Simplicity of design: no “impedance mismatch”• Finer control over availability• Speed, due to not having to join tables

Main differences (Non-relational/NoSQL)Cons• Weaker or eventual consistency (BASE) instead of ACID• Limited support for joins, does not support star schema• Data is denormalized, requiring mass updates (i.e. product name change)• Does not have built-in data integrity (must do in code)• No relationship enforcement• Limited indexing• Weak SQL• Limited transaction support• Slow mass updates• Uses 10-50x more space (replication, denormalized, documents)• Difficulty tracking schema changes over time• Most NoSQL databases are still too immature for reliable enterprise operational

applications

Main differences (Hadoop)Pros• Not a type of database, but rather a open-source software ecosystem that

allows for massively parallel computing• No inherent structure (no conversion to relational or JSON needed)• Good for batch processing, large files, volume writes, parallel scans, sequential

access• Great for large, distributed data processing tasks where time isn’t a constraint

(i.e. end-of-day reports, scanning months of historical data)• Tradeoff: In order to make deep connections between many data points, the

technology sacrifices speed• Some NoSQL databases such as HBase are built on top of HDFS

Main differences (Hadoop)Cons• File system, not a database• Not good for millions of users, random access, fast individual record lookups or updates (OLTP)• Not so great for real-time analytics• Lacks: indexing, metadata layer, query optimizer, memory management• Same cons at non-relational: no ACID support, data integrity, limited indexing, weak SQL, etc• Security limitations• More complex debugging

Hadoop adoption has considerable slowed• Too much hype• Companies adopt is without understanding use cases (i.e. real big data)• Difficulty in finding skillset• Pace of change too fast• Too many products involved in a solution• Our technologies (RDBMS, NoSQL) improving and expanding use cases• Higher learning curve

ACID (RDBMS) vs BASE (NoSQL)

ATOMICITY: All data and commands in a transaction succeed, or all fail and roll backCONSISTENCY: All committed data must be consistent with all data rules including constraints, triggers, cascades, atomicity, isolation, and durabilityISOLATION: Other operations cannot access data that has been modified during a transaction that has not yet completedDURABILITY: Once a transaction is committed, data will survive system failures, and can be reliably recovered after an unwanted deletion

Basically Available: Guaranteed AvailabilitySoft-state: The state of the system may change, even without a query (because of node updates)Eventually Consistent: The system will become consistent over time

ACID BASEStrong Consistency Weak Consistency – stale

data OKIsolation Last Write WinsTransaction Programmer ManagedAvailable/Consistent Available/Partition TolerantRobust Database/Simpler Code

Simpler Database, Harder Code

Relational storesData stored in tables.Tables contain some number of columns, each of a type.A schema describes the columns each table can have.Every table’s data is stored in one or more rows.Each row contains a value for every column in that table.Rows aren’t kept in any particular order.

Thanks to: Harri Kauhanan, http://www.slideshare.net/harrikauhanen/nosql-3376398

Relational stores

http://www.slideshare.net/harrikauhanen/nosql-3376398

Key-value storesKey-value stores offer very high speed via the least complicated data model—anything can be stored as a value, as long as each value is associated with a key or name.

Key Value

Key-value stores

Key “dog_12”: value_name “Stella”, value_mood “Happy”, etc

Wide-column storesWide-column stores are fast and can be nearly as simple as key-value stores. They include a primary key, an optional secondary key, and anything stored as a value.

Values

Primary key

Keys and values can be sparse or

numerous

Secondary key

Wide-column stores

Document storesDocument stores contain data objects that are inherently hierarchical, tree-like structures (most notably JSON or XML). Not Word documents!

Document stores

Graph storeTitle:

Forgotten

Bridges

Title: MythicalBridges

PurchasedDate: 03-02-2011

PurchasedDate: 09-09-2011

PurchasedDate: 05-07-

2011

Name:Ian

Name:Alan

Graph store

Use cases for NoSQL categories• Key-value stores: [Redis] For cache, queues, fit in memory, rapidly changing data, store blob

data. Examples: shopping cart, session data, leaderboards, stock prices. Fastest performance• Wide-column stores: [Cassandra] Real-time querying of random (non-sequential) data, huge

number of writes, sensors. Examples: Web analytics, time series analytics, real-time data analysis, banking industry. Internet scale

• Document stores: [MongoDB] Flexible schemas, dynamic queries, defined indexes, good performance on big DB. Examples: order data, customer data, log data, product catalog, user generated content (chat sessions, tweets, blog posts, ratings, comments). Fastest development

• Graph databases: [Neo4j] Graph-style data, social network, master data management, network and IT operations. Examples: social relations, real-time recommendations, fraud detection, identity and access management, graph-based search, web browsing, portfolio analytics, gene sequencing, class curriculum

Note: Many NoSQL solutions are now multi-model

VelocityVolume Per

DayReal-world

Transactions Per Day

Real-world Transactions Per Second

Relational DB

Document DB

Key Value or Wide

Column8 GB 8.64B 100,000 As Is

86 GB 86.4B 1M Tuned* As Is

432 GB 432B 5M Appliance Tuned* As Is

864 GB 864B 10M Clustered Appliance

Clustered Servers

Tuned*

8,640 GB 8.64T 100M Many Clustered Servers

Clustered Servers

43,200 GB 43.2T 500M Many Clustered Servers

* Tuned means tuning the model, queries, and/or hardware (more CPU, RAM, and Flash)

Focus of different data models

…you may not have the data volume for NoSQL (yet), but there are other reasons to use NoSQL (semi-structured data, schemaless, high availability, etc)

NewSQLRelational NewSQL stores are designed for web-scale applications, but still require up-front schemas, joins, and table management that can be labor intensive.

Blend RDBMS with NoSQL: provide the same scalable performance of NoSQL systems for OLTP read-write workloads while still maintaining the ACID guarantees of a traditional relational database system.

Use case for different database technologies• Traditional OLTP business systems (i.e. ERP, CRM, In-house app): relational

database (RDBMS)• Data warehouses (OLAP): relational database (SMP or MPP) • Web and mobile global OLTP applications: non-relational database (NoSQL)• Data lake: Hadoop • Relational and scalable OLTP: NewSQL

CAP TheoremImpossible for any shared data system to guarantee simultaneously all of the following three properties:Consistency: Once data is written, all future requests will contain the data. “Is the data I’m looking at now the same if I look at it somewhere else?”Availability: The database is always available and responsive. “What happens if my database goes down?”Partitioning: If part of the database is unavailable, other parts are unaffected. “What if my data is on a different node?”Relational: CA (i.e. SQL Server with no replication)Non-relational: AP (Cassandra, CoachDB, Riak); CP (Hbase, DocumentDB, MongoDB, Redis)

NoSQL can’t be both consistent and available. If two nodes (A and B) and B goes down, if the A node takes requests, it is available but not consistent with B node. If A node stops taking requests, it remains consistent with B node but it is not available. RDBMS is consistent and available because it only has one node/partition (so no partition tolerance)

Microsoft data platform solutionsProduct Category Description More Info

SQL Server 2016 RDBMS Earned top spot in Gartner’s Operational Database magic quadrant. JSON support

https://www.microsoft.com/en-us/server-cloud/products/sql-server-2016/

SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly. Has built-in high availability and disaster recovery. JSON support

https://azure.microsoft.com/en-us/services/sql-database/

SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data. Provision and scale quickly. Can pause service to reduce cost

https://azure.microsoft.com/en-us/services/sql-data-warehouse/

Analytics Platform System (APS)

MPP RDBMS Big data analytics appliance for high performance and seamless integration of all your data

https://www.microsoft.com/en-us/server-cloud/products/analytics-platform-system/

Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics

https://azure.microsoft.com/en-us/services/data-lake-store/

Azure Data Lake Analytics On-demand analytics job service/Big Data-as-a-service

Cloud-based service that dynamically provisions resources so you can run queries on exabytes of data. Includes U-SQL, a new big data query language

https://azure.microsoft.com/en-us/services/data-lake-analytics/

HDInsight PaaS Hadoop compute

A managed Apache Hadoop, Spark, R, HBase, and Storm cloud service made easy

https://azure.microsoft.com/en-us/services/hdinsight/

DocumentDB PaaS NoSQL: Document Store

Get your apps up and running in hours with a fully managed NoSQL database service that indexes, stores, and queries data using familiar SQL syntax

https://azure.microsoft.com/en-us/services/documentdb/

Azure Table Storage PaaS NoSQL: Key-value Store

Store large amount of semi-structured data in the cloud

https://azure.microsoft.com/en-us/services/storage/tables/




















PolyBaseQuery relational and non-relational data with T-SQL

CapabilityT-SQL for querying relational and non-relational data across SQL Server (APS, SQL Server 2016, SQL DW) and Hadoop (HDP, Cloudera, HDInsight)

Benefits New business insights across

your data lake Leverage existing skillsets

and BI tools Faster time to insights and

simplified ETL process

DocumentDB consistency options• Strong, which is the slowest of the four, but is guaranteed to always return correct

data• Bounded staleness, which ensures that an application will see changes in the order

in which they were made. This option does allow an application to see out-of-date data, but only within a specified window, e.g., 500 milliseconds

• Session, which ensures that an application always sees its own writes correctly, but allows access to potentially out-of-date or out-of-order data written by other applications

• Eventual, which provides the fastest access, but also has the highest chance of returning out-of-date data

On-prem vs Cloud• On-prem: SQL Server, APS, MongoDB, Oracle, Cassandra, Neo4J• IaaS Cloud: SQL Server in Azure VM, Oracle in Azure VM• DBaaS/PaaS Cloud: SQL Database, SQL Data Warehouse, DocumentDB, Redshift,

RDS, MongoLab

Product Categories

, DocumentDB, Coachbase

, APS, SQL DW

SQL Database, SQLite

, PostgreSQL

, Redis

, OrientDB

Product CategoriesRankings from db-engines.com

Azure Product Categories

SQL DW

ADLS, ADLA

(PaaS)

(IaaS)

db-engines.com/en/ranking

Method of calculation:

• Number of mentions of the system on websites

• General interest in the system• Frequency of technical

discussions about the system• Number of job offers, in which

the system is mentioned• Number of profiles in

professional networks, in which the system is mentioned

• Relevance in social networks

db-engines.com/en/ranking_definition

db-engines.com/en/ranking_categories

NoSQL = 14%

Polyglot Persistence• Sometimes a relational store is the right choice, sometimes a NoSQL store is the

right choice• Sometimes you need more than one store: Using the right tool for the right job

Cloud Big Data Solution

SummaryChoose NoSQL when…

• You are bringing in new data with a lot of volume and/or variety• Your data is non-relational/semi-structured• Your team will be trained in these new technologies (NoSQL)• You have enough information to correctly select the type and product of NoSQL for your

situation• You can relax transactional consistency when scalability or performance is more important• You can service a large number of user requests vs rigorously enforcing business rules

Relational databases are created for strong consistency, but at the cost of speed and scale. NoSQL slightly sacrifices consistency across nodes for both speed and scalability.

NoSQL and Hadoop are viable technologies for a subset of specialized needs and use cases.

Lines are getting blurred – do your homework!

Bottom line!• RDBMS for enterprise OLTP and ACID compliance, or db’s

under 1TB

• NoSQL for scaled OLTP and JSON documents

• Hadoop for big data analytics (OLAP)

Resources Relational database vs Non-relational databases: http://bit.ly/1HXn2Rt Types of NoSQL databases: http://bit.ly/1HXn8Zl What is Polyglot Persistence? http://bit.ly/1HXnhMm Hadoop and Data Warehouses: http://bit.ly/1xuXfu9 Hadoop and Microsoft: http://bit.ly/20Cg2hA

http://bit.ly/1HXn2Rt

http://bit.ly/1HXn8Zl

http://bit.ly/1HXnhMm

http://bit.ly/1xuXfu9

http://bit.ly/20Cg2hA

Q & A ?James Serra, Big Data EvangelistEmail me at: [email protected] me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted via the “Presentations” link on the top menu)

mailto:[email protected]

http://www.linkedin.com/in/JamesSerra

http://www.jamesserra.com/

Relational databases vs Non-relational databases

Technology

Transcript of Relational databases vs Non-relational databases