Relational databases vs Non-relational databases

Click here to load reader

download Relational databases vs Non-relational databases

of 46

  • date post

    06-Jan-2017
  • Category

    Technology

  • view

    3.356
  • download

    1

Embed Size (px)

Transcript of Relational databases vs Non-relational databases

Title Slide No more than 2 lines

Relational databases vs Non-relational databasesJames SerraBig Data [email protected](RDBMS vs NoSQL vs Hadoop)

There is a lot of confusion about the place and purpose of the many recent non-relational database solutions (NoSQL databases) compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, how they compare to Hadoop, and discuss the best use cases for each. Ill discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.

2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/15/20171

About MeMicrosoft, Big Data EvangelistIn IT for 30 years, worked on many BI and DW projectsWorked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developerBeen perm employee, contractor, consultant, business ownerPresenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conferenceCertifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform SolutionsBlog at JamesSerra.comFormer SQL Server MVPAuthor of book Reporting with Microsoft SQL Server 2012

Fluff, but point is I bring real work experience to the session

2

AgendaDefinition and differencesACID vs BASEFour categories of NoSQLUse casesCAP theoremOn-prem vs cloudProduct categoriesPolyglot persistenceArchitecture samples

My goal is to give you a high level overview of all the technologies so you know where to start Make you a hero3

GoalMy goal is to give you a high level overview of all the technologies so you know where to start and put you on the right path to be a hero!

4

Relational and non-relational definedRelational databasesAlso called relational database management systems (RDBMS) or SQL databasesMost popular are Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2Mostly used in large enterprise scenarios (exception is MySQL, whichis mostly used to store data for web applications, typically as part of the popular LAMP stack)Analytical RDBMS (OLAP, MPP) solutions are Analytics Platform System, Teradata, NetezzaNon-relational databasesAlso called NoSQL databasesMost popular being MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis, and Neo4jUsually grouped into four categories: Key-value stores, Wide-column stores, Document storesand Graph stores

Hadoop: Made up of Hadoop Distributed File System (HDFS), YARN and MapReduce

LAMP stack (Linux, Apache, MySQL, PHP/ Python/ Perl)5

OriginsUsing SQL Server, I need to index a few thousand documents and search them. No problem. I can use Full-Text Search. Im a healthcare company and I need to store and analyze millions of medical claims per day.Problem. Enter Hadoop.

Using SQL Server, my internal company app needs to handle a few thousand transactions per second. No problem. I can handle that with a nice size server.Now I have Pokmon Go where users can enter millions of transactions per second. Problem. Enter NoSQL.

But most enterprise data just needs an RDBMS (89% market share Gartner).

Hadoop started 2006. NoSQL started 2009

DocumentDB has done 5m/tps per region for 4 regions, so 20m/tps. DocumentDB uses local storage

Kevin Cox: What is the highest performance (transactions per second) you have seen out of SQL Server? Over 500k/sec. Very dependent on using flash-type storage for tran log; i.e. FusionIO or similar. Also short transactions (stock trades).

Matt Goswell: Please see attached. SDX offers 171,800 TPS however this is using SQL 2014. We are waiting on updated numbers for SQL 2016.

Arvind Shyamsundar: The question is fairly open-ended and the answer is dependent on the workload pattern. On the in-memory OLTP front, we achieved 1.2 million batch requests / second on a Fujitsu Primergy server (4 sockets, 72 cores, 144 logical procs) last October. The Superdome X can go up to 16-sockets and hundreds of cores, but with the form factor beyond 4 sockets comes increased NUMA memory latency. So more sockets does not necessarily translate to more throughput. The recent 10TB TPC-H numbers we released were all on 8-socket Lenovo boxes, and the workload involved is predominantly read-workloadhttps://blogs.msdn.microsoft.com/sqlcat/2016/10/26/how-bwin-is-using-sql-server-2016-in-memory-oltp-to-achieve-unprecedented-performance-and-scale/

sql server: 1.2m batch requests/sec (30-40 sql statements each batch)

Batch requests / second is the nearest equivalent to compare transactions / second. Statements is not an accurate comparison. Transactions / second is too overloaded / ambiguous because it could mean any of:Business transactions / second (one business transactions could mean multiple SQL batches)Batch requests / second (assuming one business transaction == one SQL batches)Some other number involving interplay between SQL commands and external web services etc.So from a pure OLTP perspective we prefer to quote batch requests / second in this benchmark. Proper benchmarks like TPC benchmarks have their own clearly defined unit of meansurement (http://www.tpc.org/tpcc/detail.asp)Arvind Shyamsundar 6

Main differences (Relational)ProsWorks with structured dataSupports strict ACID transactional consistencySupports joinsBuilt-in data integrityLarge eco-systemRelationships via constraintsLimitless indexingStrong SQLOLTP and OLAPMost off-the-shelf applications run on RDBMS

OLTP DBMS now called Operational DBMS: http://www.gartner.com/technology/reprints.do?id=1-2RIVJYE&ct=151104&st=sb

Hadoop is kind of FileSystem on which Several Ecosystem can work. Its not a DB.Nosql is a kind of DB, Which having specific property.

The diff between filesystem and database is subtle. Anyway databases store all data in files or in RAM. Also we have "object storages"(like S3), or " key-value data stores"(like Riak), or "data structure stores"(like Redis) and we can treat them as the databases.Hadoop is file system and technology stack including NoSQL solutions(HBase for example). NoSQL is a set of methods or ways of data handling.

Hadoop HDFS + YARN is a file system on steroids... i.e. it is neither a relational DBMS's nor non-relational (NoSql) DBMS's... it is optimized for string processing (large strings in large amounts of data)... Hadoop allows users to interact with the data via SQL (multiple options of SQL dialects) and NoSql (multiple options of procedural languages)... unfortunately, in a sub-optimal performance and functionally restrictive for all non string related processing... that's the reason for all vendors and gurus to be so emphatic about Hadoop costs...

For any real-time processing or analytics, NoSQL would be a better use case, rather than Hadoop. However, there are several factors to keep in mind. NoSQL is better suited for simple data structure (key-value, doc etc), but Hadoop has no inherent structure. Hadoop is better for volume writes and parallel scans, but NoSQL is better for high volume random reads (indexed access) and writes.Finally, it would be important to look at what type of analytics you want to do: statistical (with R), Visualization etc to pick the right store. Sometimes it would mean to have both hadoop and NoSQL

On SQL, you nay not need to define schema, but you still need to convert to key/value or JSON before you can storeHadoop is good for batch processing and you don't want to expose to millions of users

Historically Hadoop ecosystem(hdfs,map reduce,yarn etc) targeted OLAP use cases and No Sql (Cassandra, Couchbase etc) were more towards OLTP work loads. However lines are getting blurred. You gave a good example of Map Reduce on Couchbase. Or Hbase on Hadoop ecosystem targeting real time use cases.

HDFS (Hadoop File System) has been built for large files and is very efficient in batch processing ,supports sequential access of data only , hence no support for random access and fast individual record lookups and data update is not efficient either, while NoSQL database addresses all the these challenges.

To reiterate in short, Hadoop is a computation platform, while NoSQL is an unstructured database.

Hadoop on its most basic constituent is a distributed file system HDFS built to store large volume of string data in parallel with redundancy. But the filesystem by itself is of little use without the rest of the ecosystem like YARN, HBASE, HIVE, etc (and now SPARC for more realtime usage) providing more user friendly usage. HBASE also falls under the noSQL category. NoSQL come in different flavors based on the inherent architecture and use-cases they support.

7

Main differences (Relational)ConsDoes not scale out horizontally (concurrency and data size) only vertically, unless use shardi