Big data hadoop rdbms

Big DataMapReduce vs. RDBMS

Arjen P. de [email protected]

Centrum Wiskunde & InformaticaDelft University of Technology

Spinque B.V.

Context Business ‘best practices’: decisions based

on data and hard facts rather than instinct and theory

MapReduce, though originally designed for text, is more and more “ab”-used for structured data, at tremendous scale Hadoop is used to manage Facebook’s 2.5

petabyte data warehouse

Shared-nothing Architecture A collection of independent, possibly

virtual, machines, each with local disk and local main memory, connected together on a high-speed network Possible trade-off: large number of low-end

servers instead of small number of high-end ones

@CWI – 2011

Programming Model Parallel DBMS

Claimed best at ad-hoc analytical queries Substantially faster once data is loaded, but

loading the data takes considerably longer Who wants to program parallel joins etc.?!

Map-Reduce Very suited for extract-transform-load tasks

Ease-of-use for complex analytics tasks Hybrid

Best of both worlds?

Parallel DBMS Horizontal partitioning of relational tables Partitioned execution of SQL operators

Select, aggregate, join, project and update New shuffle operator; dynamically repartition

rows of intermediate results (usually by hash) Partitioning techniques:

Hash, range, round-robin

Parallel DBMS DBMS automatically manages the various

alternative strategies for the tables involved in the query, transparently to user and application program

Parallel DBMS Many map and Reduce operations can be

expressed as plain SQL; reshuffle between Map and Reduce is equivalent to a GROUP BY operation in SQL

Map operations not easily expressed in SQL can be implemented as UDF (not trivial)

Reduce operations not easily expressed in SQL can be implemented as user defined aggregates (even less trivial)

Comparison (on 100-node cluster)

Hadoop DBMS-X Vertica Hadoop/DBMS-X

Hadoop/Vertica

Grep 284s 194s 108s 1.5 2.6Web Log >1Ks 740s 268s 1.6 4.3Join >1Ks 32s 55s 36.3 21

http://database.cs.brown.edu/projects/mapreduce-vs-dbms/

Details Comparison Study Avoid repetitive record parsing

Default HDFS setting stores data in textual format it has been generated

Even when using SequenceFiles, user code necessary to parse out multiple attributes

Don’t write intermediate results to disk Push vs. pull model Note:

DBMSs now offer ‘restart operators’ to improve fault tolerance; trade-off between runtime penalty and # work lost on failure

Details Comparison Study Column-oriented storage Compression

Column-stores have focused on cheap (and even no) de-compression (some operations executed in compressed domains)

Column-by-column compression may achieve better compression (?)

Parallel DBMS Bad “out-of-the-box” experience

Reported difference of execution time from days to minutes, after tuning by vendor

However, tuning Hadoop to maximum performance also an arduous task!

Big difference in cost of loading data Performance gains from faster queries offset

upfront costs Most DBMSs don’t work on “in situ” data

Only expensive, commercial offerings No low-cost O/S alternative

Ease-of-Use Push programmers to a higher level of

abstraction Pig, Hive, … SQL code substantially easier to write than MR

code in their study Just for them?

Right model of real-life tasks in benchmark?

Ease-of-Use Getting a MapReduce program up and

running takes generally less effort than the Parallel DBMS alternative No schema definition No UDF registration

Modifying MapReduce program though…

Parallel DBMS Not used on 100s or 1000s of nodes

Assume homogeneous array of machines Designed with the assumption that failures

are a rare event Combine MapReduce proven scalability

with Parallel DBMS proven efficiency?

Hybrid Solution? HadoopDB: Hadoop as communication

layer above multiple nodes running single-node DBMS instances

Full open-source solution: PostgreSQL as DB layer Hadoop as communication layer Hive as translation layer … and the rest is “HadoopDB”

Shared-nothing version of PostgreSQL as side-effect

Desiderata Performance Fault tolerance Ability to run in heterogeneous

environment Slowest compute node should not determine

completion time Flexible query interface

ODBC/JDBC UDFs

HadoopDB

RDBMS Careful layout of data Indexing Sorting Shared I/O, buffer

management Compression Query Optimization

Hadoop Job scheduling Task coordination Parallellization

HadoopDB Database connection

JDBC, by extending Hadoop’s InputFormat Catalog

XML file in HDFS

Data Loader Globally repartitioning data on a given

partitioning key upon loading Breaks apart single node data into

multiple smaller partitions (chunks) Bulk-loads chunks in single-node

databases Chunk size ~1GB in experiments

Planner (SMS) SQL MapReduce SQL Extends Hive

All operators in DAG, bottom-up, until first repartition operator with a partitioning key different from the database’s key are converted into one or more SQL queries for that database

Takes advantage of relative quality of RDBMS query optimizer as opposed to ‘normal’ Hive query optimizer

SELECT YEAR(saleDate), SUM(revenue)FROM SALES GROUP BY YEAR(saleDate)

Planner (SMS) Join queries

Hive assumes tables are never collocated SMS pushes entire join into database layer

where possible (i.e., whenever join key matches database partitioning key)

Comparison HadoopDB up to an order of magnitude

faster than Hadoop and Hive But… also 10x longer load time (for the join

benchmark query, amortized in 1 query) Outperformed by Vertica

Even for fault-tolerance tests, in spite of larger slow-down

Main performance difference attributed to efficiency of column-store and lack of compression

Hadoop / Hive Shortcomings:

Data storage layer No use of hash partitioning on join keys for co-

location of related tables No statistics of data in catalog

No cost based optimization Lack of native indexing

Most jobs heavy on I/O BTW: Hive is catching up on some of

these!

Hadapt Two heuristics to guide optimizations:

1. Maximize single-node DBMS use DBMS processes data at faster rate than Hadoop

2. Minimize # jobs per SQL query Each MapReduce job involves much I/O, both to

disk and over network

Two orders of magnitude Three key database ideas at basis:

Column store relational back-ends Referential partitioning to maximize number

of single node joins Integrate semi-joins in Hadoop Map phase

Dutch Database History!!! Vectorwise = MonetDB/X100 – Peter

Boncz and Martin Kersten (CWI/UvA) Semi-joins in distributed relational query

processing – Peter Apers (UT) Peter M. G. Apers, Alan R. Hevner, S. Bing

Yao: Optimization Algorithms for Distributed Queries. IEEE Trans. Software Eng. 9(1): 57-68(1983)

Vectorwise Vectorized operations on in-cache data

Attacks directly the memory wall Efficient I/O

PFor and PForDelta lightweight compression algorithms – extremely high decompression rates by design for modern CPUs (including tricks like predication)

Improved Query Plans Join plans including data re-distribution

before computing Extended Database Connector, giving access

to multiple database tables in Map phase of single job

After repartitioning on the join key, related records sent to Reduce phase for actual join computation

Improved Query Plans Referential Partitioning

HadoopDB/Hadapt performs ‘aggressive’ hash-partitioning on foreign-key attributes ~ Jimmy’s secondary (value) sort trick

During data load, involves extra step of joining to parent table to enable partitioning

Join in Hadoop Outline algorithm

If tables not already co-partitioned… Mappers read table partitions and output join

attributes intended to re-partition the tables Reducer processes the tuples with the same

join key, i.e., does the join on the partition BTW… symmetric hash-join is a UT

invention! Wilschut & Apers, Dataflow execution in a parallel main

memory environment, PDIS 1991

Improved Query Plans Alternatives for fully partitioned hash-join Directed join

Re-partition only one table, when other argument already partitioned on join key

Broadcast join Ship entire smaller table to all nodes with

larger table

Broadcast & Directed Joins Non-trivial in Hadoop

HDFS does not guarantee to maintain co-partitioning between jobs; datasets using same hash may end up on different nodes

Requires join in Map phase; hard to do well when multiple passes required (unless both tables already sorted by join key)

Broadcast Join Mapper reads smaller table from HDFS

into in-memory hashtable, followed by sequential scan of larger table Map-side join ~ Jimmy’s in-mapper combiner

Provided low-cost database support for temporary tables, the join can in HadoopDB be pushed into DBMS for (usually) more efficient execution

Directed Join The OutputFormat feature of Hadoop

writes output of a repartitioning mapper, reading catalog data for other table, directly into DBMSs, circumventing HDFS

Semi-join Hadoop:

Mapper performs selection and projection of join attribute on first table

Resulting column replicated as “map-side join”

HadoopDB: If projected column is small (e.g., list of

countries, …), transform to SELECT … WHERE foreignKey IN (list-of-Values) –

skips completely the temporary table costs ~ Jimmy’s stripes

Results TPC-H 3TB on 45-node cluster Loading time:

DBMS-X 33h3m Hive and Hadoop 49m HadoopDB 11h4m (w/ 6h42m for referential

partitioning) VectorWise 3h47m (includes clustering index

creation)

Results DBMS-X >> Hive

Lack of partitioning and indexing Switching HadoopDB from PostreSQL to

Vectorwise results in a factor of 7 improvement on average

Generally, map-side join optimization improves efficiency by 2 to 3 when using column-store

Semi-join improves by factor of 2 over map-side join and factor of 3.6 over reduce join

Conclusion Hybrid is good

MapReduce takes care “rack to cluster” RDBMS takes care of the within-rack

Not sure how good it is for text analytical tasks RDBMS often problems with data skew Hadapt whitepaper suggests they do

unstructured data with MapReduce and structured data with HadoopDB

Conclusion Never a free lunch… If your problem involves non-text data

types, consider working with hybrid solution

If your problem involves primarily textual data, question still open whether hybrid will actually be of any help

Information Science“Search for the fundamental knowledge

which will allow us to postulate and utilize the most efficient combination of [human and machine] resources”

M.E. Senko. Information systems: records, relations, sets, entities, and things. Information systems, 1(1):3–13, 1975.

References Stonebraker et al., MapReduce and Parallel DBMSs: Friends or

Foes?, in CACM 53, 1 (Jan 2010):64-70 Bajda-Pawlikowski et al., Efficient Processing of Data Warehousing

Queries in a Split Execution Environment, SIGMOD 2011 Abouzeid et al., HadoopDB: An Architectural Hybrid of MapReduce

and DBMS Technologies for Analytical Workloads, VLDB 2009 Pavlo et al., A Comparison of Approaches to Large-Scale Data

Analysis, SIGMOD 2009 Thusoo et al., Data Warehousing and Analytics Infrastructure at

Facebook, SIGMOD 2010 Wilschut, Flokstra, Apers, Parallelism in a Main-Memory DBMS: The

performance of PRISMA/DB, VLDB 1992 Wilschut, Apers & Flokstra, Parallel Query Execution in

PRISMA/DB. LNCS 503 (1990) Daniel Abadi’s blog, http://dbmsmusings.blogspot.com/

Big data hadoop rdbms

Technology

Transcript of Big data hadoop rdbms