Big data hadoop rdbms
-
Upload
arjen-de-vries -
Category
Technology
-
view
5.141 -
download
1
Transcript of Big data hadoop rdbms
Big DataMapReduce vs. RDBMS
Arjen P. de [email protected]
Centrum Wiskunde & InformaticaDelft University of Technology
Spinque B.V.
Context Business ‘best practices’: decisions based
on data and hard facts rather than instinct and theory
MapReduce, though originally designed for text, is more and more “ab”-used for structured data, at tremendous scale Hadoop is used to manage Facebook’s 2.5
petabyte data warehouse
Shared-nothing Architecture A collection of independent, possibly
virtual, machines, each with local disk and local main memory, connected together on a high-speed network Possible trade-off: large number of low-end
servers instead of small number of high-end ones
@CWI – 2011
Programming Model Parallel DBMS
Claimed best at ad-hoc analytical queries Substantially faster once data is loaded, but
loading the data takes considerably longer Who wants to program parallel joins etc.?!
Map-Reduce Very suited for extract-transform-load tasks
Ease-of-use for complex analytics tasks Hybrid
Best of both worlds?
Parallel DBMS Horizontal partitioning of relational tables Partitioned execution of SQL operators
Select, aggregate, join, project and update New shuffle operator; dynamically repartition
rows of intermediate results (usually by hash) Partitioning techniques:
Hash, range, round-robin
Parallel DBMS DBMS automatically manages the various
alternative strategies for the tables involved in the query, transparently to user and application program
Parallel DBMS Many map and Reduce operations can be
expressed as plain SQL; reshuffle between Map and Reduce is equivalent to a GROUP BY operation in SQL
Map operations not easily expressed in SQL can be implemented as UDF (not trivial)
Reduce operations not easily expressed in SQL can be implemented as user defined aggregates (even less trivial)
Comparison (on 100-node cluster)
Hadoop DBMS-X Vertica Hadoop/DBMS-X
Hadoop/Vertica
Grep 284s 194s 108s 1.5 2.6Web Log >1Ks 740s 268s 1.6 4.3Join >1Ks 32s 55s 36.3 21
http://database.cs.brown.edu/projects/mapreduce-vs-dbms/
Details Comparison Study Avoid repetitive record parsing
Default HDFS setting stores data in textual format it has been generated
Even when using SequenceFiles, user code necessary to parse out multiple attributes
Don’t write intermediate results to disk Push vs. pull model Note:
DBMSs now offer ‘restart operators’ to improve fault tolerance; trade-off between runtime penalty and # work lost on failure
Details Comparison Study Column-oriented storage Compression
Column-stores have focused on cheap (and even no) de-compression (some operations executed in compressed domains)
Column-by-column compression may achieve better compression (?)
Parallel DBMS Bad “out-of-the-box” experience
Reported difference of execution time from days to minutes, after tuning by vendor
However, tuning Hadoop to maximum performance also an arduous task!
Big difference in cost of loading data Performance gains from faster queries offset
upfront costs Most DBMSs don’t work on “in situ” data
Only expensive, commercial offerings No low-cost O/S alternative
Ease-of-Use Push programmers to a higher level of
abstraction Pig, Hive, … SQL code substantially easier to write than MR
code in their study Just for them?
Right model of real-life tasks in benchmark?
Ease-of-Use Getting a MapReduce program up and
running takes generally less effort than the Parallel DBMS alternative No schema definition No UDF registration
Modifying MapReduce program though…
Parallel DBMS Not used on 100s or 1000s of nodes
Assume homogeneous array of machines Designed with the assumption that failures
are a rare event Combine MapReduce proven scalability
with Parallel DBMS proven efficiency?
Hybrid Solution? HadoopDB: Hadoop as communication
layer above multiple nodes running single-node DBMS instances
Full open-source solution: PostgreSQL as DB layer Hadoop as communication layer Hive as translation layer … and the rest is “HadoopDB”
Shared-nothing version of PostgreSQL as side-effect
Desiderata Performance Fault tolerance Ability to run in heterogeneous
environment Slowest compute node should not determine
completion time Flexible query interface
ODBC/JDBC UDFs
HadoopDB
RDBMS Careful layout of data Indexing Sorting Shared I/O, buffer
management Compression Query Optimization
Hadoop Job scheduling Task coordination Parallellization
HadoopDB Database connection
JDBC, by extending Hadoop’s InputFormat Catalog
XML file in HDFS
Data Loader Globally repartitioning data on a given
partitioning key upon loading Breaks apart single node data into
multiple smaller partitions (chunks) Bulk-loads chunks in single-node
databases Chunk size ~1GB in experiments
Planner (SMS) SQL MapReduce SQL Extends Hive
All operators in DAG, bottom-up, until first repartition operator with a partitioning key different from the database’s key are converted into one or more SQL queries for that database
Takes advantage of relative quality of RDBMS query optimizer as opposed to ‘normal’ Hive query optimizer
SELECT YEAR(saleDate), SUM(revenue)FROM SALES GROUP BY YEAR(saleDate)
Planner (SMS) Join queries
Hive assumes tables are never collocated SMS pushes entire join into database layer
where possible (i.e., whenever join key matches database partitioning key)
Comparison HadoopDB up to an order of magnitude
faster than Hadoop and Hive But… also 10x longer load time (for the join
benchmark query, amortized in 1 query) Outperformed by Vertica
Even for fault-tolerance tests, in spite of larger slow-down
Main performance difference attributed to efficiency of column-store and lack of compression
Hadoop / Hive Shortcomings:
Data storage layer No use of hash partitioning on join keys for co-
location of related tables No statistics of data in catalog
No cost based optimization Lack of native indexing
Most jobs heavy on I/O BTW: Hive is catching up on some of
these!
Hadapt Two heuristics to guide optimizations:
1. Maximize single-node DBMS use DBMS processes data at faster rate than Hadoop
2. Minimize # jobs per SQL query Each MapReduce job involves much I/O, both to
disk and over network
Two orders of magnitude Three key database ideas at basis:
Column store relational back-ends Referential partitioning to maximize number
of single node joins Integrate semi-joins in Hadoop Map phase
Dutch Database History!!! Vectorwise = MonetDB/X100 – Peter
Boncz and Martin Kersten (CWI/UvA) Semi-joins in distributed relational query
processing – Peter Apers (UT) Peter M. G. Apers, Alan R. Hevner, S. Bing
Yao: Optimization Algorithms for Distributed Queries. IEEE Trans. Software Eng. 9(1): 57-68(1983)
Vectorwise Vectorized operations on in-cache data
Attacks directly the memory wall Efficient I/O
PFor and PForDelta lightweight compression algorithms – extremely high decompression rates by design for modern CPUs (including tricks like predication)
Improved Query Plans Join plans including data re-distribution
before computing Extended Database Connector, giving access
to multiple database tables in Map phase of single job
After repartitioning on the join key, related records sent to Reduce phase for actual join computation
Improved Query Plans Referential Partitioning
HadoopDB/Hadapt performs ‘aggressive’ hash-partitioning on foreign-key attributes ~ Jimmy’s secondary (value) sort trick
During data load, involves extra step of joining to parent table to enable partitioning
Join in Hadoop Outline algorithm
If tables not already co-partitioned… Mappers read table partitions and output join
attributes intended to re-partition the tables Reducer processes the tuples with the same
join key, i.e., does the join on the partition BTW… symmetric hash-join is a UT
invention! Wilschut & Apers, Dataflow execution in a parallel main
memory environment, PDIS 1991
Improved Query Plans Alternatives for fully partitioned hash-join Directed join
Re-partition only one table, when other argument already partitioned on join key
Broadcast join Ship entire smaller table to all nodes with
larger table
Broadcast & Directed Joins Non-trivial in Hadoop
HDFS does not guarantee to maintain co-partitioning between jobs; datasets using same hash may end up on different nodes
Requires join in Map phase; hard to do well when multiple passes required (unless both tables already sorted by join key)
Broadcast Join Mapper reads smaller table from HDFS
into in-memory hashtable, followed by sequential scan of larger table Map-side join ~ Jimmy’s in-mapper combiner
Provided low-cost database support for temporary tables, the join can in HadoopDB be pushed into DBMS for (usually) more efficient execution
Directed Join The OutputFormat feature of Hadoop
writes output of a repartitioning mapper, reading catalog data for other table, directly into DBMSs, circumventing HDFS
Semi-join Hadoop:
Mapper performs selection and projection of join attribute on first table
Resulting column replicated as “map-side join”
HadoopDB: If projected column is small (e.g., list of
countries, …), transform to SELECT … WHERE foreignKey IN (list-of-Values) –
skips completely the temporary table costs ~ Jimmy’s stripes
Results TPC-H 3TB on 45-node cluster Loading time:
DBMS-X 33h3m Hive and Hadoop 49m HadoopDB 11h4m (w/ 6h42m for referential
partitioning) VectorWise 3h47m (includes clustering index
creation)
Results DBMS-X >> Hive
Lack of partitioning and indexing Switching HadoopDB from PostreSQL to
Vectorwise results in a factor of 7 improvement on average
Generally, map-side join optimization improves efficiency by 2 to 3 when using column-store
Semi-join improves by factor of 2 over map-side join and factor of 3.6 over reduce join
Conclusion Hybrid is good
MapReduce takes care “rack to cluster” RDBMS takes care of the within-rack
Not sure how good it is for text analytical tasks RDBMS often problems with data skew Hadapt whitepaper suggests they do
unstructured data with MapReduce and structured data with HadoopDB
Conclusion Never a free lunch… If your problem involves non-text data
types, consider working with hybrid solution
If your problem involves primarily textual data, question still open whether hybrid will actually be of any help
Information Science“Search for the fundamental knowledge
which will allow us to postulate and utilize the most efficient combination of [human and machine] resources”
M.E. Senko. Information systems: records, relations, sets, entities, and things. Information systems, 1(1):3–13, 1975.
References Stonebraker et al., MapReduce and Parallel DBMSs: Friends or
Foes?, in CACM 53, 1 (Jan 2010):64-70 Bajda-Pawlikowski et al., Efficient Processing of Data Warehousing
Queries in a Split Execution Environment, SIGMOD 2011 Abouzeid et al., HadoopDB: An Architectural Hybrid of MapReduce
and DBMS Technologies for Analytical Workloads, VLDB 2009 Pavlo et al., A Comparison of Approaches to Large-Scale Data
Analysis, SIGMOD 2009 Thusoo et al., Data Warehousing and Analytics Infrastructure at
Facebook, SIGMOD 2010 Wilschut, Flokstra, Apers, Parallelism in a Main-Memory DBMS: The
performance of PRISMA/DB, VLDB 1992 Wilschut, Apers & Flokstra, Parallel Query Execution in
PRISMA/DB. LNCS 503 (1990) Daniel Abadi’s blog, http://dbmsmusings.blogspot.com/