MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor,...
-
Upload
sibyl-mills -
Category
Documents
-
view
215 -
download
1
Transcript of MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor,...
![Page 1: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/1.jpg)
MapReduce
With a SQL-MapReduce focus
byCurt A. Monash, Ph.D.
President, Monash ResearchEditor, DBMS2
contact @monash.comhttp://www.monash.comhttp://www.DBMS2.com
![Page 2: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/2.jpg)
Curt Monash
Analyst since 1981 Covered DBMS since the pre-relational days Also analytics, search, etc.
Publicly available research Blogs, including DBMS2 (http://www.DBMS2.com) Feed at http://www.monash.com/blogs.html
User and vendor consulting
![Page 3: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/3.jpg)
Agenda
Introduction and truisms MapReduce overview MapReduce specifics SQL and MapReduce together
![Page 4: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/4.jpg)
Monash’s First Law of Commercial Semantics
Bad jargon drives out good
For example: “Relational”, “Parallel”, “MapReduce”
![Page 5: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/5.jpg)
Where to measure database technology
Language interpretation and execution capabilities Functionality Speed
Administrative capabilities How well it all works
Fit and finish Reliability
How much it all – really – costs
You can do anything in 0s and 1s … but how much effort will it actually take?
![Page 6: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/6.jpg)
What’s hard about parallelization*
Getting the right data … … to the right nodes … … at the right time … … while dealing with errors … … and without overloading the network
Otherwise, programming a grid is a lot like programming a single node.
*in general -- not just for “database” technology
![Page 7: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/7.jpg)
MPP DBMS are good at parallelization …
… under three assumptions, namely:
You can express the job nicely in SQL … ... or whatever other automatically-parallel
languages the DBMS offers You don’t really need query fault-tolerance …
… which is usually the case unless you have 1000s of nodes
There’s enough benefit to storing the data in tables to justify the overhead
![Page 8: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/8.jpg)
SQL commonly gets frustrating …
… when you’re dealing with sequences of events or relationships, because:
Self-joins are expensive Programming is hard when you’re not sure how
long the sequence is For example:
Clickstreams Financial data time series Social network graph analysis
![Page 9: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/9.jpg)
The pure MapReduce alternative
Lightweight approach to parallelization
The only absolute requirement is a certain simple programming model … … so simple that parallelization is “automatic” … … and very friendly to procedural languages
It doesn’t require a DBMS on the back end No SQL required!
Non-DBMS implementations commonly have query fault-tolerance
But you have to take care of optimizing data redistribution yourself
![Page 10: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/10.jpg)
MapReduce evolution
Used under-the-covers for quite a while Named and popularized by Google Open-sourced in Hadoop Widely adopted by big web companies Integrated (at various levels) into MPP RDBMS Adopted for social network analysis Explored/investigated for data mining
applications ???
![Page 11: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/11.jpg)
M/R use cases -- large-scale ETL
Text indexing This is how Google introduced the MapReduce concept
Time series disaggregation Clickstream sessionization and analytics Stock trade pattern identification
Relationship graph traversal
![Page 12: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/12.jpg)
M/R use cases – hardcore arithmetic
Statistical routines Data “cooking”
![Page 13: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/13.jpg)
The essence of MapReduce
“Map” steps Data redistribution “Reduce” steps In strict alternation … … or not-so-strict
![Page 14: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/14.jpg)
“Map” step basics (reality)
Input = anything Set of data Output of previous Reduce step
Output = anything There’s an obvious key
![Page 15: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/15.jpg)
Map step basics (formality)
Input = {<key, value> pairs} Output = {<key, value> pairs} Input and output key types don’t have to be the
same
“Embarrassingly parallel” based on key
![Page 16: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/16.jpg)
Map step examples
Word count Input format = document/text string Output format = <WordName, 1>
Text indexing Input format = document/text string Output format = <WordName, (DocumentID, Offset)>
Log parsing Input format = log file Output format = <Key, formatted event>
![Page 17: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/17.jpg)
Reduce step basics
Input = {<key, value> pairs}, where all the keys are equal
Output = {<key, value> pairs}, where the set commonly has cardinality = 1
Input and output key types don’t have to be the same
Just like Map, “embarrassingly parallel” based on key
![Page 18: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/18.jpg)
Reduce step examples
Word count Input format = <WordName, 1> Output format = <WordName, count>
Text indexing Input format = <WordName, (DocumentID, Offset)> Output format = <WordName, index file>
Log parsing E.g., input format = <UserID or EventID, event record> E.g., output format = <Same, reformatted event record>
![Page 19: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/19.jpg)
More honoured in the breach than in the observance!
![Page 20: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/20.jpg)
Sometimes the Reduce step is trivial
MapReduce for data mining
Partition on some key Calculate a single vector* for each whole partition Aggregate the vectors Hooray!
*Algorithm-dependent
![Page 21: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/21.jpg)
Sometimes Reduce doesn’t reduce
Tick stream data “cooking” can increase its size by one to two orders of magnitude
Sessionization might just add a column – SessionID – to records Or is that a Map step masquerading as a Reduce?
![Page 22: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/22.jpg)
Some reasons to integrate SQL and MapReduce
JOINs were invented for a reason So was SQL 2003 It’s kind of traditional to keep data in an RDBMS
![Page 23: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/23.jpg)
Some ways to integrate SQL and MapReduce
A SQL layer built on a MapReduce engine E.g., Facebook’s Hive over Hadoop But building a DBMS-equivalent is hard
MapReduce invoking SQL SQL invoking MapReduce
Aster’s SQL M/R
![Page 24: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/24.jpg)
To materialize or not to materialize?
DBMS avoidance of intermediate materialization much better performance
Classic MapReduce intermediate materialization query fault-tolerance
How much does query fault-tolerance matter? (Query duration) x (Node count) vs. Node MTTF
DBMS-style materialization strategies usually win
![Page 25: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/25.jpg)
Other reasons to put your data in a real database
Query response time General performance Backup Security General administration SQL syntax General programmability and connectivity
![Page 26: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/26.jpg)
Aspects of Aster’s approach to MapReduce
Data stored in a database MapReduce execution managed by a DBMS Flexible MapReduce syntax MapReduce invoked via SQL
![Page 27: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .](https://reader030.fdocuments.in/reader030/viewer/2022032605/56649e7a5503460f94b7aaf2/html5/thumbnails/27.jpg)
Further information
Curt A. Monash, Ph.D.President, Monash Research
Editor, DBMS2
contact @monash.comhttp://www.monash.comhttp://www.DBMS2 com