MapReduce and DBMS Hybrids
-
Upload
zubair-nabi -
Category
Technology
-
view
623 -
download
1
description
Transcript of MapReduce and DBMS Hybrids
12: MapReduce and DBMS Hybrids
Zubair Nabi
May 26, 2013
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 1 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 2 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 3 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled intomap-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas andstatistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used forsimple summarization, business intelligence, machine learning, amongmany other applications1
Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc.
1https://www.facebook.com/note.php?note_id=89508453919Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled intomap-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas andstatistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used forsimple summarization, business intelligence, machine learning, amongmany other applications1
Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc.
1https://www.facebook.com/note.php?note_id=89508453919Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled intomap-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas andstatistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used forsimple summarization, business intelligence, machine learning, amongmany other applications1
Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc.
1https://www.facebook.com/note.php?note_id=89508453919Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled intomap-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas andstatistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used forsimple summarization, business intelligence, machine learning, amongmany other applications1
Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc.
1https://www.facebook.com/note.php?note_id=89508453919Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled intomap-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas andstatistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used forsimple summarization, business intelligence, machine learning, amongmany other applications1
Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc.
1https://www.facebook.com/note.php?note_id=89508453919Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled intomap-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas andstatistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used forsimple summarization, business intelligence, machine learning, amongmany other applications1
Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc.
1https://www.facebook.com/note.php?note_id=89508453919Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Introduction
Data warehousing solution built atop Hadoop by Facebook
Now an Apache open source project
Queries are expressed in SQL-like HiveQL, which are compiled intomap-reduce jobs
Also contains a type system for describing RDBMS-like tables
A system catalog, Hive-Metastore, which contains schemas andstatistics is used for data exploration and query optimization
Stores 2PB of uncompressed data at Facebook and is heavily used forsimple summarization, business intelligence, machine learning, amongmany other applications1
Also used by Digg, Grooveshark, hi5, Last.fm, Scribd, etc.
1https://www.facebook.com/note.php?note_id=89508453919Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 4 / 37
Data Model
Tables:I Similar to RDBMS tables
I Each table has a corresponding HDFS directoryI The contents of the table are serialized and stored in files within that
directoryI Serialization can be both system provided or user definedI Serialization information of each table is also stored in the
Hive-Metastore for query optimizationI Tables can also be defined for data stored in external sources such as
HDFS, NFS, and local FS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:I Similar to RDBMS tablesI Each table has a corresponding HDFS directory
I The contents of the table are serialized and stored in files within thatdirectory
I Serialization can be both system provided or user definedI Serialization information of each table is also stored in the
Hive-Metastore for query optimizationI Tables can also be defined for data stored in external sources such as
HDFS, NFS, and local FS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:I Similar to RDBMS tablesI Each table has a corresponding HDFS directoryI The contents of the table are serialized and stored in files within that
directory
I Serialization can be both system provided or user definedI Serialization information of each table is also stored in the
Hive-Metastore for query optimizationI Tables can also be defined for data stored in external sources such as
HDFS, NFS, and local FS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:I Similar to RDBMS tablesI Each table has a corresponding HDFS directoryI The contents of the table are serialized and stored in files within that
directoryI Serialization can be both system provided or user defined
I Serialization information of each table is also stored in theHive-Metastore for query optimization
I Tables can also be defined for data stored in external sources such asHDFS, NFS, and local FS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:I Similar to RDBMS tablesI Each table has a corresponding HDFS directoryI The contents of the table are serialized and stored in files within that
directoryI Serialization can be both system provided or user definedI Serialization information of each table is also stored in the
Hive-Metastore for query optimization
I Tables can also be defined for data stored in external sources such asHDFS, NFS, and local FS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model
Tables:I Similar to RDBMS tablesI Each table has a corresponding HDFS directoryI The contents of the table are serialized and stored in files within that
directoryI Serialization can be both system provided or user definedI Serialization information of each table is also stored in the
Hive-Metastore for query optimizationI Tables can also be defined for data stored in external sources such as
HDFS, NFS, and local FS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 5 / 37
Data Model (2)
Partitions:I Determine the distribution of data within sub-directories of the main
table directory
I For instance, for a table T stored in /wh/T and partitioned on columnsds and ctry
F Data with ds value 20090101 and ctry value US,F Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:I Data within partitions is divided into bucketsI Buckets are calculated based on the hash of a column within the
partitionI Each bucket is stored within a file in the partition directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:I Determine the distribution of data within sub-directories of the main
table directoryI For instance, for a table T stored in /wh/T and partitioned on columnsds and ctry
F Data with ds value 20090101 and ctry value US,F Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:I Data within partitions is divided into bucketsI Buckets are calculated based on the hash of a column within the
partitionI Each bucket is stored within a file in the partition directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:I Determine the distribution of data within sub-directories of the main
table directoryI For instance, for a table T stored in /wh/T and partitioned on columnsds and ctry
F Data with ds value 20090101 and ctry value US,
F Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:I Data within partitions is divided into bucketsI Buckets are calculated based on the hash of a column within the
partitionI Each bucket is stored within a file in the partition directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:I Determine the distribution of data within sub-directories of the main
table directoryI For instance, for a table T stored in /wh/T and partitioned on columnsds and ctry
F Data with ds value 20090101 and ctry value US,F Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:I Data within partitions is divided into bucketsI Buckets are calculated based on the hash of a column within the
partitionI Each bucket is stored within a file in the partition directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:I Determine the distribution of data within sub-directories of the main
table directoryI For instance, for a table T stored in /wh/T and partitioned on columnsds and ctry
F Data with ds value 20090101 and ctry value US,F Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:I Data within partitions is divided into buckets
I Buckets are calculated based on the hash of a column within thepartition
I Each bucket is stored within a file in the partition directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:I Determine the distribution of data within sub-directories of the main
table directoryI For instance, for a table T stored in /wh/T and partitioned on columnsds and ctry
F Data with ds value 20090101 and ctry value US,F Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:I Data within partitions is divided into bucketsI Buckets are calculated based on the hash of a column within the
partition
I Each bucket is stored within a file in the partition directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Data Model (2)
Partitions:I Determine the distribution of data within sub-directories of the main
table directoryI For instance, for a table T stored in /wh/T and partitioned on columnsds and ctry
F Data with ds value 20090101 and ctry value US,F Will be stored in files within /wh/T/ds=20090101/ctry=US
Buckets:I Data within partitions is divided into bucketsI Buckets are calculated based on the hash of a column within the
partitionI Each bucket is stored within a file in the partition directory
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 6 / 37
Column Data Types
Primitive types: integers, floats, strings, dates, and booleans
Nestable collection types: arrays and maps
Custom types: user-defined
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
Column Data Types
Primitive types: integers, floats, strings, dates, and booleans
Nestable collection types: arrays and maps
Custom types: user-defined
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
Column Data Types
Primitive types: integers, floats, strings, dates, and booleans
Nestable collection types: arrays and maps
Custom types: user-defined
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 7 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specificserialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input datausing a single HiveQL statement
User-defined column transformation and aggregation functions in Java
Custom map-reduce scripts written in any language can be embedded
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specificserialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input datausing a single HiveQL statement
User-defined column transformation and aggregation functions in Java
Custom map-reduce scripts written in any language can be embedded
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specificserialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input datausing a single HiveQL statement
User-defined column transformation and aggregation functions in Java
Custom map-reduce scripts written in any language can be embedded
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specificserialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input datausing a single HiveQL statement
User-defined column transformation and aggregation functions in Java
Custom map-reduce scripts written in any language can be embedded
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specificserialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input datausing a single HiveQL statement
User-defined column transformation and aggregation functions in Java
Custom map-reduce scripts written in any language can be embedded
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
HiveQL
Supports select, project, join, aggregate, union all, and sub-queries
Tables are created using data definition statements with specificserialization formats, partitioning, and bucketing
Data is loaded from external sources and inserted into tables
Support for multi-table insert – multiple queries on the same input datausing a single HiveQL statement
User-defined column transformation and aggregation functions in Java
Custom map-reduce scripts written in any language can be embedded
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 8 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory/logs/status_updates
This data is loaded on a daily basis to a Hive table:status_updates(userid int,status string,dsstring)
Using:
1 LOAD DATA LOCAL INPATH ’/logs/status_updates’2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’)
Detailed profile information, such as gender and academic institution ispresent in the table: profiles(userid int,schoolstring,gender int)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory/logs/status_updates
This data is loaded on a daily basis to a Hive table:status_updates(userid int,status string,dsstring)
Using:
1 LOAD DATA LOCAL INPATH ’/logs/status_updates’2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’)
Detailed profile information, such as gender and academic institution ispresent in the table: profiles(userid int,schoolstring,gender int)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory/logs/status_updates
This data is loaded on a daily basis to a Hive table:status_updates(userid int,status string,dsstring)
Using:
1 LOAD DATA LOCAL INPATH ’/logs/status_updates’2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’)
Detailed profile information, such as gender and academic institution ispresent in the table: profiles(userid int,schoolstring,gender int)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status
Status updates are stored on flat files in an NFS directory/logs/status_updates
This data is loaded on a daily basis to a Hive table:status_updates(userid int,status string,dsstring)
Using:
1 LOAD DATA LOCAL INPATH ’/logs/status_updates’2 INTO TABLE status_updates PARTITION (ds=’2013-05-26’)
Detailed profile information, such as gender and academic institution ispresent in the table: profiles(userid int,schoolstring,gender int)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 9 / 37
Example: Facebook Status (2)
Query to workout the frequency of status updates based on gender andacademic institution
1 FROM (SELECT a.status, b.school, b.gender2 FROM status_updates a JOIN profiles b3 ON (a.userid = b.userid and4 a.ds=’2013-05-26’)5 ) subq16 INSERT OVERWRITE TABLE gender_summary7 PARTITION(ds=’2013-05-26’)8 SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender9 INSERT OVERWRITE TABLE school_summary
10 PARTITION(ds=’2013-05-26’)11 SELECT subq1.school, COUNT(1) GROUP BY subq1.school
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
Example: Facebook Status (2)
Query to workout the frequency of status updates based on gender andacademic institution
1 FROM (SELECT a.status, b.school, b.gender2 FROM status_updates a JOIN profiles b3 ON (a.userid = b.userid and4 a.ds=’2013-05-26’)5 ) subq16 INSERT OVERWRITE TABLE gender_summary7 PARTITION(ds=’2013-05-26’)8 SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender9 INSERT OVERWRITE TABLE school_summary
10 PARTITION(ds=’2013-05-26’)11 SELECT subq1.school, COUNT(1) GROUP BY subq1.school
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 10 / 37
Metastore
Similar to the metastore maintained by traditional warehousingsolutions such as Oracle and IBM DB2 (distinguishes Hive from Pig orCascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFSContains the following objects:
I Database: namespace for tablesI Table: metadata for a table including columns and their types, owner,
storage, and serialization informationI Partition: metadata for a partition; similar to the information for a table
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousingsolutions such as Oracle and IBM DB2 (distinguishes Hive from Pig orCascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFS
Contains the following objects:I Database: namespace for tablesI Table: metadata for a table including columns and their types, owner,
storage, and serialization informationI Partition: metadata for a partition; similar to the information for a table
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousingsolutions such as Oracle and IBM DB2 (distinguishes Hive from Pig orCascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFSContains the following objects:
I Database: namespace for tables
I Table: metadata for a table including columns and their types, owner,storage, and serialization information
I Partition: metadata for a partition; similar to the information for a table
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousingsolutions such as Oracle and IBM DB2 (distinguishes Hive from Pig orCascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFSContains the following objects:
I Database: namespace for tablesI Table: metadata for a table including columns and their types, owner,
storage, and serialization information
I Partition: metadata for a partition; similar to the information for a table
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Metastore
Similar to the metastore maintained by traditional warehousingsolutions such as Oracle and IBM DB2 (distinguishes Hive from Pig orCascading which have no such store)
Stored in either a traditional DB such as MySQL or an FS such as NFSContains the following objects:
I Database: namespace for tablesI Table: metadata for a table including columns and their types, owner,
storage, and serialization informationI Partition: metadata for a partition; similar to the information for a table
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 11 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 12 / 37
Introduction
Two options for data analytics on shared nothing clusters:1 Parallel Databases, such as Teradata, Oracle etc. but,
I Assume that failures are a rare eventI Assume that hardware is homogeneousI Never tested in deployments with more than a few dozen nodes
2 MapReduce but,I All shortcomings pointed by DeWitt and Stonebraker, as discussed
beforeI At times an order of magnitude slower than parallel DBs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:1 Parallel Databases, such as Teradata, Oracle etc. but,
I Assume that failures are a rare event
I Assume that hardware is homogeneousI Never tested in deployments with more than a few dozen nodes
2 MapReduce but,I All shortcomings pointed by DeWitt and Stonebraker, as discussed
beforeI At times an order of magnitude slower than parallel DBs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:1 Parallel Databases, such as Teradata, Oracle etc. but,
I Assume that failures are a rare eventI Assume that hardware is homogeneous
I Never tested in deployments with more than a few dozen nodes
2 MapReduce but,I All shortcomings pointed by DeWitt and Stonebraker, as discussed
beforeI At times an order of magnitude slower than parallel DBs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:1 Parallel Databases, such as Teradata, Oracle etc. but,
I Assume that failures are a rare eventI Assume that hardware is homogeneousI Never tested in deployments with more than a few dozen nodes
2 MapReduce but,I All shortcomings pointed by DeWitt and Stonebraker, as discussed
beforeI At times an order of magnitude slower than parallel DBs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:1 Parallel Databases, such as Teradata, Oracle etc. but,
I Assume that failures are a rare eventI Assume that hardware is homogeneousI Never tested in deployments with more than a few dozen nodes
2 MapReduce but,
I All shortcomings pointed by DeWitt and Stonebraker, as discussedbefore
I At times an order of magnitude slower than parallel DBs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:1 Parallel Databases, such as Teradata, Oracle etc. but,
I Assume that failures are a rare eventI Assume that hardware is homogeneousI Never tested in deployments with more than a few dozen nodes
2 MapReduce but,I All shortcomings pointed by DeWitt and Stonebraker, as discussed
before
I At times an order of magnitude slower than parallel DBs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Introduction
Two options for data analytics on shared nothing clusters:1 Parallel Databases, such as Teradata, Oracle etc. but,
I Assume that failures are a rare eventI Assume that hardware is homogeneousI Never tested in deployments with more than a few dozen nodes
2 MapReduce but,I All shortcomings pointed by DeWitt and Stonebraker, as discussed
beforeI At times an order of magnitude slower than parallel DBs
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 13 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReducewith performance of parallel DBs
HadoopDB is such a hybridI Unlike Hive, Pig, Greenplum, Aster, etc. which are language and
interface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodesrunning single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communicationlayer, and Hive as the translation layer
Commercialized through the start up, Hadapt2
2http://hadapt.com/Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReducewith performance of parallel DBsHadoopDB is such a hybrid
I Unlike Hive, Pig, Greenplum, Aster, etc. which are language andinterface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodesrunning single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communicationlayer, and Hive as the translation layer
Commercialized through the start up, Hadapt2
2http://hadapt.com/Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReducewith performance of parallel DBsHadoopDB is such a hybrid
I Unlike Hive, Pig, Greenplum, Aster, etc. which are language andinterface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodesrunning single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communicationlayer, and Hive as the translation layer
Commercialized through the start up, Hadapt2
2http://hadapt.com/Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReducewith performance of parallel DBsHadoopDB is such a hybrid
I Unlike Hive, Pig, Greenplum, Aster, etc. which are language andinterface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodesrunning single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communicationlayer, and Hive as the translation layer
Commercialized through the start up, Hadapt2
2http://hadapt.com/Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReducewith performance of parallel DBsHadoopDB is such a hybrid
I Unlike Hive, Pig, Greenplum, Aster, etc. which are language andinterface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodesrunning single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communicationlayer, and Hive as the translation layer
Commercialized through the start up, Hadapt2
2http://hadapt.com/Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
Hybrid
Combine scalability and non-existent monetary cost of MapReducewith performance of parallel DBsHadoopDB is such a hybrid
I Unlike Hive, Pig, Greenplum, Aster, etc. which are language andinterface level hybrids, Hadoop DB is a systems level hybrid
Uses MapReduce as the communication layer atop a cluster of nodesrunning single-node DBMS instances
PostgreSQL as the database layer, Hadoop as the communicationlayer, and Hive as the translation layer
Commercialized through the start up, Hadapt2
2http://hadapt.com/Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 14 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systemsand Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
3 Data Loader: Data partitioning across single-node databases
4 SQL to MapReduce to SQL (SMS) Planner: Translation betweenSQL and MapReduce
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systemsand Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
3 Data Loader: Data partitioning across single-node databases
4 SQL to MapReduce to SQL (SMS) Planner: Translation betweenSQL and MapReduce
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systemsand Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
3 Data Loader: Data partitioning across single-node databases
4 SQL to MapReduce to SQL (SMS) Planner: Translation betweenSQL and MapReduce
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB
Consists of four components:
1 Database Connector: Interface between per-node database systemsand Hadoop TaskTrackers
2 Catalog: Meta-information about per-node databases
3 Data Loader: Data partitioning across single-node databases
4 SQL to MapReduce to SQL (SMS) Planner: Translation betweenSQL and MapReduce
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 15 / 37
HadoopDB Architecture
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 16 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant HadoopInputFormat
The connector is served the SQL query and other information by theMapReduce job
The connector connects to the DB, executes the SQL query, andreturns results in the form of key/value pairs
Hadoop in essence sees the DB as just another data source
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant HadoopInputFormat
The connector is served the SQL query and other information by theMapReduce job
The connector connects to the DB, executes the SQL query, andreturns results in the form of key/value pairs
Hadoop in essence sees the DB as just another data source
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant HadoopInputFormat
The connector is served the SQL query and other information by theMapReduce job
The connector connects to the DB, executes the SQL query, andreturns results in the form of key/value pairs
Hadoop in essence sees the DB as just another data source
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Database Connector
Uses the Java Database Connectivity (JDBC)-compliant HadoopInputFormat
The connector is served the SQL query and other information by theMapReduce job
The connector connects to the DB, executes the SQL query, andreturns results in the form of key/value pairs
Hadoop in essence sees the DB as just another data source
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 17 / 37
Catalog
Contains information, such as:1 Connection parameters, such as DB location, format, and any
credentials
2 Metadata about the datasets, replica locations, and partitioning scheme
Stored as an XML file on the HDFS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Catalog
Contains information, such as:1 Connection parameters, such as DB location, format, and any
credentials2 Metadata about the datasets, replica locations, and partitioning scheme
Stored as an XML file on the HDFS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Catalog
Contains information, such as:1 Connection parameters, such as DB location, format, and any
credentials2 Metadata about the datasets, replica locations, and partitioning scheme
Stored as an XML file on the HDFS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 18 / 37
Data Loader
Consists of two key components:
1 Global Hasher: Executes a custom Hadoop job to repartition raw datafiles from the HDFS into n parts, where n is the number of nodes in thecluster
2 Local Hasher: Copies a partition from the HDFS to the node-local DBof each node and further partitions it into smaller size chunks
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
Data Loader
Consists of two key components:
1 Global Hasher: Executes a custom Hadoop job to repartition raw datafiles from the HDFS into n parts, where n is the number of nodes in thecluster
2 Local Hasher: Copies a partition from the HDFS to the node-local DBof each node and further partitions it into smaller size chunks
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 19 / 37
SQL to MapReduce to SQL (SMS) Planner
Extends HiveQL in two key ways:
1 Before query execution, the Hive Metastore is updated with referencesto HadoopDB tables, table schemas, formats, and serializationinformation
2 All operators with partitioning keys similar to the node-local databaseare converted into SQL queries and pushed to the database layer
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
SQL to MapReduce to SQL (SMS) Planner
Extends HiveQL in two key ways:
1 Before query execution, the Hive Metastore is updated with referencesto HadoopDB tables, table schemas, formats, and serializationinformation
2 All operators with partitioning keys similar to the node-local databaseare converted into SQL queries and pushed to the database layer
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 20 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 21 / 37
Introduction
The declarative nature of SQL is too limiting for describing most bigdata computation
The underlying subsystems are also suboptimal as they do notconsider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-definedfunctions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database gearedtowards analytic workloads
Originally designed by Aster Data Systems and later acquired byTeradata
Used by Barnes and Noble, LinkedIn, SAS, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most bigdata computation
The underlying subsystems are also suboptimal as they do notconsider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-definedfunctions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database gearedtowards analytic workloads
Originally designed by Aster Data Systems and later acquired byTeradata
Used by Barnes and Noble, LinkedIn, SAS, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most bigdata computation
The underlying subsystems are also suboptimal as they do notconsider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-definedfunctions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database gearedtowards analytic workloads
Originally designed by Aster Data Systems and later acquired byTeradata
Used by Barnes and Noble, LinkedIn, SAS, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most bigdata computation
The underlying subsystems are also suboptimal as they do notconsider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-definedfunctions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database gearedtowards analytic workloads
Originally designed by Aster Data Systems and later acquired byTeradata
Used by Barnes and Noble, LinkedIn, SAS, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most bigdata computation
The underlying subsystems are also suboptimal as they do notconsider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-definedfunctions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database gearedtowards analytic workloads
Originally designed by Aster Data Systems and later acquired byTeradata
Used by Barnes and Noble, LinkedIn, SAS, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
Introduction
The declarative nature of SQL is too limiting for describing most bigdata computation
The underlying subsystems are also suboptimal as they do notconsider domain-specific optimizations
nCluster makes use of SQL/MR, a framework that inserts user-definedfunctions in any programming language into SQL queries
By itself, nCluster is a shared-nothing parallel database gearedtowards analytic workloads
Originally designed by Aster Data Systems and later acquired byTeradata
Used by Barnes and Noble, LinkedIn, SAS, etc.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 22 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided atruntime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical toSQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueriesor a relation
Can be implemented in a number of languages including Java, C#,C++, Python, etc. and can thus make use of third-party libraries
Executed within processes to provide sandboxing and resourceallocation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided atruntime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical toSQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueriesor a relation
Can be implemented in a number of languages including Java, C#,C++, Python, etc. and can thus make use of third-party libraries
Executed within processes to provide sandboxing and resourceallocation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided atruntime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical toSQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueriesor a relation
Can be implemented in a number of languages including Java, C#,C++, Python, etc. and can thus make use of third-party libraries
Executed within processes to provide sandboxing and resourceallocation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided atruntime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical toSQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueriesor a relation
Can be implemented in a number of languages including Java, C#,C++, Python, etc. and can thus make use of third-party libraries
Executed within processes to provide sandboxing and resourceallocation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided atruntime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical toSQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueriesor a relation
Can be implemented in a number of languages including Java, C#,C++, Python, etc. and can thus make use of third-party libraries
Executed within processes to provide sandboxing and resourceallocation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
SQL/MR Functions
Dynamically polymorphic: input and output schemes are decided atruntime
Parallelizable across cores and machines
Composable because their input and output behaviour is identical toSQL subqueries
Amenable to static and dynamic optimizations just like SQL subqueriesor a relation
Can be implemented in a number of languages including Java, C#,C++, Python, etc. and can thus make use of third-party libraries
Executed within processes to provide sandboxing and resourceallocation
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 23 / 37
Syntax
1 SELECT ...2 FROM functionname(3 ON table-or-query4 [PARTITION BY expr, ...]5 [ORDER BY expr, ...]6 [clausename(arg, ...) ...]7 )8 ...
SQL/MR function appears in the FROM clause
ON is the only required clause which specifies the input to the function
PARTITION BY partitions the input to the function on one or moreattributes from the schema
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax
1 SELECT ...2 FROM functionname(3 ON table-or-query4 [PARTITION BY expr, ...]5 [ORDER BY expr, ...]6 [clausename(arg, ...) ...]7 )8 ...
SQL/MR function appears in the FROM clause
ON is the only required clause which specifies the input to the function
PARTITION BY partitions the input to the function on one or moreattributes from the schema
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax
1 SELECT ...2 FROM functionname(3 ON table-or-query4 [PARTITION BY expr, ...]5 [ORDER BY expr, ...]6 [clausename(arg, ...) ...]7 )8 ...
SQL/MR function appears in the FROM clause
ON is the only required clause which specifies the input to the function
PARTITION BY partitions the input to the function on one or moreattributes from the schema
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 24 / 37
Syntax (2)
1 SELECT ...2 FROM functionname(3 ON table-or-query4 [PARTITION BY expr, ...]5 [ORDER BY expr, ...]6 [clausename(arg, ...) ...]7 )8 ...
ORDER BY sorts the input to the function and can only be used after aPARTITION BY clause
Any number of custom clauses can also be defined whose names andarguments are passed as a key/value map to the function
Implemented as relations so easily nestable
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Syntax (2)
1 SELECT ...2 FROM functionname(3 ON table-or-query4 [PARTITION BY expr, ...]5 [ORDER BY expr, ...]6 [clausename(arg, ...) ...]7 )8 ...
ORDER BY sorts the input to the function and can only be used after aPARTITION BY clause
Any number of custom clauses can also be defined whose names andarguments are passed as a key/value map to the function
Implemented as relations so easily nestable
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Syntax (2)
1 SELECT ...2 FROM functionname(3 ON table-or-query4 [PARTITION BY expr, ...]5 [ORDER BY expr, ...]6 [clausename(arg, ...) ...]7 )8 ...
ORDER BY sorts the input to the function and can only be used after aPARTITION BY clause
Any number of custom clauses can also be defined whose names andarguments are passed as a key/value map to the function
Implemented as relations so easily nestable
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 25 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce(partition function) functions
Identical to MapReduce, these functions are executed across manynodes and machinesContracts identical to MapReduce functions
I Only one row function operates over a row from the input tableI Only one partition function operates over a group of rows defined by thePARTITION BY clause, in the order specified by the ORDER BYclause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce(partition function) functions
Identical to MapReduce, these functions are executed across manynodes and machines
Contracts identical to MapReduce functionsI Only one row function operates over a row from the input tableI Only one partition function operates over a group of rows defined by thePARTITION BY clause, in the order specified by the ORDER BYclause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce(partition function) functions
Identical to MapReduce, these functions are executed across manynodes and machinesContracts identical to MapReduce functions
I Only one row function operates over a row from the input table
I Only one partition function operates over a group of rows defined by thePARTITION BY clause, in the order specified by the ORDER BYclause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Execution Model
Functions are equivalent to either map (row function) or reduce(partition function) functions
Identical to MapReduce, these functions are executed across manynodes and machinesContracts identical to MapReduce functions
I Only one row function operates over a row from the input tableI Only one partition function operates over a group of rows defined by thePARTITION BY clause, in the order specified by the ORDER BYclause
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 26 / 37
Programming Interface
A Runtime Contract is passed by the query planner to thefunction which contains the names and types of the input columns andthe names and values of the argument clauses
The function then completes this contract by filling in the outputschema and making a call to complete()Row and partition functions are implemented through theoperateOnSomeRows and operateOnPartition methods,respectively
I These methods are passed an iterator over their input rows and anemitter object for returning output rows to the database
operateOnPartition can also optionally implement the combinerinterface
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to thefunction which contains the names and types of the input columns andthe names and values of the argument clauses
The function then completes this contract by filling in the outputschema and making a call to complete()
Row and partition functions are implemented through theoperateOnSomeRows and operateOnPartition methods,respectively
I These methods are passed an iterator over their input rows and anemitter object for returning output rows to the database
operateOnPartition can also optionally implement the combinerinterface
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to thefunction which contains the names and types of the input columns andthe names and values of the argument clauses
The function then completes this contract by filling in the outputschema and making a call to complete()Row and partition functions are implemented through theoperateOnSomeRows and operateOnPartition methods,respectively
I These methods are passed an iterator over their input rows and anemitter object for returning output rows to the database
operateOnPartition can also optionally implement the combinerinterface
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to thefunction which contains the names and types of the input columns andthe names and values of the argument clauses
The function then completes this contract by filling in the outputschema and making a call to complete()Row and partition functions are implemented through theoperateOnSomeRows and operateOnPartition methods,respectively
I These methods are passed an iterator over their input rows and anemitter object for returning output rows to the database
operateOnPartition can also optionally implement the combinerinterface
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Programming Interface
A Runtime Contract is passed by the query planner to thefunction which contains the names and types of the input columns andthe names and values of the argument clauses
The function then completes this contract by filling in the outputschema and making a call to complete()Row and partition functions are implemented through theoperateOnSomeRows and operateOnPartition methods,respectively
I These methods are passed an iterator over their input rows and anemitter object for returning output rows to the database
operateOnPartition can also optionally implement the combinerinterface
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 27 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, suchas row function or partition function, support for combining, etc.
Any arbitrary file can be installed which is replicated to all workers,such as configuration files, binaries, etc.
Each function is provided with a temporary directory which is garbagecollected after execution
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, suchas row function or partition function, support for combining, etc.
Any arbitrary file can be installed which is replicated to all workers,such as configuration files, binaries, etc.
Each function is provided with a temporary directory which is garbagecollected after execution
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, suchas row function or partition function, support for combining, etc.
Any arbitrary file can be installed which is replicated to all workers,such as configuration files, binaries, etc.
Each function is provided with a temporary directory which is garbagecollected after execution
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, suchas row function or partition function, support for combining, etc.
Any arbitrary file can be installed which is replicated to all workers,such as configuration files, binaries, etc.
Each function is provided with a temporary directory which is garbagecollected after execution
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Installation
Functions need to be installed first before they can be used
Can be supplied as a .zip along with third-party libraries
Install-time examination also enables static analysis of properties, suchas row function or partition function, support for combining, etc.
Any arbitrary file can be installed which is replicated to all workers,such as configuration files, binaries, etc.
Each function is provided with a temporary directory which is garbagecollected after execution
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 28 / 37
Architecture
One or more Queen nodes process queries and hash partition themacross Worker nodes
The query planner honours the Runtime Contract with thefunction and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separateprocesses for isolation, security, resource allocation, forcedtermination, etc.
The worker database implements a “bridge” which manages itscommunication with the SQL/MR function
The SQL/MR function process contains a “runner” which manages itscommunication with the worker database
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition themacross Worker nodes
The query planner honours the Runtime Contract with thefunction and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separateprocesses for isolation, security, resource allocation, forcedtermination, etc.
The worker database implements a “bridge” which manages itscommunication with the SQL/MR function
The SQL/MR function process contains a “runner” which manages itscommunication with the worker database
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition themacross Worker nodes
The query planner honours the Runtime Contract with thefunction and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separateprocesses for isolation, security, resource allocation, forcedtermination, etc.
The worker database implements a “bridge” which manages itscommunication with the SQL/MR function
The SQL/MR function process contains a “runner” which manages itscommunication with the worker database
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition themacross Worker nodes
The query planner honours the Runtime Contract with thefunction and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separateprocesses for isolation, security, resource allocation, forcedtermination, etc.
The worker database implements a “bridge” which manages itscommunication with the SQL/MR function
The SQL/MR function process contains a “runner” which manages itscommunication with the worker database
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture
One or more Queen nodes process queries and hash partition themacross Worker nodes
The query planner honours the Runtime Contract with thefunction and invokes its initializer (Constructor in case of Java)
Functions are executed within the Worker databases as separateprocesses for isolation, security, resource allocation, forcedtermination, etc.
The worker database implements a “bridge” which manages itscommunication with the SQL/MR function
The SQL/MR function process contains a “runner” which manages itscommunication with the worker database
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 29 / 37
Architecture (2)
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 30 / 37
Example: Wordcount
1 SELECT token, COUNT(*)2 FROM tokenizer(3 ON input-table4 DELIMITER(’ ’)5 )6 GROUP BY token;
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 31 / 37
Example: Clickstream Sessionization
Divide a user’s clicks on a website into sessions
A session includes the user’s clicks within a specified time period
Timestamp User ID10:00:00 23890900:58:24 765610:00:24 23890902:30:33 765610:01:23 23890910:02:40 238909
Timestamp User ID Session ID10:00:00 238909 010:00:24 238909 010:01:23 238909 010:02:40 238909 100:58:24 7656 002:30:33 7656 1
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization
Divide a user’s clicks on a website into sessions
A session includes the user’s clicks within a specified time period
Timestamp User ID10:00:00 23890900:58:24 765610:00:24 23890902:30:33 765610:01:23 23890910:02:40 238909
Timestamp User ID Session ID10:00:00 238909 010:00:24 238909 010:01:23 238909 010:02:40 238909 100:58:24 7656 002:30:33 7656 1
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization
Divide a user’s clicks on a website into sessions
A session includes the user’s clicks within a specified time period
Timestamp User ID10:00:00 23890900:58:24 765610:00:24 23890902:30:33 765610:01:23 23890910:02:40 238909
Timestamp User ID Session ID10:00:00 238909 010:00:24 238909 010:01:23 238909 010:02:40 238909 100:58:24 7656 002:30:33 7656 1
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 32 / 37
Example: Clickstream Sessionization (2)
1 SELECT ts, userid, session2 FROM sessionize (3 ON clicks4 PARTITION BY userid5 ORDER BY ts6 TIMECOLUMN (’ts’)7 TIMEOUT (60)8 );
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 33 / 37
Example: Clickstream Sessionization (3)
1 public class Sessionize implements PartitionFunction {23 private int timeColumnIndex;4 private int timeout;56 public Sessionize(RuntimeContract contract) {7 // Get time column and timeout from contract8 // Define output schema9 contract.complete();
10 }1112 public void operationOnPartition(13 PartitionDefinition partition,14 RowIterator inputIterator,15 RowEmitter outputEmitter) {16 // Implement the partition function logic17 // Emit output rows18 }1920 }
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 34 / 37
Outline
1 Hive
2 HadoopDB
3 nCluster
4 Summary
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 35 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the designspace
1 Hive uses MapReduce to give DBMS-like functionality
2 HadoopDB uses MapReduce and DBMS side-by-side
3 nCluster implements MapReduce within a DBMS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the designspace
1 Hive uses MapReduce to give DBMS-like functionality
2 HadoopDB uses MapReduce and DBMS side-by-side
3 nCluster implements MapReduce within a DBMS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the designspace
1 Hive uses MapReduce to give DBMS-like functionality
2 HadoopDB uses MapReduce and DBMS side-by-side
3 nCluster implements MapReduce within a DBMS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
Summary
Hive, HadoopDB, and nCluster explore three different points in the designspace
1 Hive uses MapReduce to give DBMS-like functionality
2 HadoopDB uses MapReduce and DBMS side-by-side
3 nCluster implements MapReduce within a DBMS
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 36 / 37
References
1 Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, PrasadChakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and RaghothamMurthy. 2009. Hive: a warehousing solution over a map-reduceframework. Proc. VLDB Endow. 2, 2 (August 2009), 1626-1629.
2 Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, AviSilberschatz, and Alexander Rasin. 2009. HadoopDB: an architecturalhybrid of MapReduce and DBMS technologies for analytical workloads.Proc. VLDB Endow. 2, 1 (August 2009), 922-933.
3 Eric Friedman, Peter Pawlowski, and John Cieslewicz. 2009.SQL/MapReduce: a practical approach to self-describing, polymorphic,and parallelizable user-defined functions. Proc. VLDB Endow. 2, 2(August 2009), 1402-1413.
Zubair Nabi 12: MapReduce and DBMS Hybrids May 26, 2013 37 / 37