Shark:’Hive’on’Spark - Duke UniversityShark:’Hive’on’Spark Borrowed slide 1 Optional...

Shark: Hive on Spark

Borrowed slide1

Optional Reading(additional material)

Prajakta KalmeghDuke University

What is Shark?Port of Apache Hive to run on Spark

Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc)

Similar speedups of up to 40x

Borrowed slide2

MotivationHive is great, but Hadoop’s execution engine makes even the smallest queries take minutes

Scala is good for programmers, but many data users only know SQL

Can we extend Hive to run on Spark?

Borrowed slide3

Hive Architecture

Meta store

HDFS

Client

Driver

SQL Parser

Query Optimizer

Physical Plan

Execution

CLI JDBC

MapReduce

Borrowed slide4

Shark Architecture

Meta store

HDFS

Client

Driver

SQL Parser

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query Optimizer

Borrowed slide5

Shark Engine: Extensions to Hive• PDE (Partial DAG Executions)

• To Support dynamic query optimization

• allows dynamic alteration of query plans based on data statistics collected at run-‐time

• use PDE to optimize the global structure of the plan at stage boundaries

• Skew Handling and Degree of Parallelism

• Importance of DoP for Mappers vs Reducers (too few can overload reducers)

• Skew mitigation: Fine-‐grained partitions are assigned to coalesced partitions using a greedy bin-‐packing heuristic

• Distributed Data Loading

• Loading tasks use the data schema to extract individual fields from rows

• Marshal a partition of data into its columnar representation

• Store those columns in memory

6

Shark Engine: Extensions to Hive

• Join Optimizations

7

Efficient In-‐Memory Storage

Simply caching Hive records as Java objects is inefficient due to high per-‐object overhead

Instead, Shark employs column-‐oriented storage using arrays of primitive types

1

Column Storage

2 3

john mike sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2 mike 3.5

3 sally 6.4

Borrowed slide8

Efficient In-‐Memory Storage

Simply caching Hive records as Java objects is inefficient due to high per-‐object overhead

Instead, Shark employs column-‐oriented storage using arrays of primitive types

1

Column Storage

2 3

john mike sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2 mike 3.5

3 sally 6.4

Benefit: similarly compact size to serialized data,but >5x faster to access

Borrowed slide9

Shark vs Spark SQL

10

Borrowed slide11

Spark SQL

Borrowed slide12

Borrowed slide13

Borrowed slide14

Borrowed slide15

Borrowed slide16

Borrowed slide17

Borrowed slide18

Borrowed slide19

Borrowed slide20

References:[1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce Translator. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems (ICDCS '11). IEEE Computer Society, Washington, DC, USA, 25-36

[2] Harold Lim, Herodotos Herodotou, and Shivnath Babu. 2012. Stubby: a transformation-based optimizer for MapReduce workflows. Proc. VLDB Endow. 5, 11 (July 2012), 1196-1207.

[3] PTF: http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive

[4] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2010. HaLoop: efficient iterative data processing on largeclusters. Proc. VLDB Endow. 3, 1-2 (September 2010), 285-296.

[5] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: a runtime for iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA, 810-818

[6] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.

[7] Spark and Shark: <http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data?related=1>

[8] Spark SQL: <http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=2>

21

Shark:’Hive’on’Spark - Duke UniversityShark:’Hive’on’Spark Borrowed slide 1 Optional...

Documents

Transcript of Shark:’Hive’on’Spark - Duke UniversityShark:’Hive’on’Spark Borrowed slide 1 Optional...