Shark:’Hive’on’Spark - Duke UniversityShark:’Hive’on’Spark Borrowed slide 1 Optional...
Transcript of Shark:’Hive’on’Spark - Duke UniversityShark:’Hive’on’Spark Borrowed slide 1 Optional...
Shark: Hive on Spark
Borrowed slide1
Optional Reading(additional material)
Prajakta KalmeghDuke University
What is Shark?Port of Apache Hive to run on Spark
Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc)
Similar speedups of up to 40x
Borrowed slide2
MotivationHive is great, but Hadoop’s execution engine makes even the smallest queries take minutes
Scala is good for programmers, but many data users only know SQL
Can we extend Hive to run on Spark?
Borrowed slide3
Hive Architecture
Meta store
HDFS
Client
Driver
SQL Parser
Query Optimizer
Physical Plan
Execution
CLI JDBC
MapReduce
Borrowed slide4
Shark Architecture
Meta store
HDFS
Client
Driver
SQL Parser
Physical Plan
Execution
CLI JDBC
Spark
Cache Mgr.
Query Optimizer
Borrowed slide5
Shark Engine: Extensions to Hive• PDE (Partial DAG Executions)
• To Support dynamic query optimization
• allows dynamic alteration of query plans based on data statistics collected at run-‐time
• use PDE to optimize the global structure of the plan at stage boundaries
• Skew Handling and Degree of Parallelism
• Importance of DoP for Mappers vs Reducers (too few can overload reducers)
• Skew mitigation: Fine-‐grained partitions are assigned to coalesced partitions using a greedy bin-‐packing heuristic
• Distributed Data Loading
• Loading tasks use the data schema to extract individual fields from rows
• Marshal a partition of data into its columnar representation
• Store those columns in memory
6
Efficient In-‐Memory Storage
Simply caching Hive records as Java objects is inefficient due to high per-‐object overhead
Instead, Shark employs column-‐oriented storage using arrays of primitive types
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.4
Borrowed slide8
Efficient In-‐Memory Storage
Simply caching Hive records as Java objects is inefficient due to high per-‐object overhead
Instead, Shark employs column-‐oriented storage using arrays of primitive types
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.4
Benefit: similarly compact size to serialized data,but >5x faster to access
Borrowed slide9
References:[1] Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. YSmart: Yet Another SQL-to-MapReduce Translator. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems (ICDCS '11). IEEE Computer Society, Washington, DC, USA, 25-36
[2] Harold Lim, Herodotos Herodotou, and Shivnath Babu. 2012. Stubby: a transformation-based optimizer for MapReduce workflows. Proc. VLDB Endow. 5, 11 (July 2012), 1196-1207.
[3] PTF: http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive
[4] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2010. HaLoop: efficient iterative data processing on largeclusters. Proc. VLDB Endow. 3, 1-2 (September 2010), 285-296.
[5] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: a runtime for iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA, 810-818
[6] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.
[7] Spark and Shark: <http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data?related=1>
[8] Spark SQL: <http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=2>
21