Cloudera Impala, updated for v1.0
-
Upload
scott-leberknight -
Category
Technology
-
view
2.069 -
download
3
description
Transcript of Cloudera Impala, updated for v1.0
![Page 1: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/1.jpg)
Scott Leberknight
Cloudera's
7/9/2013
![Page 2: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/2.jpg)
History lesson...
![Page 3: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/3.jpg)
Google Map/Reduce paper (2004)
Cutt ing & Cafare l la create Hadoop (2005)
![Page 4: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/4.jpg)
Google Dremel paper (2010)
Facebook creates Hive (2007)*
![Page 5: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/5.jpg)
Cloudera announces Impala (October 2012)
HortonWorks' Stinger (February 2013)
Apache Drill proposal (August 2012)
![Page 6: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/6.jpg)
* Hive => "SQL on Hadoop"
Write SQL queries
Translate into Map/Reduce job(s)
Convenient & easy
High-latency (batch processing)
![Page 7: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/7.jpg)
What is Impala?
In-memory, distributed SQL query engine (no Map/Reduce)
Native code (C++)
Distributed(on HDFS data nodes)
![Page 8: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/8.jpg)
Why Impala?
Interactive data analysis
Low-latency response(roughly, 4 - 100x Hive)
Deploy on existing Hadoop clusters
![Page 9: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/9.jpg)
Why Impala? (cont'd)
Data stored in HDFS avoids...
...duplicate storage
...data transformation
...moving data
![Page 10: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/10.jpg)
Why Impala? (cont'd)
SPEED!
![Page 11: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/11.jpg)
statestored & Hive metastore (for database metadata)
Overview
impalad daemon runs on HDFS nodes
Queries run on "relevant" nodes
Supports common HDFS file formats
(for cluster metadata)
![Page 12: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/12.jpg)
Overview (cont'd)
Does not use Map/Reduce
Not fault tolerant ! (query fails if any query on any node fails)
Submit queries via Hue/Beeswax Thrift API, CLI, ODBC, JDBC
![Page 13: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/13.jpg)
SQL Support
SELECT
Projection
UNION
INSERT OVERWRITE
INSERT INTO
ORDER BY(w/ LIMIT)
Aggregation
Subqueries(uncorrelated)
JOIN (equi-join only, subject to memory limitations)
(subset of Hive QL)
![Page 14: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/14.jpg)
HBase Queries
Maps HBase tables via Hive metastore mapping
Row key predicates => start/stop row
Non-row key predicates => SingleColumnValueFilter
HBase scan translations:
![Page 15: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/15.jpg)
(Very) Unscientific Benchmarks
![Page 16: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/16.jpg)
9 queries, run in CDH Quickstart VM
Macbook Pro Retina, mid 201216GB RAM,4GB for VM (VMWare 5),Intel i7 2.6GHz quad-core processor
Hardware
No other load on system during queries
Pseudo-cluster + Impala daemons
CDH 4.2, Impala 1.0
![Page 17: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/17.jpg)
Benchmarks (cont'd)
(from simple projection queries to multiple joins, aggregation, multiple
predicates, and order by)
Impala vs. Hive performance
"TPC-DS" sample dataset(http://www.tpc.org/tpcds/)
![Page 18: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/18.jpg)
Query "A"
select c.c_first_name, c.c_last_namefrom customer c limit 50;
![Page 19: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/19.jpg)
Query "B"
select c.c_first_name, c.c_last_name, ca.ca_city, ca.ca_county, ca.ca_state from customer c join customer_address ca on c.c_current_addr_sk = ca.ca_address_sklimit 50;
![Page 20: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/20.jpg)
Query "C"
select c.c_first_name, c.c_last_name, ca.ca_city, ca.ca_county, ca.ca_statefrom customer c join customer_address ca on c.c_current_addr_sk = ca.ca_address_skwhere lower(c.c_last_name) like 'smi%'limit 50;
![Page 21: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/21.jpg)
Query "D"
select distinct cd_credit_ratingfrom customer_demographics;
![Page 22: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/22.jpg)
Query "E"
select cd_credit_rating, count(*)from customer_demographicsgroup by cd_credit_rating;
![Page 23: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/23.jpg)
Query "F"select c.c_first_name, c.c_last_name, ca.ca_city, ca.ca_county, ca.ca_state, cd.cd_marital_status, cd.cd_education_statusfrom customer c join customer_address ca on c.c_current_addr_sk = ca.ca_address_sk join customer_demographics cd on c.c_current_cdemo_sk = cd.cd_demo_skwhere lower(c.c_last_name) like 'smi%' and cd.cd_credit_rating in ('Unknown', 'High Risk')limit 50;
![Page 24: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/24.jpg)
Query "G"
select count(c.c_customer_sk)from customer c join customer_address ca on c.c_current_addr_sk = ca.ca_address_sk join customer_demographics cd on c.c_current_cdemo_sk = cd.cd_demo_skwhere ca.ca_zip in ('20191', '20194') and cd.cd_credit_rating in ('Unknown', 'High Risk');
![Page 25: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/25.jpg)
Query "H"select c.c_first_name, c.c_last_name, ca.ca_city, ca.ca_county, ca.ca_state, cd.cd_marital_status, cd.cd_education_statusfrom customer c join customer_address ca on c.c_current_addr_sk = ca.ca_address_sk join customer_demographics cd on c.c_current_cdemo_sk = cd.cd_demo_skwhere ca.ca_zip in ('20191', '20194') and cd.cd_credit_rating in ('Unknown', 'High Risk')limit 100;
![Page 26: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/26.jpg)
select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4from store_salesjoin date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)join item on (store_sales.ss_item_sk = item.i_item_sk)join customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)join store on (store_sales.ss_store_sk = store.s_store_sk)where cd_gender = 'M' and cd_marital_status = 'S' and cd_education_status = 'College' and d_year = 2002 and s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD')group by i_item_id, s_stateorder by i_item_id, s_statelimit 100;
Query "TPC-DS"
![Page 27: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/27.jpg)
Query Hive (sec) # M/R jobs Impala (sec) x Hive perf.
A 13.8 1 0.25 54
B 30.0 1 0.41 73
C 33.3 1 0.42 79
D 23.2 1 0.64 36
E 21.6 1 0.62 35
F 59.1 2 1.96 30
G 78.5 3 1.56 50
H 59.6 2 1.89 32
TPC-DS 204.5 6 3.23 63
(remember, unscientific...)
![Page 28: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/28.jpg)
![Page 29: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/29.jpg)
Arch
itect
ure
![Page 30: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/30.jpg)
Two daemonsimpaladstatestored
impalad on each HDFS data node
statestored - cluster metadata
Thrift APIs, ODBC, JDBC
![Page 31: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/31.jpg)
impalad
Query execution
Query coordination
Query planning
![Page 32: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/32.jpg)
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
![Page 33: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/33.jpg)
Queries performed in-memory
Intermediate data never hits disk!
Data streamed to clients
C++runtime code generationintrinsics for optimization
Execution engine:
![Page 34: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/34.jpg)
statestored
Cluster membership
Acts as a cluster monitor
Not a SPOF(single point of failure)
![Page 35: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/35.jpg)
Metadata
Impala uses Hive metastore
Daemons cache metadata
REFRESH when tabledefinition/data change
Create tables in Hive or Impala
![Page 36: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/36.jpg)
Next up - how queries work...
![Page 37: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/37.jpg)
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
Client Statestore Hive Metastore
table/database metadata
SQL query
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
clustermonitoring
![Page 38: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/38.jpg)
Read directly from disk
Short-circuit reads
Bypass HDFS DataNode(avoids overhead of HDFS API)
![Page 39: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/39.jpg)
impalad
Query Coordinator
Query Planner
Query Executor
HBase RegionServer
HDFS DataNode
Local Filesystem
Read directly
from disk
![Page 40: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/40.jpg)
![Page 41: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/41.jpg)
Current Limitations(as of version 1.0.1)
No join order optimization
No custom file formats, SerDes or UDFs
Limit required when using ORDER BY
Joins limited by aggregate memory of cluster
("put larger table on left")
![Page 42: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/42.jpg)
Current Limitations(as of version 1.0.1)
No advanced data structures (arrays, maps, json, etc.)
Only basic DDL (otherwise do in Hive)
Limited file formats and compression(though probably fine for most people)
![Page 43: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/43.jpg)
Future...
Structure types (structs, arrays, maps, json, etc.)
DDL support
Additional file formats & compression support
"Performance"
Join optimization(e.g. cost-based)
UDFs (???)
YARN integration
Fault-tolerance (???)
![Page 44: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/44.jpg)
![Page 45: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/45.jpg)
Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.
Comparing Impala to Dremel
- http://research.google.com/pubs/pub36632.html
![Page 46: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/46.jpg)
Comparing Impala to Dremel
Impala = Dremel features circa 2010 + join support, assuming columnar data format
(but, Google doesn't stand still...)
Dremel is production, mature
Basis for Google's BigQuery
![Page 47: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/47.jpg)
Comparing Impala to Hive
Hive uses Map/Reduce -> high latency
Impala is in-memory, low-latency query engine
Impala sacrifices fault tolerance for performance
![Page 48: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/48.jpg)
Comparing Impala to Drill
Apache Drill
Based on Dremel
In early stages...
![Page 49: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/49.jpg)
"Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an IaaS service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Currently, Drill is incubating at Apache."
- http://incubator.apache.org/drill/drill_overview.html
Comparing Impala to Drill
![Page 50: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/50.jpg)
"The Stinger Initiative is a collection of development threads in the Hive community that will deliver 100X performance improvements as well as SQL compatibility."
Comparing Impala to Stinger
- http://hortonworks.com/stinger/
![Page 51: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/51.jpg)
Comparing Impala to Stinger
Stinger
Improve Hive performance (e.g. optimize execution plan)
Support for analytics (e.g. OVER clause, window functions)
TEZ framework to optimize execution
Columnar file format
http://hortonworks.com/stinger/
![Page 52: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/52.jpg)
Stinger Phase 1 performance...
(Stinger phase 1 is really just Hive 0.11)
![Page 53: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/53.jpg)
remember, these numbers are non-scientific micro-benchmarks!
![Page 54: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/54.jpg)
Same 9 queries (as w/ Impala), run in HortonWorks Sandbox VM
Macbook Pro Retina, mid 201216GB RAM,4GB for VM (VMWare 5),Intel i7 2.6GHz quad-core processor
Hardware (same as w/ Impala)
No other load on system during queries
HortonWorks Data Platform (HDP) 1.3
Running pseudo-cluster
![Page 55: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/55.jpg)
Query Hive (sec)# M/R jobs
StingerPhase 1 (sec)
# M/R jobs
x Hive perf.
A 13.8 1 10.0 1 1.4
B 30.0 1 15.8 1 1.9
C 33.3 1 14.1 1 2.4
D 23.2 1 18.7 1 1.2
E 21.6 1 19.7 1 1.1
F 59.1 2 34.3 1 1.7
G 78.5 3 35.2 1 2.2
H 59.6 2 31.5 1 1.9
TPC-DS 204.5 6 37.2 1 5.5
(remember, unscientific...)
![Page 56: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/56.jpg)
QueryStinger Phase 1
(sec)Impala (sec) x Stinger perf.
A 10.0 0.25 39
B 15.8 0.41 38
C 14.1 0.42 33
D 18.7 0.64 29
E 19.7 0.62 32
F 34.3 1.96 18
G 35.2 1.56 23
H 31.5 1.89 17
TPC-DS 37.2 3.23 12
(remember, unscientific...)
![Page 57: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/57.jpg)
Impala Review
In-memory, distributed SQL query engine
Integrates into existing HDFS
Not Map/Reduce
Focus on performance
(native code)
Competition...
Interactive data analysis
![Page 58: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/58.jpg)
References
Google Dremel - http://research.google.com/pubs/pub36632.html
Apache Drill - http://incubator.apache.org/drill/
TPC-DS dataset - http://www.tpc.org/tpcds/
Stinger Initiative - http://hortonworks.com/blog/100x-faster-hive/ http://hortonworks.com/stinger/
Cloudera Impala resourceshttp://www.cloudera.com/content/support/en/documentation/cloudera-impala/cloudera-impala-documentation-v1-latest.html
Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real
http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
![Page 59: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/59.jpg)
Photo Attributions
Impala - http://www.flickr.com/photos/gerardstolk/5897570970/
Measuring tape - http://www.morguefile.com/archive/display/24850
Bridge frame - http://www.morguefile.com/archive/display/9699
Balance - http://www.morguefile.com/archive/display/93433
* All others are iStockPhoto (I paid for them...)
![Page 60: Cloudera Impala, updated for v1.0](https://reader034.fdocuments.in/reader034/viewer/2022042607/55875a1ed8b42adb788b4579/html5/thumbnails/60.jpg)
My Info
twitter.com/sleberknight www.sleberknight.com/blog
scott dot leberknight at gmail dot com