Cloudera Impala - San Diego Big Data Meetup August 13th 2014
description
Transcript of Cloudera Impala - San Diego Big Data Meetup August 13th 2014
![Page 1: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/1.jpg)
1
Cloudera Impala SD Big Data Monthly Meetup #2 August 13th 2014 Maxime Dumas Systems Engineer
![Page 2: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/2.jpg)
Thirty Seconds About Max
• Systems Engineer • aka Sales Engineer • SoCal, AZ, NV
• former coder of PHP • teaches meditaLon + yoga • from Montreal, Canada
2
![Page 3: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/3.jpg)
What Does Cloudera Do?
• product • distribuLon of Hadoop components, Apache licensed • enterprise tooling
• support • training • services (aka consulLng) • community
3
![Page 4: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/4.jpg)
What This Talk Isn’t About
• deploying • Puppet, Chef, Ansible, homegrown scripts, intern labor
• sizing & tuning • depends heavily on data and workload
• coding • unless you count XML or CSV or SQL
• algorithms
4
![Page 5: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/5.jpg)
Public Domain IFCAR
![Page 6: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/6.jpg)
What is Cloudera Impala?
6
![Page 7: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/7.jpg)
cloud·∙e·∙ra im·∙pal·∙a
7
/kloudˈi(ə)rə imˈpalə/ noun
a modern, open source, MPP SQL query engine for Apache Hadoop. “Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complemenLng tradiLonal MapReduce batch processing.”
![Page 8: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/8.jpg)
8
Quick and dirty, for context.
The Apache Hadoop Ecosystem
![Page 9: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/9.jpg)
Why “Ecosystem?”
• In the beginning, just Hadoop • HDFS • MapReduce
• Today, dozens of interrelated components • I/O • Processing • Specialty ApplicaLons • ConfiguraLon • Workflow
9
![Page 10: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/10.jpg)
HDFS
• Distributed, highly fault-‐tolerant filesystem • OpLmized for large streaming access to data • Based on Google File System
• hjp://research.google.com/archive/gfs.html
10
![Page 11: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/11.jpg)
Lots of Commodity Machines
11
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
![Page 12: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/12.jpg)
MapReduce (MR)
• Programming paradigm • Batch oriented, not realLme • Works well with distributed compuLng • Lots of Java, but other languages supported • Based on Google’s paper
• hjp://research.google.com/archive/mapreduce.html
12
![Page 13: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/13.jpg)
Apache Hive
• AbstracLon of Hadoop’s Java API • HiveQL “compiles” down to MR
• a “SQL-‐like” language
• Eases analysis using MapReduce
13
![Page 14: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/14.jpg)
Apache Hive Metastore
• Maps HDFS files to DB-‐like resources • Databases • Tables • Column/field names, data types • Roles/users • InputFormat/OutputFormat
14
![Page 15: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/15.jpg)
Sqoop
©2011 Cloudera, Inc. All Rights Reserved. 15
• SQL to Hadoop
• Tool to import/export any JDBC-‐supported database into Hadoop
• Transfer data between Hadoop and external databases or EDW
• High performance connectors for some RDBMS
• Oracle, Teradata, Netezza
• Developed at Cloudera
![Page 16: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/16.jpg)
16
![Page 17: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/17.jpg)
17
Familiar interface, but more powerful.
Cloudera Impala
![Page 18: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/18.jpg)
Cloudera Impala
18
Interac(ve SQL for Hadoop § Responses in seconds § Nearly ANSI-‐92 standard SQL with Hive SQL
Na(ve MPP Query Engine § Purpose-‐built for low-‐latency queries § Separate runLme from MapReduce § Designed as part of the Hadoop ecosystem
Open Source § Apache-‐licensed
![Page 19: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/19.jpg)
Benefits of Impala
19
More & Faster Value from “Big Data” § InteracLve BI/AnalyLcs experience via SQL § No delays from data migraLon
Flexibility § Query across exisLng data § Select best-‐fit file formats (Parquet, Avro, etc.) § Run mulLple frameworks on the same data at the same Lme
Cost Efficiency § Reduce movement, duplicate storage & compute § 10% to 1% the cost of analyLc DBMS
Full Fidelity Analysis § No loss from aggregaLons or fixed schemas
![Page 20: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/20.jpg)
Impala Use Cases
20
InteracLve BI/analyLcs on more data
Asking new quesLons – exploraLon, ML
Data processing with Lght SLAs
Query-‐able archive w/full fidelity
Cost-‐effec(ve, ad hoc query environment that offloads the data warehouse for:
![Page 21: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/21.jpg)
Our Design Strategy
21
One pool of (open) data
One metadata model
One security framework
One set of system resources
An Integrated Part of the Hadoop System
In-‐Memory Processing & Streaming
Spark
Storage
Integra(on
Resource Management
Metad
ata
Batch Processing MAPREDUCE, HIVE & PIG
…
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Engines
InteracLve SQL
CLOUDERA IMPALA
InteracLve Search CLOUDERA SEARCH
Machine Learning MAHOUT,
ClouderaML, Oryx
Math & Sta(s(cs
SAS, R
Security
![Page 22: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/22.jpg)
Impala Key Features
22
Fast Flexible Secure
Easy to Implement Easy to Use Simple to Manage
§ In-‐memory data transfers § ParLLoned joins
§ Fully distributed aggregaLons
§ Query data in HDFS & HBase § Supports mul(ple file formats
& compression algorithms
§ Java & Na(ve UDFs, UDAFs
§ Integrated with Hadoop security
§ Kerberos authenLcaLon
§ Authoriza(on (Sentry)
§ Leverages Hive’s ODBC/JDBC connectors, metastore & SQL syntax
§ Open source
§ Interact with data via SQL § CerLfied with leading BI tools
§ Deploy, configure & monitor with Cloudera Manager
§ Integrated with Hadoop resource management
![Page 23: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/23.jpg)
What’s Coming?*
23
SQL 2003-‐Compliant AnalyLc Window FuncLons
AddiLonal AuthenLcaLon Mechanisms
User Defined Table FuncLons
Intra-‐node Parallelized AggregaLons & Joins
Nested Data
Enhanced YARN-‐Integrated Resource Manager
Dynamic ParLLon Pruning
In the Near Term:
*On the roadmap… no guarantees
![Page 24: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/24.jpg)
Impala Plays Well with Others
24
BI Partners: Building on the
Enterprise Standard POWERED BY
IMPALA
![Page 25: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/25.jpg)
Not All SQL On Hadoop Is Created Equal
25
Batch MapReduce Make MapReduce faster
Slow, s(ll batch
Remote Query Pull data from HDFS over the network to the DW
compute layer
Slow, expensive
Siloed DBMS Load data into a
proprietary database file
Rigid, siloed data, slow ETL
Impala Na(ve MPP query engine that’s integrated into
Hadoop
Fast, flexible, cost-‐effec(ve
$
![Page 26: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/26.jpg)
DMBS Hadoop
More Detail On AlternaLve Approaches
26
Batch MapReduce
§ Batch-‐oriented § High latency
Remote Query Siloed DBMS
Hadoop DMBS
HDFS Storage
Compute Compute
§ Network bojleneck § 2x the hardware § Duplicate metadata, security, SQL, etc.
Storage (HDFS)
Integra(on
Resource Management
Hado
op M
etad
ata
DBMS
Hadoop Engines
MAPREDUCE, HIVE, PIG, IMPALA, ETC.
DBMS Metad
ata
PROPRIETARY STANDARD & SHARED
§ RDBMS rigidity § Query subset of data § Duplicate storage, metadata, security, SQL, etc.
Storage
Integra(on
Resource Management
Metad
ata
Batch Processing
… InteracLve SQL
Machine Learning
HDFS HBase
Security Security
![Page 27: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/27.jpg)
Other Sexy New Big Data MPP Tools
27
Presto Purpose-‐Built MPP Engine; Similar Architecture to Impala; Few Performance Comparisons, but Impala Anecdotally 5x-‐10x Faster
Shark Hive-‐CompaLble Data Warehouse for Spark; Great Performance unLl Required to go to Disk, at Which Point Impala Bejer; With HDFS Caching Impala will Perform on Par from a Memory PerspecLve
Drill Open Source version of Dremel; Another MPP Engine; MulLple Data Formats and Sources
Phoenix – Sort Of SQL Skin over HBase (and Only HBase); Subset of SQL Standard
![Page 28: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/28.jpg)
What About an EDW/RDBMS?
“Right Tool for the Right Job” EDW/RDBMS Great For:
• OLTP’s complex transacLons • Highly planned and opLmized known workloads • Opera'onal reports and repeated known queries
Impala Great For:
• Exploratory analy'cs with previously-‐unknown queries • Queries on big and growing data sets
EDW/RDBMS Can’t: • Dump in raw data then later define schema and query what you want • Evolve schemas without an expensive schema upgrade planning process • Simply scale just by adding industry-‐standard servers • Store at < $1k/TB instead of $10-‐150k/TB
28
![Page 29: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/29.jpg)
29
Impala Technical Details
![Page 30: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/30.jpg)
The Impala Advantage
30
No MapReduce; No JVM; All NaLve
In-‐Memory Data Transfers
Saturate Disks on Reads
OpLmized File Format (ie Parquet)
In-‐Memory HDFS Caching Cost-‐Based Join Order OpLmizaLon – Frees User from Having to Guess the Correct Join Order
Where does the Performance Come From?
![Page 31: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/31.jpg)
Impala and Hive
31
Shares Everything Client-‐Facing § Metadata (table definiLons) § ODBC/JDBC drivers § SQL syntax (Hive SQL) § Flexible file formats § Machine pool § Hue GUI
But Built for Different Purposes § Hive: runs on MapReduce and ideal for batch processing
§ Impala: naLve MPP query engine ideal for interacLve SQL
Storage
Integra(on
Resource Management
Metad
ata
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Hive SQL Syntax Impala
SQL Syntax + Compute Framework MapReduce
Compute Framework
Batch Processing
InteracLve
SQL
![Page 32: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/32.jpg)
Impala Query ExecuLon
32
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
1) Request arrives via ODBC/JDBC/HUE/Shell
![Page 33: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/33.jpg)
Impala Query ExecuLon
33
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
2) Planner turns request into collec(ons of plan fragments 3) Coordinator ini(ates execu(on on impalad(s) local to data
![Page 34: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/34.jpg)
Impala Query ExecuLon
34
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client
Query results
![Page 35: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/35.jpg)
Parquet File Format
35
Open source, columnar Hadoop file format developed by Cloudera & Twiler Limits the IO to only the data that is needed
Supports storing each column in a separate file
Saves space: columnar layout compresses bejer
Enables bejer scans: load only the columns that are needed
Supports index pages for fast lookup
Extensible value encodings
![Page 36: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/36.jpg)
36
Impala Performance Results
![Page 37: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/37.jpg)
Impala Performance Results
• Impala’s Milestone in Jan 2014: • Comparable commercial MPP DBMS speed • NaLvely on Hadoop
• Three Result Sets: • Impala vs Hive 0.12 (Impala 6-‐70x faster) • Impala vs “DBMS-‐Y” (Impala average of 2x faster) • Impala scalability (Impala achieves linear scale)
• Background • 20 pre-‐selected, diverse TPC-‐DS queries (modified to remove unsupported
language) • Sufficient data scale for realisLc comparison (3 TB, 15 TB, and 30 TB) • RealisLc nodes (e.g. 8-‐core CPU, 96GB RAM, 12x2TB disks) • Methodical tesLng (mulLple runs, reviewed fairness for compeLLon, etc)
• Details: hjp://blog.cloudera.com/blog/2014/01/impala-‐performance-‐dbms-‐class-‐speed/
37
![Page 38: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/38.jpg)
Enough slides… DEMO TIME!
38
![Page 39: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/39.jpg)
So What is Cloudera Impala?
39
![Page 40: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/40.jpg)
What’s Next?
• Download Hadoop! • CDH available at www.cloudera.com • Try it online: Cloudera Live
• Cloudera provides pre-‐loaded VMs • hjp://Lny.cloudera.com/quickstartvm
• Ride Impala! • hjp://impala.io/
40
![Page 41: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/41.jpg)
41
SAN DIEGO BIG DATA
Special thanks:
![Page 42: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/42.jpg)
42
Preferably related to the talk… or not.
QuesLons?
![Page 44: Cloudera Impala - San Diego Big Data Meetup August 13th 2014](https://reader034.fdocuments.in/reader034/viewer/2022051611/54b4ebd54a79591b688b458e/html5/thumbnails/44.jpg)
44