SQL Engines for Hadoop - The case for Impala

45
SQL Engines for Hadoop – The case for Impala Budapest Data Forum, June 4 th , 2015 tiny.cloudera.com/mark-sql-budapest Mark Grover | @mark_grover

Transcript of SQL Engines for Hadoop - The case for Impala

Page 1: SQL Engines for Hadoop - The case for Impala

SQL Engines for Hadoop – The case for Impala Budapest Data Forum, June 4th, 2015 tiny.cloudera.com/mark-sql-budapest

Mark Grover | @mark_grover

Page 2: SQL Engines for Hadoop - The case for Impala

SQL Engines for Hadoop – The case for Impala Budapest Data Forum, June 4th, 2015 tiny.cloudera.com/mark-sql-budapest

Mark Grover | @mark_grover Grover Mark

Page 3: SQL Engines for Hadoop - The case for Impala

SQL Engines for Hadoop – The case for Impala Budapest Data Forum, June 4th, 2015 tiny.cloudera.com/mark-sql-budapest

Mark Grover | @mark_grover Grover Mark

József

Page 4: SQL Engines for Hadoop - The case for Impala

4

Agenda

Page 5: SQL Engines for Hadoop - The case for Impala

5

Agenda •  Word Count

©2015 Cloudera, Inc. All Rights Reserved.

Page 6: SQL Engines for Hadoop - The case for Impala

6

Agenda

©2015 Cloudera, Inc. All Rights Reserved.

Page 7: SQL Engines for Hadoop - The case for Impala

7

Past

©2015 Cloudera, Inc. All Rights Reserved.

Page 8: SQL Engines for Hadoop - The case for Impala

8

Future

©2015 Cloudera, Inc. All Rights Reserved.

Page 9: SQL Engines for Hadoop - The case for Impala

9

What is Apache Hadoop?

9

Has the Flexibility to Store and Mine Any Type of Data

§  Ask questions across structured and

unstructured data that were previously impossible to ask or solve

§  Not bound by a single schema

Excels at Processing Complex Data

§  Scale-out architecture divides

workloads across multiple nodes

§  Flexible file system eliminates ETL bottlenecks

Scales Economically

§  Can be deployed on commodity

hardware

§  Open source platform guards against vendor lock

Hadoop Distributed File System (HDFS)

Self-Healing, High

Bandwidth Clustered Storage

MapReduce

Distributed Computing Framework

Apache Hadoop is an open source platform for data storage and processing that is…

ü  Scalable ü  Fault tolerant ü  Distributed

CORE HADOOP SYSTEM COMPONENTS

©2015 Cloudera, Inc. All Rights Reserved.

Page 10: SQL Engines for Hadoop - The case for Impala

10

What is Apache Hadoop?

Has the Flexibility to Store and Mine Any Type of Data

§  Ask questions across structured and

unstructured data that were previously impossible to ask or solve

§  Not bound by a single schema

Excels at Processing Complex Data

§  Scale-out architecture divides

workloads across multiple nodes

§  Flexible file system eliminates ETL bottlenecks

Scales Economically

§  Can be deployed on commodity

hardware

§  Open source platform guards against vendor lock

Hadoop Distributed File System (HDFS)

Self-Healing, High

Bandwidth Clustered Storage

MapReduce

Distributed Computing Framework

Apache Hadoop is an open source platform for data storage and processing that is…

ü  Scalable ü  Fault tolerant ü  Distributed

CORE HADOOP SYSTEM COMPONENTS

©2015 Cloudera, Inc. All Rights Reserved.

Page 11: SQL Engines for Hadoop - The case for Impala

11

What? No schema? •  How do you use SQL? •  Do you lose Schema-on-read when using SQL-on-Hadoop?

©2015 Cloudera, Inc. All Rights Reserved.

Page 12: SQL Engines for Hadoop - The case for Impala

12

SQL engines

Page 13: SQL Engines for Hadoop - The case for Impala

13

Hive

©2015 Cloudera, Inc. All Rights Reserved.

•  First SQL engine •  Converts SQL to MapReduce •  New execution engine additions

–  Hive-on-Spark –  Hive-on-Tez

Page 14: SQL Engines for Hadoop - The case for Impala

14

Confusion #1 •  If there’s Hive, why we need more execution engines?

–  Just a matter of speed?

©2015 Cloudera, Inc. All Rights Reserved.

Page 15: SQL Engines for Hadoop - The case for Impala

15

Early architecture using Hive

©2015 Cloudera, Inc. All Rights Reserved.

Hive

Hadoop

RDBMS

Web servers

Different behaviour based on aggregations

Page 16: SQL Engines for Hadoop - The case for Impala

16

Early architecture using Hive

©2015 Cloudera, Inc. All Rights Reserved.

Hive

Hadoop

RDBMS

Web servers

Different behaviour based on aggregations

Why do we need this?

Page 17: SQL Engines for Hadoop - The case for Impala

17

Why not?

©2015 Cloudera, Inc. All Rights Reserved.

Hive

Hadoop

Web servers

Aggregations/Recommendations

Different behaviour based on aggregations

Page 18: SQL Engines for Hadoop - The case for Impala

18

Aha #1 •  Performance

–  Speed of execution –  Concurrency

•  Need an MPP like solution

©2015 Cloudera, Inc. All Rights Reserved.

Page 19: SQL Engines for Hadoop - The case for Impala

19

Aha #1 •  Impala •  Drill •  Presto •  Hive

–  Hive-on-Tez –  Hive-on-Spark

©2015 Cloudera, Inc. All Rights Reserved.

Page 20: SQL Engines for Hadoop - The case for Impala

20

Impala

Page 21: SQL Engines for Hadoop - The case for Impala

21

Impala - Goals •  General-purpose SQL query engine:

•  Works for both for analytical and transactional/single-row workloads

•  Runs directly within Hadoop: –  reads widely used Hadoop file formats –  talks to widely used Hadoop storage managers –  runs on same nodes that run Hadoop processes

•  High performance –  Execution times –  Concurrency

•  Open source

©2015 Cloudera, Inc. All Rights Reserved.

Page 22: SQL Engines for Hadoop - The case for Impala

22

User view of Impala •  There is no ‘Impala format’! •  Supported file formats:

–  uncompressed/lzo-compressed text files –  sequence files with snappy/gzip compression –  RCFile with snappy/gzip compression –  Avro data files –  Parquet (columnar format) –  HBase –  And, more…

©2015 Cloudera, Inc. All Rights Reserved.

Page 23: SQL Engines for Hadoop - The case for Impala

23

Impala Use Cases

Interactive BI/analytics on more data

Asking new questions – exploration, ML

Data processing with tight SLAs

Query-able archive w/full fidelity

Cost-effective, ad hoc query environment that offloads/replaces the data warehouse for:

Page 24: SQL Engines for Hadoop - The case for Impala

24

Global Financial Services Company

Saved 90% on incremental EDW spend & improved performance by 5x

Offload data warehouse for query-able archive

Store decades of data cost-effectively

Process & analyze on the same system

Improved capabilities through interactive query on more data

Page 25: SQL Engines for Hadoop - The case for Impala

25

Digital Media Company

20x performance improvement for exploration & data discovery

Easily identify new data sets for modeling

Interact with raw data directly to test hypotheses

Avoid expensive DW schema changes

Accelerate ‘time to answer’

Page 26: SQL Engines for Hadoop - The case for Impala

26

Architecture of Impala

Page 27: SQL Engines for Hadoop - The case for Impala

27

Impala Architecture •  Three binaries: impalad, statestored, catalogd •  Impala daemon (impalad) – N instances

–  handles client requests and all internal requests related to query execution

•  State store daemon (statestored) – 1 instance –  Provides name service and metadata distribution

•  Catalog daemon (catalogd) – 1 instance –  Relays metadata changes to all impalad’s

Page 28: SQL Engines for Hadoop - The case for Impala

28

Impala Architecture: Query Execution •  Request arrives via odbc/jdbc

Query  Planner  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  

Query  Executor  

HDFS  DN   HBase  

SQL  request  

Query  Coordinator   Query  Coordinator  

HiveMetastore   HDFS  NN   statestored   catalogd  

Page 29: SQL Engines for Hadoop - The case for Impala

29

Impala Architecture: Query Execution •  Planner turns request into collections of plan fragments •  Coordinator initiates execution on remote impalad's

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

HiveMetastore   HDFS  NN   statestored   catalogd  

Page 30: SQL Engines for Hadoop - The case for Impala

30

Impala Architecture: Query Execution •  Intermediate results are streamed between impalad's Query results

are streamed back to client

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

SQL  App  

ODBC  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  

Query  Executor  

HDFS  DN   HBase  

query  results  

HiveMetastore   HDFS  NN   statestored   catalogd  

Page 31: SQL Engines for Hadoop - The case for Impala

31

Query Planning: Overview •  2-phase planning process:

–  single-node plan –  plan partitioning: partition single-node plan to maximize scan locality, minimize

data movement

•  Parallelization of operators: –  All query operators are fully distributed

Page 32: SQL Engines for Hadoop - The case for Impala

32

Single-Node Plan: Example Query  

SELECT t1.custid, SUM(t2.revenue) AS revenue FROM LargeHdfsTable t1 JOIN LargeHdfsTable t2 ON (t1.id1 = t2.id) JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id) WHERE t3.category = 'Online' GROUP BY t1.custid ORDER BY revenue DESC LIMIT 10;

Page 33: SQL Engines for Hadoop - The case for Impala

33

Query Planning: Single-Node Plan •  Single-node plan for example:

HashJoin

Scan: t1

Scan: t3

Scan: t2

HashJoin

TopN

Agg

Page 34: SQL Engines for Hadoop - The case for Impala

34

Single-node plan •  SQL query as a left-deep tree of plan operators •  Scan, HashJoin, HashAggregation, Union, TopN, Exchange

Page 35: SQL Engines for Hadoop - The case for Impala

35

Plan Partitioning •  Partition single-node plan

–  Maximize scan locality –  Minimize data movement

•  Parallelization of operators: –  All query operators are fully distributed

Page 36: SQL Engines for Hadoop - The case for Impala

36

Query  Planning:  Distributed  Plans  

HashJoin Scan: t1

Scan: t3

Scan: t2

HashJoin

TopN

Pre-Agg

MergeAgg

TopN

Broadcast

Broadcast

hash t2.id hash t1.id1

hash t1.custid

at HDFS DN

at HBase RS

at coordinator

Page 37: SQL Engines for Hadoop - The case for Impala

37

Impala Execution Engine •  Written in C++ for minimal execution overhead •  Internal in-memory tuple format puts fixed-width data at fixed offsets •  Uses intrinsics/special cpu instructions for text parsing, crc32

computation, etc. •  Runtime code generation for “big loops”

Page 38: SQL Engines for Hadoop - The case for Impala

38

Runtime code generation •  example of "big loop": insert batch of rows into hash table •  known at query compile time: # of tuples in a batch, tuple layout,

column types, etc. •  generate at compile time: unrolled loop that inlines all function calls,

contains no dead code, minimizes branches •  code generated using LLVM

Page 39: SQL Engines for Hadoop - The case for Impala

39

Comparing Impala to Dremel •  What is Dremel?

–  columnar storage for data with nested structures –  distributed scalable aggregation on top of that

•  Columnar storage in Hadoop: Parquet –  stores data in appropriate native/binary types –  can also store nested structures similar to Dremel's ColumnIO –  Parquet is open source: github.com/parquet

•  Distributed aggregation: Impala •  Impala plus Parquet: a superset of the published version of Dremel

(which didn't support joins)

Page 40: SQL Engines for Hadoop - The case for Impala

40

But, what makes Impala fast? •  No MapReduce •  Use of memory •  LLVM •  C++ •  Vectorization •  Tight integration with Parquet

Page 41: SQL Engines for Hadoop - The case for Impala

41

Confusion #2 – What do I use? •  Hive on Mapreduce? •  Hive on Tez? •  Hive on Spark? •  Impala? •  Spark SQL? •  Drill?

©2015 Cloudera, Inc. All Rights Reserved.

Page 42: SQL Engines for Hadoop - The case for Impala

42

Processing Frameworks in Hadoop

©2015 Cloudera, Inc. All Rights Reserved.

Hadoop Storage managers (HDFS/HBase/Solr)

Hiv

e P

ig

Tez

Cas

cadi

ng

MapReduce

Hiv

e

Cas

cadi

ng

Cru

nch

Pig

Mah

out

Gira

ph

Dat

o (G

raph

lab)

Spark

Hiv

e (in

bet

a)

Cas

cadi

ng

Cru

nch

Pig

MLl

ib

Gra

phX

SQL engines

General purpose execution engines

Storage managers

Impa

la

Dril

l

H20

O

ryx

Graph processing engines

Machine Learning engines

Abstraction engines

Sto

rm/T

riden

t

Spa

rk S

tream

ing

Real-time frameworks

Spa

rk S

QL

Pre

sto

Page 43: SQL Engines for Hadoop - The case for Impala

43

Free books! •  @hadooparchbook •  hadooparchitecturebook.com •  github.com/hadooparchitecturebook •  slideshare.com/hadooparchbook

•  Later today at 3:10 PM

©2015 Cloudera, Inc. All Rights Reserved.

Page 44: SQL Engines for Hadoop - The case for Impala

44

Impala •  SQL-on-Hadoop engine •  Near real time SQL •  Reads YOUR file formats •  Allows to write custom UDFs •  Open source •  Commonly used for Data Warehouse offloading

©2015 Cloudera, Inc. All Rights Reserved.

Page 45: SQL Engines for Hadoop - The case for Impala

45

I’d love to talk more! •  @mark_grover •  Find me at Cloudera booth!

©2015 Cloudera, Inc. All Rights Reserved.