Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on HadoopIan FyfeDirector Product MarketingAugust 2nd, 2016

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda

Hadoop architecture vs. BI tool requirements Hive on MapReduce The Data Movement Work-Around Hive on Tez and LLAP In-Hadoop Databases: Apache HAWQ, Apache Impala In-Memory: AtScale, Zoomdata Conclusion and summary


Hortonworks

The company behind Apache Hadoop– Hortonworks Data Platform (HDP)– Hortonworks Data Flow (HDF)

Strong partnership with Pivotal– Pivotal is converting Pivotal Hadoop Distribution customers to HDP– Hortonworks is reselling Pivotal HDB subscriptions


Hadoop “Classic” Core Components

HDFS– a distributed file system allowing massive storage

across a cluster of commodity servers

MapReduce– Framework for distributed computation,

common use cases include aggregating, sorting, and filtering BIG data sets

– Problem is broken up into small fragments of work that can be computed or recomputed in isolation on any node of the cluster

– Massively scalable– High latency


Related Projects

Hive – a data warehouse infrastructure on top of Hadoop– Implements a SQL like Query language, including a

JDBC driver

HBase – the Hadoop database – AH HA!– NoSQL database problematic for traditional BI

• Apache Phoenix provides SQL interface – Best at storing large amounts of unstructured data!– Not optimized for aggregate (BI style) queries


Unfortunately Hadoopwasn't originally designed for most BI requirements …


Low-latency queries at petabyte scale Full ANSI SQL Compliance


… but the situation is rapidly improving!


Hive on MapReduce (Hive 1.0)


Apache Hive 1.0

Facilitates querying and managing large datasets. Data analysts use Hive to explore, structure and analyze that data using a SQL-like

language called HiveQL

Hive on MapReduce

Massively scale-able to PB rangeBatch reporting & ETL

High latency – queries in minutes or hoursLimited ANSI SQL compliance


The Result of Hive on MR… The Data Movement Work-Around


The Data Movement Work-Around

ETL the data into traditional data marts or data warehouses

OracleVerticaNetezzaMySQL

Etc.

ETL Toolor

code

Low latency queries SQL compliance

Currency Cost Complexity


Hive on Tez (Hive 2.0)


Hive on Tez

Apache Tez– Alternate Query Framework to

MapReduce which allows for a complex directed-acyclic-graph (DAG) of tasks for processing data”

– Significantly faster than MapReduce

Many times faster than Hive on MR PB scale

Still quite high query latency for BI tools


Hive on LLAP (Hive 2.1)


HDP 2.5 is a Major Milestone for Hive

At a High Level:– 2000+ features, improvements and bug fixes

in Hive since HDP 2.4.– 600+ of these from outside of Hortonworks.

Major Improvements:– Hive LLAP: Persistent query servers with

intelligent in-memory caching.– ACID GA: Hardened and proven at scale.– Expanded SQL Compliance: More capable

integration with BI tools.– Performance: Interactive query, 2x faster ETL.– Security: Row / Column security extending to

views, Column level security for Spark.– Operations: LLAP integration in Ambari, new

Grafana dashboards.

1391

642

From HortonworksFrom Community

Hive 2 Highlights

Interactive Query with Hive LLAP+SQL ACID Fully Supported+2x Faster ETL+


Hive 2 with LLAP: Architecture Overview

Deep

St

orag

e

HDFS S3 + Other HDFS Compatible Filesystems

YARN Cluster

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

QueryCoordinators

App Master

App Master

App Master

HiveServer2

ODBC /JDBC SQL

Queries


Hive 2 with LLAP: Preliminary Numbers

q3 q7 q12 q13 q19 q21 q26 q27 q42 q43 q45 q52 q55 q60 q73 q84 q89 q91 q980

10

20

30

40

50

60

70

80

Hive2.0 and LLAP: TPC-DS at 10 TB Scale, 18 Nodes

Hive2.0-TezLLAP

Min query time:Query 55: 2.38s


Hive 2 with LLAP: Linear Scaling at 1TB: 8 nodes versus 16 nodes.

8 16 320

20

40

60

80

100

120

Average Query Time by Concurrency

Average Time: 8 Node Average Time: 16 Node

Concurrent Queries

Tim

e (s

)

8 16 320

50

100

150

200

250

300

Maximum Query Time by Concurrency

Max Time: 8 Node Max Time: 16 Node

Concurrent Queries

Tim

e (s

)


Hive 2 with LLAP Enables Interactive Query In Seconds

Faster interactive query Faster ETL Expanded SQL compliance for BI tools (nearing SQL:2011) Enterprise Readiness: granular row & column level security Simplified Operations: LLAP integration with Ambari with

automated dashboards

TB scale datasets, not PB scale


Caution: Not All Hive’s Are Created Equal

Apache Hive in Hortonworks HDP Apache Hive in Cloudera CDH

• Supports LLAP (with HDP 2.5)• Supports Tez• Supports ORC, Atlas, Ranger• Supports Vectorization• Supports In-Memory Computation

• Lacks LLAP Support• Lacks Tez Support• Lacks ORC Support• Lacks Vectorization Support• Lacks In-Memory Support

Note: I’ll talk about Impala later


In-Hadoop Databases

Apache HAWQ


Hadoop-native SQL query engine and advanced analytics MPP database that

offers high-performance interactive ANSI SQL query execution and

machine learning for Data Analysts & Data Scientists who want to find

insights from large/complex datasets.

HORTONWORKS

HDBpowered by Apache HAWQ

Apache HAWQ (Pivotal/Hortonworks HDB)

Created by Pivotal – based on Greenplum coreResold as Hortonworks HDB


HAWQ Architecture


Apache Hive● Multiple subject areas● Holds very detailed information● Scale – Multiple Petabytes● Integrates all data sources● ETL, Reporting & BI● Low-Mid Query Latency

Apache HAWQ / HDB● Single Subject Mart● Summarized information● Scale – 100s TB● Ad-hoc Analytics & Visualization● Machine Learning ● Low Query Latency

Apache Hive & HAWQ/HDB

Right Tool for the Job:Choose the right SQL engine based on your application’s needs.


In-Hadoop Databases

Apache Impala (incubating)


Apache Impala (Incubating)

Brings scalable massively parallel processing (MPP) database technology to Hadoop– Circumvents MapReduce– Directly accesses the data through

a specialized distributed query engine

MapReduce data processing and interactive queries can be done on the same system using the same data and metadata

Uses metadata, ODBC driver, and SQL syntax from Apache Hive


seco

nds

5-User Testing ResultsIndustry Standard TPC-DS Queries

* Queries that did not complete are omitted from results on both platforms

• HAWQ 30% faster• Impala failed to complete 47% of the queries

1 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22

23 24 25 26 27 28 29 30 31 32 33

34 35 36 37 38 39 40 41 42 43 44

45 46 47 48 49 50 51 52 53 54 55

56 57 58 59 60 61 62 63 64 65 66

67 68 69 70 71 72 73 74 75 76 77

78 79 80 81 82 83 84 85 86 87 88

89 90 91 92 93 94 95 96 97 98 99

Unsuppported SQL

Long running killed

Memory Limit Exceeded

Impala Test Query Fails


Apache Impala

Fast interactive query Re-uses Hive metadata and JDBC driver

Incomplete ANSI SQL compliance User concurrency stability issues TB scale, not PB scale Vendor-specific security model (Cloudera)


Apache HAWQ vs. Apache Impala

Apache HAWQ / HDB Apache Impala

• Deep YARN Integration• Best In Class Optimizer• Full ANSI SQL Compliance• Integrated Predictive Modeling• Performance Advantage 30%-600%

over Impala

• No YARN Integration – Poor Cluster Utilization

• In-complete SQL Support• No Built-in support for Predictive

Modeling• Poor concurrency


In-Memory Approach

AtScale


AtScale


AtScale

Any BI Tool No Data Movement Single Semantic Layer

Turn Your Hadoop Cluster into Scale-Out OLAP Server

Resold by Hortonworks


AtScale Architecture – Leverages

Auto-builds & maintains aggregates in Spark


AtScale

Fast interactive query Full ANSI SQL Compliance Any BI tool No data movement Good user concurrency

It is a “middleware” layer running on an edge node that you need to maintain


In-Memory + Micro-Queries ApproachZoomdata


Zoomdata – Micro-Queries + Spark

Patented technique for delivering fast visualization of large volumes of data– Immediately displays a partial or approximate rendering

which then becomes more accurate over time

Single logical query turned into a set of micro-queries executed in parallel– Results from the first micro-query immediately displayed– As the rest of the micro-queries complete, Zoomdata’s

streaming architecture updates the visualization with new data until the full result set comes into focus

– Interact with the data while data is still being processed

Leverages Spark as internal in-memory database

Data Sharpening


Zoomdata

Fast interactive query Full ANSI SQL Compliance Built-in visualization No data movement Good user concurrency

Runs on an edge node that you need to maintain Not designed to work with other BI tools


Conclusion


The Holy Grail of sub-second queriesand full SQL compliance against PB-scale

datasets in Hadoop is not easy.


But through a combination ofinnovation at the core

and vendor innovation …… we are getting closer.


Summary ScorecardTechnology Scale Speed SQL Compliance

Apache Hive on MapReduce

Apache Hive on Tez

Apache Hive on LLAP

Hive on Tez + LLAP (HDP 2.5)

Data Movement Work-Around

Apache HAWQ

Apache Impala (incubating)

AtScale

Zoomdata


Thank You

[email protected]

http://hortonworks.com

http://hortonworks.com/


Theme Colors

R 30G 30B 30

R 0G 0B 0

R 255G 255B 255

R 59G 134B 64

R 63G 174B 42

R 61G 181B 230

R 68G 105B 125

R 218G 217B 214

R 255G 112B 10

R 255G 198B 30

Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

Technology

Transcript of Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop