Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

Post on 17-Feb-2017

372 views 2 download

Transcript of Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on HadoopIan FyfeDirector Product MarketingAugust 2nd, 2016

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda

Hadoop architecture vs. BI tool requirements Hive on MapReduce The Data Movement Work-Around Hive on Tez and LLAP In-Hadoop Databases: Apache HAWQ, Apache Impala In-Memory: AtScale, Zoomdata Conclusion and summary

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hortonworks

The company behind Apache Hadoop– Hortonworks Data Platform (HDP)– Hortonworks Data Flow (HDF)

Strong partnership with Pivotal– Pivotal is converting Pivotal Hadoop Distribution customers to HDP– Hortonworks is reselling Pivotal HDB subscriptions

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hadoop “Classic” Core Components

HDFS– a distributed file system allowing massive storage

across a cluster of commodity servers

MapReduce– Framework for distributed computation,

common use cases include aggregating, sorting, and filtering BIG data sets

– Problem is broken up into small fragments of work that can be computed or recomputed in isolation on any node of the cluster

– Massively scalable– High latency

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Related Projects

Hive – a data warehouse infrastructure on top of Hadoop– Implements a SQL like Query language, including a

JDBC driver

HBase – the Hadoop database – AH HA!– NoSQL database problematic for traditional BI

• Apache Phoenix provides SQL interface – Best at storing large amounts of unstructured data!– Not optimized for aggregate (BI style) queries

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Unfortunately Hadoopwasn't originally designed for most BI requirements …

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Low-latency queries at petabyte scale Full ANSI SQL Compliance

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

… but the situation is rapidly improving!

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive on MapReduce (Hive 1.0)

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Hive 1.0

Facilitates querying and managing large datasets. Data analysts use Hive to explore, structure and analyze that data using a SQL-like

language called HiveQL

Hive on MapReduce

Massively scale-able to PB rangeBatch reporting & ETL

High latency – queries in minutes or hoursLimited ANSI SQL compliance

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

The Result of Hive on MR… The Data Movement Work-Around

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

The Data Movement Work-Around

ETL the data into traditional data marts or data warehouses

OracleVerticaNetezzaMySQL

Etc.

ETL Toolor

code

Low latency queries SQL compliance

Currency Cost Complexity

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive on Tez (Hive 2.0)

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive on Tez

Apache Tez– Alternate Query Framework to

MapReduce which allows for a complex directed-acyclic-graph (DAG) of tasks for processing data”

– Significantly faster than MapReduce

Many times faster than Hive on MR PB scale

Still quite high query latency for BI tools

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive on LLAP (Hive 2.1)

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDP 2.5 is a Major Milestone for Hive

At a High Level:– 2000+ features, improvements and bug fixes

in Hive since HDP 2.4.– 600+ of these from outside of Hortonworks.

Major Improvements:– Hive LLAP: Persistent query servers with

intelligent in-memory caching.– ACID GA: Hardened and proven at scale.– Expanded SQL Compliance: More capable

integration with BI tools.– Performance: Interactive query, 2x faster ETL.– Security: Row / Column security extending to

views, Column level security for Spark.– Operations: LLAP integration in Ambari, new

Grafana dashboards.

1391

642

From HortonworksFrom Community

Hive 2 Highlights

Interactive Query with Hive LLAP+SQL ACID Fully Supported+2x Faster ETL+

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: Architecture Overview

Deep

St

orag

e

HDFS S3 + Other HDFS Compatible Filesystems

YARN Cluster

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

QueryCoordinators

App Master

App Master

App Master

HiveServer2

ODBC /JDBC SQL

Queries

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: Preliminary Numbers

q3 q7 q12 q13 q19 q21 q26 q27 q42 q43 q45 q52 q55 q60 q73 q84 q89 q91 q980

10

20

30

40

50

60

70

80

Hive2.0 and LLAP: TPC-DS at 10 TB Scale, 18 Nodes

Hive2.0-TezLLAP

Min query time:Query 55: 2.38s

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: Linear Scaling at 1TB: 8 nodes versus 16 nodes.

8 16 320

20

40

60

80

100

120

Average Query Time by Concurrency

Average Time: 8 Node Average Time: 16 Node

Concurrent Queries

Tim

e (s

)

8 16 320

50

100

150

200

250

300

Maximum Query Time by Concurrency

Max Time: 8 Node Max Time: 16 Node

Concurrent Queries

Tim

e (s

)

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP Enables Interactive Query In Seconds

Faster interactive query Faster ETL Expanded SQL compliance for BI tools (nearing SQL:2011) Enterprise Readiness: granular row & column level security Simplified Operations: LLAP integration with Ambari with

automated dashboards

TB scale datasets, not PB scale

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Caution: Not All Hive’s Are Created Equal

Apache Hive in Hortonworks HDP Apache Hive in Cloudera CDH

• Supports LLAP (with HDP 2.5)• Supports Tez• Supports ORC, Atlas, Ranger• Supports Vectorization• Supports In-Memory Computation

• Lacks LLAP Support• Lacks Tez Support• Lacks ORC Support• Lacks Vectorization Support• Lacks In-Memory Support

Note: I’ll talk about Impala later

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

In-Hadoop Databases

Apache HAWQ

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hadoop-native SQL query engine and advanced analytics MPP database that

offers high-performance interactive ANSI SQL query execution and

machine learning for Data Analysts & Data Scientists who want to find

insights from large/complex datasets.

HORTONWORKS

HDBpowered by Apache HAWQ

Apache HAWQ (Pivotal/Hortonworks HDB)

Created by Pivotal – based on Greenplum coreResold as Hortonworks HDB

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HAWQ Architecture

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Hive● Multiple subject areas● Holds very detailed information● Scale – Multiple Petabytes● Integrates all data sources● ETL, Reporting & BI● Low-Mid Query Latency

Apache HAWQ / HDB● Single Subject Mart● Summarized information● Scale – 100s TB● Ad-hoc Analytics & Visualization● Machine Learning ● Low Query Latency

Apache Hive & HAWQ/HDB

Right Tool for the Job:Choose the right SQL engine based on your application’s needs.

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

In-Hadoop Databases

Apache Impala (incubating)

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Impala (Incubating)

Brings scalable massively parallel processing (MPP) database technology to Hadoop– Circumvents MapReduce– Directly accesses the data through

a specialized distributed query engine

MapReduce data processing and interactive queries can be done on the same system using the same data and metadata

Uses metadata, ODBC driver, and SQL syntax from Apache Hive

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

seco

nds

5-User Testing ResultsIndustry Standard TPC-DS Queries

* Queries that did not complete are omitted from results on both platforms

• HAWQ 30% faster• Impala failed to complete 47% of the queries

1 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22

23 24 25 26 27 28 29 30 31 32 33

34 35 36 37 38 39 40 41 42 43 44

45 46 47 48 49 50 51 52 53 54 55

56 57 58 59 60 61 62 63 64 65 66

67 68 69 70 71 72 73 74 75 76 77

78 79 80 81 82 83 84 85 86 87 88

89 90 91 92 93 94 95 96 97 98 99

Unsuppported SQL

Long running killed

Memory Limit Exceeded

Impala Test Query Fails

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Impala

Fast interactive query Re-uses Hive metadata and JDBC driver

Incomplete ANSI SQL compliance User concurrency stability issues TB scale, not PB scale Vendor-specific security model (Cloudera)

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache HAWQ vs. Apache Impala

Apache HAWQ / HDB Apache Impala

• Deep YARN Integration• Best In Class Optimizer• Full ANSI SQL Compliance• Integrated Predictive Modeling• Performance Advantage 30%-600%

over Impala

• No YARN Integration – Poor Cluster Utilization

• In-complete SQL Support• No Built-in support for Predictive

Modeling• Poor concurrency

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

In-Memory Approach

AtScale

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AtScale

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AtScale

Any BI Tool No Data Movement Single Semantic Layer

Turn Your Hadoop Cluster into Scale-Out OLAP Server

Resold by Hortonworks

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AtScale Architecture – Leverages

Auto-builds & maintains aggregates in Spark

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AtScale

Fast interactive query Full ANSI SQL Compliance Any BI tool No data movement Good user concurrency

It is a “middleware” layer running on an edge node that you need to maintain

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

In-Memory + Micro-Queries ApproachZoomdata

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zoomdata – Micro-Queries + Spark

Patented technique for delivering fast visualization of large volumes of data– Immediately displays a partial or approximate rendering

which then becomes more accurate over time

Single logical query turned into a set of micro-queries executed in parallel– Results from the first micro-query immediately displayed– As the rest of the micro-queries complete, Zoomdata’s

streaming architecture updates the visualization with new data until the full result set comes into focus

– Interact with the data while data is still being processed

Leverages Spark as internal in-memory database

Data Sharpening

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zoomdata

Fast interactive query Full ANSI SQL Compliance Built-in visualization No data movement Good user concurrency

Runs on an edge node that you need to maintain Not designed to work with other BI tools

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Conclusion

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

The Holy Grail of sub-second queriesand full SQL compliance against PB-scale

datasets in Hadoop is not easy.

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

But through a combination ofinnovation at the core

and vendor innovation …… we are getting closer.

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summary ScorecardTechnology Scale Speed SQL Compliance

Apache Hive on MapReduce

Apache Hive on Tez

Apache Hive on LLAP

Hive on Tez + LLAP (HDP 2.5)

Data Movement Work-Around

Apache HAWQ

Apache Impala (incubating)

AtScale

Zoomdata

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You

ifyfe@hortonworks.com

http://hortonworks.com

50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Theme Colors

R 30G 30B 30

R 0G 0B 0

R 255G 255B 255

R 59G 134B 64

R 63G 174B 42

R 61G 181B 230

R 68G 105B 125

R 218G 217B 214

R 255G 112B 10

R 255G 198B 30