Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

44
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop Ian Fyfe Director Product Marketing August 2 nd , 2016

Transcript of Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

Page 1: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on HadoopIan FyfeDirector Product MarketingAugust 2nd, 2016

Page 2: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda

Hadoop architecture vs. BI tool requirements Hive on MapReduce The Data Movement Work-Around Hive on Tez and LLAP In-Hadoop Databases: Apache HAWQ, Apache Impala In-Memory: AtScale, Zoomdata Conclusion and summary

Page 3: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hortonworks

The company behind Apache Hadoop– Hortonworks Data Platform (HDP)– Hortonworks Data Flow (HDF)

Strong partnership with Pivotal– Pivotal is converting Pivotal Hadoop Distribution customers to HDP– Hortonworks is reselling Pivotal HDB subscriptions

Page 4: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hadoop “Classic” Core Components

HDFS– a distributed file system allowing massive storage

across a cluster of commodity servers

MapReduce– Framework for distributed computation,

common use cases include aggregating, sorting, and filtering BIG data sets

– Problem is broken up into small fragments of work that can be computed or recomputed in isolation on any node of the cluster

– Massively scalable– High latency

Page 5: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Related Projects

Hive – a data warehouse infrastructure on top of Hadoop– Implements a SQL like Query language, including a

JDBC driver

HBase – the Hadoop database – AH HA!– NoSQL database problematic for traditional BI

• Apache Phoenix provides SQL interface – Best at storing large amounts of unstructured data!– Not optimized for aggregate (BI style) queries

Page 6: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Unfortunately Hadoopwasn't originally designed for most BI requirements …

Page 7: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Low-latency queries at petabyte scale Full ANSI SQL Compliance

Page 8: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

… but the situation is rapidly improving!

Page 9: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive on MapReduce (Hive 1.0)

Page 10: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Hive 1.0

Facilitates querying and managing large datasets. Data analysts use Hive to explore, structure and analyze that data using a SQL-like

language called HiveQL

Hive on MapReduce

Massively scale-able to PB rangeBatch reporting & ETL

High latency – queries in minutes or hoursLimited ANSI SQL compliance

Page 11: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

The Result of Hive on MR… The Data Movement Work-Around

Page 12: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

The Data Movement Work-Around

ETL the data into traditional data marts or data warehouses

OracleVerticaNetezzaMySQL

Etc.

ETL Toolor

code

Low latency queries SQL compliance

Currency Cost Complexity

Page 13: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive on Tez (Hive 2.0)

Page 14: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive on Tez

Apache Tez– Alternate Query Framework to

MapReduce which allows for a complex directed-acyclic-graph (DAG) of tasks for processing data”

– Significantly faster than MapReduce

Many times faster than Hive on MR PB scale

Still quite high query latency for BI tools

Page 15: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive on LLAP (Hive 2.1)

Page 16: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDP 2.5 is a Major Milestone for Hive

At a High Level:– 2000+ features, improvements and bug fixes

in Hive since HDP 2.4.– 600+ of these from outside of Hortonworks.

Major Improvements:– Hive LLAP: Persistent query servers with

intelligent in-memory caching.– ACID GA: Hardened and proven at scale.– Expanded SQL Compliance: More capable

integration with BI tools.– Performance: Interactive query, 2x faster ETL.– Security: Row / Column security extending to

views, Column level security for Spark.– Operations: LLAP integration in Ambari, new

Grafana dashboards.

1391

642

From HortonworksFrom Community

Hive 2 Highlights

Interactive Query with Hive LLAP+SQL ACID Fully Supported+2x Faster ETL+

Page 17: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: Architecture Overview

Deep

St

orag

e

HDFS S3 + Other HDFS Compatible Filesystems

YARN Cluster

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

QueryCoordinators

App Master

App Master

App Master

HiveServer2

ODBC /JDBC SQL

Queries

Page 18: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: Preliminary Numbers

q3 q7 q12 q13 q19 q21 q26 q27 q42 q43 q45 q52 q55 q60 q73 q84 q89 q91 q980

10

20

30

40

50

60

70

80

Hive2.0 and LLAP: TPC-DS at 10 TB Scale, 18 Nodes

Hive2.0-TezLLAP

Min query time:Query 55: 2.38s

Page 19: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: Linear Scaling at 1TB: 8 nodes versus 16 nodes.

8 16 320

20

40

60

80

100

120

Average Query Time by Concurrency

Average Time: 8 Node Average Time: 16 Node

Concurrent Queries

Tim

e (s

)

8 16 320

50

100

150

200

250

300

Maximum Query Time by Concurrency

Max Time: 8 Node Max Time: 16 Node

Concurrent Queries

Tim

e (s

)

Page 20: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP Enables Interactive Query In Seconds

Faster interactive query Faster ETL Expanded SQL compliance for BI tools (nearing SQL:2011) Enterprise Readiness: granular row & column level security Simplified Operations: LLAP integration with Ambari with

automated dashboards

TB scale datasets, not PB scale

Page 21: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Caution: Not All Hive’s Are Created Equal

Apache Hive in Hortonworks HDP Apache Hive in Cloudera CDH

• Supports LLAP (with HDP 2.5)• Supports Tez• Supports ORC, Atlas, Ranger• Supports Vectorization• Supports In-Memory Computation

• Lacks LLAP Support• Lacks Tez Support• Lacks ORC Support• Lacks Vectorization Support• Lacks In-Memory Support

Note: I’ll talk about Impala later

Page 22: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

In-Hadoop Databases

Apache HAWQ

Page 23: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hadoop-native SQL query engine and advanced analytics MPP database that

offers high-performance interactive ANSI SQL query execution and

machine learning for Data Analysts & Data Scientists who want to find

insights from large/complex datasets.

HORTONWORKS

HDBpowered by Apache HAWQ

Apache HAWQ (Pivotal/Hortonworks HDB)

Created by Pivotal – based on Greenplum coreResold as Hortonworks HDB

Page 24: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HAWQ Architecture

Page 25: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Hive● Multiple subject areas● Holds very detailed information● Scale – Multiple Petabytes● Integrates all data sources● ETL, Reporting & BI● Low-Mid Query Latency

Apache HAWQ / HDB● Single Subject Mart● Summarized information● Scale – 100s TB● Ad-hoc Analytics & Visualization● Machine Learning ● Low Query Latency

Apache Hive & HAWQ/HDB

Right Tool for the Job:Choose the right SQL engine based on your application’s needs.

Page 26: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

In-Hadoop Databases

Apache Impala (incubating)

Page 27: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Impala (Incubating)

Brings scalable massively parallel processing (MPP) database technology to Hadoop– Circumvents MapReduce– Directly accesses the data through

a specialized distributed query engine

MapReduce data processing and interactive queries can be done on the same system using the same data and metadata

Uses metadata, ODBC driver, and SQL syntax from Apache Hive

Page 28: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

seco

nds

5-User Testing ResultsIndustry Standard TPC-DS Queries

* Queries that did not complete are omitted from results on both platforms

• HAWQ 30% faster• Impala failed to complete 47% of the queries

1 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22

23 24 25 26 27 28 29 30 31 32 33

34 35 36 37 38 39 40 41 42 43 44

45 46 47 48 49 50 51 52 53 54 55

56 57 58 59 60 61 62 63 64 65 66

67 68 69 70 71 72 73 74 75 76 77

78 79 80 81 82 83 84 85 86 87 88

89 90 91 92 93 94 95 96 97 98 99

Unsuppported SQL

Long running killed

Memory Limit Exceeded

Impala Test Query Fails

Page 29: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Impala

Fast interactive query Re-uses Hive metadata and JDBC driver

Incomplete ANSI SQL compliance User concurrency stability issues TB scale, not PB scale Vendor-specific security model (Cloudera)

Page 30: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache HAWQ vs. Apache Impala

Apache HAWQ / HDB Apache Impala

• Deep YARN Integration• Best In Class Optimizer• Full ANSI SQL Compliance• Integrated Predictive Modeling• Performance Advantage 30%-600%

over Impala

• No YARN Integration – Poor Cluster Utilization

• In-complete SQL Support• No Built-in support for Predictive

Modeling• Poor concurrency

Page 31: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

In-Memory Approach

AtScale

Page 32: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AtScale

Page 33: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AtScale

Any BI Tool No Data Movement Single Semantic Layer

Turn Your Hadoop Cluster into Scale-Out OLAP Server

Resold by Hortonworks

Page 34: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AtScale Architecture – Leverages

Auto-builds & maintains aggregates in Spark

Page 35: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AtScale

Fast interactive query Full ANSI SQL Compliance Any BI tool No data movement Good user concurrency

It is a “middleware” layer running on an edge node that you need to maintain

Page 36: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

In-Memory + Micro-Queries ApproachZoomdata

Page 37: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zoomdata – Micro-Queries + Spark

Patented technique for delivering fast visualization of large volumes of data– Immediately displays a partial or approximate rendering

which then becomes more accurate over time

Single logical query turned into a set of micro-queries executed in parallel– Results from the first micro-query immediately displayed– As the rest of the micro-queries complete, Zoomdata’s

streaming architecture updates the visualization with new data until the full result set comes into focus

– Interact with the data while data is still being processed

Leverages Spark as internal in-memory database

Data Sharpening

Page 38: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zoomdata

Fast interactive query Full ANSI SQL Compliance Built-in visualization No data movement Good user concurrency

Runs on an edge node that you need to maintain Not designed to work with other BI tools

Page 39: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Conclusion

Page 40: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

The Holy Grail of sub-second queriesand full SQL compliance against PB-scale

datasets in Hadoop is not easy.

Page 41: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

But through a combination ofinnovation at the core

and vendor innovation …… we are getting closer.

Page 42: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summary ScorecardTechnology Scale Speed SQL Compliance

Apache Hive on MapReduce

Apache Hive on Tez

Apache Hive on LLAP

Hive on Tez + LLAP (HDP 2.5)

Data Movement Work-Around

Apache HAWQ

Apache Impala (incubating)

AtScale

Zoomdata

Page 43: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You

[email protected]

http://hortonworks.com

Page 44: Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Theme Colors

R 30G 30B 30

R 0G 0B 0

R 255G 255B 255

R 59G 134B 64

R 63G 174B 42

R 61G 181B 230

R 68G 105B 125

R 218G 217B 214

R 255G 112B 10

R 255G 198B 30