Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on HadoopIan FyfeDirector Product MarketingAugust 2nd, 2016

Agenda

Hadoop architecture vs. BI tool requirements Hive on MapReduce The Data Movement Work-Around Hive on Tez and LLAP In-Hadoop Databases: Apache HAWQ, Apache Impala In-Memory: AtScale, Zoomdata Conclusion and summary

Hortonworks

The company behind Apache Hadoop– Hortonworks Data Platform (HDP)– Hortonworks Data Flow (HDF)

Strong partnership with Pivotal– Pivotal is converting Pivotal Hadoop Distribution customers to HDP– Hortonworks is reselling Pivotal HDB subscriptions

Hadoop “Classic” Core Components

HDFS– a distributed file system allowing massive storage

across a cluster of commodity servers

MapReduce– Framework for distributed computation,

common use cases include aggregating, sorting, and filtering BIG data sets

– Problem is broken up into small fragments of work that can be computed or recomputed in isolation on any node of the cluster

– Massively scalable– High latency

Related Projects

Hive – a data warehouse infrastructure on top of Hadoop– Implements a SQL like Query language, including a

JDBC driver

HBase – the Hadoop database – AH HA!– NoSQL database problematic for traditional BI

• Apache Phoenix provides SQL interface – Best at storing large amounts of unstructured data!– Not optimized for aggregate (BI style) queries

Unfortunately Hadoopwasn't originally designed for most BI requirements …

Low-latency queries at petabyte scale Full ANSI SQL Compliance

… but the situation is rapidly improving!

Hive on MapReduce (Hive 1.0)

Apache Hive 1.0

Facilitates querying and managing large datasets. Data analysts use Hive to explore, structure and analyze that data using a SQL-like

language called HiveQL

Hive on MapReduce

Massively scale-able to PB rangeBatch reporting & ETL

High latency – queries in minutes or hoursLimited ANSI SQL compliance

The Result of Hive on MR… The Data Movement Work-Around

The Data Movement Work-Around

ETL the data into traditional data marts or data warehouses

OracleVerticaNetezzaMySQL

ETL Toolor

Low latency queries SQL compliance

Currency Cost Complexity

Hive on Tez (Hive 2.0)

Hive on Tez

Apache Tez– Alternate Query Framework to

MapReduce which allows for a complex directed-acyclic-graph (DAG) of tasks for processing data”

– Significantly faster than MapReduce

Many times faster than Hive on MR PB scale

Still quite high query latency for BI tools

Hive on LLAP (Hive 2.1)

HDP 2.5 is a Major Milestone for Hive

At a High Level:– 2000+ features, improvements and bug fixes

in Hive since HDP 2.4.– 600+ of these from outside of Hortonworks.

Major Improvements:– Hive LLAP: Persistent query servers with

intelligent in-memory caching.– ACID GA: Hardened and proven at scale.– Expanded SQL Compliance: More capable

integration with BI tools.– Performance: Interactive query, 2x faster ETL.– Security: Row / Column security extending to

views, Column level security for Spark.– Operations: LLAP integration in Ambari, new

Grafana dashboards.

From HortonworksFrom Community

Hive 2 Highlights

Interactive Query with Hive LLAP+SQL ACID Fully Supported+2x Faster ETL+

Hive 2 with LLAP: Architecture Overview

HDFS S3 + Other HDFS Compatible Filesystems

YARN Cluster

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

LLAP Daemon

Executors

In-Memory Cache

QueryCoordinators

App Master

HiveServer2

ODBC /JDBC SQL

Queries

Hive 2 with LLAP: Preliminary Numbers

q3 q7 q12 q13 q19 q21 q26 q27 q42 q43 q45 q52 q55 q60 q73 q84 q89 q91 q980

Hive2.0 and LLAP: TPC-DS at 10 TB Scale, 18 Nodes

Hive2.0-TezLLAP

Min query time:Query 55: 2.38s

Hive 2 with LLAP: Linear Scaling at 1TB: 8 nodes versus 16 nodes.

8 16 320

Average Query Time by Concurrency

Average Time: 8 Node Average Time: 16 Node

Concurrent Queries

8 16 320

Maximum Query Time by Concurrency

Max Time: 8 Node Max Time: 16 Node

Concurrent Queries

Hive 2 with LLAP Enables Interactive Query In Seconds

Faster interactive query Faster ETL Expanded SQL compliance for BI tools (nearing SQL:2011) Enterprise Readiness: granular row & column level security Simplified Operations: LLAP integration with Ambari with

automated dashboards

TB scale datasets, not PB scale

Caution: Not All Hive’s Are Created Equal

Apache Hive in Hortonworks HDP Apache Hive in Cloudera CDH

• Supports LLAP (with HDP 2.5)• Supports Tez• Supports ORC, Atlas, Ranger• Supports Vectorization• Supports In-Memory Computation

• Lacks LLAP Support• Lacks Tez Support• Lacks ORC Support• Lacks Vectorization Support• Lacks In-Memory Support

Note: I’ll talk about Impala later

In-Hadoop Databases

Apache HAWQ

Hadoop-native SQL query engine and advanced analytics MPP database that

offers high-performance interactive ANSI SQL query execution and

machine learning for Data Analysts & Data Scientists who want to find

insights from large/complex datasets.

HORTONWORKS

HDBpowered by Apache HAWQ

Apache HAWQ (Pivotal/Hortonworks HDB)

Created by Pivotal – based on Greenplum coreResold as Hortonworks HDB

HAWQ Architecture

Apache Hive● Multiple subject areas● Holds very detailed information● Scale – Multiple Petabytes● Integrates all data sources● ETL, Reporting & BI● Low-Mid Query Latency

Apache HAWQ / HDB● Single Subject Mart● Summarized information● Scale – 100s TB● Ad-hoc Analytics & Visualization● Machine Learning ● Low Query Latency

Apache Hive & HAWQ/HDB

Right Tool for the Job:Choose the right SQL engine based on your application’s needs.

In-Hadoop Databases

Apache Impala (incubating)

Apache Impala (Incubating)

Brings scalable massively parallel processing (MPP) database technology to Hadoop– Circumvents MapReduce– Directly accesses the data through

a specialized distributed query engine

MapReduce data processing and interactive queries can be done on the same system using the same data and metadata

Uses metadata, ODBC driver, and SQL syntax from Apache Hive

5-User Testing ResultsIndustry Standard TPC-DS Queries

* Queries that did not complete are omitted from results on both platforms

• HAWQ 30% faster• Impala failed to complete 47% of the queries

1 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22

23 24 25 26 27 28 29 30 31 32 33

34 35 36 37 38 39 40 41 42 43 44

45 46 47 48 49 50 51 52 53 54 55

56 57 58 59 60 61 62 63 64 65 66

67 68 69 70 71 72 73 74 75 76 77

78 79 80 81 82 83 84 85 86 87 88

89 90 91 92 93 94 95 96 97 98 99

Unsuppported SQL

Long running killed

Memory Limit Exceeded

Impala Test Query Fails

Apache Impala

Fast interactive query Re-uses Hive metadata and JDBC driver

Incomplete ANSI SQL compliance User concurrency stability issues TB scale, not PB scale Vendor-specific security model (Cloudera)

Apache HAWQ vs. Apache Impala

Apache HAWQ / HDB Apache Impala

• Deep YARN Integration• Best In Class Optimizer• Full ANSI SQL Compliance• Integrated Predictive Modeling• Performance Advantage 30%-600%

over Impala

• No YARN Integration – Poor Cluster Utilization

• In-complete SQL Support• No Built-in support for Predictive

Modeling• Poor concurrency

In-Memory Approach

AtScale

Any BI Tool No Data Movement Single Semantic Layer

Turn Your Hadoop Cluster into Scale-Out OLAP Server

Resold by Hortonworks

AtScale Architecture – Leverages

Auto-builds & maintains aggregates in Spark

AtScale

Fast interactive query Full ANSI SQL Compliance Any BI tool No data movement Good user concurrency

It is a “middleware” layer running on an edge node that you need to maintain

In-Memory + Micro-Queries ApproachZoomdata

Zoomdata – Micro-Queries + Spark

Patented technique for delivering fast visualization of large volumes of data– Immediately displays a partial or approximate rendering

which then becomes more accurate over time

Single logical query turned into a set of micro-queries executed in parallel– Results from the first micro-query immediately displayed– As the rest of the micro-queries complete, Zoomdata’s

streaming architecture updates the visualization with new data until the full result set comes into focus

– Interact with the data while data is still being processed

Leverages Spark as internal in-memory database

Data Sharpening

Zoomdata

Fast interactive query Full ANSI SQL Compliance Built-in visualization No data movement Good user concurrency

Runs on an edge node that you need to maintain Not designed to work with other BI tools

Conclusion

The Holy Grail of sub-second queriesand full SQL compliance against PB-scale

datasets in Hadoop is not easy.

But through a combination ofinnovation at the core

and vendor innovation …… we are getting closer.

Summary ScorecardTechnology Scale Speed SQL Compliance

Apache Hive on MapReduce

Apache Hive on Tez

Apache Hive on LLAP

Hive on Tez + LLAP (HDP 2.5)

Data Movement Work-Around

Apache HAWQ

Apache Impala (incubating)

AtScale

Zoomdata

Thank You

ifyfe@hortonworks.com

http://hortonworks.com

Theme Colors

R 30G 30B 30

R 0G 0B 0

R 255G 255B 255

R 59G 134B 64

R 63G 174B 42

R 61G 181B 230

R 68G 105B 125

R 218G 217B 214

R 255G 112B 10

R 255G 198B 30

Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

Technology

Transcript of Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

Hadoop Deployment Manual - Hyadespleiades.ucsc.edu/doc/bright/hadoop-deployment-manual.pdf2.2 Ncurses Installation Of Hadoop Using cm-hadoop-setup ... •The Hadoop Deployment Manual

THE SANDSTONE MEGA-REGION - Sydney · 2020. 4. 1. · Sandstone Mega-region. Achieving this long-term goal includes a number of steps, some immediate and some longer term including:

Achieving Business Value by Fusing Hadoop and Corporate Data

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Hadoop Conf 2014 - Hadoop BigQuery Connector

Hadoop Hand-on Lab: Installing Hadoop 2

Analyzing Hadoop with Hadoop

Why use Hadoop?, Challenges / Learning Hadoop & Average Salary of Hadoop Professional

MEGA MEGA SRI SRI YOGA

SUPER MEGA MEGA

MEGA Recruits MEGA Leader Agents

Hue: The Hadoop UI - Hadoop Singapore

HadoopLearn | HADOOP Online Training USA | HADOOP Trainings

Hadoop , Hadoop , Hadoop !!!

Hadoop Crash Course Hadoop Summit SJ

Hadoop Interview Questions Version 2.0.0 Author: Hadoop ...kpbigdata.com/img/Hadoop_Interview_question.pdf · Hadoop Interview Questions Version 2.0.0 Author: Hadoop Learning Resource

Hadoop virtualization extensions hadoop world meetup

MEGA - MEGA Generation Overview

Docker based Hadoop provisioning - Hadoop Summit 2014

Perfect Mega Bio PMBT Micro finance Perfect Mega Retails Perfect Mega Herbals Perfect Mega Real Estate.