Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop
-
Upload
pivotal -
Category
Technology
-
view
372 -
download
2
Transcript of Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on HadoopIan FyfeDirector Product MarketingAugust 2nd, 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Hadoop architecture vs. BI tool requirements Hive on MapReduce The Data Movement Work-Around Hive on Tez and LLAP In-Hadoop Databases: Apache HAWQ, Apache Impala In-Memory: AtScale, Zoomdata Conclusion and summary
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks
The company behind Apache Hadoop– Hortonworks Data Platform (HDP)– Hortonworks Data Flow (HDF)
Strong partnership with Pivotal– Pivotal is converting Pivotal Hadoop Distribution customers to HDP– Hortonworks is reselling Pivotal HDB subscriptions
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop “Classic” Core Components
HDFS– a distributed file system allowing massive storage
across a cluster of commodity servers
MapReduce– Framework for distributed computation,
common use cases include aggregating, sorting, and filtering BIG data sets
– Problem is broken up into small fragments of work that can be computed or recomputed in isolation on any node of the cluster
– Massively scalable– High latency
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Related Projects
Hive – a data warehouse infrastructure on top of Hadoop– Implements a SQL like Query language, including a
JDBC driver
HBase – the Hadoop database – AH HA!– NoSQL database problematic for traditional BI
• Apache Phoenix provides SQL interface – Best at storing large amounts of unstructured data!– Not optimized for aggregate (BI style) queries
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Unfortunately Hadoopwasn't originally designed for most BI requirements …
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Low-latency queries at petabyte scale Full ANSI SQL Compliance
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
… but the situation is rapidly improving!
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive on MapReduce (Hive 1.0)
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive 1.0
Facilitates querying and managing large datasets. Data analysts use Hive to explore, structure and analyze that data using a SQL-like
language called HiveQL
Hive on MapReduce
Massively scale-able to PB rangeBatch reporting & ETL
High latency – queries in minutes or hoursLimited ANSI SQL compliance
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Result of Hive on MR… The Data Movement Work-Around
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Data Movement Work-Around
ETL the data into traditional data marts or data warehouses
OracleVerticaNetezzaMySQL
Etc.
ETL Toolor
code
Low latency queries SQL compliance
Currency Cost Complexity
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive on Tez (Hive 2.0)
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive on Tez
Apache Tez– Alternate Query Framework to
MapReduce which allows for a complex directed-acyclic-graph (DAG) of tasks for processing data”
– Significantly faster than MapReduce
Many times faster than Hive on MR PB scale
Still quite high query latency for BI tools
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive on LLAP (Hive 2.1)
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP 2.5 is a Major Milestone for Hive
At a High Level:– 2000+ features, improvements and bug fixes
in Hive since HDP 2.4.– 600+ of these from outside of Hortonworks.
Major Improvements:– Hive LLAP: Persistent query servers with
intelligent in-memory caching.– ACID GA: Hardened and proven at scale.– Expanded SQL Compliance: More capable
integration with BI tools.– Performance: Interactive query, 2x faster ETL.– Security: Row / Column security extending to
views, Column level security for Spark.– Operations: LLAP integration in Ambari, new
Grafana dashboards.
1391
642
From HortonworksFrom Community
Hive 2 Highlights
Interactive Query with Hive LLAP+SQL ACID Fully Supported+2x Faster ETL+
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
Deep
St
orag
e
HDFS S3 + Other HDFS Compatible Filesystems
YARN Cluster
LLAP Daemon
Executors
In-Memory Cache
LLAP Daemon
Executors
In-Memory Cache
LLAP Daemon
Executors
In-Memory Cache
LLAP Daemon
Executors
In-Memory Cache
QueryCoordinators
App Master
App Master
App Master
HiveServer2
ODBC /JDBC SQL
Queries
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Preliminary Numbers
q3 q7 q12 q13 q19 q21 q26 q27 q42 q43 q45 q52 q55 q60 q73 q84 q89 q91 q980
10
20
30
40
50
60
70
80
Hive2.0 and LLAP: TPC-DS at 10 TB Scale, 18 Nodes
Hive2.0-TezLLAP
Min query time:Query 55: 2.38s
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Linear Scaling at 1TB: 8 nodes versus 16 nodes.
8 16 320
20
40
60
80
100
120
Average Query Time by Concurrency
Average Time: 8 Node Average Time: 16 Node
Concurrent Queries
Tim
e (s
)
8 16 320
50
100
150
200
250
300
Maximum Query Time by Concurrency
Max Time: 8 Node Max Time: 16 Node
Concurrent Queries
Tim
e (s
)
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP Enables Interactive Query In Seconds
Faster interactive query Faster ETL Expanded SQL compliance for BI tools (nearing SQL:2011) Enterprise Readiness: granular row & column level security Simplified Operations: LLAP integration with Ambari with
automated dashboards
TB scale datasets, not PB scale
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Caution: Not All Hive’s Are Created Equal
Apache Hive in Hortonworks HDP Apache Hive in Cloudera CDH
• Supports LLAP (with HDP 2.5)• Supports Tez• Supports ORC, Atlas, Ranger• Supports Vectorization• Supports In-Memory Computation
• Lacks LLAP Support• Lacks Tez Support• Lacks ORC Support• Lacks Vectorization Support• Lacks In-Memory Support
Note: I’ll talk about Impala later
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
In-Hadoop Databases
Apache HAWQ
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop-native SQL query engine and advanced analytics MPP database that
offers high-performance interactive ANSI SQL query execution and
machine learning for Data Analysts & Data Scientists who want to find
insights from large/complex datasets.
HORTONWORKS
HDBpowered by Apache HAWQ
Apache HAWQ (Pivotal/Hortonworks HDB)
Created by Pivotal – based on Greenplum coreResold as Hortonworks HDB
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HAWQ Architecture
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive● Multiple subject areas● Holds very detailed information● Scale – Multiple Petabytes● Integrates all data sources● ETL, Reporting & BI● Low-Mid Query Latency
Apache HAWQ / HDB● Single Subject Mart● Summarized information● Scale – 100s TB● Ad-hoc Analytics & Visualization● Machine Learning ● Low Query Latency
Apache Hive & HAWQ/HDB
Right Tool for the Job:Choose the right SQL engine based on your application’s needs.
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
In-Hadoop Databases
Apache Impala (incubating)
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Impala (Incubating)
Brings scalable massively parallel processing (MPP) database technology to Hadoop– Circumvents MapReduce– Directly accesses the data through
a specialized distributed query engine
MapReduce data processing and interactive queries can be done on the same system using the same data and metadata
Uses metadata, ODBC driver, and SQL syntax from Apache Hive
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
seco
nds
5-User Testing ResultsIndustry Standard TPC-DS Queries
* Queries that did not complete are omitted from results on both platforms
• HAWQ 30% faster• Impala failed to complete 47% of the queries
1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40 41 42 43 44
45 46 47 48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63 64 65 66
67 68 69 70 71 72 73 74 75 76 77
78 79 80 81 82 83 84 85 86 87 88
89 90 91 92 93 94 95 96 97 98 99
Unsuppported SQL
Long running killed
Memory Limit Exceeded
Impala Test Query Fails
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Impala
Fast interactive query Re-uses Hive metadata and JDBC driver
Incomplete ANSI SQL compliance User concurrency stability issues TB scale, not PB scale Vendor-specific security model (Cloudera)
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache HAWQ vs. Apache Impala
Apache HAWQ / HDB Apache Impala
• Deep YARN Integration• Best In Class Optimizer• Full ANSI SQL Compliance• Integrated Predictive Modeling• Performance Advantage 30%-600%
over Impala
• No YARN Integration – Poor Cluster Utilization
• In-complete SQL Support• No Built-in support for Predictive
Modeling• Poor concurrency
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
In-Memory Approach
AtScale
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AtScale
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AtScale
Any BI Tool No Data Movement Single Semantic Layer
Turn Your Hadoop Cluster into Scale-Out OLAP Server
Resold by Hortonworks
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AtScale Architecture – Leverages
Auto-builds & maintains aggregates in Spark
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AtScale
Fast interactive query Full ANSI SQL Compliance Any BI tool No data movement Good user concurrency
It is a “middleware” layer running on an edge node that you need to maintain
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
In-Memory + Micro-Queries ApproachZoomdata
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zoomdata – Micro-Queries + Spark
Patented technique for delivering fast visualization of large volumes of data– Immediately displays a partial or approximate rendering
which then becomes more accurate over time
Single logical query turned into a set of micro-queries executed in parallel– Results from the first micro-query immediately displayed– As the rest of the micro-queries complete, Zoomdata’s
streaming architecture updates the visualization with new data until the full result set comes into focus
– Interact with the data while data is still being processed
Leverages Spark as internal in-memory database
Data Sharpening
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zoomdata
Fast interactive query Full ANSI SQL Compliance Built-in visualization No data movement Good user concurrency
Runs on an edge node that you need to maintain Not designed to work with other BI tools
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Conclusion
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Holy Grail of sub-second queriesand full SQL compliance against PB-scale
datasets in Hadoop is not easy.
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
But through a combination ofinnovation at the core
and vendor innovation …… we are getting closer.
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary ScorecardTechnology Scale Speed SQL Compliance
Apache Hive on MapReduce
Apache Hive on Tez
Apache Hive on LLAP
Hive on Tez + LLAP (HDP 2.5)
Data Movement Work-Around
Apache HAWQ
Apache Impala (incubating)
AtScale
Zoomdata
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
http://hortonworks.com
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Theme Colors
R 30G 30B 30
R 0G 0B 0
R 255G 255B 255
R 59G 134B 64
R 63G 174B 42
R 61G 181B 230
R 68G 105B 125
R 218G 217B 214
R 255G 112B 10
R 255G 198B 30