May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
Scalable Data Warehousing on Hadoop - BI...
Transcript of Scalable Data Warehousing on Hadoop - BI...
Scalable Data Warehousing on HadoopZsolt Fekete
Budapest Dataforum, 2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaScalable Data Warehousing on Hadoop
Hadoop Ecosystem
Hive
Solution Architecture
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hadoop History
Google papers– 2003, GFS distributed filesystem
– 2004, Map-Reduce computation model/system
– Key idea: distributed computation on commodity hardware
2006, Yahoo! Implements Hadoop, makes it open source– Hadoop = HDFS + MapReduce
Big Data hype starts
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive History – The Beginning
2007, Facebook started developing Hive: petabyte scale SQL over Hadoop
2008, Hive became open source
Designed for batch processing
Tables cannot be modified, no update, no delete
SQL compiled to MapReduce
Performance limitations of MapReduce
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What do we expect from an EDW solution?
Scalable storage: HDFS
Fast, scalable SQL engine: Apache Hive
Security– Authentication, Authorization: Kerberos, Apache Ranger, Apache Knox
– Encrypted storage, encrypted communication: HDFS TDE, wire encryption
– Data governance: Apache Atlas
BI, cubes, data science: Apache Spark, Apache Zeppelin, Druid
Monitoring, configuration, deployment: Apache Ambari
Data ingestion: Apache Sqoop, Apache Storm
Data Lifecycle management: Apache Falcon
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Access
User ID Region Total Spend
1 East 5,131
2 East 27,828
3 West 55,493
4 West 7,193
5 East 18,193
Example: Ranger, Per-User Row Filtering by Region in Hive
User 2
(East Region)
User 1
(West Region)
Original Query:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
Query Rewrites based on
Dynamic Ranger PoliciesDynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “east”
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “west”
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Data Platform (HDP)
Enterprise ready integration of 100% Open Source projects
HDFS, Hive, Ranger, Atlas, Knox, Sqoop, Ambari, Spark, Zeppelin, Druid, etc…
Why Apache Software Foundation?
Cloud solutions:– Azure HDInsight
– Cloudbreak (for AWS, Azure, Google Cloud)
– Hortonworks Data Cloud (HDC) on AWS marketplace
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaScalable Data Warehousing on Hadoop
Hadoop Ecosystem
Hive
Solution Architecture
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive Story
May, 2013: release 0.11.0– ORC format, Facebook 300+ PB
April, 2014: release 0.13.0– Apache Tez, vectorization, up to 100x perfomance improvement
November, 2014: release 0.14.0– ACID: insert, update, delete
June 2016 : release 2.1.0– LLAP: Low Latency Analytical Processing
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive ACID Production-Ready with HDP 2.5
Tested at multi-TB scale using TPC-H benchmark.– Reliably ingest 400GB+ per day within a
partition.
– 10TB+ raw data in a single partition.
– Simultaneous ingest, delete and query.
70+ stabilization improvements.
Supported:– SQL INSERT, UPDATE, DELETE.
– Streaming API.
HDP-2.6: SQL MERGE under development (HIVE-10924).
Notable Improvements
0 MB
1 TB
1 TB
2 TB
2 TB
3 TB
3 TB
4 TB
4 TB
5 TB
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
5.24.16 5.25.16 5.26.16 5.27.16 5.28.16 5.29.16 5.30.16 5.31.16 6.1.16
Tim
e (
s)
Query Time versus Data Size
Runtime for All Queries (s) Total Compressed Data
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
5.23.16 5.24.16 5.25.16 5.26.16 5.27.16 5.28.16 5.29.16 5.30.16 5.31.16 6.1.16
Tim
e (
s)
Times for Inserts and Deletes
time_insert_lineitem time_insert_orders time_delete_lineitem time_delete_orders
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Types SQL Features File Formats HDP 2.6Numeric Core SQL Features Columnar ACID MERGE
FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery
DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries
INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins
BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT
String UNION ALL Logfile
CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs
BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints
Date, Time UNION DISTINCT JSON Default Values
DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi Table Transactions
Complex Types OLAP and Windowing Functions Custom Formats
ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features
Nested Data Analytics CUBE and Grouping Sets XPath Analytics
Nested Data Traversal ACID Transactions
Lateral Views INSERT / UPDATE / DELETE
Procedural Extensions Constraints
HPL/SQL Primary / Foreign Key (Non Validated)
Apache Hive: Journey to SQL:2011 Analytics
Legend
HDP 2.5
Projected: HDP 3.0
HDP 2.6
Track Hive SQL:2011 Complete: HIVE-13554
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
Dee
p
Sto
rage
YARN Cluster
LLAP Daemon
Query Executors
LLAP Daemon
Query Executors
LLAP Daemon
Query Executors
LLAP Daemon
Query Executors
QueryCoordinators
Coord-inator
Coord-inator
Coord-inator
HiveServer2 (Query
Endpoint)
ODBC /JDBC
SQLQueries In-Memory Cache
(Shared Across All Users)
HDFS and Compatible
S3 WASB Isilon
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: 7x Performance Boost at 10 TB Scale
0
200
400
600
800
1000
1200
1400
1600
0
5
10
15
20
25
30
35q
uer
y52
qu
ery1
2
qu
ery5
5
qu
ery8
2
qu
ery7
9
qu
ery7
9
qu
ery9
1
qu
ery7
3
qu
ery6
6
qu
ery5
8
qu
ery4
9
qu
ery4
8
qu
ery4
2
qu
ery3
qu
ery7
qu
ery4
3
qu
ery4
5
qu
ery1
9
qu
ery2
0
qu
ery2
6
qu
ery4
6
qu
ery8
9
qu
ery2
5
qu
ery9
3
qu
ery9
0
qu
ery3
4
qu
ery1
5
qu
ery1
3
qu
ery8
5
qu
ery3
9
qu
ery2
7
qu
ery4
0
qu
ery3
2
qu
ery9
8
qu
ery8
4
qu
ery8
7
qu
ery6
8
qu
ery9
6
qu
ery1
7
qu
ery2
1
qu
ery5
0
qu
ery8
8
qu
ery7
1
qu
ery6
4
qu
ery7
6
Qu
ery
Ru
nti
me
(s)
Imp
rove
me
nt
Vs.
HD
P 2
.5 (
Rat
io)
HDP 2.5 with LLAP: 7x Performance Improvement Across All Query Types(10 TB, 10x d2.8xlarge EC2 Nodes, TPC-DS Queries)
Runtime (s) Improvement versus HDP 2.4
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Spe
edu
p (
x Fa
cto
r)
Qu
ery
Tim
e(s)
(Lo
wer
is B
ette
r)
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive vs. Apache Impala at 10TB
10TB scale on 10 identical AWS nodes.
Hive and Impala showed similar times on most smaller queries.
Hive scaled better, with many queries completing in <2m where Impala ran to timeout (3000s).
Highlights
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive vs. Presto on a partitioned 1TB dataset.
Presto lacks basic performance optimizations like dynamic partition pruning.
On a real dataset / workload Presto perform poorly without full re-writes.
Example: Query 55 without re-writes = 185.17s, with re-writes = 16s. LLAP = 1.37s.
Highlights
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour(Yahoo Japan)
Analytics Performance
100 Million rows/s Per Node(with Hive LLAP)
Largest Hive Warehouse
300+ PB Raw Storage(Facebook)
Largest Cluster
4,500+ Nodes(Yahoo)
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Roadmap At A Glance
Scalable DW in HadoopHDP 2.5 (GA) HDP 2.6 (GA) Beyond HDP 2.6
Fast BI
• LLAP Technical Preview (25x performance improvements)
• Primary Key / Foreign Key
• LLAP GA• Vectorized Decimal• SSD Cache• Cache Text Data in LLAP
• Materialized Views• Druid tables as Hive Indexes for fast
drill-down.• Fine-Grained Resource
Management.
SQL / EDW
• OLAP Improvements: Multi partition and ordering keys, order by aggregations.
• ACID MERGE• SQL: Cross Product, Multi Subquery,
TPC-DS Complete
• Column NOT NULL / Defaults• Surrogate Key Generation• Multi-Statement Transactions• Improved HPL/SQL• Better Unicode support
Cloud• Cloud Templates for ETL and
Presentation Layers• LLAP Template for Hortonworks Data
Cloud• Full ACID support for S3 / WASB• Replication / DR
Operations• Grafana Dashboards • Hive View: DBA Tooling
• Tez UI: Hive-Oriented Search• Activity Monitoring.• Schema Recommendations.
Current
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Is Apache Hive Now?
Apache Hive is a SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud.
Features:
• Extensive SQL:2011 Support
• ACID Transactions
• In-Memory Caching
• Cost-Based Optimizer
• User-Based Dynamic Security
• Replication and Disaster Recovery
• JDBC and ODBC Support
• Compatible with every major BI Tool
• Proven at 300+ PB Scale
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaScalable Data Warehousing on Hadoop
Hadoop Ecosystem
Hive
Solution Architecture
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Typical Legacy EDW ImplementationsBefore Connected Data Platforms
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Typical Legacy EDW Implementationsend state post EDW Optimization
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scalable Data Warehousing on HadoopC
apab
iliti
es
Batch SQL OLAP / CubeInteractive SQLSub-Second
SQLACID / MERGE
Ap
plic
atio
ns
• ETL• Reporting• Data Mining• Deep Analytics
• Multidimensional Analytics
• MDX Tools• Excel
• Reporting• BI Tools: Tableau,
Microstrategy, Cognos
• Ad-Hoc• Drill-Down• BI Tools: Tableau,
Excel
• Continuous Ingestion from Operational DBMS
• Slowly Changing Dimensions
Existing
Development
Emerging
Legend
Co
re
Platform
Scale-Out Storage
Petabyte Scale Processing
Core SQL Engine
Apache Tez: Scalable Distributed Processing
Advanced Cost-Based Optimizer
Connectivity
Advanced Security
JDBC / ODBC
ComprehensiveSQL:2011 Coverage
MDX
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New: Hortonworks EDW Optimization Solution
SyncsortHigh-Performance Data Movement
HadoopScalable Storage and Compute
Hive LLAPHigh Performance SQL Data Mart
AtScale Intelligence PlatformOLAP Cubes for Higher Performance
Source Data Systems
Fast, scalable SQL analytics
Intelligent in-memory caching
Define OLAP cubes for 10x faster queries
Unified semantic layer for all BI tools
High performance data import
from all major EDW platforms
Pre-aggregateddata
... Or, full-fidelitydata
Hortonworks EDWOptimization Solutionmakes analytics onHadoop easier thanever
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Accelerate Analytics with AtScale
• Analyze data directly in HDP.• Use any BI Tool.• Unified Semantic Layer.• Support directly from Hortonworks.
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Overview: Syncsort DMX-h
• Simple drag-and-drop ETL pipelines.• Connects to all major data sources in addition to Hadoop.• Integrated with Ranger, Atlas integration in development.
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Adopting Hadoop EDW solution
New technology is needed for data processing which fits better in Hadoop ecosystem– Unstructured data
– Computing inverse index
– Etc…
Archiving data– HDFS is a low cost storage solution
– On par with tape backup solutions
Keep more data– Longer time window
– No need to reduce data
Move cold data from EDW to Hadoop
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Adopting Hadoop EDW solution
Offload ETL jobs in the current EDW– Save CPU in existing EDW deployment, focus it to the real critical tasks
After adapting Hadoop storage– Possible to add new data sources
Analysis is still possible– Hive LLAP
– AtScale
– Integration with BI tools
– Druid
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thanks for the Attention!
Questions?