The Enterprise and Connected Data,Trends in the Apache Hadoop EcosystemAlan GatesCo-FounderHortonworks@alanfgates
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Our Hadoop Journey Begins…
1 ° ° °
° ° ° N
HDFS
MapReduce
Batch apps
2006
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
TodayOur Hadoop Journey: Ecosystem Innovation Accelerates
2006 2011
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
6 Years of Apache Hive and Beyond
• Apache Hive becomes a Top-Level Project
• HiveServer2 adds ODBC/JDBC• SQL breadth expands with windowing
and more
• Apache Tez enters incubation
• Hive 0.13 marks delivery of the Stinger Initiative with Tez, Vectorized Query and ORCFile support
• Standard SQL authorization, integration with Apache Ranger
• ACID transactions introduced• Governance added with Apache
Atlas integration
• Hive 2 introduces LLAP and intelligent in-memory caching
2010 2011 2012 2013 2014 2015 2016
A SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud
• Extensive SQL:2011 Support• Compatible with every major BI Tool• Proven at 300+ PB Scale
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
Deep
St
orag
e
HDFS S3 + Other HDFS Compatible Filesystems
YARN Cluster
LLAP Daemon
QueryExecutors
In-Memory Cache
LLAP Daemon
QueryExecutors
In-Memory Cache
LLAP Daemon
QueryExecutors
In-Memory Cache
LLAP Daemon
QueryExecutors
In-Memory Cache
QueryCoordinators
Coord-inator
Coord-inator
Coord-inator
HiveServer2 (Query
Endpoint)
ODBC /JDBC SQL
Queries
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: 25+x Performance Boost
query43.sq
l
query73.sq
l
query63.sq
l
query3.sq
l
query7.sq
l
query89.sq
l
query34.sq
l
query42.sq
l
query27.sq
l
query52.sq
l
query55.sq
l
query13.sq
l
query79.sq
l
query98.sq
l
query19.sq
l0
50
100
150
200
250
0
5
10
15
20
25
30
35
40
45
50
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
Que
ry T
ime(
s) (L
ower
is B
etter
)
Spee
dup
(x F
acto
r)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s new in Spark 2.0? API Improvements
– SparkSession – new entry point– Unified DataFrame & DataSet API– Structured Streaming/Continuous Application
Performance Improvements– Tungsten Phase 2 – Whole-stage code generation
ML– ML model persistence– Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)
SparkSQL– SQL 2003 support (new ANSI SQL parser, subquery support)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
How to Secure and Govern Access to Your Data?
Classification
Prohibition
Time
Location
Streams
Pipelines
Feeds
HiveTables
HDFSFiles
HBaseTables
Entitiesin Data
Lake
Policies
?
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Secure and Govern Your Data with Tag-Based Access Policies
Classification
Prohibition
Time
Location
Policies
PDPResource
Cache
Ranger
Manage Access Policies and Audit Logs
Track Metadataand Lineage
Atlas ClientSubscribers
to Topic
Gets MetadataUpdates
Atlas
MetastoreTags
Assets
Entitles
Streams
Pipelines
Feeds
HiveTables
HDFSFiles
HBaseTables
Entitiesin Data
Lake
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data In Motion
Constrained High-latency Localized context
Hybrid – cloud/on-premises Low-latency Global context
SOURCES REGIONAL INFRASTRUCTURE
CORE INFRASTRUCTURE
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Our Hadoop Journey: From the Data Center to the Cloud!2006 Today
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Hadoop in the Cloud?
Unlimited Elastic Scale
Ephemeral & Long-Running
IT & Business Agility
No UpfrontHW Costs
$0
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Architectural Considerations for Hadoop in the Cloud
Shared Data& Storage
On-Demand Ephemeral Workloads
1010110101010101
010101010101010101010101010101010
Elastic Resource Management
Shared Metadata, Security & Governance
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Shared Data and Storage
Understand and Leverage Unique Cloud Properties Shared data lake is cloud storage accessible
by all apps Cloud storage segregated from compute Built-in geo-distribution and DR
Focus Areas Address cloud storage consistency
and performance Enhance performance via memory
and local storage
Shared Data& Storage
1010110101010101
010101010101010101010101010101010
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enhance Performance via Caching
Tabular Data: LLAP Read + Write-thru Cache Shared across jobs / apps and across engines Cache only the needed columns Spills to SSD when memory is full (anti-caching) Read & Write-through cache Security: Column-level and row-level
HDFS Caching for Non-tabular Data Cache data from cloud storage as needed Write-through cache
Workloads
Cloud Storage
LLAP R/W TablesHDFS Files
Cache
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Prescriptive On-Demand Ephemeral Workloads
On-DemandEphemeralWorkloads
Data ScienceR/W TablesCompute Fabric
ETL
R/W TablesCompute Fabric
WarehouseR/W TablesCompute Fabric
Search
R/W TablesCompute Fabric
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Shared Data Requires Shared Metadata, Security, and Governance
Shared Metadata Across All Workloads Metadata considerations
– Tabular data metastore– Lineage and provenance metadata– Pipeline and job management metadata– Add upon ingest– Update as processing modifies data
Access / tag-based policies and audit logs Centrally stored to facilitate use across clusters
– Ex. backed by Cloud RDS (or shared DB)
Classification
Prohibition
Time
Location
Streams
Pipelines
Feeds
Tables
Files Objects
SharedMetadata
Policies
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Elastic Resource Management in Context of Workload
Workload Management vs. Cluster Management Understand resource needs of different
workload types Add / remove resources to meet workload SLAs Manage compute power and high-performance
data-access (ex., LLAP) Pricing-aware: instances (spot, reserved),
data, bandwidthElasticResourceManagement
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data in Motion
Data at Rest
Deep HistoricalAnalysis
D ATA C E N T E R
Stream Analytics
Edge Data
Data in Motion
MachineLearning
C L O U D Edge Data
Edge Analytics
Data at Rest
Transformational Applications Require Connected Data
Thank You
Top Related