Download - Big data spain keynote nov 2016

The Enterprise and Connected Data,Trends in the Apache Hadoop EcosystemAlan GatesCo-FounderHortonworks@alanfgates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Our Hadoop Journey Begins…

1 ° ° °

° ° ° N

HDFS

MapReduce

Batch apps

2006


TodayOur Hadoop Journey: Ecosystem Innovation Accelerates

2006 2011


6 Years of Apache Hive and Beyond

• Apache Hive becomes a Top-Level Project

• HiveServer2 adds ODBC/JDBC• SQL breadth expands with windowing

and more

• Apache Tez enters incubation

• Hive 0.13 marks delivery of the Stinger Initiative with Tez, Vectorized Query and ORCFile support

• Standard SQL authorization, integration with Apache Ranger

• ACID transactions introduced• Governance added with Apache

Atlas integration

• Hive 2 introduces LLAP and intelligent in-memory caching

2010 2011 2012 2013 2014 2015 2016

A SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud

• Extensive SQL:2011 Support• Compatible with every major BI Tool• Proven at 300+ PB Scale


Hive 2 with LLAP: Architecture Overview

Deep

St

orag

e

HDFS S3 + Other HDFS Compatible Filesystems

YARN Cluster

LLAP Daemon

QueryExecutors

In-Memory Cache

LLAP Daemon

QueryExecutors

In-Memory Cache

LLAP Daemon

QueryExecutors

In-Memory Cache

LLAP Daemon

QueryExecutors

In-Memory Cache

QueryCoordinators

Coord-inator

Coord-inator

Coord-inator

HiveServer2 (Query

Endpoint)

ODBC /JDBC SQL

Queries


Hive 2 with LLAP: 25+x Performance Boost

query43.sq

l

query73.sq

l

query63.sq

l

query3.sq

l

query7.sq

l

query89.sq

l

query34.sq

l

query42.sq

l

query27.sq

l

query52.sq

l

query55.sq

l

query13.sq

l

query79.sq

l

query98.sq

l

query19.sq

l0

50

100

150

200

250

0

5

10

15

20

25

30

35

40

45

50

Hive 2 with LLAP averages 26x faster than Hive 1

Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

Que

ry T

ime(

s) (L

ower

is B

etter

)

Spee

dup

(x F

acto

r)


What’s new in Spark 2.0? API Improvements

– SparkSession – new entry point– Unified DataFrame & DataSet API– Structured Streaming/Continuous Application

Performance Improvements– Tungsten Phase 2 – Whole-stage code generation

ML– ML model persistence– Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)

SparkSQL– SQL 2003 support (new ANSI SQL parser, subquery support)


How to Secure and Govern Access to Your Data?

Classification

Prohibition

Time

Location

Streams

Pipelines

Feeds

HiveTables

HDFSFiles

HBaseTables

Entitiesin Data

Lake

Policies

?


Secure and Govern Your Data with Tag-Based Access Policies

Classification

Prohibition

Time

Location

Policies

PDPResource

Cache

Ranger

Manage Access Policies and Audit Logs

Track Metadataand Lineage

Atlas ClientSubscribers

to Topic

Gets MetadataUpdates

Atlas

MetastoreTags

Assets

Entitles

Streams

Pipelines

Feeds

HiveTables

HDFSFiles

HBaseTables

Entitiesin Data

Lake


Data In Motion

Constrained High-latency Localized context

Hybrid – cloud/on-premises Low-latency Global context

SOURCES REGIONAL INFRASTRUCTURE

CORE INFRASTRUCTURE


Our Hadoop Journey: From the Data Center to the Cloud!2006 Today


Why Hadoop in the Cloud?

Unlimited Elastic Scale

Ephemeral & Long-Running

IT & Business Agility

No UpfrontHW Costs

$0


Key Architectural Considerations for Hadoop in the Cloud

Shared Data& Storage

On-Demand Ephemeral Workloads

1010110101010101

010101010101010101010101010101010

Elastic Resource Management

Shared Metadata, Security & Governance


Shared Data and Storage

Understand and Leverage Unique Cloud Properties Shared data lake is cloud storage accessible

by all apps Cloud storage segregated from compute Built-in geo-distribution and DR

Focus Areas Address cloud storage consistency

and performance Enhance performance via memory

and local storage

Shared Data& Storage

1010110101010101

010101010101010101010101010101010


Enhance Performance via Caching

Tabular Data: LLAP Read + Write-thru Cache Shared across jobs / apps and across engines Cache only the needed columns Spills to SSD when memory is full (anti-caching) Read & Write-through cache Security: Column-level and row-level

HDFS Caching for Non-tabular Data Cache data from cloud storage as needed Write-through cache

Workloads

Cloud Storage

LLAP R/W TablesHDFS Files

Cache


Prescriptive On-Demand Ephemeral Workloads

On-DemandEphemeralWorkloads

Data ScienceR/W TablesCompute Fabric

ETL

R/W TablesCompute Fabric

WarehouseR/W TablesCompute Fabric

Search

R/W TablesCompute Fabric


Shared Data Requires Shared Metadata, Security, and Governance

Shared Metadata Across All Workloads Metadata considerations

– Tabular data metastore– Lineage and provenance metadata– Pipeline and job management metadata– Add upon ingest– Update as processing modifies data

Access / tag-based policies and audit logs Centrally stored to facilitate use across clusters

– Ex. backed by Cloud RDS (or shared DB)

Classification

Prohibition

Time

Location

Streams

Pipelines

Feeds

Tables

Files Objects

SharedMetadata

Policies


Elastic Resource Management in Context of Workload

Workload Management vs. Cluster Management Understand resource needs of different

workload types Add / remove resources to meet workload SLAs Manage compute power and high-performance

data-access (ex., LLAP) Pricing-aware: instances (spot, reserved),

data, bandwidthElasticResourceManagement


Data in Motion

Data at Rest

Deep HistoricalAnalysis

D ATA C E N T E R

Stream Analytics

Edge Data

Data in Motion

MachineLearning

C L O U D Edge Data

Edge Analytics

Data at Rest

Transformational Applications Require Connected Data

Thank You