Big data spain keynote nov 2016

21
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem Alan Gates Co-Founder Hortonworks @alanfgates

Transcript of Big data spain keynote nov 2016

Page 1: Big data spain keynote nov 2016

The Enterprise and Connected Data,Trends in the Apache Hadoop EcosystemAlan GatesCo-FounderHortonworks@alanfgates

Page 2: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Our Hadoop Journey Begins…

1 ° ° °

° ° ° N

HDFS

MapReduce

Batch apps

2006

Page 3: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

TodayOur Hadoop Journey: Ecosystem Innovation Accelerates

2006 2011

Page 4: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

6 Years of Apache Hive and Beyond

• Apache Hive becomes a Top-Level Project

• HiveServer2 adds ODBC/JDBC• SQL breadth expands with windowing

and more

• Apache Tez enters incubation

• Hive 0.13 marks delivery of the Stinger Initiative with Tez, Vectorized Query and ORCFile support

• Standard SQL authorization, integration with Apache Ranger

• ACID transactions introduced• Governance added with Apache

Atlas integration

• Hive 2 introduces LLAP and intelligent in-memory caching

2010 2011 2012 2013 2014 2015 2016

A SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud

• Extensive SQL:2011 Support• Compatible with every major BI Tool• Proven at 300+ PB Scale

Page 5: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: Architecture Overview

Deep

St

orag

e

HDFS S3 + Other HDFS Compatible Filesystems

YARN Cluster

LLAP Daemon

QueryExecutors

In-Memory Cache

LLAP Daemon

QueryExecutors

In-Memory Cache

LLAP Daemon

QueryExecutors

In-Memory Cache

LLAP Daemon

QueryExecutors

In-Memory Cache

QueryCoordinators

Coord-inator

Coord-inator

Coord-inator

HiveServer2 (Query

Endpoint)

ODBC /JDBC SQL

Queries

Page 6: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: 25+x Performance Boost

query43.sq

l

query73.sq

l

query63.sq

l

query3.sq

l

query7.sq

l

query89.sq

l

query34.sq

l

query42.sq

l

query27.sq

l

query52.sq

l

query55.sq

l

query13.sq

l

query79.sq

l

query98.sq

l

query19.sq

l0

50

100

150

200

250

0

5

10

15

20

25

30

35

40

45

50

Hive 2 with LLAP averages 26x faster than Hive 1

Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

Que

ry T

ime(

s) (L

ower

is B

etter

)

Spee

dup

(x F

acto

r)

Page 7: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

What’s new in Spark 2.0? API Improvements

– SparkSession – new entry point– Unified DataFrame & DataSet API– Structured Streaming/Continuous Application

Performance Improvements– Tungsten Phase 2 – Whole-stage code generation

ML– ML model persistence– Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)

SparkSQL– SQL 2003 support (new ANSI SQL parser, subquery support)

Page 8: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 9: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

How to Secure and Govern Access to Your Data?

Classification

Prohibition

Time

Location

Streams

Pipelines

Feeds

HiveTables

HDFSFiles

HBaseTables

Entitiesin Data

Lake

Policies

?

Page 10: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Secure and Govern Your Data with Tag-Based Access Policies

Classification

Prohibition

Time

Location

Policies

PDPResource

Cache

Ranger

Manage Access Policies and Audit Logs

Track Metadataand Lineage

Atlas ClientSubscribers

to Topic

Gets MetadataUpdates

Atlas

MetastoreTags

Assets

Entitles

Streams

Pipelines

Feeds

HiveTables

HDFSFiles

HBaseTables

Entitiesin Data

Lake

Page 11: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data In Motion

Constrained High-latency Localized context

Hybrid – cloud/on-premises Low-latency Global context

SOURCES REGIONAL INFRASTRUCTURE

CORE INFRASTRUCTURE

Page 12: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Our Hadoop Journey: From the Data Center to the Cloud!2006 Today

Page 13: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Hadoop in the Cloud?

Unlimited Elastic Scale

Ephemeral & Long-Running

IT & Business Agility

No UpfrontHW Costs

$0

Page 14: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key Architectural Considerations for Hadoop in the Cloud

Shared Data& Storage

On-Demand Ephemeral Workloads

1010110101010101

010101010101010101010101010101010

Elastic Resource Management

Shared Metadata, Security & Governance

Page 15: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Shared Data and Storage

Understand and Leverage Unique Cloud Properties Shared data lake is cloud storage accessible

by all apps Cloud storage segregated from compute Built-in geo-distribution and DR

Focus Areas Address cloud storage consistency

and performance Enhance performance via memory

and local storage

Shared Data& Storage

1010110101010101

010101010101010101010101010101010

Page 16: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enhance Performance via Caching

Tabular Data: LLAP Read + Write-thru Cache Shared across jobs / apps and across engines Cache only the needed columns Spills to SSD when memory is full (anti-caching) Read & Write-through cache Security: Column-level and row-level

HDFS Caching for Non-tabular Data Cache data from cloud storage as needed Write-through cache

Workloads

Cloud Storage

LLAP R/W TablesHDFS Files

Cache

Page 17: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Prescriptive On-Demand Ephemeral Workloads

On-DemandEphemeralWorkloads

Data ScienceR/W TablesCompute Fabric

ETL

R/W TablesCompute Fabric

WarehouseR/W TablesCompute Fabric

Search

R/W TablesCompute Fabric

Page 18: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Shared Data Requires Shared Metadata, Security, and Governance

Shared Metadata Across All Workloads Metadata considerations

– Tabular data metastore– Lineage and provenance metadata– Pipeline and job management metadata– Add upon ingest– Update as processing modifies data

Access / tag-based policies and audit logs Centrally stored to facilitate use across clusters

– Ex. backed by Cloud RDS (or shared DB)

Classification

Prohibition

Time

Location

Streams

Pipelines

Feeds

Tables

Files Objects

SharedMetadata

Policies

Page 19: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Elastic Resource Management in Context of Workload

Workload Management vs. Cluster Management Understand resource needs of different

workload types Add / remove resources to meet workload SLAs Manage compute power and high-performance

data-access (ex., LLAP) Pricing-aware: instances (spot, reserved),

data, bandwidthElasticResourceManagement

Page 20: Big data spain keynote nov 2016

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data in Motion

Data at Rest

Deep HistoricalAnalysis

D ATA C E N T E R

Stream Analytics

Edge Data

Data in Motion

MachineLearning

C L O U D Edge Data

Edge Analytics

Data at Rest

Transformational Applications Require Connected Data

Page 21: Big data spain keynote nov 2016

Thank You