Munich HUG 21.11.2013

48
© Hortonworks Inc. 2013 - Confidential Hortonworks: We Do Hadoop. Our mission is to enable your Modern Data Architecture by delivering One Enterprise Hadoop November 2013 Page 1

description

 

Transcript of Munich HUG 21.11.2013

Page 1: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Hortonworks: We Do Hadoop.Our mission is to enable your Modern Data Architecture

by delivering One Enterprise Hadoop

November 2013

Page 1

Page 2: Munich HUG 21.11.2013

Agenda

Page 2

• Hortonworks Overview of Tez–Quick and painless

• A driver for Tez: The Stinger Initiative• Tez Deep Dive• Demo

Page 3: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

A Brief History of Apache Hadoop

Page 3

2013

Focus on INNOVATION2005: Hadoop created

at Yahoo!

Focus on OPERATIONS2008: Yahoo team extends focus to

operations to support multiple projects & growing clusters

Yahoo! begins to Operate at scale

EnterpriseHadoop

Apache Project Established

HortonworksData Platform

2004 2008 2010 20122006

STABILITY2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with

24 key Hadoop engineers from Yahoo

Page 4: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Our Mission:

Our Commitment

Innovate in the OpenWe employ the core architects and operators of Hadoop and drive innovation through open source Apache Foundation projects to avoid vendor lock-in

Certify for the EnterpriseWe engineer, test and certify the Hortonworks Data Platform for enterprise usage and deliver the highest quality of support

Interoperate with the EcosystemWe work with partners to deeply integrate Hadoop with key technologies so you can leverage existing skills and investments

Page 4

Headquarters: Palo Alto, CAEmployees: 240+ and growingCustomers: 120+ and growingInvestors: Benchmark, Index, Yahoo, Dragoneer, Tenaya

Trusted Partners with:

Enable your Modern Data Architecture by delivering One Enterprise Hadoop

Page 5: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Goal: Interoperable and Familiar

Page 5

APPL

ICAT

ION

SDA

TA S

YSTE

MSO

URC

ES

RDBMS EDW MPP

Emerging Sources (Sensor, Sentiment, Geo, Unstructured)

HANA

BusinessObjects BI

OPERATIONAL TOOLS

DEV & DATA TOOLS

Existing Sources (CRM, ERP, Clickstream, Logs)

INFRASTRUCTURE

Page 6: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

UDADiagram

Betting on Hortonworks…

Teradata Portfolio for Hadoop

• Seamless data access between Teradata and Hadoop (SQL-H)

• Simple management & monitoring with Viewpoint integration

• Flexible deployment options

Page 6

HDInsight & HDP for Windows

• Only Hadoop Distribution for Windows Azure & Windows Server

• Native integration with SQL Server, Excel, and System Center

• Extends Hadoop to .NET community

Complete Portfolio for Hadoop

Appliances

Instant Access + Infinite Scale

• SAP can assure their customers they are deploying an SAP HANA + Hadoop architecture fully supported by SAP

• Enables analytics apps (BOBJ) to interact with Hadoop

Page 7: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Hortonworks Approach to Enterprise Hadoop

Identify and introduce enterprise requirements into the public domain

Work with the community to advance and incubate open source projects

Apply Enterprise Rigor to provide the most stable and reliable distribution

Community Driven Enterprise Apache Hadoop

Page 8: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Driving Hadoop Innovation

147,933 lines

614,041 lines

End Users

449,768 lines

Total Net Lines Contributed to Apache Hadoop

Yahoo: 10

Cloudera: 7

IBM: 3

10 Others

21

Facebook: 5

LinkedIn: 3

Total Number of Committers to Apache Hadoop

63total

Hortonworks engineers focus on making Apache Hadoop an enterprise viable

platform that powers modern data architectures and deeply integrates

with existing data center technologies

Page 9: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

HDP: Enterprise Hadoop Platform

Page 9

Hortonworks Data Platform (HDP)

• The ONLY 100% open source and complete platform

• Integrates full range of enterprise-ready services

• Certified and tested at scale

• Engineered for deep ecosystem interoperability

OS/VM Cloud Appliance

PLATFORM SERVICES

HADOOP CORE

Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots

HORTONWORKS DATA PLATFORM (HDP)

OPERATIONAL SERVICES

DATASERVICES

HDFS

SQOOP

FLUME

NFS

LOAD & EXTRACT

WebHDFS

KNOX*

OOZIE

AMBARI

FALCON*

YARN

MAP TEZREDUCE

HIVE &HCATALOG

PIGHBASE

Page 10: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Hortonworks: The Value of “Open” for You

Page 10

Connect With the Hadoop CommunityWe employ a large number of Apache project committers & innovators so that you are represented in the open source community

Avoid Vendor LockHortonworks Data Platform remain as close to the open source trunk as possible and is developed 100% in the open so you are never locked in

The partners you rely on, rely on Hortonworks We work with partners to deeply integrate Hadoop with data center technologies so you can leverage existing skills and investments

Certified for the EnterpriseWe engineer, test and certify the Hortonworks Data Platform at scale to ensure reliability and stability you require for enterprise use

Support from the expertsWe provide the highest quality of support for deploying at scale. You are supported by hundreds of years of Hadoop experience

Page 11: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

SQL-in-Hadoop with Apache Hive

• Apache Hive is the standard for SQL interaction with Hadoop–Enterprise makes final purchasing

decision on two key characteristics: 'compatibility' with existing investments (60%) and skills (20%) 

–Most application claim Hive compatibility TODAY*

• Stinger Initiative: Simple Focus–Performance–SQL-Compatibility

Claims publicly made by: Teradata, Microsoft, Oracle, Microstrategy, IBM, Information Builders, SAS, QlikTech, SAP, Tableau, Tibco, Actuate, Jaspersoft, Alteryx, Datameer, Pentaho

Page 11

Had

oop

HDFS

Hive

TezMapReduce

SQL

YARN

Business Analytics

CustomApps

Improves existing tools & preserves investments

Page 12: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Stinger Initiative Goals

• Enables Hive to support interactive workloads• Improves existing tools & preserves investments

Query Planner

Hive

Execution Engine

Tez= 100X+

FileFormat

ORC file

= SQL Compatible

+

Data Types

Windowing&

Subqueries+

Page 13: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Stinger: Hive For All Analytics

Enterprise Reports

Dashboard / Scorecard

Parameterized Reports

Visualization Data Mining

Interactive Batch

100X Faster+

SQL Compatible

Page 14: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Stinger Roadmap

Page 14

DATA TYPES• Subqueries for IN,

NOT IN, HAVING• Datatypes: CHAR,

VARCHAR, DATETIME

• Improvements to DECIMAL datatype

• Integration with Tez and Tez Service

• Vectorization Preview

• Intelligent Optimizer• Column Statistics• Authentication and

Authorization Enhancements

• Full vector query

• Join optimizations• ORCFile• SQL:2003

windowing functions

Page 15: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Stinger: Some early Results

• Query Engine Work ONLY• Uses TPC “style” benchmark• Just a few weeks of work

• OTHER work coming

Page 15

Page 16: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Apache Tez : Accelerating Hadoop Query Processing

Page 16

Page 17: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Introduction

Page 17

• Distributed execution framework targeted towards data-processing applications.

• Based on expressing a computation as a dataflow graph.

• Built on top of YARN – the resource management framework for Hadoop.

• Open source Apache incubator project and Apache licensed.

Page 18: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Old School Hadoop: MapReduce

Page 19: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Fundamentals of YARN

• The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities:–a global ResourceManager–a per-application ApplicationMaster.–a per-node slave NodeManager and–a per-application Container running on a NodeManager

Page 19

Page 20: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

New School Hadoop with YARN

Page 21: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Design Themes

Page 21

• Empowering End Users• Execution Performance

Page 22: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Empowering End Users

• Expressive dataflow definition API’s• Flexible Input-Processor-Output runtime model• Data type agnostic• Simplifying deployment

Page 22

Page 23: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Empowering End Users

• Expressive dataflow definition API’s–Enable definition of complex data flow pipelines using simple

graph connection API’s. Tez expands the logical plan at runtime.–Targeted towards data processing applications like Hive/Pig but

not limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance.

Page 23

TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2

TaskD-1 TaskD-2 TaskE-1 TaskE-2

Page 24: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Aggregate Stage

Partition Stage

Preprocessor Stage

Tez – Empowering End Users

• Expressive dataflow definition API’s

Page 24

Sampler

Task-1 Task-2

Task-1 Task-2

Task-1 Task-2

Samples

Ranges

Distributed Sort

Page 25: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Empowering End Users

• Flexible Input-Processor-Output runtime model–Construct physical runtime executors dynamically by connecting

different inputs, processors and outputs.–End goal is to have a library of inputs, outputs and processors that

can be programmatically composed to generate useful tasks.

Page 25

IntermediateReduce

ShuffleInput

ReduceProcessor

FileSortedOutput

FinalReduce

ShuffleInput

ReduceProcessor

HDFSOutput

PairwiseJoin

Input1

JoinProcessor

FileSortedOutput

Input2

Page 26: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Empowering End Users

• Data type agnostic–Tez is only concerned with the movement of data. Files and

streams of bytes.–Does not impose any data format on the user application. MR

application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them.

Page 26

File

Stream

Key Value

Tez Task

Tuples

User Code

Bytes Bytes

Page 27: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Empowering End Users

• Simplifying deployment–Tez is a completely client side application.–No deployments to do. Simply upload to any accessible

FileSystem and change local Tez configuration to point to that.–Enables running different versions concurrently. Easy to test new

functionality while keeping stable versions for production.–Leverages YARN local resources.

Page 27

ClientMachine

NodeManager

TezTask

NodeManager

TezTaskTezClient

HDFSTez Lib 1 Tez Lib 2

ClientMachine

TezClient

Page 28: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Empowering End Users

• Expressive dataflow definition API’s• Flexible Input-Processor-Output runtime model• Data type agnostic• Simplifying usage

With great power API’s come great responsibilities

Tez is a framework on which end user applications can be built

Page 28

Page 29: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Execution Performance

• Performance gains over Map Reduce• Optimal resource management• Plan reconfiguration at runtime• Dynamic physical data flow decisions

Page 29

Page 30: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Execution Performance

• Performance gains over Map Reduce–Eliminate replicated write barrier between successive

computations.–Eliminate job launch overhead of workflow jobs.–Eliminate extra stage of map reads in every workflow job.–Eliminate queue and resource contention suffered by workflow

jobs that are started after a predecessor job completes.

Page 30

Pig/Hive - MRPig/Hive - Tez

Page 31: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Execution Performance

• Optimal resource management–Reuse YARN containers to launch new tasks.–Reuse YARN containers to enable shared objects across tasks.

Page 31

YARN Container

TezTask Host

TezTask1

TezTask2

Sha

red

Obj

ects

YARN Container

Tez Application Master

Start Task

Task Done

Start Task

Page 32: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Execution Performance

• Plan reconfiguration at runtime–Dynamic runtime concurrency control based on data size, user

operator resources, available cluster resources and locality.–Advanced changes in dataflow graph structure.–Progressive graph construction in concert with user optimizer.

Page 32

HDFS Blocks

YARNResources

Stage 150 maps

100 partitions

Stage 2100

reducers

Stage 150 maps

100 partitions

Stage 2100 10

reducers

Only 10GB’s

of data

Page 33: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Execution Performance

• Dynamic physical data flow decisions–Decide the type of physical byte movement and storage on the fly.–Store intermediate data on distributed store, local store or in-

memory.–Transfer bytes via blocking files or streaming and the spectrum in

between.

Page 33

Producer(small size)

In-Memory

Consumer

Producer

Local File

Consumer

At Runtime

Page 34: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Deep Dive – API

DAG dag = new DAG();

Vertex map1 = new Vertex(MapProcessor.class);

Vertex map2 = new Vertex(MapProcessor.class);

Vertex reduce1 = new Vertex(ReduceProcessor.class);

Vertex reduce2 = new Vertex(ReduceProcessor.class);

Vertex join1 = new Vertex(JoinProcessor.class);

…….

Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);

…….

dag.addVertex(map1).addVertex(map2)

.addVertex(reduce1).addVertex(reduce2)

.addVertex(join1)

.addEdge(edge1).addEdge(edge2)

.addEdge(edge3).addEdge(edge4);

Page 34

reduce1

map2

reduce2

join1

map1

Scatter_Gather

Bipartite Sequential

Scatter_Gather

Bipartite Sequential

Simple DAG definition API

Page 35: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Deep Dive – API

Page 35

• Data movement – Defines routing of data between tasks– One-To-One : Data from the ith producer task routes to the ith consumer

task.– Broadcast : Data from a producer task routes to all consumer tasks.– Scatter-Gather : Producer tasks scatter data into shards and consumer

tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task.

• Scheduling – Defines when a consumer task is scheduled– Sequential : Consumer task may be scheduled after a producer task

completes.– Concurrent : Consumer task must be co-scheduled with a producer task.

• Data source – Defines the lifetime/reliability of a task output– Persisted : Output will be available after the task exits. Output may be lost

later on.– Persisted-Reliable : Output is reliably stored and will always be available– Ephemeral : Output is available only while the producer task is running

Edge properties define the connection between producer and consumer vertices in the DAG

Page 36: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Deep Dive – Scheduling

Page 36

reduce1

map1

Start

vertex

Vertex Scheduler

Start

tasks

DAGScheduler

Get Priority

Get Priority

Start

vertex

TaskScheduler

Get container

Get container

• Vertex SchedulerDetermines when tasks in a vertex can start

• DAG SchedulerDetermines priority of task

• Task SchedulerAllocates containers from YARN and assigns them to tasks

Page 37: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Deep Dive – Task Execution

Page 37

Task Attempt(real on machine)

Task Attempt(logical in AM)

Env, cmd line, resources

Task JVM

InputProcessor

Output

Get Task

Start container

Input

Processor

OutputData

InformationData Events

• Start task shell with user specified env, resources etc.

• Fetch and instantiate Input, Processor, Output objects

• Receive (incremental) input information and process the input

• Provide output information

Page 38: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez - Sessions

• The amount of work programmed into a script/query may not be doable within a single Tez DAG.

Page 38

Page 39: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez - Sessions

Page 39

• Even better performance gains may be achieved through caching with the session: Within AM or container

Page 40: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Automatic Reduce Parallelism

Page 40

Map Vertex

Reduce VertexApp Master

Vertex ManagerData Size Statistics

Vertex StateMachine

Set Parallelism

Cancel Task

Re-Route

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Page 41: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Reduce Slow Start/Pre-launch

Page 41

Map Vertex

Reduce VertexApp Master

Vertex ManagerTask Completed

Vertex StateMachine

Start Tasks

Start

Event Model

Map completion events sent to the Reduce Vertex Manager.

Vertex ManagerPluggable user logic that understands the data size. Advises the vertex controller to launch the reducers before all maps have completed so that shuffle can start.

Page 42: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Current status

• Apache Incubator Project–Rapid development. Over 330 jiras opened. Over 220 resolved.–Growing community.

• Focus on stability–Testing and quality are highest priority.–Working on Tez+YARN to fix basic performance overheads.–Code ready and deployed on multi-node environments.

• DAG of MR processing is working– Already functionally equivalent to Map Reduce. Existing Map

Reduce jobs can be executed on Tez with few or no changes.– Working Hive prototype that can target Tez for execution of

queries (HIVE-4660).–Work started on prototype of Pig that can target Tez.

Page 42

Page 43: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Current status

Page 43

Fact TableDimension

Table 1

Result Table 1

Dimension Table 2

Result Table 2

Dimension Table 3

Result Table 3

Join

Join

Join

Typical pattern in a TPC-DS query

Fact Table

Dimension Table 1

Dimension Table 1

Dimension Table 1

Optimization for

small data sets

Both can now run as a single Tez job

Page 44: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – MRR Performance

Page 44

RC File _x000d_Scale 200

ORC File _x000d_Scale 200

RC File _x000d_Scale 1000

ORC File _x000d_Scale 1000

0

10

20

30

40

50

60

70

80

55 54

75

65

35 34

55

46

Traditional _x000d_Map-ReduceTez Map_x000d_Reduce Reduce

Elap

sed

Tim

e (s

econ

ds)

TPC-DS Query 12 with Hive on Tez

Page 45: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Roadmap

• Full DAG support–Multi-way input and output.–Other graph connection patterns.

• Performance optimizations–Container reuse–Cross task shared resources–Using HDFS data caching

• Runtime plan optimizations–Automatic input (map) parallelism–Automatic aggregation (reduce) parallelism

• Usability.–Stability and testability–Recovery and history

Page 45

Page 46: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Community

• Early adopters and contributors welcome–Adopters to drive more scenarios. Contributors to make them

happen.–Hive and Pig communities are on-board and making great

progress - HIVE-4660 and PIG-3446

• Stay tuned for Tez meetups with deep dives on Tez architecture and using Tez–http://www.meetup.com/Apache-Tez-User-Group

• Useful links–Work tracking: https://issues.apache.org/jira/browse/TEZ–Code: https://github.com/apache/incubator-tez–  Developer list: [email protected]

 User list: [email protected] Issues list: [email protected]

Page 46

Page 47: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez – Takeaways

• Distributed execution framework that works on computations represented as dataflow graphs

• Naturally maps to execution plans produced by query optimizers

• Execution architecture designed to enable dynamic performance optimizations at runtime

• Open source Apache project – your use-cases and code are welcome

• It works and is already being used by Hive

Page 47

Page 48: Munich HUG 21.11.2013

© Hortonworks Inc. 2013 - Confidential

Tez

https://github.com/t3rmin4t0r/tez-autobuild

Tez: https://github.com/apache/tez.git

Demo: https://github.com/t3rmin4t0r/tez-autobuild

Thanks for your time and attention!

Questions?

Page 48