Hadoop in the Enterprise - README

Post on 12-Feb-2022

9 views 0 download

Transcript of Hadoop in the Enterprise - README

© Hortonworks Inc. 2013

Hadoop in the Enterprise

Jeff Markham Technical Director, APAC Hortonworks

Modern Architecture with Hadoop 2

© Hortonworks Inc. 2013

Hadoop Wave ONE: Web-scale Batch Apps

time

rela

tive

%

cus

tom

ers

Customers want solutions & convenience

Customers want technology & performance

Source: Geoffrey Moore - Crossing the Chasm

2006 to 2012 Web-Scale

Batch Applications

Innovators, technology enthusiasts

Early adopters,

visionaries

Early majority,

pragmatists

Late majority,

conservatives

Laggards, Skeptics

The

CH

ASM

© Hortonworks Inc. 2013

Customers want solutions & convenience

Customers want technology & performance

Hadoop Wave TWO: Broad Enterprise Apps

time

rela

tive

%

cus

tom

ers

Source: Geoffrey Moore - Crossing the Chasm

Innovators, technology enthusiasts

Early adopters,

visionaries

Early majority,

pragmatists

Late majority,

conservatives

Laggards, Skeptics

The

CH

ASM

2013 & Beyond Batch, Interactive, Online, Streaming, etc., etc.

© Hortonworks Inc. 2013

2.0 Architected for the Broad Enterprise

Hadoop 2.0 Key Highlights

Rolling Upgrades

Disaster Recovery

Snapshots

Full Stack HA

Hive on Tez

YARN

HDP 2.0 Features

Single  Cluster,  Many  Workloads  

BATCH

INTERACTIVE

ONLINE

STREAMING

ZERO downtime

Multi Data Center

Point in time Recovery

Reliability

Interactive Query

Mixed workloads

Enterprise Requirements

© Hortonworks Inc. 2013

The 1st Generation of Hadoop: Batch

HADOOP 1.0 Built for Web-Scale Batch Apps

Single  App  

BATCH

HDFS

Single  App  

INTERACTIVE

Single  App  

BATCH

HDFS

•  All other usage patterns must leverage that same infrastructure

•  Forces the creation of silos for managing mixed workloads

Single  App  

BATCH

HDFS

Single  App  

ONLINE

© Hortonworks Inc. 2013

A Transition From Hadoop 1 to 2

HADOOP 1.0

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

© Hortonworks Inc. 2013

A Transition From Hadoop 1 to 2

HADOOP 1.0

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  

 &  data  processing)  

HDFS  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

MapReduce  (data  processing)  

Others  (data  processing)  

HADOOP 2.0

The Enterprise Requirement: Beyond Batch

To become an enterprise viable data platform, customers have told us they want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service

Page 17

HDFS  (Redundant,  Reliable  Storage)  

BATCH   INTERACTIVE   STREAMING   GRAPH   IN-­‐MEMORY   HPC  MPI  ONLINE   OTHER  

YARN: Taking Hadoop Beyond Batch

• Created to manage resource needs across all uses

• Ensures predictable performance & QoS for all apps • Enables apps to run “IN” Hadoop rather than “ON”

– Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc.

Page 18

ApplicaIons  Run  NaIvely  IN  Hadoop  

HDFS2  (Redundant,  Reliable  Storage)  

YARN  (Cluster  Resource  Management)      

BATCH  (MapReduce)  

INTERACTIVE  (Tez)  

STREAMING  (Storm,  S4,…)  

GRAPH  (Giraph)  

IN-­‐MEMORY  (Spark)  

HPC  MPI  (OpenMPI)  

ONLINE  (HBase)  

OTHER  (Search)  (Weave…)  

Old School Hadoop: MapReduce

ResourceManager

Client

MapReduce Status

Job Submission

Client

NodeManager

Container Container

NodeManager

App Mstr Container

NodeManager

Container App Mstr

Node Status

Resource Request

New School Hadoop with YARN

5 5 Key Benefits of YARN

1.  Scale!

2.  Compatibility with MapReduce.

3.  Improved cluster utilization.

4.  New Programming Models

5.  Agility

Page 23

Apache Tez

• An alternate data processing framework to MapReduce

•  Improves performance of low-latency applications

Page 24

SQL-IN-Hadoop with Apache Hive

• Apache Hive: First Application to use YARN • Hive on Tez optimizes resource for Hive

queries to improve performance – Apache Hive is the standard for SQL interaction

in Hadoop (Most applications claim Hive compatibility today)

– Apache Tez: optimized for YARN, general purpose processing framework for existing Hadoop applications

Page 25

Stinger Initiative Simple Focus

Hado

op  

HDFS2  

YARN      

HIVE  

SQL  

MAP  REDUCE     TEZ  

Business  AnalyIcs  

Custom  Apps  

SInger  Phase  3  •  Vector  Query  •  Buffer  Cache  •  Query  Planner  

 

SInger  Phase  2  •  YARN  Resource  Mgmnt  •  Hive  on  Apache  Tez  •  Query  Service  (always  on)  

SInger  Phase  1  •  Base  OpJmizaJons  •  SQL  AnalyJcs  •  ORCFile  Format  

1 2Improve existing tools & preserve investments

Enable Hive to support interactive workloads

Increased SQL Compatibility

100x Performance Improvement

© Hortonworks Inc. 2013

SQL Compliance Highlights

Hive: More SQL & 100X Faster

Stinger Phase 3 •  Vector Query •  Buffer Cache •  Query Planner

Stinger Phase 2 •  YARN Resource Mgmnt •  Hive on Apache Tez •  Query Service

Stinger Phase 1 •  Base Optimizations •  SQL Analytics •  ORCFile Format

We Are Here

Done in Hive 0.11

CHAR

VARCHAR

DATE

DECIMAL

Sub-queries for IN/NOT IN, HAVING

EXISTS / NOT EXISTS

INTERSECT, EXCEPT

UNION DISTINCT and UNION outside of subquery

ROLLUP and CUBE

Windowing functions (OVER, RANK, etc.)

Work Started

© Hortonworks Inc. 2013

Hive’s Performance Trajectory

http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/

© Hortonworks Inc. 2013

Making Hadoop Enterprise Ready

© Hortonworks Inc. 2013

Thank You!

http://hortonworks.com/sandbox