Huhadoop - v1.1

04/10/2023

Prepared for:

Big Data Expedition Roadshow

Presented by:“Big Data Joe” Rossi

Huhadoop?

What Makes Up Hadoop 1.x?

Hadoop 1.0 – HDFS + MapReduce

NameNode

DataNode / TaskTracker DataNode / TaskTracker


SecondaryNameNode /

JobTracker

Client1-1

1-21-3

Hadoop 1.0 – HDFS + MapReduce

NameNode



SecondaryNameNode /

JobTracker

Client1-1 1-2

1-3

ReduceMap

2-1 3-2 3-3 4-1

2-3 4-2 2-2 3-1 4-3

ReduceMap

MapReduce v1 Limitations

ScalabilityMaximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000

AvailabilityJobTracker failure kills all queued and running jobs

Resources Partitioned into Map and ReduceHard partitioning of Map and Reduce slots led to low resource utilization

No Support for Alternate Paradigms / ServicesOnly MapReduce batch jobs, nothing else

HADOOP 1.0

Single Use SystemBatch Apps

Apache Hadoop 1.0: Single Use System

HDFS(redundant, reliable storage)

MapReduce(cluster resource management and data

processing)

Pig Hive

What’s New In Hadoop 2.x?

YARN Replaces MapReduce

Yet Another Resource Negotiator

YARN

YARN will be the de-facto distributed operating system for Big Data

Store DATA in one place

YARN: Taking Hadoop Beyond Batch

Interact with that data in MULTIPLE WAYSwith Predictable Performance and Quality of Service

Applications Run Natively IN Hadoop

HDFS2(redundant, reliable storage)

YARN(cluster resource management)

BATCH(MapReduce)

INTERACTIVE(Tez, Spark)

ONLINE(HBase)

STREAMING(DataTorrent)

GRAPH(Giraph)

2010

2011

2012

2013

2014

Today

YARN: Moving Quickly

Conceived at Yahoo!

Alpha Releases – 2.0

Beta Releases – 2.1GA Released – 2.2

100,000+ nodes, 400,000+ jobs daily10 million+ hours of compute daily

Version 2.3

YARN: Dr. Evil Approved

Graph Processing

Running all on the same Hadoop cluster to give applications access to all the same source data!

YARN: Applications

MapReduce v2

Real-Time Streaming Analytics

Master-WorkerOnline

YARN: What Has Changed?YARN MRv1RM

ResourceManager

AMApplicationMaster

JTJobTracker

Scheduler Scheduler

NMNodeManager

TTTaskTracker

ContainerMap

Reduce

ResourceManager

Scheduler

JobTracker

Scheduler

NodeManager

ApplicationMaster

TaskTracker

Map Reduce

NodeManager

Container Container

TaskTracker

Map Reduce

ScaleNew programming models and servicesImproved cluster utilizationAgilityBackwards compatible with MapReduce v1Mixed workloads on the same source of dataEnables running apps in memory within the cluster

7 Benefits of YARN

7

The Future of HadoopProjects and Roadmap

SpeedDeliver interactive query through 100x performance increases as compared to Hive 10.

Stinger: Interactive Query for Hive

SQLSupport the broadest array of SQL semantics for analytic applications running against Hadoop.

ScaleThe only SQL interface to Hadoop designed for

queries that scale from Terabytes to Petabytes.

Stinger: Speed – Apache Tez



Tez(execution layer)

MR Pig Hive

Stinger: Speed – Apache Tez

Dynamic ScalingOn-demand cluster size. Increase and decrease the size with load.

HOYA: HBase on YARN

Easier DeploymentAPIs to create, start, stop and delete HBase clusters.

AvailabilityRecover from Region Server loss with a new container.

Machine LearningFramework well suited for building machine learning jobs.

Microsoft REEF

Scalable / Fault TolerantMakes it easy to implement scalable, fault-tolerant runtime environments for a range of computational models.

Maintain StateUsers can build jobs that utilize data from where it’s needed and also maintain state after jobs are done.

RetainableEvaluatorExecutionFramework

Heterogeneous Storages in HDFS

NameNode

Storage

NameNode

SATA SSD Fusion IO

Apache Hadoop 2.4ResourceManager HA / Auto FailoverHDFS Rolling Upgrades

Apache Hadoop 2.5NodeManager Restart w/o disruptionDynamic Resource Configuration

Hadoop Roadmap

EARLYQ2 2014

MIDQ2 2014

Questions?No such thing as a stupid question.

Huhadoop?

Thank You!

Huhadoop?

Big Data Joe Rossi:http://about.me/[email protected]. 858.761.2918

Supporting SlidesSlides with information that may be asked

YARN: How It Works

ResourceManager

NodeManager

ApplicationMaster

NodeManager

NodeManager NodeManager

Scheduler

Container

Container Container

Client

YARN: Example App Deployment

ResourceManager

NodeManager

HOYA / HBase Master

NodeManager

NodeManager NodeManager

Scheduler

Region Server

Region Server Region Server

HOYA Client

Storm Vs. DataTorrentSolution Matrix DataTorrent Apache Storm

Atomic Micro-batch 1 3

Events per Second Billions Thousands

Automated Parallelism 3

Dynamic Runtime Changes 3

Linear Scalability 3

State Checkpointing 3

Apache Spark + Shark



Apache Spark

Shark

Hive(sql)

Hadoop 2.x – YARN + HDFS

NameNode

DataNode / NodeManager DataNode / NodeManager

DataNode / NodeManager DataNode / NodeManager

StandbyNameNode /

ResourceManager

ContainerContainer

ContainerContainer

ContainerContainer

ContainerContainer

Backwards CompatibleYARN is Backwards Compatible for your existing MapReduce applications. You can get value from it right away.

YARN: Key Take-Aways

Resource ManagementYARN enables Fine Grained Resource Management for better cluster utilization.

One Source of DataYARN allows you to interact with One Source of Data in multiple ways while maintaining Predictable Performance and Quality of Service.

Enabling Smart PeopleYARN is a flexible framework that is giving smart people and companies to do amazing things with data.

YARN will be the de-facto distributed operating system for Big Data

Storm Vs. DataTorrent - DetailedSolution Matrix DataTorrent Apache Storm

Proprietary / Open Source O O

Support for Hadoop 1.x 1 1

Support for Hadoop 2.x 1 1

Native YARN 1 3

Dashboard 1 3

Extensible via Modules 1 1

Technical Support 1 1

Atomic Micro-batch 1 3

Events per Second Billions Thousands

Automated Parallelism 1 3

Dynamic Runtime Changes 1 3

High Availability 1 2

Prog. Languages Supported Java, Python, etc. Java, Python, etc.

Log Analysis 1 3

Site Operations 1 3

MapReduce Diagnostics 1 3

Open Source Operators Library 1 2

Open Source Application Templates 1 3

Complex Computations (DAG) 1 3

Linear Scalability 1 3

Security 1 3

CLI and Macros 1 3

Configuration Based Specification 1 3

State Checkpointing 1 3

Users forced to create data system silos for managing mixed workloadsDevelopers forced to abuse very specific MapReduce to fit their use cases

The 1st Generation Of Hadoop

Hadoop

HBase

Stinger: HiveQL – SQL SupportHive SQL Datatypes Hive SQL Semantics

Apache Spark



Apache Spark

Shark

Hive(sql)

Spark Streaming

MLib(machine learning)

Project Mgt Committee Members

Hortonworks

Others

Cloudera

Yahoo!

Facebook

0 2 4 6 8 10 12 14 16

7

6

3

15

11

Project Committers

Hortonworks

Others

Cloudera

Yahoo!

Facebook

0 5 10 15 20 25 30

24

24

11

11

5

YARN: Why The De-Facto Distributed OS

Technology Adoption100,000 nodes+ - 400,000 jobs - 10m compute hours daily

Enables InnovationSmart people and companies to do amazing things to data

Financial Backing568m+ invested in Hadoop contributing companies, nearly 400m in the

2013 alone

Apache Storm Topology

Bolt(Filter)Spout

Stream(Data Source)

Spout

Stream(Data Source)

Bolt(RDBMS Writes)

Bolt(Calculation)

Bolt(HDFS Writes)

RDBMS

HDFS

Hadoop 1.0 – MR + HDFS

NameNode



SecondaryNameNode /

JobTracker

ReduceMap

ReduceMap ReduceMap

ReduceMap

Hadoop 1.0 – MapReduce

JobTracker

TaskTracker

ReduceMap

TaskTracker

ReduceMap

TaskTracker

ReduceMap

TaskTracker

ReduceMap

YARN: Uncharted Territory

You

Are Here

Technology

Value

Huhadoop - v1.1

Technology

Transcript of Huhadoop - v1.1