Huhadoop - v1.1

42
06/07/2022 Prepared for: Big Data Expedition Roadshow Presented by: “Big Data Joe” Rossi Huhadoop?

description

 

Transcript of Huhadoop - v1.1

Page 1: Huhadoop - v1.1

04/10/2023

Prepared for:

Big Data Expedition Roadshow

Presented by:“Big Data Joe” Rossi

Huhadoop?

Page 2: Huhadoop - v1.1

What Makes Up Hadoop 1.x?

Page 3: Huhadoop - v1.1

Hadoop 1.0 – HDFS + MapReduce

NameNode

DataNode / TaskTracker DataNode / TaskTracker

DataNode / TaskTracker DataNode / TaskTracker

SecondaryNameNode /

JobTracker

Client1-1

1-21-3

Page 4: Huhadoop - v1.1

Hadoop 1.0 – HDFS + MapReduce

NameNode

DataNode / TaskTracker DataNode / TaskTracker

DataNode / TaskTracker DataNode / TaskTracker

SecondaryNameNode /

JobTracker

Client1-1 1-2

1-3

ReduceMap

2-1 3-2 3-3 4-1

2-3 4-2 2-2 3-1 4-3

ReduceMap

Page 5: Huhadoop - v1.1

MapReduce v1 Limitations

ScalabilityMaximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000

AvailabilityJobTracker failure kills all queued and running jobs

Resources Partitioned into Map and ReduceHard partitioning of Map and Reduce slots led to low resource utilization

No Support for Alternate Paradigms / ServicesOnly MapReduce batch jobs, nothing else

Page 6: Huhadoop - v1.1

HADOOP 1.0

Single Use SystemBatch Apps

Apache Hadoop 1.0: Single Use System

HDFS(redundant, reliable storage)

MapReduce(cluster resource management and data

processing)

Pig Hive

Page 7: Huhadoop - v1.1

What’s New In Hadoop 2.x?

Page 8: Huhadoop - v1.1

YARN Replaces MapReduce

Yet Another Resource Negotiator

YARN

YARN will be the de-facto distributed operating system for Big Data

Page 9: Huhadoop - v1.1

Store DATA in one place

YARN: Taking Hadoop Beyond Batch

Interact with that data in MULTIPLE WAYSwith Predictable Performance and Quality of Service

Applications Run Natively IN Hadoop

HDFS2(redundant, reliable storage)

YARN(cluster resource management)

BATCH(MapReduce)

INTERACTIVE(Tez, Spark)

ONLINE(HBase)

STREAMING(DataTorrent)

GRAPH(Giraph)

Page 10: Huhadoop - v1.1

2010

2011

2012

2013

2014

Today

YARN: Moving Quickly

Conceived at Yahoo!

Alpha Releases – 2.0

Beta Releases – 2.1GA Released – 2.2

100,000+ nodes, 400,000+ jobs daily10 million+ hours of compute daily

Version 2.3

Page 11: Huhadoop - v1.1

YARN: Dr. Evil Approved

Page 12: Huhadoop - v1.1

Graph Processing

Running all on the same Hadoop cluster to give applications access to all the same source data!

YARN: Applications

MapReduce v2

Real-Time Streaming Analytics

Master-WorkerOnline

Page 13: Huhadoop - v1.1

YARN: What Has Changed?YARN MRv1RM

ResourceManager

AMApplicationMaster

JTJobTracker

Scheduler Scheduler

NMNodeManager

TTTaskTracker

ContainerMap

Reduce

ResourceManager

Scheduler

JobTracker

Scheduler

NodeManager

ApplicationMaster

TaskTracker

Map Reduce

NodeManager

Container Container

TaskTracker

Map Reduce

Page 14: Huhadoop - v1.1

ScaleNew programming models and servicesImproved cluster utilizationAgilityBackwards compatible with MapReduce v1Mixed workloads on the same source of dataEnables running apps in memory within the cluster

7 Benefits of YARN

7

Page 15: Huhadoop - v1.1

The Future of HadoopProjects and Roadmap

Page 16: Huhadoop - v1.1

SpeedDeliver interactive query through 100x performance increases as compared to Hive 10.

Stinger: Interactive Query for Hive

SQLSupport the broadest array of SQL semantics for analytic applications running against Hadoop.

ScaleThe only SQL interface to Hadoop designed for

queries that scale from Terabytes to Petabytes.

Page 17: Huhadoop - v1.1

Stinger: Speed – Apache Tez

HDFS2(redundant, reliable storage)

YARN(cluster resource management)

Tez(execution layer)

MR Pig Hive

Page 18: Huhadoop - v1.1

Stinger: Speed – Apache Tez

Page 19: Huhadoop - v1.1

Dynamic ScalingOn-demand cluster size. Increase and decrease the size with load.

HOYA: HBase on YARN

Easier DeploymentAPIs to create, start, stop and delete HBase clusters.

AvailabilityRecover from Region Server loss with a new container.

Page 20: Huhadoop - v1.1

Machine LearningFramework well suited for building machine learning jobs.

Microsoft REEF

Scalable / Fault TolerantMakes it easy to implement scalable, fault-tolerant runtime environments for a range of computational models.

Maintain StateUsers can build jobs that utilize data from where it’s needed and also maintain state after jobs are done.

RetainableEvaluatorExecutionFramework

Page 21: Huhadoop - v1.1

Heterogeneous Storages in HDFS

NameNode

Storage

NameNode

SATA SSD Fusion IO

Page 22: Huhadoop - v1.1

Apache Hadoop 2.4ResourceManager HA / Auto FailoverHDFS Rolling Upgrades

Apache Hadoop 2.5NodeManager Restart w/o disruptionDynamic Resource Configuration

Hadoop Roadmap

EARLYQ2 2014

MIDQ2 2014

Page 23: Huhadoop - v1.1

Questions?No such thing as a stupid question.

Huhadoop?

Page 24: Huhadoop - v1.1

Thank You!

Huhadoop?

Big Data Joe Rossi:http://about.me/[email protected]. 858.761.2918

Page 25: Huhadoop - v1.1

Supporting SlidesSlides with information that may be asked

Page 26: Huhadoop - v1.1

YARN: How It Works

ResourceManager

NodeManager

ApplicationMaster

NodeManager

NodeManager NodeManager

Scheduler

Container

Container Container

Client

Page 27: Huhadoop - v1.1

YARN: Example App Deployment

ResourceManager

NodeManager

HOYA / HBase Master

NodeManager

NodeManager NodeManager

Scheduler

Region Server

Region Server Region Server

HOYA Client

Page 28: Huhadoop - v1.1

Storm Vs. DataTorrentSolution Matrix DataTorrent Apache Storm

Atomic Micro-batch 1 3

Events per Second Billions Thousands

Automated Parallelism 3

Dynamic Runtime Changes 3

Linear Scalability 3

State Checkpointing 3

Page 29: Huhadoop - v1.1

Apache Spark + Shark

HDFS2(redundant, reliable storage)

YARN(cluster resource management)

Apache Spark

Shark

Hive(sql)

Page 30: Huhadoop - v1.1

Hadoop 2.x – YARN + HDFS

NameNode

DataNode / NodeManager DataNode / NodeManager

DataNode / NodeManager DataNode / NodeManager

StandbyNameNode /

ResourceManager

ContainerContainer

ContainerContainer

ContainerContainer

ContainerContainer

Page 31: Huhadoop - v1.1

Backwards CompatibleYARN is Backwards Compatible for your existing MapReduce applications. You can get value from it right away.

YARN: Key Take-Aways

Resource ManagementYARN enables Fine Grained Resource Management for better cluster utilization.

One Source of DataYARN allows you to interact with One Source of Data in multiple ways while maintaining Predictable Performance and Quality of Service.

Enabling Smart PeopleYARN is a flexible framework that is giving smart people and companies to do amazing things with data.

YARN will be the de-facto distributed operating system for Big Data

Page 32: Huhadoop - v1.1

Storm Vs. DataTorrent - DetailedSolution Matrix DataTorrent Apache Storm

Proprietary / Open Source O O

Support for Hadoop 1.x 1 1

Support for Hadoop 2.x 1 1

Native YARN 1 3

Dashboard 1 3

Extensible via Modules 1 1

Technical Support 1 1

Atomic Micro-batch 1 3

Events per Second Billions Thousands

Automated Parallelism 1 3

Dynamic Runtime Changes 1 3

High Availability 1 2

Prog. Languages Supported Java, Python, etc. Java, Python, etc.

Log Analysis 1 3

Site Operations 1 3

MapReduce Diagnostics 1 3

Open Source Operators Library 1 2

Open Source Application Templates 1 3

Complex Computations (DAG) 1 3

Linear Scalability 1 3

Security 1 3

CLI and Macros 1 3

Configuration Based Specification 1 3

State Checkpointing 1 3

Page 33: Huhadoop - v1.1

Users forced to create data system silos for managing mixed workloadsDevelopers forced to abuse very specific MapReduce to fit their use cases

The 1st Generation Of Hadoop

Hadoop

HBase

Page 34: Huhadoop - v1.1

Stinger: HiveQL – SQL SupportHive SQL Datatypes Hive SQL Semantics

Page 35: Huhadoop - v1.1

Apache Spark

HDFS2(redundant, reliable storage)

YARN(cluster resource management)

Apache Spark

Shark

Hive(sql)

Spark Streaming

MLib(machine learning)

Page 36: Huhadoop - v1.1

Project Mgt Committee Members

Hortonworks

Others

Cloudera

Yahoo!

Facebook

0 2 4 6 8 10 12 14 16

7

6

3

15

11

Page 37: Huhadoop - v1.1

Project Committers

Hortonworks

Others

Cloudera

Yahoo!

Facebook

0 5 10 15 20 25 30

24

24

11

11

5

Page 38: Huhadoop - v1.1

YARN: Why The De-Facto Distributed OS

Technology Adoption100,000 nodes+ - 400,000 jobs - 10m compute hours daily

Enables InnovationSmart people and companies to do amazing things to data

Financial Backing568m+ invested in Hadoop contributing companies, nearly 400m in the

2013 alone

Page 39: Huhadoop - v1.1

Apache Storm Topology

Bolt(Filter)Spout

Stream(Data Source)

Spout

Stream(Data Source)

Bolt(RDBMS Writes)

Bolt(Calculation)

Bolt(HDFS Writes)

RDBMS

HDFS

Page 40: Huhadoop - v1.1

Hadoop 1.0 – MR + HDFS

NameNode

DataNode / TaskTracker DataNode / TaskTracker

DataNode / TaskTracker DataNode / TaskTracker

SecondaryNameNode /

JobTracker

ReduceMap

ReduceMap ReduceMap

ReduceMap

Page 41: Huhadoop - v1.1

Hadoop 1.0 – MapReduce

JobTracker

TaskTracker

ReduceMap

TaskTracker

ReduceMap

TaskTracker

ReduceMap

TaskTracker

ReduceMap

Page 42: Huhadoop - v1.1

YARN: Uncharted Territory

You

Are Here

Technology

Value