Hadoop past, present and future

Post on 26-Jan-2015

132 views 3 download

description

Ever wonder what Hadoop might look like in 12 months or 24 months or longer? Apache Hadoop MapReduce has undergone a complete re-haul to emerge as Apache Hadoop YARN, a generic compute fabric to support MapReduce and other application paradigms. As a result, Hadoop looks very different from itself 12 months ago. This talk will take you through some ideas for YARN itself and the many myriad ways it is really moving the needle for MapReduce, Pig, Hive, Cascading and other data-processing tools in the Hadoop ecosystem.

Transcript of Hadoop past, present and future

© Hortonworks Inc. 2013

Hadoop : Past, Present and Future Chris Harris Email : charris@hortonworks.com Twitter : cj_harris5

© Hortonworks Inc. 2013

Past

Page 2

© Hortonworks Inc. 2013

A little history… it’s 2005

© Hortonworks Inc. 2013

A Brief History of Apache Hadoop

Page 4

2013

2005: Yahoo! creates team under E14 to work on Hadoop

Yahoo! begins to Operate at scale

Enterprise Hadoop

Apache Project Established

Hortonworks Data Platform

2004 2008 2010 2012 2006

© Hortonworks Inc. 2013

Key Hadoop Data Types

1.  Sentiment Understand how your customers feel about your brand and products – right now

2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website

3.  Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines

4.  Geographic Analyze location-based data to manage operations where they occur

5.  Server Logs Research logs to diagnose process failures and prevent security breaches

6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents

Value

© Hortonworks Inc. 2013

Hadoop is NOT

! ESB ! NoSQL ! HPC ! Relational ! Real-time ! The “Jack of all Trades”

© Hortonworks Inc. 2013

Hadoop 1

•  Limited up to 4,000 nodes per cluster •  O(# of tasks in a cluster) •  JobTracker bottleneck - resource management,

job scheduling and monitoring •  Only has one namespace for managing HDFS •  Map and Reduce slots are static •  Only job to run is MapReduce

© Hortonworks Inc. 2013

Hadoop 1 - Basics

B C A A A

A B C C B

MapReduce (Computation Framework)

HDFS (Storage Framework)

© Hortonworks Inc. 2013

Hadoop 1 - Reading Files

Rack1 Rack2 Rack3 RackN

read file (fsimage/edit) Hadoop Client

NameNode SNameNode

return DNs, block ids, etc.

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

checkpoint

heartbeat/ block report read blocks

© Hortonworks Inc. 2013

Hadoop 1 - Writing Files

Rack1 Rack2 Rack3 RackN

request write (fsimage/edit) Hadoop Client

NameNode SNameNode

return DNs, etc.

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

checkpoint

block report write blocks

replication pipelining

© Hortonworks Inc. 2013

Hadoop 1 - Running Jobs

Rack1 Rack2 Rack3 RackN

Hadoop Client

JobTracker

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

submit job

deploy job

part 0

map

reduce

shuffle

© Hortonworks Inc. 2013

Hadoop 1 - Security

Users

F I R E W A L L

LDAP/AD

Client Node/ Spoke Server

KDC

Hadoop Cluster

authN/authZ

service request

block token

delegate token

* block token is for accessing data

* delegate token is for running jobs

Encryption Plugin

© Hortonworks Inc. 2013

Hadoop 1 - APIs

! org.apache.hadoop.mapreduce.Partitioner ! org.apache.hadoop.mapreduce.Mapper ! org.apache.hadoop.mapreduce.Reducer ! org.apache.hadoop.mapreduce.Job

© Hortonworks Inc. 2013

Present

Page 14

© Hortonworks Inc. 2013

Hadoop 2

! Potentially up to 10,000 nodes per cluster ! O(cluster size) ! Supports multiple namespace for managing HDFS ! Efficient cluster utilization (YARN) ! MRv1 backward and forward compatible ! Any apps can integrate with Hadoop ! Beyond Java

© Hortonworks Inc. 2013

Hadoop 2 - Basics

© Hortonworks Inc. 2013

Hadoop 2 - Reading Files (w/ NN Federation)

Rack1 Rack2 Rack3 RackN

read file

fsimage/edit copy Hadoop Client NN1/ns1

SNameNode per NN

return DNs, block ids, etc.

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

checkpoint

register/ heartbeat/ block report

read blocks

fs sync Backup NN per NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

or

ns1 ns2 ns3 ns4

dn1, dn2 dn1, dn3

dn4, dn5 dn4, dn5

Block Pools

© Hortonworks Inc. 2013

Hadoop 2 - Writing Files

Rack1 Rack2 Rack3 RackN

request write

Hadoop Client

return DNs, etc.

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

write blocks

replication pipelining

fsimage/edit copy NN1/ns1

SNameNode per NN

checkpoint

block report

fs sync Backup NN per NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

or

© Hortonworks Inc. 2013

Hadoop 2 - Running Jobs

RackN

NodeManager

NodeManager

NodeManager

Rack2

NodeManager

NodeManager

NodeManager

Rack1

NodeManager

NodeManager

NodeManager

C2.1

C1.4

AM2

C2.2 C2.3

AM1

C1.3

C1.2

C1.1

Hadoop Client 1

Hadoop Client 2

create app2

submit app1

submit app2

create app1

ASM Scheduler queues

ASM Containers

NM ASM

Scheduler Resources

.......negotiates.......

.......reports to.......

.......partitions.......

ResourceManager

status report

© Hortonworks Inc. 2013

Hadoop 2 - Security

F I R E W A L L

LDAP/AD Knox Gateway Cluster

KDC

Hadoop Cluster

Enterprise/ Cloud SSO Provider

JDBC Client

REST Client

F I R E W A L L

DMZ

Browser(HUE) Native Hive/HBase Encryption

© Hortonworks Inc. 2013

Hadoop 2 - APIs

! org.apache.hadoop.yarn.api.ApplicationClientProtocol ! org.apache.hadoop.yarn.api.ApplicationMasterProtocol ! org.apache.hadoop.yarn.api.ContainerManagementProtocol

© Hortonworks Inc. 2013

Future

Page 22

© Hortonworks Inc. 2013

Apache Tez A New Hadoop Data Processing Framework

Page 23

© Hortonworks Inc. 2013

HDP: Enterprise Hadoop Distribution

Page 24

PLATFORM    SERVICES  

HADOOP    CORE  

Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots

HORTONWORKS    DATA  PLATFORM  (HDP)  

OPERATIONAL  SERVICES  

DATA  SERVICES  

HIVE  &  HCATALOG  PIG   HBASE  

HDFS  

MAP      

Hortonworks Data Platform (HDP) Enterprise Hadoop

•  The ONLY 100% open source and complete distribution

•  Enterprise grade, proven and tested at scale

•  Ecosystem endorsed to ensure interoperability

SQOOP  

FLUME  

NFS  

LOAD  &    EXTRACT  

WebHDFS  

KNOX*  

OOZIE  

AMBARI  

FALCON*  

YARN*      

TEZ*   OTHER  REDUCE*  

© Hortonworks Inc. 2013

Tez (“Speed”)

• What is it? – A data processing framework as an alternative to MapReduce – A new incubation project in the ASF

• Who else is involved? – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,

Microsoft

• Why does it matter? – Widens the platform for Hadoop use cases – Crucial to improving the performance of low-latency applications – Core to the Stinger initiative – Evidence of Hortonworks leading the community in the evolution

of Enterprise Hadoop

© Hortonworks Inc. 2013

Moving Hadoop Beyond MapReduce

• Low level data-processing execution engine • Built on YARN

• Enables pipelining of jobs • Removes task and job launch times • Does not write intermediate output to HDFS

– Much lighter disk and network usage

• New base of MapReduce, Hive, Pig, Cascading etc. • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline

© Hortonworks Inc. 2013

Tez - Core Idea

Task with pluggable Input, Processor & Output

YARN ApplicationMaster to run DAG of Tez Tasks

Input Processor

Task

Output

Tez Task - <Input, Processor, Output>

© Hortonworks Inc. 2013

Building Blocks for Tasks MapReduce ‘Map’

MapReduce ‘Reduce’

HDFS Input

Map Processor

MapReduce ‘Map’ Task

Sorted Output

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Shuffle Input

Reduce Processor

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Sorted Output

Shuffle Input

Reduce Processor

HDFS Output

MapReduce ‘Reduce’ Task

Special Pig/Hive ‘Map’

HDFS Input

Map Processor

Tez Task

Pipeline

Sorter Output

Special Pig/Hive ‘Reduce’

Shuffle Skip-

merge Input

Reduce Processor

Tez Task

Sorted Output

In-memory Map

HDFSInput

Map Processor

Tez Task

In-memor

y Sorted Output

© Hortonworks Inc. 2013

Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId) GROUP BY a.state

Pig/Hive - MR Pig/Hive - Tez

I/O Synchronization Barrier

I/O Synchronization Barrier

Job 1

Job 2

Job 3

Single Job

© Hortonworks Inc. 2013

Tez on YARN: Going Beyond Batch

Tez Optimizes Execution New runtime engine for

more efficient data processing

Always-On Tez Service Low latency processing for all Hadoop data processing

Tez Task

© Hortonworks Inc. 2013

Apache Knox Secure Access to Hadoop

© Hortonworks Inc. 2013

Knox Initiative Make Hadoop security simple

Simplify security for both users

and operators.

Provide seamless access for users while securing cluster at

the perimeter, shielding the intricacies of the security

implementation.

Simplify Security

Deliver unified and centralized access to the Hadoop cluster.

Make Hadoop feel like a single

application to users.

Aggregate Access

Ensure service users are abstracted from where services are located and how services

are configured & scaled.

Client Agility

© Hortonworks Inc. 2013

Knox: Make Hadoop Security Simple

Hadoop Cluster

Authentication & Verification

Client

User Store KDC, AD, LDAP

{REST}! Knox Gateway

© Hortonworks Inc. 2013

Knox: Next Generation of Hadoop Security

•  All users see one end-point website

•  All online systems see one end-point RESTful service

•  Consistency across all interfaces and capabilities

•  Firewalled cluster that no end users need to access

•  More IT-friendly. Enables: – Systems admins – DB admins – Security admins – Network admins

Hadoop cluster

Gateway

firewallfirewall

end usersonline apps

+analytics tools

© Hortonworks Inc. 2013

Apache Falcon Data Lifecycle Management for Hadoop

© Hortonworks Inc. 2013

Data Lifecycle on Hadoop is Challenging

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Problem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges.

© Hortonworks Inc. 2013

Falcon: One-stop Shop for Data Lifecycle

Apache Falcon Provides Orchestrates

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications.

© Hortonworks Inc. 2013

Falcon At A Glance

>  Falcon provides the key services data processing applications need. >  Complex data processing logic handled by Falcon instead of hard-coded in apps. >  Faster development and higher quality for ETL, reporting and other data

processing apps on Hadoop.

Data Processing Applications

Spec Files or REST APIs

Data Import and

Replication

Scheduling and

Coordination

Data Lifecycle Policies

Multi-Cluster Management

SLA Management

Falcon Data Lifecycle Management Service

© Hortonworks Inc. 2013

Falcon Core Capabilities

• Core Functionality – Pipeline processing – Replication – Retention – Late data handling

• Automates – Scheduling and retry – Recording audit, lineage and metrics

• Operations and Management – Monitoring, management, metering – Alerts and notifications – Multi Cluster Federation

• CLI and REST API

© Hortonworks Inc. 2013

Falcon Example: Multi-Cluster Failover

>  Falcon manages workflow, replication or both. >  Enables business continuity without requiring full data reprocessing. >  Failover clusters require less storage and CPU.

Staged Data

Cleansed Data

Conformed Data

Presented Data

Staged Data

Presented Data

BI and Analytics

Primary Hadoop Cluster

Failover Hadoop Cluster

Rep

licat

ion

© Hortonworks Inc. 2013

>  Sophisticated retention policies expressed in one place. >  Simplify data retention for audit, compliance, or for data re-processing.

Falcon Example: Retention Policies

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only

© Hortonworks Inc. 2013

Falcon Example: Late Data Handling

>  Processing waits until all data is available. >  Developers don’t write complex data handling rules within applications.

Online Transaction

Data (Pull via Sqoop)

Web Log Data (Push via FTP)

Staging Area Combined Dataset

Wait up to 4 hours for FTP data

to arrive

© Hortonworks Inc. 2013

Multi Cluster Management with Prism

Page 43

>  Prism is the part of Falcon that handles multi-cluster. >  Key use cases: Replication and data processing that spans clusters.

© Hortonworks Inc. 2013

Hortonworks Sandbox Go from Zero to Big Data in 15 minutes

Page 44

© Hortonworks Inc. 2013

Sandbox: A Guided Tour of HDP

Page 45

Tutorials and videos give a guided tour of HDP and Hadoop Perfect for beginners or anyone learning more about Hadoop Installs easily on your laptop or desktop

Browse and manage HDFS files

Easily import data and create tables

Easy-to-use editors for Apache Pig and Hive

Latest tutorials pushed directly to your Sandbox

© Hortonworks Inc. 2013 Page 46

THANK YOU! Chris Harris

charris@hortonworks.com

Download Sandbox

hortonworks.com/sandbox