Hadoop past, present and future

Hadoop : Past, Present and Future Chris Harris Email : charris@hortonworks.com Twitter : cj_harris5

A little history… it’s 2005

A Brief History of Apache Hadoop

2005: Yahoo! creates team under E14 to work on Hadoop

Yahoo! begins to Operate at scale

Enterprise Hadoop

Apache Project Established

Hortonworks Data Platform

2004 2008 2010 2012 2006

Key Hadoop Data Types

1.  Sentiment Understand how your customers feel about your brand and products – right now

2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website

3.  Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines

4.  Geographic Analyze location-based data to manage operations where they occur

5.  Server Logs Research logs to diagnose process failures and prevent security breaches

6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents

Hadoop is NOT

! ESB ! NoSQL ! HPC ! Relational ! Real-time ! The “Jack of all Trades”

Hadoop 1

•  Limited up to 4,000 nodes per cluster •  O(# of tasks in a cluster) •  JobTracker bottleneck - resource management,

job scheduling and monitoring •  Only has one namespace for managing HDFS •  Map and Reduce slots are static •  Only job to run is MapReduce

Hadoop 1 - Basics

B C A A A

A B C C B

MapReduce (Computation Framework)

HDFS (Storage Framework)

Hadoop 1 - Reading Files

Rack1 Rack2 Rack3 RackN

read file (fsimage/edit) Hadoop Client

NameNode SNameNode

return DNs, block ids, etc.

DN | TT

checkpoint

heartbeat/ block report read blocks

Hadoop 1 - Writing Files

request write (fsimage/edit) Hadoop Client

NameNode SNameNode

return DNs, etc.

DN | TT

checkpoint

block report write blocks

replication pipelining

Hadoop 1 - Running Jobs

Hadoop Client

JobTracker

DN | TT

submit job

deploy job

part 0

reduce

shuffle

Hadoop 1 - Security

F I R E W A L L

LDAP/AD

Client Node/ Spoke Server

Hadoop Cluster

authN/authZ

service request

block token

delegate token

* block token is for accessing data

* delegate token is for running jobs

Encryption Plugin

Hadoop 1 - APIs

! org.apache.hadoop.mapreduce.Partitioner ! org.apache.hadoop.mapreduce.Mapper ! org.apache.hadoop.mapreduce.Reducer ! org.apache.hadoop.mapreduce.Job

Present

Hadoop 2

! Potentially up to 10,000 nodes per cluster ! O(cluster size) ! Supports multiple namespace for managing HDFS ! Efficient cluster utilization (YARN) ! MRv1 backward and forward compatible ! Any apps can integrate with Hadoop ! Beyond Java

Hadoop 2 - Basics

Hadoop 2 - Reading Files (w/ NN Federation)

read file

fsimage/edit copy Hadoop Client NN1/ns1

SNameNode per NN

return DNs, block ids, etc.

DN | NM

checkpoint

register/ heartbeat/ block report

read blocks

fs sync Backup NN per NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

ns1 ns2 ns3 ns4

dn1, dn2 dn1, dn3

dn4, dn5 dn4, dn5

Block Pools

Hadoop 2 - Writing Files

request write

Hadoop Client

return DNs, etc.

DN | NM

write blocks

replication pipelining

fsimage/edit copy NN1/ns1

SNameNode per NN

checkpoint

block report

fs sync Backup NN per NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

Hadoop 2 - Running Jobs

NodeManager

C2.2 C2.3

Hadoop Client 1

Hadoop Client 2

create app2

submit app1

submit app2

create app1

ASM Scheduler queues

ASM Containers

NM ASM

Scheduler Resources

.......negotiates.......

.......reports to.......

.......partitions.......

ResourceManager

status report

Hadoop 2 - Security

F I R E W A L L

LDAP/AD Knox Gateway Cluster

Hadoop Cluster

Enterprise/ Cloud SSO Provider

JDBC Client

REST Client

F I R E W A L L

Browser(HUE) Native Hive/HBase Encryption

Hadoop 2 - APIs

! org.apache.hadoop.yarn.api.ApplicationClientProtocol ! org.apache.hadoop.yarn.api.ApplicationMasterProtocol ! org.apache.hadoop.yarn.api.ContainerManagementProtocol

Future

Apache Tez A New Hadoop Data Processing Framework

HDP: Enterprise Hadoop Distribution

PLATFORM SERVICES

HADOOP CORE

Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots

HORTONWORKS DATA PLATFORM (HDP)

OPERATIONAL SERVICES

DATA SERVICES

HIVE & HCATALOG PIG HBASE

Hortonworks Data Platform (HDP) Enterprise Hadoop

•  The ONLY 100% open source and complete distribution

•  Enterprise grade, proven and tested at scale

•  Ecosystem endorsed to ensure interoperability

LOAD & EXTRACT

WebHDFS

AMBARI

FALCON*

TEZ* OTHER REDUCE*

Tez (“Speed”)

• What is it? – A data processing framework as an alternative to MapReduce – A new incubation project in the ASF

• Who else is involved? – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,

Microsoft

• Why does it matter? – Widens the platform for Hadoop use cases – Crucial to improving the performance of low-latency applications – Core to the Stinger initiative – Evidence of Hortonworks leading the community in the evolution

of Enterprise Hadoop

Moving Hadoop Beyond MapReduce

• Low level data-processing execution engine • Built on YARN

• Enables pipelining of jobs • Removes task and job launch times • Does not write intermediate output to HDFS

– Much lighter disk and network usage

• New base of MapReduce, Hive, Pig, Cascading etc. • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline

Tez - Core Idea

Task with pluggable Input, Processor & Output

YARN ApplicationMaster to run DAG of Tez Tasks

Input Processor

Output

Tez Task - <Input, Processor, Output>

Building Blocks for Tasks MapReduce ‘Map’

MapReduce ‘Reduce’

HDFS Input

Map Processor

MapReduce ‘Map’ Task

Sorted Output

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Shuffle Input

Reduce Processor

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Sorted Output

Shuffle Input

Reduce Processor

HDFS Output

MapReduce ‘Reduce’ Task

Special Pig/Hive ‘Map’

HDFS Input

Map Processor

Tez Task

Pipeline

Sorter Output

Special Pig/Hive ‘Reduce’

Shuffle Skip-

merge Input

Reduce Processor

Tez Task

Sorted Output

In-memory Map

HDFSInput

Map Processor

Tez Task

In-memor

y Sorted Output

Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId) GROUP BY a.state

Pig/Hive - MR Pig/Hive - Tez

I/O Synchronization Barrier

Single Job

Tez on YARN: Going Beyond Batch

Tez Optimizes Execution New runtime engine for

more efficient data processing

Always-On Tez Service Low latency processing for all Hadoop data processing

Tez Task

Apache Knox Secure Access to Hadoop

Knox Initiative Make Hadoop security simple

Simplify security for both users

and operators.

Provide seamless access for users while securing cluster at

the perimeter, shielding the intricacies of the security

implementation.

Simplify Security

Deliver unified and centralized access to the Hadoop cluster.

Make Hadoop feel like a single

application to users.

Aggregate Access

Ensure service users are abstracted from where services are located and how services

are configured & scaled.

Client Agility

Knox: Make Hadoop Security Simple

Hadoop Cluster

Authentication & Verification

Client

User Store KDC, AD, LDAP

{REST}! Knox Gateway

Knox: Next Generation of Hadoop Security

•  All users see one end-point website

•  All online systems see one end-point RESTful service

•  Consistency across all interfaces and capabilities

•  Firewalled cluster that no end users need to access

•  More IT-friendly. Enables: – Systems admins – DB admins – Security admins – Network admins

Hadoop cluster

Gateway

firewallfirewall

end usersonline apps

+analytics tools

Apache Falcon Data Lifecycle Management for Hadoop

Data Lifecycle on Hadoop is Challenging

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Problem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges.

Falcon: One-stop Shop for Data Lifecycle

Apache Falcon Provides Orchestrates

Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs

Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications.

Falcon At A Glance

>  Falcon provides the key services data processing applications need. >  Complex data processing logic handled by Falcon instead of hard-coded in apps. >  Faster development and higher quality for ETL, reporting and other data

processing apps on Hadoop.

Data Processing Applications

Spec Files or REST APIs

Data Import and

Replication

Scheduling and

Coordination

Data Lifecycle Policies

Multi-Cluster Management

SLA Management

Falcon Data Lifecycle Management Service

Falcon Core Capabilities

• Core Functionality – Pipeline processing – Replication – Retention – Late data handling

• Automates – Scheduling and retry – Recording audit, lineage and metrics

• Operations and Management – Monitoring, management, metering – Alerts and notifications – Multi Cluster Federation

• CLI and REST API

Falcon Example: Multi-Cluster Failover

>  Falcon manages workflow, replication or both. >  Enables business continuity without requiring full data reprocessing. >  Failover clusters require less storage and CPU.

Staged Data

Cleansed Data

Conformed Data

Presented Data

Staged Data

Presented Data

BI and Analytics

Primary Hadoop Cluster

Failover Hadoop Cluster

>  Sophisticated retention policies expressed in one place. >  Simplify data retention for audit, compliance, or for data re-processing.

Falcon Example: Retention Policies

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only

Falcon Example: Late Data Handling

>  Processing waits until all data is available. >  Developers don’t write complex data handling rules within applications.

Online Transaction

Data (Pull via Sqoop)

Web Log Data (Push via FTP)

Staging Area Combined Dataset

Wait up to 4 hours for FTP data

to arrive

Multi Cluster Management with Prism

>  Prism is the part of Falcon that handles multi-cluster. >  Key use cases: Replication and data processing that spans clusters.

Hortonworks Sandbox Go from Zero to Big Data in 15 minutes

Sandbox: A Guided Tour of HDP

Tutorials and videos give a guided tour of HDP and Hadoop Perfect for beginners or anyone learning more about Hadoop Installs easily on your laptop or desktop

Browse and manage HDFS files

Easily import data and create tables

Easy-to-use editors for Apache Pig and Hive

Latest tutorials pushed directly to your Sandbox

THANK YOU! Chris Harris

charris@hortonworks.com

Download Sandbox

hortonworks.com/sandbox

Hadoop past, present and future

Technology

Transcript of Hadoop past, present and future

Hadoop @ eBay: Past, Present and Future

Hadoop @ eBay: Past, Present, and Future

CCS past present and futureCCS past present and future · CCS past present and futureCCS past present and future ... Kemper County, Miss. QUEQUESSTT,, ... • Siemens gasification

Apache Hadoop YARN: Present and Future

REGIONAL CLIMATE: PAST AND PRESENT REGIONAL CLIMATE: PAST AND PRESENT.

Hadoop Summit Europe 2015 - YARN Present and Future

Power semiconductor device: Past , Present, and the …power.kyutech.ac.jp/pdf/ICPE2015ECCE_Asia_Korea_omura_Final.pdf · Power semiconductor device: Past , Present, ... • Past

Past,present,andfutureofsimultaneouslocalizationandmapping ...

Apache Hadoop YARN: Past, Present and Future

PAST TENSES: Past simple Present perfect Past continuous Present perfect continuous

Past to present

past present invention.ppt

Past or present

Past Present Future

Lecture 6 Tense and Aspect 1. Simple Present 2. Simple Past 3. Present Progressive 4. Past Progressive 5. Present Perfective 6. Past Perfective 7. Present.

Present Perfectnad Past

Hadoop Present - Open Enterprise Hadoop

Hadoop Infrastructure @Uber Past , Present and Future

Past and Present

PAST \ PRESENT