Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013

Streaming Live data in

Hadoop Ecosystem

Oleg Zhurakousky@z_oleg

© Hortonworks Inc. 2013 - Confidential

Simplistic view

Process

Acquire DataAcquire Data

Store Data


Real view


Modern data processing concerns

• Multiple Sources of Data• Geo Distribution• Multiple protocols for data transport• New technologies/products• New data-processing paradigms• Security• New type of users• Etc.


Apache Hadoop

• Apache Hadoop– De facto Big Data open source platform– Distributed storage– Distributed processing– Running for about 6 years in production at hundreds of companies

like Yahoo, Ebay and Facebook


Storage


HDFS – Hadoop Distributed File System


HDFS - details

Namenode

Datanode_1 Datanode_2 Datanode_3

HDFSBlock 1

HDFSBlock 2

HDFSBlock 3 Block 4

• URI-based addressing – hdfs://myhost:55555/foo/bar/foo.txt

• Name Nodes and Data Nodes

• Block-based storage

• Data Replication

• Replica placements

• File formats


Processing


1st Generation Hadoop: Batch Focus

HADOOP 1.0Built for Batch Apps

Single App

BATCH

HDFS

Single App

INTERACTIVE

Single App

BATCH

HDFS

All other usage patterns MUST leverage same infrastructure

Forces Creation of Silos to Manage Mixed Workloads

Single App

BATCH

HDFS

Single App

ONLINE


Hadoop 1 Limitations

Lacks Support for Alternate Paradigms and ServicesForce everything needs to look like Map Reduce

Iterative applications in MapReduce are 10x slower

ScalabilityMax Cluster size ~5,000 nodes

Max concurrent tasks ~40,000

AvailabilityFailure Kills Queued & Running Jobs

Hard partition of resources into map and reduce slotsNon-optimal Resource Utilization


Our Vision: Hadoop as Next-Gen Platform

HADOOP 1.0

HDFS(redundant, reliable storage)

MapReduce(cluster resource management

& data processing)

HDFS2(redundant, highly-available & reliable storage)

YARN(cluster resource management)

MapReduce(data processing)

Others

HADOOP 2.0

Single Use SystemBatch Apps

Multi Purpose PlatformBatch, Interactive, Online, Streaming, …


Hadoop 2 - YARN Architecture

ResourceManager (RM)Central agent - Manages and allocates cluster

resources

NodeManager (NM)Per-Node agent - Manages and enforces

node resource allocations

ApplicationMaster (AM)Per-Application –

Manages application

lifecycle and task

scheduling


YARN: Taking Hadoop Beyond Batch

Applications Run Natively in Hadoop

HDFS2 (Redundant, Reliable Storage)

YARN (Cluster Resource Management)

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm, S4,…)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

ONLINE(HBase)

OTHER(Search)

(Weave…)

Store ALL DATA in one place…

Interact with that data in MULTIPLE WAYS

with Predictable Performance and Quality of Service


Hadoop/YARN Eco-system

ApplicationsApache Giraph – Graph ProcessingApache Hama - BSPApache Hadoop MapReduce – BatchApache Tez – Batch/Interactive Apache Samza – Stream ProcessingApache Storm – Stream ProcessingApache Spark – Iterative applicationsElastic Search – Scalable SearchApache NiFiApache Kafka. . . .

FrameworksApache TwillREEF by MicrosoftSpring for Apache Hadoop. . .


Let’s write some code

DEMO


Streaming usage patterns

1. Capture -> Persist

2. Capture -> Process –> Persist

3. Capture -> Buffer -> Process -> Persist


Thank you!

http://hortonworks.com/products/hortonworks-sandbox/

Download Sandbox: Experience Apache HadoopBoth 2.0 and 1.x Versions Available!http://hortonworks.com/products/hortonworks-sandbox/

Questions?

http://hortonworks.com/products/hortonworks-sandbox/

Streaming Live Data and the Hadoop Ecosystem

Technology

Transcript of Streaming Live Data and the Hadoop Ecosystem