Streaming Live Data and the Hadoop Ecosystem

18
© Hortonworks Inc. 2013 Streaming Live data in Hadoop Ecosystem Page 1 Oleg Zhurakousky @z_oleg

Transcript of Streaming Live Data and the Hadoop Ecosystem

Page 1: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013

Streaming Live data in

Hadoop Ecosystem

Page 1

Oleg Zhurakousky@z_oleg

Page 2: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Simplistic view

Page 2

Process

Acquire DataAcquire Data

Store Data

Page 3: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Real view

Page 3

Page 4: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Modern data processing concerns

• Multiple Sources of Data• Geo Distribution• Multiple protocols for data transport• New technologies/products• New data-processing paradigms• Security• New type of users• Etc.

Page 4

Page 5: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Apache Hadoop

• Apache Hadoop– De facto Big Data open source platform– Distributed storage– Distributed processing– Running for about 6 years in production at hundreds of companies

like Yahoo, Ebay and Facebook

Page 5

Page 6: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Storage

Page 6

Page 7: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

HDFS – Hadoop Distributed File System

Page 8: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

HDFS - details

Namenode

Datanode_1 Datanode_2 Datanode_3

HDFSBlock 1

HDFSBlock 2

HDFSBlock 3 Block 4

• URI-based addressing – hdfs://myhost:55555/foo/bar/foo.txt

• Name Nodes and Data Nodes

• Block-based storage

• Data Replication

• Replica placements

• File formats

Page 9: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Processing

Page 9

Page 10: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

1st Generation Hadoop: Batch Focus

HADOOP 1.0Built for Batch Apps

Single App

BATCH

HDFS

Single App

INTERACTIVE

Single App

BATCH

HDFS

All other usage patterns MUST leverage same infrastructure

Forces Creation of Silos to Manage Mixed Workloads

Single App

BATCH

HDFS

Single App

ONLINE

Page 10

Page 11: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Hadoop 1 Limitations

Lacks Support for Alternate Paradigms and ServicesForce everything needs to look like Map Reduce

Iterative applications in MapReduce are 10x slower

ScalabilityMax Cluster size ~5,000 nodes

Max concurrent tasks ~40,000

AvailabilityFailure Kills Queued & Running Jobs

Hard partition of resources into map and reduce slotsNon-optimal Resource Utilization

Page 11

Page 12: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Our Vision: Hadoop as Next-Gen Platform

HADOOP 1.0

HDFS(redundant, reliable storage)

MapReduce(cluster resource management

& data processing)

HDFS2(redundant, highly-available & reliable storage)

YARN(cluster resource management)

MapReduce(data processing)

Others

HADOOP 2.0

Single Use SystemBatch Apps

Multi Purpose PlatformBatch, Interactive, Online, Streaming, …

Page 12

Page 13: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential Page 13

Hadoop 2 - YARN Architecture

ResourceManager (RM)Central agent - Manages and allocates cluster

resources

NodeManager (NM)Per-Node agent - Manages and enforces

node resource allocations

ApplicationMaster (AM)Per-Application –

Manages application

lifecycle and task

scheduling

Page 14: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

YARN: Taking Hadoop Beyond Batch

Page 14

Applications Run Natively in Hadoop

HDFS2 (Redundant, Reliable Storage)

YARN (Cluster Resource Management)

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm, S4,…)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

ONLINE(HBase)

OTHER(Search)

(Weave…)

Store ALL DATA in one place…

Interact with that data in MULTIPLE WAYS

with Predictable Performance and Quality of Service

Page 15: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Hadoop/YARN Eco-system

Page 15

ApplicationsApache Giraph – Graph ProcessingApache Hama - BSPApache Hadoop MapReduce – BatchApache Tez – Batch/Interactive Apache Samza – Stream ProcessingApache Storm – Stream ProcessingApache Spark – Iterative applicationsElastic Search – Scalable SearchApache NiFiApache Kafka. . . .

FrameworksApache TwillREEF by MicrosoftSpring for Apache Hadoop. . .

Page 16: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Let’s write some code

DEMO

Page 16

Page 17: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Streaming usage patterns

Page 17

1. Capture -> Persist

2. Capture -> Process –> Persist

3. Capture -> Buffer -> Process -> Persist

Page 18: Streaming Live Data and the Hadoop Ecosystem

© Hortonworks Inc. 2013 - Confidential

Thank you!

Page 18

http://hortonworks.com/products/hortonworks-sandbox/

Download Sandbox: Experience Apache HadoopBoth 2.0 and 1.x Versions Available!http://hortonworks.com/products/hortonworks-sandbox/

Questions?