CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

19
A Publish-Subscribe Distributed Notification System on Hadoop Jyotiska Nath Khasnabish IIIT-Bangalore

description

Publish Subscribe based Hadoop Distributed Notification System

Transcript of CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Page 1: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

A Publish-Subscribe Distributed Notification

System on Hadoop

Jyotiska Nath KhasnabishIIIT-Bangalore

Page 2: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

HadoopOpen source distributed framework for processing

“Big Data”.

Offers distributed file system(HDFS) for storing massive amount of data across clusters.

MapReduce as a programming model for processing the large amount of data.

Adopted and used in production by 1000+ companies worldwide.

20+ popular Hadoop-based subprojects and growing.

Page 3: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Distributed Notification System [HDFS-1742] talks about a system that could notify

interested clients about major HDFS events (like file creation, deletion, etc), MapReduce job end notification.

[HDFS-2760] talks about adding a PubSub system on HDFS for sending notification messages to clients subscribed to specific services.

[HDFS-7821] talks about an event notification system which – Provide periodic updates to subscribed users Provide the capability to let users specify 'interesting events'. Provide a 'customizable' and 'configurable' interface such that

user-defined parameters can also be 'subscribed' by the user.

Page 4: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Publish Subscribe Model

Page 5: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Messaging Systems

Apache ActiveMQ

Uses JMS (Java Messaging Service) for sending and receiving messages.

Three components – Publisher, Broker, Subscriber.

Supports both Persistence and Non Persistence.

Apache Kafka

Developed by LinkedIn.

Three components – Producer, Broker, Consumer.

Supports both Persistent and Non Persistent Messaging.

Uses Zookeeper for co-ordination.

Page 6: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Architecture

Page 7: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Use Cases

Page 8: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

1. Message Passing

Sending status flags or progress reports of running jobs among multiple Hadoop services.

Hadoop services can take the role of either a publisher or a subscriber.

Example – TaskTrackers only notifying JobTracker their status

where there is a status change.

Page 9: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

2. Notification for Data Availability

Chained jobs get notified about the completion of some other job on which they are dependent.

No need to poll the NameNode for data availability in the HDFS.

Multiple subscribed services or jobs can be notified when the data is available.

Page 10: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

3. Event Based Job Chaining

Multiple MapReduce jobs can be chained based on events occurring in the Hadoop cluster.

Easier for workflow managers to chain jobs and trigger workflows automatically.

Automatic setting of job dependency for heavily chained MapReduce jobs in order to accomplish a complex computation.

Page 11: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Cluster Configuration

Machine #1 Machine #2 Machine #3

Processing Speed

2.3 GHz 2.3 GHz 2.3 GHz

RAM 2 GB 2 GB 2 GB

Disk Space 8 GB 8 GB 8 GB

OS Ubuntu 12.04 Ubuntu 12.04 Ubuntu 12.04

Hadoop Version 1.1.1 1.1.1 1.1.1

ActiveMQ Version

5.8.0 5.8.0 5.8.0

Kafka Version 0.8 0.8 0.8

Page 12: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Performance AnalysisActiveMQ vs Kafka

Page 13: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Performance AnalysisSingle Node vs Multi Node

Page 14: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Performance ComparisonWith and Without Notification System

Page 15: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Hadoop Cluster Load

Before After

Page 16: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Network Bandwidth Consumption

Before After

Page 17: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Mobile Client

Page 18: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Conclusion

Distributed notification system based on Publish Subscribe messaging model.

Can be used to pass messages between services, notify subscribed clients and chain multiple jobs.

Reduces cluster load and network bandwidth consumption significantly resulting optimal use of hardware and resources.

Can be scaled to large Hadoop cluster, > 100/1000 nodes for handling heavily inter-dependent jobs.

Page 19: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK

Thank you