Interactive workflow management using Azkaban
-
Upload
datamantra -
Category
Data & Analytics
-
view
707 -
download
9
Transcript of Interactive workflow management using Azkaban
Interactive Workflow Management using
AzkabanAPI driven workflow management for Spark
https://github.com/phatak-dev/interactive-azkaban
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at datamantra.io
● Consult in Hadoop, Spark and Scala
● www.madhukaraphatak.com
Agenda● Different Kind of Applications in Spark● Why Interactive?● Building an Interactive Application● Workflow in Big data● Challenges of Interactive Application● Azkaban ● Azkaban in manual/batch mode● Azkaban AJAX API● Azkaban client in Scala
Big Data Applications● Typically applications in big data are divided depending
upon the their work loads.● Major divisions are
○ Batch Applications○ Streaming Applications
● Most of the existing platforms support both of these applications these days
● But there is new category of applications are in raise, they are known as interactive applications
Big data Interactive Applications● Ability to manipulate data in interactive way● Exploratory in nature● Moves away from notion that ETL, Analysis has to be in
silos● Combines batch and streaming data● For Development
○ Zepplin, Jupyter Notebook etc● For Production
○ DataMeer, Tellius,ZoomData etc
Spark and Interactive Applications● Apache Spark is only big data platform built from
scratch to support interactive applications● Spark made interactive data exploration using
notebooks popular● Caching and Intelligent lazy mechanism makes it great
tool for interactive systems● As spark system combines ETL, Exploration and
Advanced Analytics in one platform, we can do all the data work in interactive fashion.
Building an Interactive Application
REST based Spark Application
Spark Cluster
REST API Client
Database
HDFS
Akka-Http● Framework to build reactive web application/ services● Build on top AKKA abstractions for concurrency● Next version of popular REST framework spray● As stream is the base abstraction, works well with the
spark● Written in Scala. Has API’s in Java and Scala● We will use local spark session to interact with Spark
Simple API● The below is the API we expose
○ /load - for loading the data ○ /view - for looking at the sample data○ /schedule - for schedule operations
● All these operations are simple, but they give you what an API based system look like
● We test the API’s using postman to emulate interactive mode
● Ex : RestService.scala
Workflow management in Big Data
Need of Workflow in Big data● Most of the tasks we do in big data are repetitive in
nature● Once we have determined our flow, we want to run it on
new data as and when it arrives● Two parts -
○ Flow Definition ○ Scheduling
● Use cases○ ETL, Updating models etc
Workflow for Batch● Most of the scheduling for batch applications is done
using some kind of scripting● Many ways are there to define flow and executing● Once code is tested, code is deployed and scripts are
scheduled● These scripts define the flow structure and use some
scheduling to run the operations● Well known frameworks for batch scheduling are
○ Oozie○ Airflow
Workflow for Streaming● Streaming frameworks itself most of the time handle the
workflow need of the application● The spark streaming code defines the flow that needs to
be run ● Spark Streaming Scheduler runs the flow as and when
new data appears● So rarely we use an external workflow framework for
executing these work loads
Workflow for Interactive Application● Ability to define the workflows on the fly rather than
fixed workflows as in case of batch● Ability to schedule and unscheduled using API’s● Should be able to handle both batch and streaming
sources of data● Should integrate with the state build up using the
interactions in the interactive mode● Ability to monitor the status of the running jobs in
realtime
Challenges of scheduling for interactive● Most of the workflow systems does not expose REST
API to interact with system to define flow and scheduling
● Many lack good monitoring system to query the status of the running tasks which is critical
● Most of the workflow systems run on their own sandboxed execution engine which makes them hard to integrate with the application state
● More details [2]
Azkaban● Azkaban is a workflow job scheduler created at LinkedIn
to run Hadoop Jobs● Has good support to define the dependencies through
flow mechanism and monitoring of the jobs ● Allows extending the UI to track new metrics● Supports for multiple runtimes like
○ Hadoop○ Spark○ Java
Azkaban Batch Mode● Azkaban is primarily built for scheduling big data batch
jobs● It has a simple dsl to define the flows● It allows us to define different executors for a given flow● The abstractions
○ Project○ Flow
● Ex : Running a java flow using Azkaban UI
Azkaban for Interactive Workflows
Azkaban AJAX API● Though Azkaban is primarily build for the batch jobs, it
has a AJAX API to interact with the workflow system● This is an API primarily built for the UI to interact with
the engine● Though it’s not a full fledged REST API, it’s good
enough to build an interactive workflow system with this API
● This AJAX API makes Azkaban ideal workflow management system for the interactive applications.
Azkaban Scala Client● Azkaban AJAX API has some rough edges as it’s not
meant to be work as standard REST API● Interacting with API directly will be painful in your
application● azkaban-scala-client is a scala client which makes
interactive with azkaban much easier● Most of the API’s are exposed using scala, feature
requests are welcomed● https://github.com/phatak-dev/azkaban-scala-client
Schedule in REST API● As we understood how to use Azkaban API to interact
with workflow manager now we can use it in our REST API
● We will use our scala client to interact with azkaban● The implementation of the flow will do a request to the
rest server in order to use the state available in rest server
● Ex : Scheduler.scala
References● http://blog.madhukaraphatak.com/interactive-scheduling
-using-azkaban-setting-up-solo-server/● http://blog.madhukaraphatak.com/interactive-scheduling
-using-azkaban-challenges-in-scheduling-interactive-workloads/
● http://azkaban.github.io/azkaban/docs/latest/#ajax-api● https://github.com/azkaban/azkaban