Interactive workflow management using Azkaban

23
Interactive Workflow Management using Azkaban API driven workflow management for Spark https://github.com/phatak-dev/interactive-azkaban

Transcript of Interactive workflow management using Azkaban

Page 1: Interactive workflow management using Azkaban

Interactive Workflow Management using

AzkabanAPI driven workflow management for Spark

https://github.com/phatak-dev/interactive-azkaban

Page 2: Interactive workflow management using Azkaban

● Madhukara Phatak

● Technical Lead at Tellius

● Consultant and Trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Page 3: Interactive workflow management using Azkaban

Agenda● Different Kind of Applications in Spark● Why Interactive?● Building an Interactive Application● Workflow in Big data● Challenges of Interactive Application● Azkaban ● Azkaban in manual/batch mode● Azkaban AJAX API● Azkaban client in Scala

Page 4: Interactive workflow management using Azkaban

Big Data Applications● Typically applications in big data are divided depending

upon the their work loads.● Major divisions are

○ Batch Applications○ Streaming Applications

● Most of the existing platforms support both of these applications these days

● But there is new category of applications are in raise, they are known as interactive applications

Page 5: Interactive workflow management using Azkaban

Big data Interactive Applications● Ability to manipulate data in interactive way● Exploratory in nature● Moves away from notion that ETL, Analysis has to be in

silos● Combines batch and streaming data● For Development

○ Zepplin, Jupyter Notebook etc● For Production

○ DataMeer, Tellius,ZoomData etc

Page 6: Interactive workflow management using Azkaban

Spark and Interactive Applications● Apache Spark is only big data platform built from

scratch to support interactive applications● Spark made interactive data exploration using

notebooks popular● Caching and Intelligent lazy mechanism makes it great

tool for interactive systems● As spark system combines ETL, Exploration and

Advanced Analytics in one platform, we can do all the data work in interactive fashion.

Page 7: Interactive workflow management using Azkaban

Building an Interactive Application

Page 8: Interactive workflow management using Azkaban

REST based Spark Application

Spark Cluster

REST API Client

Database

HDFS

Page 9: Interactive workflow management using Azkaban

Akka-Http● Framework to build reactive web application/ services● Build on top AKKA abstractions for concurrency● Next version of popular REST framework spray● As stream is the base abstraction, works well with the

spark● Written in Scala. Has API’s in Java and Scala● We will use local spark session to interact with Spark

Page 10: Interactive workflow management using Azkaban

Simple API● The below is the API we expose

○ /load - for loading the data ○ /view - for looking at the sample data○ /schedule - for schedule operations

● All these operations are simple, but they give you what an API based system look like

● We test the API’s using postman to emulate interactive mode

● Ex : RestService.scala

Page 11: Interactive workflow management using Azkaban

Workflow management in Big Data

Page 12: Interactive workflow management using Azkaban

Need of Workflow in Big data● Most of the tasks we do in big data are repetitive in

nature● Once we have determined our flow, we want to run it on

new data as and when it arrives● Two parts -

○ Flow Definition ○ Scheduling

● Use cases○ ETL, Updating models etc

Page 13: Interactive workflow management using Azkaban

Workflow for Batch● Most of the scheduling for batch applications is done

using some kind of scripting● Many ways are there to define flow and executing● Once code is tested, code is deployed and scripts are

scheduled● These scripts define the flow structure and use some

scheduling to run the operations● Well known frameworks for batch scheduling are

○ Oozie○ Airflow

Page 14: Interactive workflow management using Azkaban

Workflow for Streaming● Streaming frameworks itself most of the time handle the

workflow need of the application● The spark streaming code defines the flow that needs to

be run ● Spark Streaming Scheduler runs the flow as and when

new data appears● So rarely we use an external workflow framework for

executing these work loads

Page 15: Interactive workflow management using Azkaban

Workflow for Interactive Application● Ability to define the workflows on the fly rather than

fixed workflows as in case of batch● Ability to schedule and unscheduled using API’s● Should be able to handle both batch and streaming

sources of data● Should integrate with the state build up using the

interactions in the interactive mode● Ability to monitor the status of the running jobs in

realtime

Page 16: Interactive workflow management using Azkaban

Challenges of scheduling for interactive● Most of the workflow systems does not expose REST

API to interact with system to define flow and scheduling

● Many lack good monitoring system to query the status of the running tasks which is critical

● Most of the workflow systems run on their own sandboxed execution engine which makes them hard to integrate with the application state

● More details [2]

Page 17: Interactive workflow management using Azkaban

Azkaban● Azkaban is a workflow job scheduler created at LinkedIn

to run Hadoop Jobs● Has good support to define the dependencies through

flow mechanism and monitoring of the jobs ● Allows extending the UI to track new metrics● Supports for multiple runtimes like

○ Hadoop○ Spark○ Java

Page 18: Interactive workflow management using Azkaban

Azkaban Batch Mode● Azkaban is primarily built for scheduling big data batch

jobs● It has a simple dsl to define the flows● It allows us to define different executors for a given flow● The abstractions

○ Project○ Flow

● Ex : Running a java flow using Azkaban UI

Page 19: Interactive workflow management using Azkaban

Azkaban for Interactive Workflows

Page 20: Interactive workflow management using Azkaban

Azkaban AJAX API● Though Azkaban is primarily build for the batch jobs, it

has a AJAX API to interact with the workflow system● This is an API primarily built for the UI to interact with

the engine● Though it’s not a full fledged REST API, it’s good

enough to build an interactive workflow system with this API

● This AJAX API makes Azkaban ideal workflow management system for the interactive applications.

Page 21: Interactive workflow management using Azkaban

Azkaban Scala Client● Azkaban AJAX API has some rough edges as it’s not

meant to be work as standard REST API● Interacting with API directly will be painful in your

application● azkaban-scala-client is a scala client which makes

interactive with azkaban much easier● Most of the API’s are exposed using scala, feature

requests are welcomed● https://github.com/phatak-dev/azkaban-scala-client

Page 22: Interactive workflow management using Azkaban

Schedule in REST API● As we understood how to use Azkaban API to interact

with workflow manager now we can use it in our REST API

● We will use our scala client to interact with azkaban● The implementation of the flow will do a request to the

rest server in order to use the state available in rest server

● Ex : Scheduler.scala