Interactive workflow management using Azkaban

Post on 16-Apr-2017

707 views 9 download

Transcript of Interactive workflow management using Azkaban

Interactive Workflow Management using

AzkabanAPI driven workflow management for Spark

https://github.com/phatak-dev/interactive-azkaban

● Madhukara Phatak

● Technical Lead at Tellius

● Consultant and Trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Agenda● Different Kind of Applications in Spark● Why Interactive?● Building an Interactive Application● Workflow in Big data● Challenges of Interactive Application● Azkaban ● Azkaban in manual/batch mode● Azkaban AJAX API● Azkaban client in Scala

Big Data Applications● Typically applications in big data are divided depending

upon the their work loads.● Major divisions are

○ Batch Applications○ Streaming Applications

● Most of the existing platforms support both of these applications these days

● But there is new category of applications are in raise, they are known as interactive applications

Big data Interactive Applications● Ability to manipulate data in interactive way● Exploratory in nature● Moves away from notion that ETL, Analysis has to be in

silos● Combines batch and streaming data● For Development

○ Zepplin, Jupyter Notebook etc● For Production

○ DataMeer, Tellius,ZoomData etc

Spark and Interactive Applications● Apache Spark is only big data platform built from

scratch to support interactive applications● Spark made interactive data exploration using

notebooks popular● Caching and Intelligent lazy mechanism makes it great

tool for interactive systems● As spark system combines ETL, Exploration and

Advanced Analytics in one platform, we can do all the data work in interactive fashion.

Building an Interactive Application

REST based Spark Application

Spark Cluster

REST API Client

Database

HDFS

Akka-Http● Framework to build reactive web application/ services● Build on top AKKA abstractions for concurrency● Next version of popular REST framework spray● As stream is the base abstraction, works well with the

spark● Written in Scala. Has API’s in Java and Scala● We will use local spark session to interact with Spark

Simple API● The below is the API we expose

○ /load - for loading the data ○ /view - for looking at the sample data○ /schedule - for schedule operations

● All these operations are simple, but they give you what an API based system look like

● We test the API’s using postman to emulate interactive mode

● Ex : RestService.scala

Workflow management in Big Data

Need of Workflow in Big data● Most of the tasks we do in big data are repetitive in

nature● Once we have determined our flow, we want to run it on

new data as and when it arrives● Two parts -

○ Flow Definition ○ Scheduling

● Use cases○ ETL, Updating models etc

Workflow for Batch● Most of the scheduling for batch applications is done

using some kind of scripting● Many ways are there to define flow and executing● Once code is tested, code is deployed and scripts are

scheduled● These scripts define the flow structure and use some

scheduling to run the operations● Well known frameworks for batch scheduling are

○ Oozie○ Airflow

Workflow for Streaming● Streaming frameworks itself most of the time handle the

workflow need of the application● The spark streaming code defines the flow that needs to

be run ● Spark Streaming Scheduler runs the flow as and when

new data appears● So rarely we use an external workflow framework for

executing these work loads

Workflow for Interactive Application● Ability to define the workflows on the fly rather than

fixed workflows as in case of batch● Ability to schedule and unscheduled using API’s● Should be able to handle both batch and streaming

sources of data● Should integrate with the state build up using the

interactions in the interactive mode● Ability to monitor the status of the running jobs in

realtime

Challenges of scheduling for interactive● Most of the workflow systems does not expose REST

API to interact with system to define flow and scheduling

● Many lack good monitoring system to query the status of the running tasks which is critical

● Most of the workflow systems run on their own sandboxed execution engine which makes them hard to integrate with the application state

● More details [2]

Azkaban● Azkaban is a workflow job scheduler created at LinkedIn

to run Hadoop Jobs● Has good support to define the dependencies through

flow mechanism and monitoring of the jobs ● Allows extending the UI to track new metrics● Supports for multiple runtimes like

○ Hadoop○ Spark○ Java

Azkaban Batch Mode● Azkaban is primarily built for scheduling big data batch

jobs● It has a simple dsl to define the flows● It allows us to define different executors for a given flow● The abstractions

○ Project○ Flow

● Ex : Running a java flow using Azkaban UI

Azkaban for Interactive Workflows

Azkaban AJAX API● Though Azkaban is primarily build for the batch jobs, it

has a AJAX API to interact with the workflow system● This is an API primarily built for the UI to interact with

the engine● Though it’s not a full fledged REST API, it’s good

enough to build an interactive workflow system with this API

● This AJAX API makes Azkaban ideal workflow management system for the interactive applications.

Azkaban Scala Client● Azkaban AJAX API has some rough edges as it’s not

meant to be work as standard REST API● Interacting with API directly will be painful in your

application● azkaban-scala-client is a scala client which makes

interactive with azkaban much easier● Most of the API’s are exposed using scala, feature

requests are welcomed● https://github.com/phatak-dev/azkaban-scala-client

Schedule in REST API● As we understood how to use Azkaban API to interact

with workflow manager now we can use it in our REST API

● We will use our scala client to interact with azkaban● The implementation of the flow will do a request to the

rest server in order to use the state available in rest server

● Ex : Scheduler.scala