Spark Workflow Management

Romi Kuntsman

Senior Big Data Engineer @ Totango

romi@totango.com

https://il.linkedin.com/in/romik

„Big things are happening here“ Meetup2015-04-29

Agenda

● Totango and Customer Success

● Totango architecture overview

● Apache Spark computing framework

● Luigi workflow Engine

● Luigi in Totango

Totango and Customer Success

Your customers' success is your success

SaaS Customer Journey

DECREASE VALUE

GROW VALUE

FIRST VALUE

INCREASE USERS

INCREASE USAGE

EXPAND FUNCTIONALITY

ONGOING VALUE

Customer Success Platform

● Analytics for SaaS companies● Clear view of customer journey● Proactively prevent churn● Increase upsale● Track feature, module and total usage● Health score based on usages pattern● Improve conversion from trial to paying

Health Console

Module Statistics

Feature Adoption

About Totango

● Founded in 2010● Size: ~50 (half R&D)● Offices in Tel Aviv, San Mateo CA● 120+ customers● ~70 million events per day● ~1.5 billion indexed documents per month● Hosted on Amazon Web Services

Totango Architecture Overview

From usage information to actionable analytics

Terminology

● Service – Totango's customer (e.g. Zendesk)

● Account – Service's (Zendesk's) customer

● SDR (Service Data Record) – User activity event (e.g. user Joe from account Acme did activity Login in module Application)

SDR reception

● Clients send SDRs to the gateway, where they are collected, filtered, packaged and finally stored in S3 for daily/hourly batch processing.

● Realtime processing also notified.

Batch Workflow

Account Data Flow

1) Raw Data (SDRs)

2) Account Aging (MySQL - legacy)

3) Activity Aggregations (Hadoop – legacy)

4) Metrics (Spark)

5) Health (Spark)

6) Alerts (Spark)

7) Indexing to Elasticsearch

Data Structure

● Account documents stored on Amazon S3● Hierarchial directory structure per task param:

e.g. /s-1234/prod/2015-04-27/account/metrics● Documents have a predefined JSON schema.

JSON mapped directly to Java document class● Each file is an immutable collection of documents

One object per line – easily partitioned by lines

Apache Spark

One tool to rule all data transformations

Resilient Distributed Datasets

● RDDs – distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant way

● Initial RDD created from stable storage ● Programmer defines a transformation from an

immutable input object to a new output object● Transformation function class can (read: should!)

be built and tested separately from Spark

Transformation flow

Read: inputRows = sparkContext.textFile(inputPath)

Decode: inputDocuments = inputRows.map(new jsonToAccountDocument())

Trasform: docsWithHealth = inputDocuments.map(new augmentDocumentWithHealth(healthCalcMetadata))

… other transformations may be done, all in memory …

Encode: outputRows = docsWithHealth.map(new accountDocumentToJson())

Write: outputRows.saveAsTextFile(outputPath)

Examples (Java)

Class AugmentDocumentWithHealth implements Function<AccountDocument, AccountDocument>

AccountDocument call(final AccountDocument document) throws Exception { … return document with health … }

Class AccountHealthToAlerts implements FlatMapFunction<AccountDocument, EventDocument>

Iterable<EventDocument> call(final AccountDocument document) throws Exception { … generate alerts … }

Transformation function

● Passed as parameter to Spark transformation:map, reduce, filter, flatMap, mapPartitions

● Can (read: should!!) be checked in Unit Tests

● Serializable – sent to Spark worker serialized● Function must be idempotent!

● May be passed immutable metadata

Luigi Workflow Engine

You build the tasks, it takes care of the plumbing

Why a workflow engine?

● Managing many ETL jobs

● Dependencies between jobs

● Continue pipeline from point of failure

● Separate workflow per service per date

● Overview and drill-down status Web UI

● Manual intervention

Workflow engines

● Azkaban, by LinkedIn (mostly for Hadoop)

● Oozie, by Apache (only for Hadoop)

● Amazon Simple Workflow Service (too generic)

● Amazon Data Pipeline (deeply tied to AWS)

● Luigi, by Spotify (customizable) – our choice!

What is Luigi

● Like Makefile – but in Python, and for data

● Dependencies are managed directly in code

● Generic and easily extendable

● Visualization of task status and dependency

● Command-line interface

Luigi Task Structure

● Extend luigi.Task

Implement 4 methods:● def input(self) (optional)● def output(self)● def depends(self)● def run(self)

Luigi Task Example

Luigi Predefined Tasks

● HadoopJobTask● SparkSubmitTask● CopyToIndex (ES)● HiveQueryTask● PigJobTask● CopyToTable (RDMS)● … many others

Luigi Task Parameters

Luigi Command-line

Luigi Task List

Luigi Dependency Graph

Luigi in Totango

This is how we do it

Our codebase is in Java

Java class is called inside the task run method

Jenkins for Luigi

Gameboy

● Totango-specific controller for Luigi

● Provides high level overview

● Enable manual re-run of specific tasks

● Monitor progress, performance, run time,

queue, worker load etc

Gameboy

Summary

● Typical data flow – from raw data to insights● We use Spark for fast in-memory

transformations, all code is in Java● Our batch processing pipeline consist of a

series of tasks, which are managed in Luigi● We don't use all of Luigi's python abilities, and

we've added some new management abilities

Questions?

The end is only the beginning

Spark Workflow Management

Data & Analytics

Transcript of Spark Workflow Management

Advanced Workflow Management Technologiestaylor/documents/1998-AdvancedWorkflow.pdfAdvanced Workflow Management Technologies 3 of 60 Workflow and process technology has the potential

Kap. 12 Workflow Management inwebarchiv.ethz.ch/.../12-ERP-Workflow-slides.pdf · Workflow Management Systemen sowie Forschungsgruppen) • „The Coalition’s mission is to promote

Prepress Workflow Automation - Crisp Digital Prepress Workflow... · Prepress Workflow Automation • Web-Based Workflow Management • Automated Pre-flighting • ROOM Workflow Integrity

IEEE Workflow Management System

Sappress Workflow Management

Sharepoint workflow project management

Taverna workflow management system (2010 11-30 Bath Workflow Tools)

WorkFlow Management Systems

Workflow Management System for Stratosphere - IT4BIit4bi.univ-tours.fr/it4bi/medias/pdfs/Workflow-management-system... · Workflow Management System for Stratosphere . ... Related

Workflow and Resource Management

A Workflow for Release Management Workflow For Release Management … · ANDREW BERRY * @ deviantintegral * WATERLOO DUG MARCH 2013 A Workflow for Release Management Baking awesome

Document Management , Workflow Management, Accounts ...

Gridbus Workflow Management System and Aneka · PDF fileGridbus Workflow Management System and Aneka Enterprise Middleware ... architecture using ... the Gridbus Workflow Management

Workflow Management Solutions

MM - Materials Management: Workflow Scenarios - Materials Management: Workflow Scenarios SAP AG 4 April 2001 Contents MM - Materials Management: Workflow Scenarios ...

Workflow management WvdA (2002)

Workflow Management and Virtual Data › ... › presentations › Deelman26 › workflow_Ewa.pdf · GGF Summer School Workflow Management 5 Ongoing Workflow Management Work zPart

Hortonworks Data Platform - Workflow … Data Platform: Workflow Management ... training and partner-enablement services. ... Spark Action Parameters ...

Workflow in Order Management

Workflow Management Coalition The Workflow Reference Model