Download - Spark Workflow Management

Spark Workflow Management

Romi Kuntsman

Senior Big Data Engineer @ Totango

[email protected]

https://il.linkedin.com/in/romik

„Big things are happening here“ Meetup2015-04-29

https://il.linkedin.com/in/romik

Agenda

● Totango and Customer Success

● Totango architecture overview

● Apache Spark computing framework

● Luigi workflow Engine

● Luigi in Totango

Totango and Customer Success

Your customers' success is your success

SaaS Customer Journey

DECREASE VALUE

DECREASE VALUE

CHURN

CHURN

GROW VALUE

FIRST VALUE

START

INCREASE USERS

INCREASE USAGE

EXPAND FUNCTIONALITY

CHURN

ONGOING VALUE

Customer Success Platform

● Analytics for SaaS companies● Clear view of customer journey● Proactively prevent churn● Increase upsale● Track feature, module and total usage● Health score based on usages pattern● Improve conversion from trial to paying

Health Console

Module Statistics

Feature Adoption

About Totango

● Founded in 2010● Size: ~50 (half R&D)● Offices in Tel Aviv, San Mateo CA● 120+ customers● ~70 million events per day● ~1.5 billion indexed documents per month● Hosted on Amazon Web Services

Totango Architecture Overview

From usage information to actionable analytics

Terminology

● Service – Totango's customer (e.g. Zendesk)

● Account – Service's (Zendesk's) customer

● SDR (Service Data Record) – User activity event (e.g. user Joe from account Acme did activity Login in module Application)

SDR reception

● Clients send SDRs to the gateway, where they are collected, filtered, packaged and finally stored in S3 for daily/hourly batch processing.

● Realtime processing also notified.

Batch Workflow

Account Data Flow

1) Raw Data (SDRs)

2) Account Aging (MySQL - legacy)

3) Activity Aggregations (Hadoop – legacy)

4) Metrics (Spark)

5) Health (Spark)

6) Alerts (Spark)

7) Indexing to Elasticsearch

Data Structure

● Account documents stored on Amazon S3● Hierarchial directory structure per task param:

e.g. /s-1234/prod/2015-04-27/account/metrics● Documents have a predefined JSON schema.

JSON mapped directly to Java document class● Each file is an immutable collection of documents

One object per line – easily partitioned by lines

Apache Spark

One tool to rule all data transformations

Resilient Distributed Datasets

● RDDs – distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant way

● Initial RDD created from stable storage ● Programmer defines a transformation from an

immutable input object to a new output object● Transformation function class can (read: should!)

be built and tested separately from Spark

Transformation flow

Read: inputRows = sparkContext.textFile(inputPath)

Decode: inputDocuments = inputRows.map(new jsonToAccountDocument())

Trasform: docsWithHealth = inputDocuments.map(new augmentDocumentWithHealth(healthCalcMetadata))

… other transformations may be done, all in memory …

Encode: outputRows = docsWithHealth.map(new accountDocumentToJson())

Write: outputRows.saveAsTextFile(outputPath)

Examples (Java)

Class AugmentDocumentWithHealth implements Function<AccountDocument, AccountDocument>

AccountDocument call(final AccountDocument document) throws Exception { … return document with health … }

Class AccountHealthToAlerts implements FlatMapFunction<AccountDocument, EventDocument>

Iterable<EventDocument> call(final AccountDocument document) throws Exception { … generate alerts … }

Transformation function

● Passed as parameter to Spark transformation:map, reduce, filter, flatMap, mapPartitions

● Can (read: should!!) be checked in Unit Tests

● Serializable – sent to Spark worker serialized● Function must be idempotent!

● May be passed immutable metadata

Luigi Workflow Engine

You build the tasks, it takes care of the plumbing

Why a workflow engine?

● Managing many ETL jobs

● Dependencies between jobs

● Continue pipeline from point of failure

● Separate workflow per service per date

● Overview and drill-down status Web UI

● Manual intervention

Workflow engines

● Azkaban, by LinkedIn (mostly for Hadoop)

● Oozie, by Apache (only for Hadoop)

● Amazon Simple Workflow Service (too generic)

● Amazon Data Pipeline (deeply tied to AWS)

● Luigi, by Spotify (customizable) – our choice!

What is Luigi

● Like Makefile – but in Python, and for data

● Dependencies are managed directly in code

● Generic and easily extendable

● Visualization of task status and dependency

● Command-line interface

Luigi Task Structure

● Extend luigi.Task

Implement 4 methods:● def input(self) (optional)● def output(self)● def depends(self)● def run(self)

Luigi Task Example

Luigi Predefined Tasks

● HadoopJobTask● SparkSubmitTask● CopyToIndex (ES)● HiveQueryTask● PigJobTask● CopyToTable (RDMS)● … many others

Luigi Task Parameters

Luigi Command-line

Luigi Task List

Luigi Dependency Graph

Luigi in Totango

This is how we do it

Our codebase is in Java

Java class is called inside the task run method

Jenkins for Luigi

Gameboy

● Totango-specific controller for Luigi

● Provides high level overview

● Enable manual re-run of specific tasks

● Monitor progress, performance, run time,

queue, worker load etc

Gameboy

Summary

● Typical data flow – from raw data to insights● We use Spark for fast in-memory

transformations, all code is in Java● Our batch processing pipeline consist of a

series of tasks, which are managed in Luigi● We don't use all of Luigi's python abilities, and

we've added some new management abilities

Questions?

The end is only the beginning