Spark Workflow Management
Romi Kuntsman
Senior Big Data Engineer @ Totango
https://il.linkedin.com/in/romik
„Big things are happening here“ Meetup2015-04-29
Agenda
● Totango and Customer Success
● Totango architecture overview
● Apache Spark computing framework
● Luigi workflow Engine
● Luigi in Totango
Totango and Customer Success
Your customers' success is your success
SaaS Customer Journey
DECREASE VALUE
DECREASE VALUE
CHURN
CHURN
GROW VALUE
FIRST VALUE
START
INCREASE USERS
INCREASE USAGE
EXPAND FUNCTIONALITY
CHURN
ONGOING VALUE
Customer Success Platform
● Analytics for SaaS companies● Clear view of customer journey● Proactively prevent churn● Increase upsale● Track feature, module and total usage● Health score based on usages pattern● Improve conversion from trial to paying
Health Console
Module Statistics
Feature Adoption
About Totango
● Founded in 2010● Size: ~50 (half R&D)● Offices in Tel Aviv, San Mateo CA● 120+ customers● ~70 million events per day● ~1.5 billion indexed documents per month● Hosted on Amazon Web Services
Totango Architecture Overview
From usage information to actionable analytics
Terminology
● Service – Totango's customer (e.g. Zendesk)
● Account – Service's (Zendesk's) customer
● SDR (Service Data Record) – User activity event (e.g. user Joe from account Acme did activity Login in module Application)
SDR reception
● Clients send SDRs to the gateway, where they are collected, filtered, packaged and finally stored in S3 for daily/hourly batch processing.
● Realtime processing also notified.
Batch Workflow
Account Data Flow
1) Raw Data (SDRs)
2) Account Aging (MySQL - legacy)
3) Activity Aggregations (Hadoop – legacy)
4) Metrics (Spark)
5) Health (Spark)
6) Alerts (Spark)
7) Indexing to Elasticsearch
Data Structure
● Account documents stored on Amazon S3● Hierarchial directory structure per task param:
e.g. /s-1234/prod/2015-04-27/account/metrics● Documents have a predefined JSON schema.
JSON mapped directly to Java document class● Each file is an immutable collection of documents
One object per line – easily partitioned by lines
Apache Spark
One tool to rule all data transformations
Resilient Distributed Datasets
● RDDs – distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant way
● Initial RDD created from stable storage ● Programmer defines a transformation from an
immutable input object to a new output object● Transformation function class can (read: should!)
be built and tested separately from Spark
Transformation flow
Read: inputRows = sparkContext.textFile(inputPath)
Decode: inputDocuments = inputRows.map(new jsonToAccountDocument())
Trasform: docsWithHealth = inputDocuments.map(new augmentDocumentWithHealth(healthCalcMetadata))
… other transformations may be done, all in memory …
Encode: outputRows = docsWithHealth.map(new accountDocumentToJson())
Write: outputRows.saveAsTextFile(outputPath)
Examples (Java)
Class AugmentDocumentWithHealth implements Function<AccountDocument, AccountDocument>
AccountDocument call(final AccountDocument document) throws Exception { … return document with health … }
Class AccountHealthToAlerts implements FlatMapFunction<AccountDocument, EventDocument>
Iterable<EventDocument> call(final AccountDocument document) throws Exception { … generate alerts … }
Transformation function
● Passed as parameter to Spark transformation:map, reduce, filter, flatMap, mapPartitions
● Can (read: should!!) be checked in Unit Tests
● Serializable – sent to Spark worker serialized● Function must be idempotent!
● May be passed immutable metadata
Luigi Workflow Engine
You build the tasks, it takes care of the plumbing
Why a workflow engine?
● Managing many ETL jobs
● Dependencies between jobs
● Continue pipeline from point of failure
● Separate workflow per service per date
● Overview and drill-down status Web UI
● Manual intervention
Workflow engines
● Azkaban, by LinkedIn (mostly for Hadoop)
● Oozie, by Apache (only for Hadoop)
● Amazon Simple Workflow Service (too generic)
● Amazon Data Pipeline (deeply tied to AWS)
● Luigi, by Spotify (customizable) – our choice!
What is Luigi
● Like Makefile – but in Python, and for data
● Dependencies are managed directly in code
● Generic and easily extendable
● Visualization of task status and dependency
● Command-line interface
Luigi Task Structure
● Extend luigi.Task
Implement 4 methods:● def input(self) (optional)● def output(self)● def depends(self)● def run(self)
Luigi Task Example
Luigi Predefined Tasks
● HadoopJobTask● SparkSubmitTask● CopyToIndex (ES)● HiveQueryTask● PigJobTask● CopyToTable (RDMS)● … many others
Luigi Task Parameters
Luigi Command-line
Luigi Task List
Luigi Dependency Graph
Luigi Dependency Graph
Luigi in Totango
This is how we do it
Our codebase is in Java
Java class is called inside the task run method
Jenkins for Luigi
Gameboy
● Totango-specific controller for Luigi
● Provides high level overview
● Enable manual re-run of specific tasks
● Monitor progress, performance, run time,
queue, worker load etc
Gameboy
Gameboy
Gameboy
Summary
● Typical data flow – from raw data to insights● We use Spark for fast in-memory
transformations, all code is in Java● Our batch processing pipeline consist of a
series of tasks, which are managed in Luigi● We don't use all of Luigi's python abilities, and
we've added some new management abilities
Questions?
The end is only the beginning
Top Related