Post on 15-Jan-2015
description
Erik Bernhardssonerikbern@spotify.com
Batch data processing in Python
Focusing mostly on music discovery and large scale machine learningPreviously managed the “Analytics team” in Stockholm
I’m at Spotify in NYC
Btw I’m Erik Bernhardsson
Background
Billions of log messages (several TBs) every dayUsage and backend stats, debug informationWhat we want to do
AB-testingMusic recommendationsMonthly/daily/hourly reportingBusiness metric dashboards
We experiment a lot – need quick development cycles
We crunch a lot of data
Why did we build Luigi?
Our second cluster (in 2009):
We like Hadoop
Long story short :)
Our fifth cluster
Running one job is easy
Lots of long-running processes with dependenciesNeed monitoringHandle failuresGo from experimentation to production easily
But what about running 1000s of job every day?
But also non-Hadoop stuffMost things are Python Map/Reduce jobsAlso Pig, HiveSCP files from one host to anotherTrain a machine learning modelPut data in Cassandra
In the pre-Luigi world
How not to do workflows
“Streams” is a list of (username, track, artist ,timestamp) tuples
Example: Artist Toplist
Streams Artist Aggregation
Top 10 Database
Pre-Luigi example of artist toplists
Don’t do this at home
OK, so chain the tasks
Cron nicer, yay!
That’s OK, but don’t leave broken data somewhere(btw, Luigi gives you atomic file operations locally and in HDFS)
Errors will occur
The second step fails, you fix it, then you want to resume
Don’t run things twice
To use data flows as command line tools
Parametrize tasks
You want to run the dataflow for a set of similar inputs
Put tasks in loops
Plumbing sucks
Graph algorithms rock!
Plumbing sucks...
Who’s the world’s second most famous plumber?
Hint: he wears green
A Python framework for data flow definition and execution
Introducing Luigi
On steroids and PCP... with a toolbox of mainly Hadoop related stuff
Simple dependency definitionsEmphasis on Hadoop/HDFS integrationAtomic file operationsData flow visualizationCommand line integration
Main features
Luigi is “kind of like Makefile” in Python
Luigi Task
Luigi - Aggregate Artists
Luigi - Aggregate ArtistsRun on the command line:$ python dataflow.py AggregateArtists
DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74375] Running AggregateArtists()INFO: [pid 74375] Done AggregateArtists()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time
Top 10 artists - Wrapped arbitrary Python code
Completing the top list
Basic functionality for exporting to Postgres. Cassandra support is in the works
Database support
Running it all...DEBUG: Checking if ArtistToplistToDatabase() is completeINFO: Scheduled ArtistToplistToDatabase()DEBUG: Checking if Top10Artists() is completeINFO: Scheduled Top10Artists()DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 3INFO: [pid 74811] Running AggregateArtists()INFO: [pid 74811] Done AggregateArtists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 2INFO: [pid 74811] Running Top10Artists()INFO: [pid 74811] Done Top10Artists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74811] Running ArtistToplistToDatabase()INFO: Done writing, importing at 2013-03-13 15:41:09.407138INFO: [pid 74811] Done ArtistToplistToDatabase()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time
Imagine how cool this would be with real data...
The results
Tasks have implicit __init__
Task Parameters
Generates command line interface with typing and documentation
Class variables with some magic
$ python dataflow.py AggregateArtists --date 2013-03-05
Combined usage example
Task Parameters
Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-filesRunning Hive and (soon) Pig queriesInserting data sets into Postgres
Luigi comes with a toolbox of abstract Tasks for...
... how to run anything, really
Task templates and targets
Writing new ones are as easy as defining an interface and implementing run()
Built-in Hadoop Streaming Python framework
Hadoop MapReduce
Tiny interface – just implement mapper and reducerFetches error logs from Hadoop cluster and displays them to the userClass instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joinsEasy to send along Python modules that might not be installed on the clusterSupport for counters, secondary sort, combiners, distributed cache, etc.Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
Features
Built-in Hadoop Streaming Python framework
Hadoop MapReduce
More features
Luigi’s “visualiser”
Dive into any task
Basic multi-processing
Multiple workers$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W08
Great for automated execution
Error notifications
Prevents two identical tasks from running simultaneously
Process Synchronization
Luigi worker 1 Luigi worker 2
A
B C
A C
F
Luigi central planner
... what happens
Process Synchronization
Luigi worker 1 Luigi worker 2
A
B C
A
C
F
... what happens
Process Synchronization
Luigi worker 1 Luigi worker 2
A
B C
A
C
F
... what happens
Process Synchronization
Luigi worker 1 Luigi worker 2
A
B C
A
C
F
Large data flows(Screenshot from web interface)
Things Luigi is not
Yes, you can run Python Hadoop jobs in Luigi. But the main focus is workflow management.
Luigi is not trying to replace mrjob
You still need to figure out how each task runs
Luigi does not give you scalability
Mapreduce/Pig/Hive/etc are wonderful tools for doing this and Luigi is more than happy to delegate it to them.
Luigi does not help you transform the data
Although Oozie is kind of annoying
...but it’s sort of like Oozie
Oozie Luigi
Only Hadoop Yes!
Horrible XML Yes!
Easy Yes!
Fun & powerful Yes!
“Oozie example”<workflow-app xmlns='uri:oozie:workflow:0.1' name='processDir'>
<start to='getDirInfo' /> <!-- STEP ONE --> <action name='getDirInfo'> <!--writes 2 properties: dir.num-files: returns -1 if dir doesn't exist, otherwise returns # of files in dir dir.age: returns -1 if dir doesn't exist, otherwise returns age of dir in days --> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.GetDirInfo</main-class> <arg>${inputDir}</arg> <capture-output /> </java> <ok to="makeIngestDecision" /> <error to="fail" /> </action>
<!-- STEP TWO --> <decision name="makeIngestDecision"> <switch> <!-- empty or doesn't exist --> <case to="end"> ${wf:actionData('getDirInfo')['dir.num-files'] lt 0 || (wf:actionData('getDirInfo')['dir.age'] lt 1 and wf:actionData('getDirInfo')['dir.num-files'] lt 24)} </case> <!-- # of files >= 24 --> <case to="ingest"> ${wf:actionData('getDirInfo')['dir.num-files'] gt 23 || wf:actionData('getDirInfo')['dir.age'] gt 6} </case> <default to="sendEmail"/> </switch> </decision>
<!--EMAIL--> <action name="sendEmail"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.StandaloneMailer</main-class> <arg>probedata2@navteq.com</arg> <arg>gregory.titievsky@navteq.com</arg> <arg>${inputDir}</arg> <arg>${wf:actionData('getDirInfo')['dir.num-files']}</arg> <arg>${wf:actionData('getDirInfo')['dir.age']}</arg> </java> <ok to="end" /> <error to="fail" /> </action>
<!--INGESTION --> <action name="ingest"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${outputDir}" /> </prepare> <configuration> <property> <name>mapred.reduce.tasks</name> <value>300</value> </property> </configuration> <main-class>com.navteq.probedata.drivers.ProbeIngest</main-class> <arg>-conf</arg> <arg>action.xml</arg> <arg>${inputDir}</arg> <arg>${outputDir}</arg> </java> <ok to=" archive-data" /> <error to="ingest-fail" /> </action>
<!—Archive Data --> <action name="archive-data"> <fs> <move source='${inputDir}' target='/probe/backup/${dirName}' /> <delete path = '${inputDir}' /> </fs> <ok to="end" /> <error to="ingest-fail" /> </action>
<kill name="ingest-fail"> <message>Ingestion failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill>
<kill name="fail"> <message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name='end' /> </workflow-app>
Instead, focus on ridiculously little boiler plate codeGeneral so you can build whatever on top of itAs well as rapid experimentation cycleOnce things work, trivial to put in production
Luigi does not have 999 features
What we use Luigi forHadoop StreamingJava Hadoop MapReduceHivePigTrain machine learning modelsImport/export data to/from PostgresInsert data into Cassandrascp/rsync/ftp data files and reportsDump and load databases
Others using it with Scala MapReduce and MRJob as well
Be one of the cool kids!
Originated at SpotifyMainly built by me and Elias FreiderBased on many years of experience with data processingOpen source since September 2012
https://github.com/spotify/luigi
Luigi is open source
•Pig•EC2•Scalding•Cassandra
Future plans!
For more information feel free to reach out at http://github.com/spotify/luigi
Thank you!Oh, and we’re hiring – http://spotify.com/jobs
Erik Bernhardssonerikbern@spotify.com