Building Scientific Workflows with Spotify's Luigi

18
Building workflows with Spotify's Workshop for e-Infrastructures for Massively Parallel Sequencing Uppsala, Jan 19-20, 2015 Samuel Lampa UU/Dept of Pharm. Biosci. & BILS [email protected] samuel.lampa.co

Transcript of Building Scientific Workflows with Spotify's Luigi

Page 1: Building Scientific Workflows with Spotify's Luigi

Building workflows with Spotify's

Workshop for e-Infrastructures for Massively Parallel SequencingUppsala, Jan 19-20, 2015

Samuel LampaUU/Dept of Pharm. Biosci. & BILS

[email protected]

Page 2: Building Scientific Workflows with Spotify's Luigi

Luigi Short Facts● Batch (No streaming)● Both Command line and Hadoop execution● Does not replace Hive, Pig, Cascading etc

- instead used to stitch many such jobs together● Powers 1000:s of jobs every day at Spotify● Communication / synchronization between tasks via “targets” (Normal file,

HDFS, database …)● Dependencies hard coded in tasks● Pull based (ask for task, and it figures out dependencies)● Main features: Scheduling, Dependency Graph,

Basic multi-node support (via central planner daemon)● github.com/spotify/luigi● Easy to install: pip install luigi [tornado]

Page 3: Building Scientific Workflows with Spotify's Luigi
Page 4: Building Scientific Workflows with Spotify's Luigi

A Basic Luigi Task

Page 5: Building Scientific Workflows with Spotify's Luigi

A Basic Luigi Task

Page 6: Building Scientific Workflows with Spotify's Luigi

Calling tasks from the commandline

Page 7: Building Scientific Workflows with Spotify's Luigi

Defining workflows: Default way

Page 8: Building Scientific Workflows with Spotify's Luigi

Dependencies: Default way

TaskA()

TaskB()

TaskBA() TaskBB()

TaskBAA() TaskBBB()TaskBAB() TaskBBA()

Page 9: Building Scientific Workflows with Spotify's Luigi

Parameters: Default way

TaskA()

TaskB()

TaskBA() TaskBB()

TaskBAA() TaskBBB()TaskBAB() TaskBBA()

DependencyParameter

Page 10: Building Scientific Workflows with Spotify's Luigi

Growing list of parameters

Page 11: Building Scientific Workflows with Spotify's Luigi

Dependencies and parameters: A better way

AWorkflowTask()

TaskA()

TaskB()

TaskBA() TaskBB()

TaskBAA() TaskBBB()TaskBAB() TaskBBA()

Page 12: Building Scientific Workflows with Spotify's Luigi

Supporting separate network definition

Page 13: Building Scientific Workflows with Spotify's Luigi

Using this functionality in tasks

Page 14: Building Scientific Workflows with Spotify's Luigi

Using this new functionality in workflows, and wrapping whole workflows in Tasks

Page 15: Building Scientific Workflows with Spotify's Luigi

Unit testing Luigi tasks is easy

Page 16: Building Scientific Workflows with Spotify's Luigi

Lessons Learned (April 2014)

● Only “pull” or “call/return” semantics, no “push” or “auto triggering by inputs”

● Not fully separate workflow● Multiple inputs / outputs somewhat error-prone● Dynamic typed language● Check out for max processes limits● No real distribution of compute or data

(only common scheduling)

Page 17: Building Scientific Workflows with Spotify's Luigi

Lessons Learned (Update Jan 2015)

● Only “pull” or “call/return” semantics, no “push” or “auto triggering by inputs”

● Not fully separate workflow def.– We found a way around this.

● Multiple inputs / outputs somewhat error-prone– We're looking into solving this too.

● Dynamic typed language– We've found that unit testing is easy.

● Check out for max processes limits● No real distribution of compute or data

(only common scheduling)– We have found ourselves running everything through SLURM anyway

Page 18: Building Scientific Workflows with Spotify's Luigi

Thank you!