FireWorks overview

53
Anubhav Jain FireWorks workflow software: An introduction LLNL meeting | November 2016 Energy & Environmental Technologies Berkeley Lab 1 Slides available at www.slideshare.net/anubhavster

Transcript of FireWorks overview

Page 1: FireWorks overview

Anubhav Jain

FireWorks workflow software:An introduction

LLNL meeting | November 2016

Energy & Environmental TechnologiesBerkeley Lab

1Slidesavailableatwww.slideshare.net/anubhavster

Page 2: FireWorks overview

¡ Built w/Python+MongoDB. Open-source, pip-installable:§ http://pythonhosted.org/FireWorks/§ Very easy to install, most people can run first tutorial within 30 minutes of

starting

¡ At least 100 million CPU-hours used; everyday production use by 3 large DOE projects (Materials Project, JCESR, JCAP) as well as many materials science research groups

¡ Also used for graphics processing, machine learning, multiscale modeling, and document processing (but not by us)

¡ #1 Google hit for “Python workflow software”§ still behind Pegasus, Kepler, Taverna, Trident,

for “scientific workflow software”2

Page 3: FireWorks overview

http://xkcd.com/927/

3

Page 4: FireWorks overview

¡ Partly, we had trouble learning and using other people’s workflow software§ Today, I think the situation is much better§ For example, Pegasus in 2011 gave no instructions to a

general user on how to install/use/deploy it apart from a super-complicated user manual

§ Today, Pegasus takes more care to show you how to use it on their web page

§ Other tools like Swift (Argonne) are also providing tutorials¡ Partly, the other workflow software wasn’t what we were

looking for§ Other software emphasized completing a fixed workload

quickly rather than fluidly adding, subtracting, reprioritizing, searching, etc. workflows over long time periods

4

Page 5: FireWorks overview

http://www3.canisius.edu/~grandem/animalshabitats/animals.jpg5

Page 6: FireWorks overview

¡ Millions of small jobs, each at least a minute long¡ Small amount of inter-job parallelism (“bundling”) (e.g. <1000

jobs); any amount of intra-job parallelism¡ Failures are common; need persistent status

§ like UPS packages, database is a necessity¡ Very dynamic workflows

§ i.e. workflows that can modify themselves intelligently and act like researchers that submit extra calculations as needed

¡ Collisions/duplicate detection§ people submitting the same workflow, or perhaps have some steps in

common¡ Runs on a laptop or a supercomputer¡ Not “extreme” or record-breaking applications¡ Can install/learn/use it by yourself without help/support, and

by a normal scientist rather than a “workflow expert”.¡ Python-centric

6

Page 7: FireWorks overview

¡ Features

¡ Potential issues

¡ Conclusion

¡ Appendix slides§ Implementation§ Getting started§ Advanced usage

7

Page 8: FireWorks overview

LAUNCHPADFW 1

FW 2

FW 3 FW 4

ROCKET LAUNCHER / QUEUE LAUNCHER

Directory 1 Directory 2

8

Page 9: FireWorks overview

?

You can scale without human effortEasily customize what gets run where

9

Page 10: FireWorks overview

¡ PBS¡ SGE¡ SLURM¡ IBM LoadLeveler¡ NEWT (a REST-based API at NERSC)

¡ Cobalt (Argonne LCF, initial runs of ~2 million CPU-hours successful)

10

Page 11: FireWorks overview

11

Page 12: FireWorks overview

No job left behind!

12

Page 13: FireWorks overview

what machinewhat timewhat directory

what was the output

when was it queued

when did it start running

when was it completed

LAUNCH

¡ both job details (scripts+parameters) and launch details are automatically stored

13

Page 14: FireWorks overview

¡ Soft failures, hard failures, human errors§ “lpad rerun –s FIZZLED”§ “lpad detect_unreserved –rerun” OR§ “lpad detect_lostruns –rerun” OR

14

Page 15: FireWorks overview

Xiaohui can be replaced by

digital Xiaohui, programmed into FireWorks

15

Page 16: FireWorks overview

16

Page 17: FireWorks overview

Generate relaxation VASP input files from

initial structure

Run VASP calculation with Custodian

Insert results into database

Set up AIMD simulation using final relaxed

structure

Generate AIMD VASP input files from relaxed

structure

Run VASP calculation with Custodian with Walltime

Handler

Insert AIMD simulation results

into database

Convergence reached?

No

Done

Transfer AIMD calculation output to specified final

location

Yes

Each box represents a FireTask, and each series of boxes with the same color represents a single Firework.Green: Initial structure relaxation runBlue: AIMD simulationRed: Insert AIMD run into db.

Generate AIMD VASP input files from relaxed

structure

Run VASP calculation with Custodian with Walltime

Handler

Insert AIMD simulation results

into database

Convergence reached?

No

Done

Transfer AIMD calculation output to specified final

location

Yes

Dynamically add multiple parallel AIMD Fireworks.E.g., different incar configs, temperatures, etc.

Dynamically add continuation AIMD Firework that starts from previous run.

Dynamically add continuation AIMD Firework that starts from previous run.

17

Page 18: FireWorks overview

¡ Submitting millions of jobs§ Easy to lose track

of what was done before

¡ Multiple users submitting jobs

¡ Sub-workflow duplication

A A

Duplicate Job detection(if two workflows contain an identical step,ensure that the step is only run once and relevant information is still passed)

18

Page 19: FireWorks overview

¡ Within workflow, or between workflows¡ Completely flexible and can be modified

whenever you want

19

Page 20: FireWorks overview

Now seems like a good time to bring

up the last few lines of the OUTCAR of all

failed jobs...

20

Page 21: FireWorks overview

¡ Keep queue full with jobs¡ Pack jobs automatically (to a point)

21

Page 22: FireWorks overview

22

¡ Keep queue full with jobs¡ Pack jobs automatically (to a point)

Page 23: FireWorks overview

¡ Lots of care put into documentation and tutorials§ Many strangers and

outsiders have independently used it w/o support from us

¡ Built in tasks§ run BASH/Python scripts§ file transfer (incl. remote)§ write/copy/delete files

23

Page 24: FireWorks overview

¡ No direct funding for FWS – certainly not a multimillion dollar project

¡ Mitigating longevity concerns:§ FWS is open-source so the existing code will always be there§ FWS never required explicit funding for development / enhancment§ FWS has a distributed user and developer community, shielding it from a single point of

failure§ Several multimillion dollar DOE projects and many research groups including my own

depend critically on FireWorks. Funding for basic improvements/bugfixes is certainly going to be there if really needed.

¡ Mitigating support concerns:§ No funding does mean limited support for external users§ Support mechanisms favor solving problems broadly (e.g., better code, better

documentation) versus working one-on-one with potential users to solve their problems and develop single-serving “workarounds”

§ BUT there is a free support list, and if you look, you will see that even specific individual concerns are handled quickly and efficiently:▪ https://groups.google.com/forum/#!forum/fireworkflows

§ In fact, I have yet to see proof of better user support from well-funded projects:▪ Compare against: http://mailman.isi.edu/pipermail/pegasus-users/▪ Compare against: https://lists.apache.org/[email protected]▪ Compare against: http://swift-lang.org/support/index.php (no results in any search?)

24

Page 25: FireWorks overview

¡ Features

¡ Potential issues

¡ Conclusion

¡ Appendix slides§ Implementation§ Getting started§ Advanced usage

25

Page 26: FireWorks overview

26

LAUNCHPAD(MongoDB)

FIREWORKER(computing resource)

LAUNCHPAD(MongoDB)

FIREWORKER(computing resource)

LAUNCHPAD(MongoDB)

FIREWORKER(computing resource)

LaunchPad and FireWorker within the same network firewallà Works great

LaunchPad and FireWorkerseparated by firewall, BUT login node of FireWorker is open to MongoDB connectionà Works great if you have a MOM

node type structureà Otherwise “offline” mode is a non-

ideal but viable option

LaunchPad and FireWorkerseparated by firewall, no communication allowedà Doesn’t work!

Page 27: FireWorks overview

2

4

6

0 250 500 750 1000# Jobs

Jobs

/sec

ond

commandmlaunchrlaunch

1 workflow 5 workflows

0.0

0.2

0.4

0.6

0.0

0.2

0.4

0.6

1 client8 clients

200

400

600

800

1000 200

400

600

800

1000

Number of tasks

Seco

nds

per t

ask Workflow pattern

pairwise

parallel

reduce

sequence

¡ Tests indicate the FireWorks can handle a throughput of about 6-7 jobs finishing per second

¡ Overhead is 0.1-1 sec per task¡ Recently changes might enhance speed, but not tested

27

Page 28: FireWorks overview

¡ Computing center issues§ Almost all computing centers limit the number

of “mpirun”-style commands that can be executed within a single job

§ Typically, this sets a limit to the degree of job packing that can be achieved

§ Currently, no good solution; may need to work on “hacking” the MPI communicator. e.g., “wraprun” is one effort at Oak Ridge.

28

Page 29: FireWorks overview

¡ Features

¡ Potential issues

¡ Conclusion

¡ Appendix slides§ Implementation§ Getting started§ Advanced usage

29

Page 30: FireWorks overview

¡ If you are curious, just try spending 1 hour with FireWorks§ http://pythonhosted.org/FireWorks§ If you’re not intrigued after an hour, try something else

¡ If you need help, contact the support list:§ https://groups.google.com/forum/#!forum/fireworkflows

¡ If you want to read up on FireWorks, there is a paper – but this is no substitute for trying it§ “FireWorks: a dynamic workflow system designed for high-

throughput applications”. Concurr. Comput. Pract. Exp. 22,5037–5059 (2015).

§ Please cite this if you use FireWorks

30

Page 31: FireWorks overview

¡ Features

¡ Potential issues

¡ Conclusion

¡ Appendix slides§ Implementation§ Getting started§ Advanced usage

31

Page 32: FireWorks overview

FW 1 Spec

FireTask 1

FireTask 2

FW 2 Spec

FireTask 1

FW 3 Spec

FireTask 1

FireTask 2

FireTask 3

FWAction

32

Page 33: FireWorks overview

from fireworks import Firework, Workflow, LaunchPad, ScriptTaskfrom fireworks.core.rocket_launcher import rapidfire

# set up the LaunchPad and reset it (first time only)launchpad = LaunchPad()launchpad.reset('', require_password=False)

# define the individual FireWorks and Workflowfw1 = Firework(ScriptTask.from_str('echo "To be, or not to be,"'))fw2 = Firework(ScriptTask.from_str('echo "that is the question:"'))wf = Workflow([fw1, fw2], {fw1:fw2}) # set of FWs and dependencies

# store workflow in LaunchPadlaunchpad.add_wf(wf)# pull all jobs and run them locallyrapidfire(launchpad)

33

Page 34: FireWorks overview

fws:- fw_id: 1spec:_tasks:- _fw_name: ScriptTask:script: echo 'To be, or not to be,’

- fw_id: 2spec:_tasks:- _fw_name: ScriptTaskscript: echo 'that is the question:’

links:1:- 2

metadata: {}

(this is YAML, a bit prettier for humans but less pretty for computers)

The same JSON document will produce the same result on any computer (with the same Python functions).

34

Page 35: FireWorks overview

fws:- fw_id: 1spec:_tasks:- _fw_name: ScriptTask:script: echo 'To be, or not to be,’

- fw_id: 2spec:_tasks:- _fw_name: ScriptTaskscript: echo 'that is the question:’

links:1:- 2

metadata: {}

Just some of your search options:

• simple matches• match in array• greater than/less than• regular expressions• match subdocument• Javascript function• MapReduce…

All for free, and all on the native workflow format!

(this is YAML, a bit prettier for humans but less pretty for computers)

35

Page 36: FireWorks overview

36

Page 37: FireWorks overview

¡ Theme: Worker machine pulls a job & runs it

¡ Variation 1:§ different workers can be configured to pull different

types of jobs via config + MongoDB¡ Variation 2:

§ worker machines sort the jobs by a priority key and pull matching jobs the highest priority

37

Page 38: FireWorks overview

Queue launcher(running on login node or crontab)

thruput job

thruput job

thruput job

thruput job

thruput job

Job wakes up when PBS runs it

Grabs the latest job description from an external DB

Runs the job based on DB description

38

Page 39: FireWorks overview

¡ Multiple processes pull and run jobs simultaneously§ It is all the same thing, just sliced* different ways!

Query&Job&*>&&&job&A!!*>&update&DB&

Query&Job&*>&&&job&B!!*>&update&DB&&

Query&Job&*>&&&job&X&&*>&Update&DB&

mpirun&*>&Node&1%

mpirun&*>&Node&2%

mpirun&*>&Node&n%

1!large!job!

Independent&Processes&

mol&a%

mol&b%

mol&x%

*get it? wink wink39

Page 40: FireWorks overview

because jobs are JSON, they are completely serializable!

40

Page 41: FireWorks overview

¡ Features

¡ Potential issues

¡ Conclusion

¡ Appendix slides§ Implementation§ Getting started§ Advanced usage

41

Page 42: FireWorks overview

input_array: [1, 2, 3]

1. Sum input array2. Write to file3. Pass result to next job

input_array: [4, 5, 6]

1. Sum input array2. Write to file3. Pass result to next job

input_data: [6, 15]

1. Sum input data2. Write to file3. Pass result to next job-------------------------------------1. Copy result to home dir

6 15

Page 43: FireWorks overview

class MyAdditionTask(FireTaskBase):_fw_name = "My Addition Task"

def run_task(self, fw_spec):input_array = fw_spec['input_array']m_sum = sum(input_array)print("The sum of {} is: {}".format(input_array, m_sum))

with open('my_sum.txt', 'a') as f:f.writelines(str(m_sum)+'\n')

# store the sum; push the sum to the input array of the next sum

return FWAction(stored_data={'sum': m_sum}, mod_spec=[{'_push': {'input_array': m_sum}}])

See also: http://pythonhosted.org/FireWorks/guide_to_writing_firetasks.html

input_array: [1, 2, 3]

1. Sum input array2. Write to file3. Pass result to next job

Page 44: FireWorks overview

input_array: [1, 2, 3] 1.  Sum input array 2.  Write to file 3.  Pass result to next job

input_array: [4, 5, 6] 1.  Sum input array 2.  Write to file 3.  Pass result to next job

input_data: [6, 15] 1.  Sum input data 2.  Write to file 3.  Pass result to next job ------------------------------------- 1.  Copy result to home dir

6 15!

# set up the LaunchPad and reset itlaunchpad = LaunchPad()launchpad.reset('', require_password=False)

# create Workflow consisting of a AdditionTask FWs + file transferfw1 = Firework(MyAdditionTask(), {"input_array": [1,2,3]}, name="pt 1A")fw2 = Firework(MyAdditionTask(), {"input_array": [4,5,6]}, name="pt 1B")fw3 = Firework([MyAdditionTask(), FileTransferTask({"mode": "cp", "files": ["my_sum.txt"], "dest": "~"})], name="pt 2")wf = Workflow([fw1, fw2, fw3], {fw1: fw3, fw2: fw3}, name="MAVRL test")launchpad.add_wf(wf)

# launch the entire Workflow locallyrapidfire(launchpad, FWorker())

Page 45: FireWorks overview

¡ lpad get_wflows -d more¡ lpad get_fws -i 3 -d all

¡ lpad webgui

¡ Also rerun features

See all reporting at official docs:http://pythonhosted.org/FireWorks

Page 46: FireWorks overview

¡ There are a ton in the documentation and tutorials, just try them!§ http://pythonhosted.org/FireWorks

¡ I want an example of running VASP!§ https://github.com/materialsvirtuallab/fireworks-vasp§ https://gist.github.com/computron/

▪ look for “fireworks-vasp_demo.py”§ Note: demo is only a single VASP run§ multiple VASP runs require passing directory names

between jobs▪ currently you must do this manually▪ in future, perhaps build into FireWorks

Page 47: FireWorks overview

¡ If you can copy commands from a web page and type them into a Terminal, you possess the skills needed to complete the FireWorks tutorials§ BUT: for long-term use, highly suggested you learn

some Python¡ Go to:

§ http://pythonhosted.org/FireWorks§ or Google “FireWorks workflow software”

¡ NERSC-specific instructions & notes:§ https://pythonhosted.org/FireWorks/installation_note

s.html

47

Page 48: FireWorks overview

¡ Features

¡ Potential issues

¡ Conclusion

¡ Appendix slides§ Implementation§ Getting started§ Advanced usage

48

Page 49: FireWorks overview

¡ Say you have a FWS database with many different job types, and want to run different jobs types on different machines

¡ You have three options:1. Set the “_fworker” variable in the FW itself. Only the

FWorker(s) with the matching name will run the job.2. Set the “_category” variable in the FW itself. Only the

FWorker(s) with the matching categories will run the job. 3. Set the “query” parameter in the FWorker. You can set

any Mongo query on the FW to decide what jobs this FWorker will run. e.g., jobs with certain parameter ranges.

49

Page 50: FireWorks overview

¡ Both Trackers and BackgroundTasks will run a process in the background of your main FW.

¡ A Tracker is a quick way to monitor the first or last few lines of a file (e.g., output file) during job execution. It is also easy to set up, just set the “_tracker” variable in the FW spec with the details of what files you want to monitor.§ This allows you to track output files of all your jobs using the

database.§ For example, one command will let you view the output files of

all failed jobs – all without logging into any machines!¡ A BackgroundTask will run any FireTask in a separate

Process from the main task. There are built-in parameters to help.

50

Page 51: FireWorks overview

¡ Sometimes, the specific Python code that you need to execute (FireTask) depends on what machine you are running on

¡ A solution to this is FW_env¡ Each Worker configuration can set its own “env”

variable, which is accessible by the FireWorkwhen running within the “_fw_env” key

¡ The same job will see different values of “_fw_env” depending on where it’s running, and use this to execute the workflow

51

Page 52: FireWorks overview

¡ Normally, a workflow stops proceeding when a FireWork fails, or “fizzles”.§ at this point, a user might change some backend code and

rerun the failed job¡ Sometimes, you want a child FW to run even if one

or more parents have “fizzled”.§ For example, the child FW might inspect the parent,

determine a cause of failure, and initiate a “recovery workflow”

¡ To enable a child to run, set the “_allow_fizzled_parents” key in the spec to True§ FWS also create a “_fizzled_parents” key in that FW

spec that becomes available when the parents fail, and contains details about the parent FW

52

Page 53: FireWorks overview

¡ You might want some statistics on FWS jobs:§ daily, weekly, monthly reports over certain periods for

how many Workflows/FireWorks/etc. completed§ identify days when there were many job failures, perhaps

associated with a computing center outage§ grouping FIZZLED jobs by a key in the spec, e.g. to get

stats on what job types failed most often¡ All this is possible with the reporting package, type

“lpad report –h” for more information¡ You can also introspect to find common factors in job

failures, type “lpad introspect –h” for more information

53