MAVRL Workshop 2014 - pymatgen-db & custodian

25
materiaIs virtuaLab Pymatgen-db and Custodian Shyue Ping Ong November 10, 2014 MAVRL Workshop 2014

description

This presentation was part of the workshop on Materials Project Software infrastructure conducted for the Materials Virtual Lab in Nov 10 2014. It presents an introduction to the pymatgen-db database plugin for the pymatge) materials analysis library, and the custodian error recovery framework. Pymatgen-db enables the creation of Materials Project-style MongoDB databases for management of materials data. A query engine is also provided to enable the easy translation of MongoDB docs to useful pymatgen objects for analysis purposes. Custodian is a simple, robust and flexible just-in-time (JIT) job management framework written in Python. Using custodian, you can create wrappers that perform error checking, job management and error recovery. It has a simple plugin framework that allows you to develop specific job management workflows for different applications. Error recovery is an important aspect of many high-throughput projects that generate data on a large scale. The specific use case for custodian is for long running jobs, with potentially random errors. For example, there may be a script that takes several days to run on a server, with a 1% chance of some IO error causing the job to fail. Using custodian, one can develop a mechanism to gracefully recover from the error, and restart the job with modified parameters if necessary. The current version of Custodian also comes with sub-packages for error handling for Vienna Ab Initio Simulation Package (VASP) and QChem calculations.

Transcript of MAVRL Workshop 2014 - pymatgen-db & custodian

Page 1: MAVRL Workshop 2014 - pymatgen-db & custodian

materiaIsvirtuaLab

Pymatgen-db and Custodian

Shyue Ping Ong

November 10, 2014

MAVRL Workshop 2014

Page 2: MAVRL Workshop 2014 - pymatgen-db & custodian

Introduction

•  Database plugin for pymatgen •  Provides an interface between pymatgen objects

and database representations

Pymatgen-db

•  Error-correcting framework for long-running codes with unpredictable errors

•  Codifies knowledge of how to perform simulations •  Includes extensive rule sets for VASP and Qchem.

Custodian

November 10, 2014 MAVRL Workshop 2014

Page 3: MAVRL Workshop 2014 - pymatgen-db & custodian

The as_dict() and from_dict() protocol

Almost all non-trivial objects in pymatgen support the as_dict() and from_dict() serialization protocol.

November 10, 2014

Pymatgen Object, e.g., Structure

Pymatgen dict of

primitives

JSON file/response

Database, e.g.,

MongoDB

Object.as_dict() json.dump/load

pymongo Structure.from_spacegroup("Pm-3m", Lattice.cubic(2.872), ["Fe"], [[0, 0, 0]])

{u'lattice': {u'a': 2.872, u'gamma': 90.0, u'c': 2.872, u'b': 2.872, u'matrix': [[2.872, 0.0, 0.0], [0.0, 2.872, 0.0], [0.0, 0.0, 2.872]], u'volume': 23.689358847999998, u'alpha': 90.0, u'beta': 90.0}, u'sites': [{u'properties': {}, u'abc': [0.0, 0.0, 0.0], u'xyz': [0.0, 0.0, 0.0], u'species': [{u'occu': 1, u'element': u'Fe'}], u'label': u'Fe'}], u'@class': 'Structure', u'@module': 'pymatgen.core.structure'}

Object.from_dict()

MAVRL Workshop 2014

Page 4: MAVRL Workshop 2014 - pymatgen-db & custodian

What is MongoDB?

MongoDB is an open-source NoSQL document database. •  JSON-style documents with dynamic schemas offer simplicity and power. •  Full Index Support •  Replication & High Availability by mirroring across LANs and WANs for scale and peace

of mind. •  Rich, document-based queries. •  Fast In-Place Updates •  Map/Reduce for flexible aggregation and data processing.

Note: Materials Project’s development team have worked with both SQL and NoSQL solutions for materials data management. Our experience has been that document-based databases is very well suited to deal with the rapidly changing nature of scientific research, especially since there are “natural” document definitions (see next slide).

November 10, 2014 MAVRL Workshop 2014

Page 5: MAVRL Workshop 2014 - pymatgen-db & custodian

How is MongoDB used in the Materials Project?

MongoDB databases contains collections of documents.

Each logical unit of calculation or analysis or material is a document in a collection.

November 10, 2014

Collection name Document description

tasks Results of a single first principles calculation, be it a relaxation, static or bandstructure calculation

materials Summary of computed information about a material formed by aggregating all tasks of a crystal structure.

(bandstructures, phasediagrams, batteries, etc.)

Documents representing results of various types of analyses. E.g., a battery document is formed by combining several materials (e.g., FePO4 and LiFePO4) to represent an intercalation system with associated properties like voltage, capacity, etc.

MAVRL Workshop 2014

Page 6: MAVRL Workshop 2014 - pymatgen-db & custodian

Pymatgen-db (http://pythonhosted.org//pymatgen-db/ )

Database add-on for pymatgen. Enables the creation of Materials Project-style MongoDB (www.mongodb.org) databases for management of materials data. Key features: •  Query engine for easy translation of MongoDB docs to useful pymatgen

objects for analysis purposes. •  Includes a clean and intuitive web ui (the Materials Genomics UI) for

exploring Mongo collections.

Pymatgen-db is primarily used for the generation of the “tasks” collection, i.e., it parses VASP calculations, convert the results into a pymatgen dict, and inserts it into a MongoDB collection as a document. It also facilitates the retrieval of these documents as pymatgen objects.

November 10, 2014 MAVRL Workshop 2014

Page 7: MAVRL Workshop 2014 - pymatgen-db & custodian

How does it help you?

If you are only doing a few calculations, it is probably not worth your time to deal with the overhead of learning MongoDB, pymatgen-db, etc.

However, if you perform hundreds or thousands of similar calculations on different materials, and you need some proper way to manage and analyze this data (note: compiling calculations in a spreadsheet does not constitute proper data management!), it is well worth taking the time to learn MongoDB and pymatgen-db.

November 10, 2014 MAVRL Workshop 2014

Page 8: MAVRL Workshop 2014 - pymatgen-db & custodian

Basic tutorial using the command line

No hands-on for pymatgen-db as it requires MongoDB server and actual completed VASP calculations to insert.

Demo will show basic tools for working with MongoDB, and show the final results and document structure.

November 10, 2014

Hint: A quick way of playing around with pymatgen-db without going through the hassle of installing MongoDB on your own machine is to sign up for a free MongoLab (https://mongolab.com) “sandbox” account. The default 512Mb should be sufficient for you to insert a few VASP calculations and play around with some queries to decide if this is suitable for your needs. If you decide to get serious, it is highly recommended that you go through the MongoDB tutorial to learn how to use MongoDB effectively. There is a learning curve, but it rapidly pays dividends when you have to manage a lot of calculations.

MAVRL Workshop 2014

Page 9: MAVRL Workshop 2014 - pymatgen-db & custodian

The mgdb CLI

Pymatgen-db has a comprehensive command-line tool called mgdb. •  Perform many common tasks without ever looking at the pymatgen-db

source code or writing a single piece of code.

After installing pymatgen-db, you can call up the help by typing:

mgdb --help

on the command line.

November 10, 2014 MAVRL Workshop 2014

Page 10: MAVRL Workshop 2014 - pymatgen-db & custodian

Initial setup

Step 1: Set up your MongoDB server (either with MongoLab, or starting your own server)

Step 2: Run

mgdb init -c db.json

to create a database credentials file called db.json. The file stores your database connection details and authentication information, which is used by other commands to access the database.

November 10, 2014 MAVRL Workshop 2014

Page 11: MAVRL Workshop 2014 - pymatgen-db & custodian

Inserting your first VASP calculation(s)

Let’s say that you have performed VASP relaxation calculations on ten different materials, and the complete calculation output is in a directory called “my_nobel_prize_research”.

To insert all ten calculations into the MongoDB, you simply need to type:

mgdb insert -c /path/to/db.json my_nobel_prize_research

November 10, 2014 MAVRL Workshop 2014

Page 12: MAVRL Workshop 2014 - pymatgen-db & custodian

Exploring the MongoDB collection

To explore the MongoDB collection, you have several options: •  Use the mongo command line given by MongoDB itself. •  Use a GUI MongoDB manager such as MongoHub (use the actively

maintained version at https://github.com/jeromelebel/MongoHub-Mac) •  You can do simple queries using mgdb mgdb query -c db.json --crit '{"pretty_formula": "Li2O"}' --props task_id energy_per_atom •  Use the Materials Genomics UI provided with pymatgen-db. •  Write python code using the matgendb.QueryEngine class to perform

queries.

Please consult the pymatgen-db documentation for details of the last three options.

November 10, 2014 MAVRL Workshop 2014

Page 13: MAVRL Workshop 2014 - pymatgen-db & custodian

More advanced usage

mgdb CLI tool is simply a way to facilitate usage of pymatgen-db for common calculations and use cases. Pymatgen-db is capable of a lot more.

Two key classes here that most developers would need to be aware of: •  matgendb.creator.VaspToDBTaskDrone: Drone class that can be

used with pymatgen’s borg framework to assimilate all calculations within a directory structure and insert them into the database.

•  matgendb.query_engine.QueryEngine: Interface layer (utilizing pymongo) between pymatgen objects and the MongoDB database. Facilitates conversions to pymatgen objects, as well as various syntax enhancements for materials applications, e.g., standardizing formula strings for queries.

November 10, 2014 MAVRL Workshop 2014

Page 14: MAVRL Workshop 2014 - pymatgen-db & custodian

materiaIsvirtuaLab

Custodian

November 10, 2014

MAVRL Workshop 2014

Page 15: MAVRL Workshop 2014 - pymatgen-db & custodian

Custodian

Custodian is a simple, robust and flexible just-in-time (JIT) job management framework. •  Wrappers to perform error checking, job management and error

recovery. •  Error recovery is an important aspect of many high-throughput projects

that generate data on a large scale. O(100,000) jobs + 1% error rate => O(1000) errored jobs. •  Recovery for long running jobs, with potentially random errors. •  Existing sub-packages for error handling for VASP, NwChem and

QChem calculations.

November 10, 2014 MAVRL Workshop 2014

Page 16: MAVRL Workshop 2014 - pymatgen-db & custodian

How custodian works

Blue: Controlled by subclasses of Job abstract class

Red: Defined by ErrorHandlers.

Note: Custodian executes Jobs serially, but more than one error handler can be defined for each job.

November 10, 2014 MAVRL Workshop 2014

Page 17: MAVRL Workshop 2014 - pymatgen-db & custodian

How custodian works

November 10, 2014

Job.setup() method

Job.run() method

Job.postprocess() method

Handler.check() method Returns True/False.

Handler.correct() method Executed if check() returns True.

MAVRL Workshop 2014

Page 18: MAVRL Workshop 2014 - pymatgen-db & custodian

The Custodian class

c = Custodian(

handlers=[ExampleHandler1(args), ExampleHandler2(args)],

jobs=[ExampleJob1(args), ExampleJob2(args)],

validators=[ExampleValidator1(args), ExampleValidator2(args)],

max_errors=100, scratch_dir=“<location of scratch directory>”,

gzipped_output=True, checkpoint=True)

c.run() #Runs all jobs in a self-healing manner.

How to use Custodian

1.  Write subclasses of Job that defines what to do.

2.  Write subclasses of ErrorHandlers that defines the rules to detect errors and how to correct them

3.  (new) Write Validators that define what is a valid state at the end of a Job.

4.  String your Jobs, ErrorHandlers and Validators to define a complete calculation.

November 10, 2014 MAVRL Workshop 2014

Page 19: MAVRL Workshop 2014 - pymatgen-db & custodian

Concrete Example for VASP calculations

Extensive set of rules have been codified for running VASP calculations

Significantly reduces error rate of calculations (< 1%)

November 10, 2014 MAVRL Workshop 2014

Page 20: MAVRL Workshop 2014 - pymatgen-db & custodian

VaspJob class

VaspJob(vasp_cmd, output_file="vasp.out", suffix="", auto_npar=True, auto_gamma=True, …<other less frequently used parameters>...) •  Simply calls vasp_cmd, e.g., [“mpirun”, “-n”, “16”, “vasp”] with intelligent

options. •  auto_npar: automatically modifies NPAR in INCAR to a relatively optimal

number based on detected number of processors! Enhances vasp calculation efficiency by ~10-30%!!!

•  auto_gamma: If this is a gamma-only calculation and a gamma compiled version of vasp exists, use it. Another 10-20% increase in efficiency!

Even without error handling, custodian already significantly improves resource utilization of running VASP calculations!

November 10, 2014 MAVRL Workshop 2014

Page 21: MAVRL Workshop 2014 - pymatgen-db & custodian

VaspErrorHandler

Features •  Handles more than 20 common errors encountered in typical VASP

relaxation and static calculations. •  Monitors the VaspJob, i.e., running in parallel while the Job is running and

stops the job if error is detected to avoid wasting computational resources. •  Reads vasp.out to detect error messages such as "Tetrahedron method fails

for NKPT<4”, "Fatal error detecting k-mesh”, etc. •  Applies correction rules. E.g., to deal with "Tetrahedron method fails for

NKPT<4”, the Handler sets ISMEAR to 0.

This is the most important handler. If you want to know how to fix a particular error you see in your runs, this is where you start!

November 10, 2014 MAVRL Workshop 2014

Page 22: MAVRL Workshop 2014 - pymatgen-db & custodian

Other handlers

ErrorHandler Purpose

MeshSymmetryErrorHandler Fixes “Reciprocal lattice and k-lattice belong to different class of lattices”.

NonConvergingErrorHandler Checks if a job has been hitting max electronic steps for multiple ionic steps. If so, restart the job with ALGO=Normal instead of Fast.

WalltimeHandler Particularly useful handler for long running jobs. At a set period before a job hits the walltime (say 5 mins), write STOPCAR to terminate the job gracefully. This will keep the WAVECAR and CHGCAR consistent for subsequent restarts.

-- many other handlers --

November 10, 2014 MAVRL Workshop 2014

Page 23: MAVRL Workshop 2014 - pymatgen-db & custodian

Options in the Custodian class

c = Custodian(

handlers=[….], jobs=[….], max_errors=100,

scratch_dir=“<location of scratch directory>”,

gzipped_output=True, checkpoint=True) •  max_errors: define the maximum number of errors that are encountered before

Custodian forcibly terminates (to avoid wasting resources on a hopeless situation). •  scratch_dir: Points to the root location of the scratch space. Most clusters have scratch

spaces that offer significantly faster IO. Custodian will intelligently run your job on the scratch space and copy the results back to the working directory at the end.

•  gzipped_output: gzip compress the final output to save storage space. Most pymatgen IO classes support gzipped files transparently.

•  checkpoint: Whether to checkpoint after each successful Job. Allows for multi-step jobs to recover from the last successful step instead of rerunning the whole series of jobs.

November 10, 2014 MAVRL Workshop 2014

Page 24: MAVRL Workshop 2014 - pymatgen-db & custodian

run_vasp command-line tool

While you can always write your own python scripts to run self-healing VASP calculations (and this is a necessity if you want heavy customization of the various options), a run_vasp CLI caters for most relaxation and static calculations.

Example usage:

run_vasp -c “mpirun -np 16 vasp” -s /path/to/scratch/rootdir relax relax static

November 10, 2014

Specifies the command to run vasp. Differs from cluster to cluster.

Define the scratch directory root. Actual calculations to perform. This means two relaxation and a static calculation

MAVRL Workshop 2014

Page 25: MAVRL Workshop 2014 - pymatgen-db & custodian

Summary

Pymatgen-db and custodian are frameworks designed specifically for high-throughput research in mind.

Though the major use case is currently first principles calculations, but the frameworks are flexible enough to cater to any kind of HT research.

November 10, 2014 MAVRL Workshop 2014