Education/Research High-Performance Computing Accelerating ...€¦ · Education/Research...

Case Study

“The task scheduler of Intel®Threading Building Blockstransparently handles the

allocation of tasks to underlyingcomputer hardware, therebyleaving the programmer freeto concentrate, metaphorically,

on managing the factoryrather than managingindividual workers.”

―Christopher WoodsAdvanced Computing Research Centre

University of Bristol

Education/ResearchHigh-Performance Computing

SoftwareAccelerating Rational Drug Design

The University of Bristol is an international powerhouse of learning, discovery, and en-terprise whose excellence is acknowledged locally, nationally, and globally. An IntelParallel Compute Centre, the University is at the forefront of high-performance com-puting in the UK.

The University’s Advanced Computing Research Centre, founded in 2007, provides ad-vanced computing support to researchers. Its research software engineers work withacademics across a range of disciplines to help optimize research software that can beapplied in industry.

To perform some of the calculations needed for drug design, the University uses the Lig-andSwap* program with a task-based parallel programming approach―with help fromIntel® Threading Building Blocks (Intel® TBB) and its efficient task scheduling. The re-searchers have found that parallelizing LigandSwap using Intel TBB can take less than100 lines of Intel TBB-specific code from a code base of more than 100,000 lines andenabled a calculation that would ordinarily take 25 days to complete in just one day.

Challenge

Computational molecular design is an essential part of developing new medicines andagrochemicals. Using computational models, scientists now routinely design new com-pounds at the molecular level. Pharmaceutical companies design new compounds todisrupt the action of viruses and bacteria, providing new treatments for diseases. Agro-chemical companies design new compounds that disrupt biological processes withinpests, supporting the sustainable growth of agriculture to satisfy the food require-ments of the world’s growing population.

LigandSwap*1, based on WaterSwap*2,3, is a computer program that calculates one ofthe key quantities needed for rational drug design, the relative binding free energy oftwo potential small molecule medicinal drugs (ligands) to a target protein. The calcula-tion involves running a replica exchange4 simulation, which involves generating multi-ple Monte Carlo trajectories along a λ-coordinate. This λ-coordinate is used to swapone ligand bound to the target protein with another. Typically, 16 λ-trajectories areused, with each involving about 100 million Monte Carlo moves. This means over a bil-lion Monte Carlo moves are required in total.

Most of the computational cost for each move is the evaluation of the coulomb andLennard Jones (LJ) interactions between the candidate ligands and the protein and

Using Intel® Threading Building Blocks, the University of Bristol helps slashcalculation time for drug development

http://www.bris.ac.uk/

Case Study | Accelerating Rational Drug Design

surrounding water molecules. With each coulomb and LJ cal-culation requiring approximately 30 floating point operations(FLOPs) per atom pair, and 175,000 pairs needing to be evalu-ated per Monte Carlo move, a complete LigandSwap calcula-tion requires about 5 petaflops of compute.

Assuming this was calculated using a single core of a dual-socket, 8-core 2.3, GHz Intel® processor with one floating pointoperation per clock cycle, 5 petaflops would take 25 days intotal. However, a single LigandSwap calculation takes aboutone day on this same processor. LigandSwap achieves thissince it has been programmed to take advantage of all of 16cores that are available using a task-based, parallel program-ming approach.

One library suited to this approach is Intel TBB5, a free, opensource (Apache*) task-based parallel library for C++. It sup-ports nesting of tasks, has an efficient task scheduler that sup-ports task stealing, and has building blocks that support theconstruction of complex task hierarchies. The research teamwrote the computation needed for one LigandSwap calculationas a set of tasks that were parallelized using Intel TBB.

MethodologyThe overall LigandSwap calculation was divided into a four-level hierarchy of tasks (Figure 1):

1. A single calculation consists of a set of replica exchangemoves applied in parallel.

2. Each replica exchange move consists of a pair of MonteCarlo trajectories generated at two different λ-windows.

3. Each λ-window consists of Monte Carlo moves evaluatingthe different terms of the LigandSwap energy equation.

4. Each LigandSwap energy equation consists of summingthe coulomb and LJ interaction energies needed to evalu-ate each term.

Dividing the program into tasks started from the bottom up,namely the large number of coulomb and LJ evaluations that

had to be performed for each Monte Carlo move. This wasachieved using tbb::parallel_for. Space was divided into cubicboxes, with atoms distributed across those boxes. The totalcoulomb and LJ energy was evaluated by looping over allpairs of boxes and summing the coulomb and LJ energy be-tween all pairs of atoms between each box pair. This loop wastested on a dual-core Intel® Core™ processor m5-6Y54 at 1.2GHz (clocks up to 2.7 GHz) and a dual-socket, 10-core Intel®Xeon® processor E5-2660 v3 at 2.60 GHz. This showed a 2.1xspeedup for the Intel Core processor (81.5 ms for single-coreversus 38.1 ms for dual-core). It showed a 19.1x speedup forthe Intel Xeon processor (82.3 ms for a single core versus 4.3ms for 20 cores).

The next step is to divide the evaluation of the LigandSwap en-ergy equation into tasks. Each Monte Carlo move in the Lig-andSwap calculation involves evaluation of the LigandSwapenergy equation. The equation comprises terms relating to theinteractions between the different molecules in the Lig-andSwap system and how these change as a function of λ. Theequation is evaluated within a custom-written computer alge-bra system. This works out any dependencies between terms,and creates a task list of individual terms to evaluate along withtheir associated scaling factors. The total energy is evaluatedby looping through and evaluating each term, and summing to-gether into a total. This is a classic reduction, which was paral-lelized using tbb::parallel_reduce.

The next level of tasks formed the Monte Carlo trajectory. MonteCarlo moves performed during the LigandSwap calculation aregrouped together into trajectories for each λ-window. This wasparallelized using tbb:task, where each task involved performingMonte Carlo sampling for a single λ-window.

Finally, LigandSwap uses replica exchange4 moves to improvesampling within the calculation. This involves pairing togetherMonte Carlo sampling at neighboring λ-windows and perform-ing periodic replica exchange moves. To achieve this, a replicaexchange task based on tbb:task was written. This ran two-child Monte Carlo trajectory tasks to perform two blocks ofMonte Carlo sampling. Next, it performed a replica exchangemove between the pair of λ-windows.

Figure 1. LigandSwap calculation hierarchy

1 Woods, C.J., Ligandswap, A program for relative binding free energy calculations, http://siremol.org/pages/apps/ligandswap.html, 20162 Woods, C.J., Malaisree, M., Hannongbua, S. and Mulholland, A.J., “A water-swap reaction coordinate for the calculation of absolute protein-ligand binding free energies”, J. Chem. Phys., 134, 054114, 20113 Woods, C.J., Malaisree, M., Michel, J., Long, B., McIntosh-Smith, S. and Mulholland, A.J., “Rapid decomposition and visualization of protein-ligand binding free energies by residue and by water”, FaradayDiscussions, 169, 477-499, 2014

4Woods, C.J., King, M.A. and Essex, J.W., “The development of replica-exchange-based free-energy methods”, J. Phys. Chem. B, 107, 13703-13710, 20035 Intel’s Threading Building Blocks, https://www.threadingbuildingblocks.org Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer, or learn more at www.intel.com. Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, andSSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer tothe applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured usingspecific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performancetests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/perfor-mance. Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sitesor others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. This document and the information given are for the convenience of Intel’s customer base and are provided “AS IS” WITH NO WARRANTIES WHATSOEVER, EXPRESS OR IMPLIED, INCLUDING ANY IM-PLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS. Receipt or possession of this document does not

Case Study | Accelerating Rational Drug Design

Altogether, the use of Threading Building Blocks has allowedLigandSwap to be written as a hierarchy of thousands of sepa-rate tasks. These run efficiently on a modern multicore IntelXeon processor, enabling a calculation that would theoreticallytake 25 days to complete in about one day. Most importantly,the task-based design of the code will allow LigandSwap to takeadvantage of massively multicore processors, such as Intel®Xeon Phi™ processors.

ConclusionsBy viewing a program as a collection of tasks, it is possible towrite code that runs well on today’s multicore processors andis ready for the massively multicore processors that will be-come the future of high-performance computing.

Task-based programming allows the programmer to adopt theperspective of a production designer or manager, and to seethe computer as a factory that processes data―instead of as aseries of tasks―to produce an output answer.

“Intel TBB is an excellent library that supports the writing of ef-ficient, task-based parallel programs in C++,” explainedChristopher Woods of the Advanced Computing Research Cen-tre, University of Bristol. “The library is free and open source

(Apache license). As seen with this application to LigandSwap,the library performs extremely well, and is capable of achievingnear-linear scaling even when the total evaluation time for thetasks is of the order of a few milliseconds. Parallelizing Lig-andSwap using Intel TBB needed less than 100 lines of IntelTBB-specific code from a code base of more than 100,000lines.

“A particular strength of Intel TBB,” Woods said, “is the ability tonest tasks, thereby allowing programmers to write programs asa hierarchy of connected tasks. The task scheduler of TBBtransparently handles the allocation of tasks to underlyingcomputer hardware, thereby leaving the programmer free toconcentrate metaphorically on managing the factory ratherthan managing individual workers.”

In short, the University of Bristol found that task-based parallelprogramming, with help from Intel TBB, provides a simple ab-straction that will enable research software to adapt to themassively multicore future.

Learn more about Intel® Threading Building Blocks >

https://software.intel.com/en-us/intel-tbb

Education/Research High-Performance Computing Accelerating ...€¦ · Education/Research...

Documents

Transcript of Education/Research High-Performance Computing Accelerating ...€¦ · Education/Research...