Multiresolution Analysis, Computational Chemistry, and Implications for High Productivity Parallel Programming
Aniruddha G. Shet, James Dinan, Robert J. Harrison, and P. Sadayappan
Background• Multiresolution Analysis (MRA)
• Mathematical technique of function approximation
• Representation is a hierarchy of coefficients
• Dynamically adapts to guarantee the accuracy of the approximation
• Varying degree of information granularity in the hierarchy
• Trade numerical accuracy with computation timeMRA can represent different areas in an object at different levels of detail.
What is (the) MADNESS (about)?• Multiresolution Adaptive Numerical Environment for Scientific Simulation
• Programming environment for the solution of integral and differential equations
• Built on adaptive multiresolution analysis in multiwavelet basis and low separation rank methods for scaling to higher dimensions
• Fast algorithms with guaranteed precision
• Trade precision for speed
• High-level composition of numerical codes
• Work with functions and operators
• Target applications
• Quantum chemistry, atomic and molecular physics, material science, nuclear structure
A molecular orbital of the benzene molecule with the adaptive mesh also displayed.
Implementation Issues with MADNESS• Multi-dimensional tree distribution
• Multiresolution adaptive properties produce unbalanced coefficient trees (binary tree in 1-d, quad tree in 2-d, oct tree in 3-d etc.)
• Tree structure evolves in unscheduled ways due to very flexible adaptive refinement
• Need a scheme to partition the complete tree as the entire tree, and not just the leaf nodes, is utilized in some algorithms
• Two main types of applications
• Few large trees (billions of nodes) - time evolution of wavepackets in molecular physics
• Many (thousands) smaller trees (millions of nodes) - materials and electronic structure
• Nodes range from 1KB-1MB
• Algorithmic characteristics
• Some are recursive, tree-walking algorithms that move data up/down the tree structure
• Number of levels navigated varies dynamically, and may constitute a data dependence chain
• Some move data laterally within same level of tree i.e. between neighboring nodes
• Some algorithms involve applying mathematical functions to the collection of coefficient tensors, and possibly combining individual results
• Certain algorithms operate on multiple trees having different refinement characteristics and produce a new tree
Need to express and manage huge amounts of nested hierarchical concurrency on distributed many-core petascale machines
Binary tree numerical form of a 1-d analytical function. Note that some intervals are not sub-divided due to the adaptive nature of refinement.
Sub-trees from decomposition of function space
Function interval
Coefficient tensor
Tree node
Adaptively refined tree Compress tree algorithm Reconstruct tree algorithm
X
Multiplication of differently refined trees
Dashed arrows depict the flow of data between operations on tree nodes. The input data to a node operation is indicated by an arrow pointing into the node, and the output from the operation is shown by an outgoing arrow.
A coefficient tensor that is added to the tree during a tree algorithm.
A coefficient tensor that is removed from the tree after it has been operated upon.
Parallel Programming Challenges• Shared Memory Model
• Cilk-style fork-join task parallelism with a work-stealing runtime doing dynamic load balancing is a plausible solution, but…
• it is not targeted at distributed data
• the model mandates that a parent task await the completion of children tasks, which constrains the full expression of available parallelism
• Message Passing Model
• Hard to express dynamic, irregular computations
• Two-sided communication model introduces unnecessary overhead in reading and writing distributed tree data
• Does not address the need for dynamic load balancing
• Partitioned Global Address Space (PGAS) Models
• i.e. Co-Array Fortran, UPC, Titanium
• Static SPMD model of parallelism, lack flexible threading capabilities
• Maintenance of distributed data structures without remote operations requires complex low-level remote memory references
• Does not address the need for dynamic load balancing
Chapel Programming Model• Multithreaded parallel programming
• Global view of computation, data structures
• Abstractions for data and task parallelism
• data: domain, forall, iterators
• task: begin, cobegin, coforall, sync variables, atomic
• Composition of parallelism
• Virtualization of threads
• Locality-aware programming
• locale: machine unit of storage and processing
• domains may be distributed across locales
• on keyword binds computation to locale(s)
• Object-oriented programming
• OOP can help manage program complexity
• Classes and objects are provided in Chapel, but their use is typically not required
• Advanced language features (e.g. distributions) expressed using classes
• Generic programming and type inference
• Type parameters
• Latent types
• Variables are statically-typed
Solution Building Blocksclass FTree { const tree: [LocaleSpace] SubTree; def FTree(order: int) { coforall loc in Locales do on loc do tree[loc.id]=new SubTree(order); } def this(node: Node) { const t = tree[node2loc(node).id]; return t[node]; } /* Global tree access methods */ }
class SubTree { const coeffDom: domain(1); var nodes: domain(Node); var coeff: [nodes] [coeffDom] real; /* Local tree access methods */ }
• Global-view container
• Container to store the tree, presents a global-view and one-sided access to tree algorithms
• Internally, maintains a directory of the distributed collection of sub-trees and transparently maps an indexed node to the host locale
• Sub-trees are structured as associative arrays of node-coefficients key-value pairs
• The mapping scheme could be a simple hash function or driven by a specialized partitioning strategy having better locality properties
• Concurrency
• Operations on tree nodes are created as asynchronous tasks
• Tasks are chained together in a hierarchical nested manner to express recursive parallelism in tree algorithms
• Dependencies are handled by letting tasks synchronize on the completion of spawned tasks
• Locality control permits running tasks using owner-computes policy which will launch tasks where target data resides
• Plan to support work-stealing technique for dynamic load balancing of a distributed set of tasks in the language runtime
const myFTree = new FTree(order=5);
def walkDownOp(node: Node) { /* Perform the operation on node */
for child in node.getChild() do on myFTree.node2loc(child) do begin walkDownOp(child);}
sync on myFTree.node2loc(root) do begin walkDownOp(root);
def walkUpOp(node: Node) { coforall child in node.getChild() do on myFTree.node2loc(child) do walkUpOp(child);
/* Perform the operation on node */}
sync on myFTree.node2loc(root) do begin walkUpOp(root);
• Coefficient tensor arithmetic
• Chapel provides ZPL-style “array language” to simplify working with multi-dimensional arrays
• Domains, a first-class language concept denoting an index set, define the size and shape of arrays, and support data parallel iteration in creating and slicing arrays
• A range of array operators facilitate parallel operations on whole arrays or array slices eliminating need for tedious array indexing
• Mathematical functions involving coefficient tensors are easily expressed in the array language
/* Overloaded multiplication operator to perform vector-matrix multiplication */def *(V: [] real, M: [] real) where V.rank == 1 && M.rank == 2 { const R: [i in M.domain.dim(2)] real = + reduce (V * M(..,i)); return R;}
/* Overloaded multiplication operator to perform matrix-vector multiplication */def *(M: [] real, V: [] real) where M.rank == 2 && V.rank == 1 { const R: [i in M.domain.dim(1)] real = + reduce (M(i,..) * V); return R;}
/* Norm of a vector */def normf(V) where V.rank == 1 { return sqrt(+ reduce V**2);}
• Python-like high-level programming
• Better usability in writing end user-codes using functions and operators rather than their underlying representations
• Key language ideas are object-oriented and type inferencing features
• Future work will explore writing dimension-independent programs
/* Fn_Test1 class wraps an analytic function */var f = new Fn_Test1();
/* Function class holds the numerical tree form */var F = new Function(k=5,thresh=1e-5,f=f);
/* Overloaded arithmetic operators on Function class that invoke various tree algorithms */ var H = F + F; H = F * F;
/* Print the numerical value at a given point */ writeln("Numerical value: ", format(" %0.8f", F(0.5)));
Solution Building Blocks
Future Work• Recursive task parallelism
• Language constructs to distinguish between tasks that may vs. must run in parallel
• “May” construct permits runtime management of parallelism
• “Must” construct for patterns like producer-consumer
• Application vs. compiler control over the granularity and degree of parallelism
• DAG-based dynamic scheduling of tasks inside Chapel runtime
• Provide execution guarantees for a class of DAGs
• Compiler support for parallel distribution/iteration/algebraic operations on associative array structures
Putting the Pieces Together
def refine(node = root) { const child = node.getChild();
var sc: [0..2*k-1] real; sc[0..k-1] = project(child(1)); sc[k..2*k-1] = project(child(2));
const dc = sc*hgT; const nf = normf(dc[k..2*k-1]); if (nf < thresh) { sumC[child(1)] = sc[0..k-1]; sumC[child(2)] = sc[k..2*k-1]; } else { on sumC.node2loc(child(1)) do begin refine(child(1)); on sumC.node2loc(child(2)) do begin refine(child(2)); }}
sync on sumC.node2loc(node) do begin refine(node);
def compress(node = root) { const child = node.getChild(); cobegin { on sumC.node2loc(child(1)) do if !sumC.hasCoeffs(child(1)) then compress(child(1)); on sumC.node2loc(child(2)) do if !sumC.hasCoeffs(child(2)) then compress(child(2)); }
var sc: [0..2*k-1] real; sc[0..k-1] = sumC[child(1)]; sc[k..2*k-1] = sumC[child(2)];
const dc = sc*hgT; sumC[node] = dc[0..k-1]; diffC[node] = dc[k..2*k-1];
sumC.remove(child(1)); sumC.remove(child(2)); }
sync on sumC.node2loc(root) do begin compress(root);
• Parallel compress algorithm• Parallel refine algorithm
sumC and diffC are global-view tree containers
Research sponsored in part by the Laboratory Directed Research and Development Program and Post Masters Research Participation Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC for the U. S. Department of Energy under Contract No. DE-AC05-00OR22725, and DOE grant #DE-FC02-06ER25755 and NSF grant #0403342.
Top Related