Download - Aniruddha G. Shet, James Dinan, Robert J. Harrison, and P. Sadayappan

Multiresolution Analysis, Computational Chemistry, and Implications for High Productivity Parallel Programming

Aniruddha G. Shet, James Dinan, Robert J. Harrison, and P. Sadayappan

Background• Multiresolution Analysis (MRA)

• Mathematical technique of function approximation

• Representation is a hierarchy of coefficients

• Dynamically adapts to guarantee the accuracy of the approximation

• Varying degree of information granularity in the hierarchy

• Trade numerical accuracy with computation timeMRA can represent different areas in an object at different levels of detail.

What is (the) MADNESS (about)?• Multiresolution Adaptive Numerical Environment for Scientific Simulation

• Programming environment for the solution of integral and differential equations

• Built on adaptive multiresolution analysis in multiwavelet basis and low separation rank methods for scaling to higher dimensions

• Fast algorithms with guaranteed precision

• Trade precision for speed

• High-level composition of numerical codes

• Work with functions and operators

• Target applications

• Quantum chemistry, atomic and molecular physics, material science, nuclear structure

A molecular orbital of the benzene molecule with the adaptive mesh also displayed.

Implementation Issues with MADNESS• Multi-dimensional tree distribution

• Multiresolution adaptive properties produce unbalanced coefficient trees (binary tree in 1-d, quad tree in 2-d, oct tree in 3-d etc.)

• Tree structure evolves in unscheduled ways due to very flexible adaptive refinement

• Need a scheme to partition the complete tree as the entire tree, and not just the leaf nodes, is utilized in some algorithms

• Two main types of applications

• Few large trees (billions of nodes) - time evolution of wavepackets in molecular physics

• Many (thousands) smaller trees (millions of nodes) - materials and electronic structure

• Nodes range from 1KB-1MB

• Algorithmic characteristics

• Some are recursive, tree-walking algorithms that move data up/down the tree structure

• Number of levels navigated varies dynamically, and may constitute a data dependence chain

• Some move data laterally within same level of tree i.e. between neighboring nodes

• Some algorithms involve applying mathematical functions to the collection of coefficient tensors, and possibly combining individual results

• Certain algorithms operate on multiple trees having different refinement characteristics and produce a new tree

Need to express and manage huge amounts of nested hierarchical concurrency on distributed many-core petascale machines

Binary tree numerical form of a 1-d analytical function. Note that some intervals are not sub-divided due to the adaptive nature of refinement.

Sub-trees from decomposition of function space

Function interval

Coefficient tensor

Tree node

Adaptively refined tree Compress tree algorithm Reconstruct tree algorithm

X

Multiplication of differently refined trees

Dashed arrows depict the flow of data between operations on tree nodes. The input data to a node operation is indicated by an arrow pointing into the node, and the output from the operation is shown by an outgoing arrow.

A coefficient tensor that is added to the tree during a tree algorithm.

A coefficient tensor that is removed from the tree after it has been operated upon.

Parallel Programming Challenges• Shared Memory Model

• Cilk-style fork-join task parallelism with a work-stealing runtime doing dynamic load balancing is a plausible solution, but…

• it is not targeted at distributed data

• the model mandates that a parent task await the completion of children tasks, which constrains the full expression of available parallelism

• Message Passing Model

• Hard to express dynamic, irregular computations

• Two-sided communication model introduces unnecessary overhead in reading and writing distributed tree data

• Does not address the need for dynamic load balancing

• Partitioned Global Address Space (PGAS) Models

• i.e. Co-Array Fortran, UPC, Titanium

• Static SPMD model of parallelism, lack flexible threading capabilities

• Maintenance of distributed data structures without remote operations requires complex low-level remote memory references

• Does not address the need for dynamic load balancing

Chapel Programming Model• Multithreaded parallel programming

• Global view of computation, data structures

• Abstractions for data and task parallelism

• data: domain, forall, iterators

• task: begin, cobegin, coforall, sync variables, atomic

• Composition of parallelism

• Virtualization of threads

• Locality-aware programming

• locale: machine unit of storage and processing

• domains may be distributed across locales

• on keyword binds computation to locale(s)

• Object-oriented programming

• OOP can help manage program complexity

• Classes and objects are provided in Chapel, but their use is typically not required

• Advanced language features (e.g. distributions) expressed using classes

• Generic programming and type inference

• Type parameters

• Latent types

• Variables are statically-typed

Solution Building Blocksclass FTree { const tree: [LocaleSpace] SubTree; def FTree(order: int) { coforall loc in Locales do on loc do tree[loc.id]=new SubTree(order); } def this(node: Node) { const t = tree[node2loc(node).id]; return t[node]; } /* Global tree access methods */ }

class SubTree { const coeffDom: domain(1); var nodes: domain(Node); var coeff: [nodes] [coeffDom] real; /* Local tree access methods */ }

• Global-view container

• Container to store the tree, presents a global-view and one-sided access to tree algorithms

• Internally, maintains a directory of the distributed collection of sub-trees and transparently maps an indexed node to the host locale

• Sub-trees are structured as associative arrays of node-coefficients key-value pairs

• The mapping scheme could be a simple hash function or driven by a specialized partitioning strategy having better locality properties

• Concurrency

• Operations on tree nodes are created as asynchronous tasks

• Tasks are chained together in a hierarchical nested manner to express recursive parallelism in tree algorithms

• Dependencies are handled by letting tasks synchronize on the completion of spawned tasks

• Locality control permits running tasks using owner-computes policy which will launch tasks where target data resides

• Plan to support work-stealing technique for dynamic load balancing of a distributed set of tasks in the language runtime

const myFTree = new FTree(order=5);

def walkDownOp(node: Node) { /* Perform the operation on node */

for child in node.getChild() do on myFTree.node2loc(child) do begin walkDownOp(child);}

sync on myFTree.node2loc(root) do begin walkDownOp(root);

def walkUpOp(node: Node) { coforall child in node.getChild() do on myFTree.node2loc(child) do walkUpOp(child);

/* Perform the operation on node */}

sync on myFTree.node2loc(root) do begin walkUpOp(root);

• Coefficient tensor arithmetic

• Chapel provides ZPL-style “array language” to simplify working with multi-dimensional arrays

• Domains, a first-class language concept denoting an index set, define the size and shape of arrays, and support data parallel iteration in creating and slicing arrays

• A range of array operators facilitate parallel operations on whole arrays or array slices eliminating need for tedious array indexing

• Mathematical functions involving coefficient tensors are easily expressed in the array language

/* Overloaded multiplication operator to perform vector-matrix multiplication */def *(V: [] real, M: [] real) where V.rank == 1 && M.rank == 2 { const R: [i in M.domain.dim(2)] real = + reduce (V * M(..,i)); return R;}

/* Overloaded multiplication operator to perform matrix-vector multiplication */def *(M: [] real, V: [] real) where M.rank == 2 && V.rank == 1 { const R: [i in M.domain.dim(1)] real = + reduce (M(i,..) * V); return R;}

/* Norm of a vector */def normf(V) where V.rank == 1 { return sqrt(+ reduce V**2);}

• Python-like high-level programming

• Better usability in writing end user-codes using functions and operators rather than their underlying representations

• Key language ideas are object-oriented and type inferencing features

• Future work will explore writing dimension-independent programs

/* Fn_Test1 class wraps an analytic function */var f = new Fn_Test1();

/* Function class holds the numerical tree form */var F = new Function(k=5,thresh=1e-5,f=f);

/* Overloaded arithmetic operators on Function class that invoke various tree algorithms */ var H = F + F; H = F * F;

/* Print the numerical value at a given point */ writeln("Numerical value: ", format(" %0.8f", F(0.5)));

Solution Building Blocks

Future Work• Recursive task parallelism

• Language constructs to distinguish between tasks that may vs. must run in parallel

• “May” construct permits runtime management of parallelism

• “Must” construct for patterns like producer-consumer

• Application vs. compiler control over the granularity and degree of parallelism

• DAG-based dynamic scheduling of tasks inside Chapel runtime

• Provide execution guarantees for a class of DAGs

• Compiler support for parallel distribution/iteration/algebraic operations on associative array structures

Putting the Pieces Together

def refine(node = root) { const child = node.getChild();

var sc: [0..2*k-1] real; sc[0..k-1] = project(child(1)); sc[k..2*k-1] = project(child(2));

const dc = sc*hgT; const nf = normf(dc[k..2*k-1]); if (nf < thresh) { sumC[child(1)] = sc[0..k-1]; sumC[child(2)] = sc[k..2*k-1]; } else { on sumC.node2loc(child(1)) do begin refine(child(1)); on sumC.node2loc(child(2)) do begin refine(child(2)); }}

sync on sumC.node2loc(node) do begin refine(node);

def compress(node = root) { const child = node.getChild(); cobegin { on sumC.node2loc(child(1)) do if !sumC.hasCoeffs(child(1)) then compress(child(1)); on sumC.node2loc(child(2)) do if !sumC.hasCoeffs(child(2)) then compress(child(2)); }

var sc: [0..2*k-1] real; sc[0..k-1] = sumC[child(1)]; sc[k..2*k-1] = sumC[child(2)];

const dc = sc*hgT; sumC[node] = dc[0..k-1]; diffC[node] = dc[k..2*k-1];

sumC.remove(child(1)); sumC.remove(child(2)); }

sync on sumC.node2loc(root) do begin compress(root);

• Parallel compress algorithm• Parallel refine algorithm

sumC and diffC are global-view tree containers

Research sponsored in part by the Laboratory Directed Research and Development Program and Post Masters Research Participation Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC for the U. S. Department of Energy under Contract No. DE-AC05-00OR22725, and DOE grant #DE-FC02-06ER25755 and NSF grant #0403342.