What About Multicore? Michael A. Heroux Sandia National Laboratories Sandia is a multiprogram...

What About Multicore?

Michael A. HerouxSandia National Laboratories

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.

Outline

Can we use shared memory parallel (SMP) only? Can we use distributed memory parallel (DMP) only? Possibilities for using SMP within DMP.

Performance characterization for preconditioned Krylov methods. Possibilities for SMP with DMP for each Krylov operation.

Implications for multicore chips. About MPI. If Time Permits: Useful Parallel Abstractions.

SMP-only Possibilities

Question: Is it possible to develop a scalable parallel application using only shared memory parallel programming?

SMP-only Observations Developing a scalable SMP application requires as much work as

DMP: Still must determine ownership of work and data. Inability to assert placement of data on DSM architectures is big problem, not

easily fixed. Study after study illustrates this point.

SMP application requires SMP machine: Much more expensive per processor than DMP machine. Poorer fault-tolerance properties.

Number of processor usable by SMP application is limited by minimum of:

• Operating System and Programming Environment support.

• Global Address Space.

• Physical processor count.

Of these, the OS/Programming Model is the real limiting factor.

SMP-only Possibilities

Question: Is it possible to develop a scalable parallel application using only shared memory parallel programming?

Answer: No.

DMP-only Possibilities

Question: Is it possible to develop a scalable parallel application using only distributed memory parallel programming?

Answer: Don’t need to ask. Scalable DMP applications are clearly possible to O(100K-1M) processors.

Thus: DMP is required for scalable parallel applications. Question: Is there still a role for SMP within a DMP application?

SMP-Under-DMP Possibilities

Can we benefit from using SMP within DMP? Example: OpenMP within an MPI process.

Let’s look at a scattering of data points that might help.

Test Platform: Clovertown

• Intel: Clovertown, Quad-core (actually two dual-cores)• Performance results are based on 1.86 GHz version

LAMMPS Strong Scaling

LAMMPS Strong Scaling Speedup

0

1

2

3

4

5

6

7

8

1 2 4 8

# of MPI tasks (cores)

Speedup

strong eam

strong lj

strong rhodo

HPC Conjugate Gradient

HPCCG-0.2 on Intel Clovertown (Quad-core)(MFLOPS increase)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

1 2 4 8

# of mpi tasks

MFLOPS increase

Total(MF)

DDOT(MF)

WAXPBY(MF)

SPARSEMV(MF)

Trilinos/Epetra MPI Results Bandwidth Usage vs. Core Usage

Solver Kernel Performance: Clovertown490K Eq, 12.25M NZ per core

0.00E+00

5.00E+02

1.00E+03

1.50E+03

2.00E+03

2.50E+03

1 2 4 8

Number of Cores

MFLOP/s

SpMV

SpMM2

SpMM4

SpMM8

NORM

DOT

AXPY

SpMV MPI+pthreads Theme: Programming model doesn’t matter

if algorithm is the same.

CR4 MXV N=1e5, NZR=201

0.0380.0390.04

0.0410.0420.0430.0440.0450.0460.0470.0480.049

mpi*p1*2

mpi*p1*4

mpi*p1*8

mpi*p2*1

mpi*p2*2

mpi*p2*4

mpi*p4*1

mpi*p4*2

mpi*p8*1

MPI*pthreads

time (seconds)

DOUBLE-DOUBLE DOT1 N=1e7

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

mpi*p1*2

mpi*p1*4

mpi*p1*8

mpi*p2*1

mpi*p2*2

mpi*p2*4

mpi*p4*1

mpi*p4*2

mpi*p8*1

MPI*pthreads

time (seconds)

Double-double dot product MPI+pthreads Same theme.

Classical DFT code. Parts of code: Speedup is great. Parts: Speedup negligible.

0

50

100

150

200

250

300

Time (sec)

1 2 4 6 8Number of Cores

Tramonto Timings: ClovertownMPI, INTEL Compilers

2-k_linsolv_timeFirst_linsolv_time2-k_manager_timeFirst_manager_time2-k_fill_timeFirst_fill_time

Closer look: 4-8 cores. 1 core: Solver is 12.7 of 289 sec (4.4%) 8 cores: Solver is 7.5 of 16.8 sec (44%).

0

5

10

15

20

25

30

Time (sec)

4 6 8

Number of Cores

Tramonto Timings: ClovertownMPI, INTEL CompilersZoom in: 4, 6, 8 cores

2-k_linsolv_timeFirst_linsolv_time2-k_manager_timeFirst_manager_time2-k_fill_timeFirst_fill_time

Summary So Far

MPI-only is sometimes enough: LAMMPS Tramonto (at least parts), and threads might not help solvers.

Introducing threads into MPI: Not useful if using same algorithms. Same conclusion as 12 years ago.

Increase in bandwidth requirements: Decreases effective core use. Independent of programming model.

Use of threading might be effective if it enables: Change of algorithm. Better load balancing.

Case Study: Linear Equation Solvers

Sandia has many engineering applications. A large fraction of newer apps are implicit in nature:

Requires solution of many large nonlinear systems. Boils down to many sparse linear systems.

Linear system solves are large fraction of total time. Small as 30%. Large as 90+%.

Iterative solvers most commonly used. Iterative solvers have small handful of important kernels. We focus on performance issues for these kernels.

Caveat: These parts do not make the whole, but are a good chunk of it…

Problem Definition

A frequent requirement for scientific and engineering computing is to solve:

Ax = b

where A is a known large (sparse) matrix,

b is a known vector,

x is an unknown vector.

Goal: Find x. Method:

Use Preconditioned Conjugate Gradient (PCG) method, Or one of many variants, e.g., Preconditioned GMRES. Called Krylov Solvers.

The performance of a parallel preconditioned Krylov solver on any given machine can be characterized by the performance of the following operations: Vector updates: Dot Products: Matrix multiplication: Preconditioner application:

What can SMP within DMP do to improve performance for these operations?

Performance Characteristics of Preconditioned Krylov Solvers

y x yα= +Tx yδ =

y Ax=1y M x−=

Machine Block Diagram

Memory

PE 0 PE n-1

Node 0

Memory

PE 0 PE n-1

Node 1

Memory

PE 0 PE n-1

Node m-1

Parallel machine with p = m * n processors, • m = number of nodes.

• n = number of shared memory processors per node.

Consider• p MPI processes vs.

• m MPI processes with n threads per MPI process (nested data-parallel).

Vector Update Performance

Vector computations are not (positively) impacted using nested parallelism. These calculations are unaware that they are being done in parallel. Problems of data locality and false cache line sharing can actually degrade

performance for nested approach.• Example: What happens if

– PE 0 must update x[j].

– PE 1 must update x[j+1] and

– x[j] and x[j+1] are in the same cache line?

Note: These same observations hold for FEM/FVM calculations and many other common data parallel computations.

Dot Product Performance

Global dot product performance can be improved using nested parallelism: Compute the partial dot product on each node before going to

binary reduction algorithm:• O(log(m)) global synchronization steps vs.

O(log(p)) for DMP-only. However, same can be accomplished using “SMP-aware” message

passing library like LIBSM. Notes:

An SMP-aware message passing library addresses many of the initial performance problems when porting an MPI code to SMP nodes.

Reason? Not lower latency of intra-node message but reduced off-node network demand.

Matrix Multiplication Performance

Typical distributed sparse matrix multiplication requires “boundary exchange” before computing.

Time for exchange is determined by longest latency remote access.

Using SMP within a node does not reduce this latency. SMP matrix multiply has same cache performance issues

as vector updates. Thus SMP within DMP for matrix multiplication is not

attractive.

Batting Average So Far: 0 for 3

So far there is no compelling reason to consider SMP within a DMP application.

Problem: Nothing we have proposed so far provides an essential improvement in algorithms.

Must search for situations where SMP provides a capability that DMP cannot.

One possibility: Addressing iteration inflation in (Overlapping) Schwarz domain decomposition preconditioning.

Iteration Inflation Overlapping Schwarz Domain Decomposition (Local ILU(0) with

GMRES)

0

50

100

150

200

250

300

1 2 4 8 16 32 64 128 Jac(Inf)

Number of Processors

Number of Iterations

Using Level Scheduling SMP

As the number of subdomains increases, iteration counts go up.

Asymptotically, (non-overlapping) Schwarz becomes diagonal scaling.

But note: ILU has parallelism due to sparsity of matrix. We can use parallelism within ILU to reduce the inflation

effect.

Defining Levels

21

31

43

51 53

1 0 0 0 0

1 0 0 0

Solve .0 1 0 0

0 0 1 0

0 0 1

l

L Ly xl

l

l l

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥= =⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

Some Sample Level Schedule StatsLinear FE basis fns on 3D grid

Avg nnz/level = 5500, Avg rows/level = 173.

0

2000

4000

6000

8000

10000

12000

14000

1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181Level Number

Number of Nonzeros

Linear Stability Analysis ProblemUnstructured domain, 21K eq, 923K nnz

Some Sample Level Schedule StatsUnstructured linear stability analysis problem

Avg nnz/level = 520, Avg rows/level = 23.

0

200

400

600

800

1000

1200

1400

1 45 89 133177221265309353397 441485529573617661705749793837881Level Number

Number of Nonzeros

Improvement Limits

Assume number of PEs per node = n. Assume speedup for level scheduled F/B solve matches

speedup of n MPI solves on same node. Then performance improvement is

For previous graph and n = 8, p = 128 (m = 16), ratio = 203/142 = 1.43 or 43%.

Number of iterations for domains

Number of iterations for domains

p

m

Practical Limitations

Level scheduling speedup is largely determined by the cost of synchronization on a node. F/B solve requires a synchronization after each level. On machines with good hardware barrier, this is not a problem and

excellent speed up can be expected. On other machines, this can be a problem.

Reducing Synchronization Restrictions

Use a flexible iterative method such as FGMRES. Preconditioner at each iteration need not be the same, thus no need for

sync’ing after each level. Level updates will still be approximately obeyed. Computational and communication complexity is identical to DMP-

only F/B solve. Iteration counts and cost per iteration go up.

Multi-color reordering: Reorder equations to increase level-set sizes. Severe increase in iteration counts.

Our motto:The best parallel algorithm is the best parallel implementation of the best serial algorithm.

SMP-Under-DMP Possibilities

Can we benefit from using SMP within DMP? Yes, but:

Must be able to take advantage of fine-grain shared memory data access.

In a way not feasible for MPI-alone.

Even so: Nested SMP-Under-DMP is very complex to program.

Most people answer, “It’s not worth it.”

Summary So Far

SMP alone is insufficient for scalable parallelism. DMP alone is certainly sufficient, but can we improve by selective use of

SMP within DMP? Analysis of key preconditioned Krylov kernels gives insight into

possibilities of using SMP with DMP, and results can be extended to other algorithms.

Most of the straight-forward techniques for introducing SMP into DMP will not work.

Level scheduled ILU is one possible example of effectively using SMP within DMP (not always satisfactory).

Most fruitful use of SMP within DMP seems to have a common theme of allowing multiple processes to have dynamic asynchronous access to large (read-only) data sets.

Implications for Multicore Chips

MPI-only use of multicore is a respectable option. May be the ultimate right answer for scalability and ease of

programming. Assumption: MPI is multicore-aware. Not completely true right now. Helpful: High task affinity. Single program image per chip.

Flexible, robust multicore kernels will be complicated. Task parallelism is preferred if available (Media Player/Outlook). Similar issues as Cray SMP programming:

• How many cores can (available) or should (problem size) be used?• Illustrates difference between hetero/homo-geneous multicore.

Data placement issues similar to Origin (if # CMP>1). Task affinity important. Mitigating factor:

• On-chip data movement is at processor speeds.• Shared cache should help?

About MPI Uncomfortable defending MPI: But… Can Parallel Programming be Easy?

Memory management is key. Is it bad that parallel programming hard? Isn’t programming hard?

La-Z-Boy principle* impacts MPI adoption also. MPI is not that hard, does not impact majority of code. Real problem: Serial to MPI transition is not gradual. Can the mass market produce new parallel language

quickly? Not convinced. Can develop MPI-based code that is portable, today! Still hope for better.

* Don’t need to write parallel code because uni-processors are getting faster (No longer applies to next-gen processors).

Useful Parallel Abstractions

Michael A. HerouxSandia National Laboratories

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.

Alternate Title

“In absence of a portable parallel language and in the presence of recurring yet evolving system architectures, while in the meantime trying to get work done with the current tools and solve problems of interest, what we have developed to solve large-scale systems of equations on large-scale computers and allow for adapting in the future.”

Useful Abstractions: Petra Object Model Comm:

Abstract parallel machine interface. All access to information about, and services from, the parallel machine are done

through this interface. ElementSpace:

Description of the layout of an object on the parallel machine. All data objects are associated with one or more ElementSpace. Often created using an ElementSpace object, easily compared.

DistObject: A base class for all distributed objects. Used to provide gather/scatter services required to redistribute existing distributed

objects. Coarse grain linear algebra objects:

Trilinos provides many concrete linear algebra objects. But each concrete class inherits from one or more abstract linear algebra classes. All Trilinos solvers access linear algebra objects via the abstract interfaces. In particular, matrices and linear operators are accessed this way.

First Useful Abstraction:Comm

Epetra Communication Classes

Epetra_Comm is a pure virtual class: Has no executable code: Interfaces only. Encapsulates behavior and attributes of the parallel machine. Defines interfaces for basic services such as:

• Collective communications.• Gather/scatter capabilities.

Allows multiple parallel machine implementations.

Implementation details of parallel machine confined to Comm classes.

In particular, rest of Epetra (and rest of Trilinos) has no dependence on MPI.

Comm Methods

•CreateDistributor() const=0 [pure virtual] •CreateDirectory(const Epetra_BlockMap & map) const=0 [pure virtual] •Barrier() const=0 [pure virtual] •Broadcast(double *MyVals, int Count, int Root) const=0 [pure virtual] •Broadcast(int *MyVals, int Count, int Root) const=0 [pure virtual] •GatherAll(double *MyVals, double *AllVals, int Count) const=0 [pure virtual] •GatherAll(int *MyVals, int *AllVals, int Count) const=0 [pure virtual] •MaxAll(double *PartialMaxs, double *GlobalMaxs, int Count) const=0 [pure virtual] •MaxAll(int *PartialMaxs, int *GlobalMaxs, int Count) const=0 [pure virtual] •MinAll(double *PartialMins, double *GlobalMins, int Count) const=0 [pure virtual] •MinAll(int *PartialMins, int *GlobalMins, int Count) const=0 [pure virtual] •MyPID() const=0 [pure virtual] •NumProc() const=0 [pure virtual] •Print(ostream &os) const=0 [pure virtual] •ScanSum(double *MyVals, double *ScanSums, int Count) const=0 [pure virtual] •ScanSum(int *MyVals, int *ScanSums, int Count) const=0 [pure virtual] •SumAll(double *PartialSums, double *GlobalSums, int Count) const=0 [pure virtual] •SumAll(int *PartialSums, int *GlobalSums, int Count) const=0 [pure virtual] •~Epetra_Comm() [inline, virtual]

Comm Implementations

Three current implementations of Petra_Comm: Epetra_SerialComm:

• Allows easy simultaneous support of serial and parallel version of user code.

Epetra_MpiComm:• OO wrapping of C MPI interface.

Epetra_MpiSmpComm:• Allows definition/use of shared memory multiprocessor nodes.

Comm, Distributor, Directory

Efficient Execution: Provide high-level abstract description of parallel machine:

• Flexibility: How the operations optimally performed. • Distributor: Static, reusable, record of data-dependent

communications.

Portability on Parallel Platforms: All classes abstract. Can provide multiple implementations: MPI and serial. No widespread dependence on MPI. Possible to develop adapters for other parallel libraries and

languages. Utilize Existing Software:

Utilize MPI as the primary adapter Provide trivial serial adapters.

Example: Specialized Comm Adapters

int levelsetSolve_Epetra_Operator::Apply (const Epetra_MultiVector& X, Epetra_MultiVector& Y) const {

try { const Epetra_SmpMpiComm & comm = dynamic_cast<const Epetra_SmpMpiComm>(X.Map().Comm()); … comm.getThread(…)} catch (…)

Fragment of code from levelset preconditioner Epetra_Operator adapter.

Allows specialized parallel machine types. Should allow specializations such as:

CAF/UPC kernels Co-processors: GPUs, Cell SPEs.

Second Useful Abstraction:ElementSpace aka Map

Map Classes Epetra maps prescribe the layout of distributed objects across

the parallel machine. Typical map: 99 elements, 4 MPI processes could look like:

Number of elements = 25 on PE 0 through 2, = 24 on PE 3.

GlobalElementList = {0, 1, 2, …, 24} on PE 0,= {25, 26, …, 49} on PE 1. … etc.

Funky Map: 10 elements, 3 MPI processes could look like: Number of elements = 6 on PE 0,

= 4 on PE 1, = 0 on PE 2.

GlobalElementList = {22, 3, 5, 2, 99, 54} on PE 0,= { 5, 10, 12, 24} on PE 1,= {} on PE 2.

Note: Global elements IDs (GIDs) are only labels: Need not be contiguous range on a processor. Need not be uniquely assigned to processors. Funky map is not unreasonable, given auto-generated meshes, etc. Use of a “Directory” facilitates arbitrary GID support.

ElementSpace/Map

Critical classes for efficient parallel execution. Provides ability to:

Generate families of compatible objects. Quickly test if two objects have compatible layout. Redistribute objects to make compatible layout (using

Import/Export).

Example: Quick Testing of Compatible Distributions

int dft_PolyA22_Epetra_Operator::Apply (const Epetra_MultiVector& X, Epetra_MultiVector& Y) const {

TEST_FOR_EXCEPT(!X.Map().SameAs(OperatorDomainMap())); TEST_FOR_EXCEPT(!Y.Map().SameAs(OperatorRangeMap()));

Fragment of code from Epetra_Operator adapter. Test for compatibility of maps. Lack of this ability is critical flaw of most parallel

languages to date.

Third Useful Abstraction:DistObject

Epetra DistObject Base Class

• Some Epetra distributed object classes:– Vector– MultiVector– CrsGraph– CrsMatrix– VbrMatrix

• DistObject is a base class for all the above:– Construction of DistObject requires a Map (or BlockMap or LocalMap).

– Has concrete methods for parallel data redistribution of an object.

– Has virtual Pack/Unpack method that each derived class must implement.

– DistObject advantages:– Minimized redundant code.

– Facilitates incorporation of other distributed objects in future.

Epetra_DistObject Virtual Methods

virtual int CheckSizes (const Epetra_SrcDistObject &Source)=0 Allows the source and target (this) objects to be compared for compatibility, return nonzero if not.

virtual int CopyAndPermute (const Epetra_SrcDistObject &Source, int NumSameIDs, int NumPermuteIDs, int *PermuteToLIDs, int *PermuteFromLIDs)=0

Perform ID copies and permutations that are on processor.

virtual int PackAndPrepare (const Epetra_SrcDistObject &Source, int NumExportIDs, int *ExportLIDs, int Nsend, int Nrecv, int &LenExports, char *&Exports, int &LenImports,

char *&Imports, int &SizeOfPacket, Epetra_Distributor &Distor)=0 Perform any packing or preparation required for call to DoTransfer().

virtual int UnpackAndCombine (const Epetra_SrcDistObject &Source, int NumImportIDs, int *ImportLIDs, char *Imports, int &SizeOfPacket, Epetra_Distributor &Distor, Epetra_CombineMode CombineMode)=0

Perform any unpacking and combining after call to DoTransfer().

Epetra_DistObject Import/Export Methods

int Import (const Epetra_SrcDistObject &A, const Epetra_Import &Importer, Epetra_CombineMode CombineMode)

Imports an Epetra_SrcDistObject using the Epetra_Import object.

int Import (const Epetra_SrcDistObject &A, const Epetra_Export &Exporter, Epetra_CombineMode CombineMode)

Imports an Epetra_SrcDistObject using the Epetra_Export object.

int Export (const Epetra_SrcDistObject &A, const Epetra_Import &Importer, Epetra_CombineMode CombineMode)

Exports an Epetra_SrcDistObject using the Epetra_Import object.

int Export (const Epetra_SrcDistObject &A, const Epetra_Export &Exporter, Epetra_CombineMode CombineMode)

Exports an Epetra_SrcDistObject using the Epetra_Export object.

Import, Export and DistObject

Work with any class that isa DistObject (or SrcDistObject).

Abstraction of these operations to classes simplifies: Optimal implementation on a variety of systems. Reuse of complex parallel data redistribution:

• DistObject has most of the complexity.

Fourth Useful Abstraction:Coarse-grain linear algebra objects

Epetra_Operator and Epetra_RowMatrix

Notes:• All Trilinos solvers can access

linear operators via:• Epetra_Operator if no

coefficients needed.• Epetra_RowMatrix if

coefficients.• All Epetra matrix and operator

class implement these two interfaces.

GPU Utilization: y = Ax

CPU

GPU

Node 0

CPU

GPU

Node 1

CPU

GPU

Node m-1

Epetra_GpuOperator: • Object data mostly resides on GPU.

• Remains resident through many uses.

Vectors:• Input vector x comes in for apply operation.

• Output vector y is returned.

Final Topic: Obsession with APIs

Obsession with APIs

APIs are good. Obsession is bad:

Symptom of poor parallel language capabilities. Fear of future changes.

Opinion: Beyond domain-specific APIs we have no clue how to program current and next generation architectures.

Languages like Chapel hold some promise. Personally: Still a fan of Co-Array Fortran.

Fan of CAF (aka F--)

real a(n)[ncore][nchips][nimages]

a(i)[j][k][l] - ith value of a on jth core of kth chip of lth node.

I can do lots of efficient parallel programming this way. Lets me program to memory hierarchy with minimal

concern for detail. Compatible with MPI. But still waiting for broad availability.

Directions

Mantevo project: Focus on app performance analysis, prediction and improvement. Delivery of multicore tools.

Trilinos: Tpetra: Delivery of Trilinos multicore support.

Activities: Continued characterization of multicore behavior. Focus on algorithms:

• Better compute-to-bandwidth behavior.

• Fine-grain parallel.

APIs: Taskpool, TBB, CUDA, etc…

What About Multicore? Michael A. Heroux Sandia National Laboratories Sandia is a multiprogram...

Documents

Transcript of What About Multicore? Michael A. Heroux Sandia National Laboratories Sandia is a multiprogram...