IMA PPtTutorial

download IMA PPtTutorial

of 104

Transcript of IMA PPtTutorial

  • 7/31/2019 IMA PPtTutorial

    1/104

    Lawrence Livermore National Laboratory

    Ulrike Meier Yang

    LLNL-PRES-463131

    Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551

    This work performed under the auspices of the U.S. Department of Energy byLawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

    A Parallel Computing Tutorial

  • 7/31/2019 IMA PPtTutorial

    2/104

    2Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    References

    Introduction to Parallel Computing, Blaise Barneyhttps://computing.llnl.gov/tutorials/parallel_comp

    A Parallel Multigrid Tutorial, Jim Jones

    https://computing.llnl.gov/casc/linear_solvers/present.html

    Introduction to Parallel Computing, Design and Analysis

    of Algorithms, Vipin Kumar, Ananth Grama, Anshul

    Gupta, George Karypis

  • 7/31/2019 IMA PPtTutorial

    3/104

    3Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Overview

    Basics of Parallel Computing (see Barney)- Concepts and Terminology- Computer Architectures- Programming Models

    - Designing Parallel Programs

    Parallel algorithms and their implementation- basic kernels- Krylov methods

    - multigrid

  • 7/31/2019 IMA PPtTutorial

    4/104

    4Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Serial computation

    Traditionally software has been written for serial computation:- run on a single computer- instructions are run one after another- only one instruction executed at a time

  • 7/31/2019 IMA PPtTutorial

    5/104

    5Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Von Neumann Architecture

    John von Neumann first authored thegeneral requirements for an electroniccomputer in 1945

    Data and instructions are stored in memory

    Control unit fetches instructions/data from

    memory, decodes the instructions and thensequentiallycoordinates operations toaccomplish the programmed task.

    Arithmetic Unit performs basic arithmeticoperations

    Input/Output is the interface to the humanoperator

  • 7/31/2019 IMA PPtTutorial

    6/104

    6Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Parallel Computing

    Simultaneous use of multiple compute sources to solve a singleproblem

  • 7/31/2019 IMA PPtTutorial

    7/1047Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Why use parallel computing?

    Save time and/or money Enables the solution of larger problems

    Provide concurrency

    Use of non-local resources

    Limits to serial computing

  • 7/31/2019 IMA PPtTutorial

    8/1048Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Concepts and Terminology

    Flynns Taxonomy (1966)

    S I S D

    Single Instruction, Single Data

    S I M D

    Single Instruction, Multiple Data

    M I S D

    Multiple Instruction, Single Data

    M I M D

    Multiple Instruction, Multiple Data

  • 7/31/2019 IMA PPtTutorial

    9/1049

    Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    SISD

    Serial computer Deterministic execution

    Examples: older generation main frames, work stations, PCs

  • 7/31/2019 IMA PPtTutorial

    10/10410

    Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    SIMD

    A type of parallel computer All processing units execute the same instruction at any given clock cycle

    Each processing unit can operate on a different data element

    Two varieties: Processor Arrays and Vector Pipelines

    Most modern computers, particularly those with graphics processor units

    (GPUs) employ SIMD instructions and execution units.

  • 7/31/2019 IMA PPtTutorial

    11/10411

    Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    MISD

    A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent

    instruction streams.

    Few actual examples : Carnegie-Mellon C.mmp computer (1971).

  • 7/31/2019 IMA PPtTutorial

    12/10412

    Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    MIMD

    Currently, most common type of parallel computer Every processor may be executing a different instruction stream

    Every processor may be working with a different data stream

    Execution can be synchronous or asynchronous, deterministic or non-deterministic

    Examples: most current supercomputers, networked parallel computerclusters and "grids", multi-processor SMP computers, multi-core PCs.

    Note: many MIMDarchitectures alsoinclude SIMD executionsub-components

  • 7/31/2019 IMA PPtTutorial

    13/10413

    Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Parallel Computer Architectures

    Shared memory:

    all processors can access the

    same memory

    Uniform memory access (UMA):

    - identical processors

    - equal access and access

    times to memory

  • 7/31/2019 IMA PPtTutorial

    14/10414

    Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Non-uniform memory access (NUMA)

    Not all processors have equal access to all memories

    Memory access across link is slower

    Advantages:- user-friendly programming perspective to memory

    - fast and uniform data sharing due to the proximity of memory to CPUs

    Disadvantages:- lack of scalability between memory and CPUs.- Programmer responsible to ensure "correct" access of global memory- Expense

  • 7/31/2019 IMA PPtTutorial

    15/104

    15Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Distributed memory

    Distributed memory systems require a communication network to

    connect inter-processor memory.

    Advantages:- Memory is scalable with number of processors.

    - No memory interference or overhead for trying to keep cache coherency.- Cost effective

    Disadvantages:- programmer responsible for data communication between processors.- difficult to map existing data structures to this memory organization.

  • 7/31/2019 IMA PPtTutorial

    16/104

    16Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Hybrid Distributed-Shared Meory

    Generally used for the currently largest and fastestcomputers

    Has a mixture of previously mentioned advantages anddisadvantages

  • 7/31/2019 IMA PPtTutorial

    17/104

    17Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Parallel programming models

    Shared memory Threads

    Message Passing

    Data Parallel

    Hybrid

    All of these can be implemented on any architecture.

  • 7/31/2019 IMA PPtTutorial

    18/104

    18Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Shared memory

    tasks share a common address space, which they read and writeasynchronously.

    Various mechanisms such as locks / semaphores may be used tocontrol access to the shared memory.

    Advantage: no need to explicitly communicate of data between

    tasks -> simplified programming Disadvantages:

    Need to take care when managing memory, avoidsynchronization conflicts

    Harder to control data locality

  • 7/31/2019 IMA PPtTutorial

    19/104

    19Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Threads

    A thread can be considered as asubroutine in the main program

    Threads communicate with eachother through the global memory

    commonly associated with shared

    memory architectures andoperating systems

    Posix Threads or pthreads

    OpenMP

  • 7/31/2019 IMA PPtTutorial

    20/104

    20Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Message Passing

    A set of tasks that use theirown local memory duringcomputation.

    Data exchange throughsending and receiving

    messages. Data transfer usually requires

    cooperative operations to beperformed by each process.For example, a send operation

    must have a matching receive operation.

    MPI (released in 1994)

    MPI-2 (released in 1996)

  • 7/31/2019 IMA PPtTutorial

    21/104

    21Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Data Parallel

    The data parallel model demonstrates the following characteristics:

    Most of the parallel workperforms operations on a dataset, organized into a commonstructure, such as an array

    A set of tasks works collectively

    on the same data structure,with each task working on adifferent partition

    Tasks perform the same operationon their partition

    On shared memory architectures, all tasks may have access to thedata structure through global memory. On distributed memoryarchitectures the data structure is split up and resides as "chunks"in the local memory of each task.

  • 7/31/2019 IMA PPtTutorial

    22/104

    22Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Other Models

    Hybrid

    - combines various models, e.g. MPI/OpenMP

    Single Program Multiple Data (SPMD)- A single program is executed by all tasks simultaneously

    Multiple Program Multiple Data (MPMD)- An MPMD application has multiple executables. Each task canexecute the same or different program as other tasks.

  • 7/31/2019 IMA PPtTutorial

    23/104

    23Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Designing Parallel Programs

    Examine problem:- Can the problem be parallelized?- Are there data dependencies?- where is most of the work done?- identify bottlenecks (e.g. I/O)

    Partitioning- How should the data be decomposed?

  • 7/31/2019 IMA PPtTutorial

    24/104

    24Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Various partitionings

  • 7/31/2019 IMA PPtTutorial

    25/104

    25Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    How should the algorithm be distributed?

  • 7/31/2019 IMA PPtTutorial

    26/104

    26Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Communications

    Types of communication:

    - point-to-point

    - collective

  • 7/31/2019 IMA PPtTutorial

    27/104

    27Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Synchronization types

    Barrier

    Each task performs its work until it reaches the barrier. It thenstops, or "blocks".

    When the last task reaches the barrier, all tasks aresynchronized.

    Lock / semaphore

    The first task to acquire the lock "sets" it. This task can thensafely (serially) access the protected data or code.

    Other tasks can attempt to acquire the lock but must wait until

    the task that owns the lock releases it. Can be blocking or non-blocking

  • 7/31/2019 IMA PPtTutorial

    28/104

    28Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Load balancing

    Keep alltasks busy allof the time. Minimize idle time. The slowest task will determine the overall performance.

  • 7/31/2019 IMA PPtTutorial

    29/104

    29Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Granularity

    In parallel computing, granularity is a qualitativemeasure of the ratio of computation tocommunication.

    Fine-grain Parallelism:- Low computation to communication ratio- Facilitates load balancing- Implies high communication overhead and lessopportunity for performance enhancement

    Coarse-grain Parallelism:- High computation to communication ratio- Implies more opportunity for performanceincrease- Harder to load balance efficiently

  • 7/31/2019 IMA PPtTutorial

    30/104

    30Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Parallel Computing Performance Metrics

    Let T(n,p) be the time to solve a problem of size nusing p processors

    - Speedup: S(n,p) = T(n,1)/T(n,p)

    - Efficiency: E(n,p) = S(n,p)/p

    - Scaled Efficiency: SE(n,p) = T(n,1)/T(pn,p)

    An algorithm is scalable if there exists C>0 withSE(n,p) C for all p

  • 7/31/2019 IMA PPtTutorial

    31/104

    31Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Strong Scalability

    increasing the number of processorsfor a fixed problem size

    1

    2

    3

    4

    5

    6

    7

    8

    1 2 3 4 5 6 7 8

    Speedup

    no of procs

    perfectspeedup

    method

    0

    1

    2

    34

    5

    6

    1 2 3 4 5 6 7 8

    seconds

    no of procs

    Run times

    0.6

    0.7

    0.80.9

    1

    1.1

    1.2

    1 2 3 4 5 6 7 8

    Effic

    iency

    no of procs

  • 7/31/2019 IMA PPtTutorial

    32/104

    32Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Weak scalability

    increasing the number of processors using a fixed problem size

    per processor

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    1 2 3 4 5 6 7 8

    seconds

    no of procs

    Run times

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1 2 3 4 5 6 7 8

    ScaledEfficiency

    No of procs

  • 7/31/2019 IMA PPtTutorial

    33/104

    33Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Amdahls Law

    Maximal Speedup = 1/(1-P), P parallel portion of code

  • 7/31/2019 IMA PPtTutorial

    34/104

    34Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Amdahls Law

    Speedup = 1/((P/N)+S),N no. of processors, S serial portion of code

    Speedup

    N P=0.5 P=0.9 P=0.9910

    1001000

    10000100000

    1.821.981.991.991.99

    5.269.179.919.919.99

    9.1750.2590.9999.0299.90

    50%75%90%95%

  • 7/31/2019 IMA PPtTutorial

    35/104

    35Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Performance Model for distributed memory

    Time for n floating point operations:tcomp = fn

    1/f Mflops

    Time for communicating n words:tcomm = + n

    latency, 1/ bandwidth

    Total time: tcomp + tcomm

  • 7/31/2019 IMA PPtTutorial

    36/104

    36Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Parallel Algorithms and Implementations

    basic kernels Krylov methods

    multigrid

  • 7/31/2019 IMA PPtTutorial

    37/104

    37Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Vector Operations

    Adding two vectorsx = y+z

    Axpysy = y+x

    Vector products

    Proc 0

    Proc 1

    Proc 2

    z0 = x0 + y0

    z1 = x1 + y1

    z2 = x2 + y2

    Proc 0

    Proc 1

    Proc 2

    s0 = x0 * y0

    s1 = x1 * y1

    s2 = x2 * y2

    s = s0 + s1 + s2

    s = x T y

  • 7/31/2019 IMA PPtTutorial

    38/104

    38Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Dense n x n Matrix Vector Multiplication

    Matrix size n x n , p processors

    =

    T = 2(n2/p)f + (p-1)(+n/p)

    Proc 0

    Proc 1

    Proc 2

    Proc 3

  • 7/31/2019 IMA PPtTutorial

    39/104

    39Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Sparse Matrix Vector Multiplication

    Matrix of size n2 x n2 , p2 processors

    Consider graph of matrix

    Partitioning influences run timeT1 10(n

    2/p2)f + 2(+n) vs. T2 10(n2/p2)f + 4(+n/p)

  • 7/31/2019 IMA PPtTutorial

    40/104

    40Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Effect of Partitioning on Performance

    1

    3

    5

    7

    9

    11

    13

    15

    1 3 5 7 9 11 13 15

    Spee

    dup

    No of procs

    Perfect Speedup

    MPI opt

    MPI-noopt

  • 7/31/2019 IMA PPtTutorial

    41/104

    41Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Coordinate Format (COO)

    real Data(nnz), integer Row(nnz), integer Col(nnz)

    Data

    Row

    Col

    Matrix-Vector multiplication:

    for (k=0; k

  • 7/31/2019 IMA PPtTutorial

    42/104

    42Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Matrix-Vector multiplication:

    for (k=0; k

  • 7/31/2019 IMA PPtTutorial

    43/104

    43Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Sparse Matrix Vector Multiply with CSR Data Structure

    Compressed sparse row (CSR)real Data(nnz), integer Rowptr(n+1), Col(nnz)

    Data

    Rowptr

    Col

    Matrix-Vector multiplication:

    for (i=0; i

  • 7/31/2019 IMA PPtTutorial

    44/104

    44Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Sparse Matrix Vector Multiply with CSC Data Structure

    Compressed sparse column (CSC)real Data(nnz), integer Colptr(n+1), Row(nnz)

    Data

    Colptr

    Row

    Matrix-Vector multiplication:

    for (i=0; i

  • 7/31/2019 IMA PPtTutorial

    45/104

    45Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    CSR

    CSC

    Parallel Matvec, CSR vs. CSC

    y0 = A0 x0

    y1 = A1 x1

    y2

    = A2

    x2

    A0 x0

    y = y0 + y1 + y2 A0 A2 x1

    x2

  • 7/31/2019 IMA PPtTutorial

    46/104

    46Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Sparse Matrix Vector Multiply with Diagonal Storage

    Diagonal storagereal Data(n,m), integer offset(m)

    Data offset

    Matrix-Vector multiplication:

    for (i=0; i

  • 7/31/2019 IMA PPtTutorial

    47/104

    47Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Other Data Structures

    Block data structures (BCSR, BCSC)block size bxb, n divisible by breal Data(nnb*b*b), integer Ptr(n/b+1), integer Index(nnb)

    Based on space filling curves(e.g. Hilbert curves, Z-order)

    integer IJ(2*nnz)real Data(nnz)

    Matrix-Vector multiplication:for (i=0; i

  • 7/31/2019 IMA PPtTutorial

    48/104

    48Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Data Structures for Distributed Memory

    Concept for MATAIJ (PETSc) and ParCSR (Hypre)

    x x xx x x x x x

    x x x x xx x x x x xx x x x

    x x x x x x xx x x x x xx x x x x

    x x x x xx x x x x x

    x x x x x x

    Proc 0

    Proc 1

    Proc 2

  • 7/31/2019 IMA PPtTutorial

    49/104

    49Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Data Structures for Distributed Memory

    Concept for MATAIJ (PETSc) and ParCSR (Hypre)

    x x xx x x x x x

    x x x x xx x x x x xx x x x

    x x x x x x xx x x x x xx x x x x

    x x x x xx x x x x x

    x x x x x x

    Proc 0

    Proc 1

    Proc 2

  • 7/31/2019 IMA PPtTutorial

    50/104

    50Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Data Structures for Distributed Memory

    Concept for MATAIJ (PETSc) and ParCSR (Hypre)

    Each processor : Local CSR Matrix,Nonlocal CSR MatrixMapping local to global indices

    x x xx x x x x x

    x x x x xx x x x x xx x x x

    x x x x x x xx x x x x xx x x x x

    x x x x xx x x x x x

    x x x x x x

    0 1 2 3 4 5 6 7 8 9 10

    Proc 0

    Proc 1

    Proc 2

    0 2 3 8 10

  • 7/31/2019 IMA PPtTutorial

    51/104

    51Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Communication Information

    Neighbor hood information (No. of neighbors)

    Amount of data to be sent and/or received

    Global partitioning for p processors

    not good on very large numbers of processors!

    r0 r1 r2 . rp

  • 7/31/2019 IMA PPtTutorial

    52/104

    52Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Assumed partition (AP) algorithm

    Answering global distribution questions previously required O(P)storage & computations

    On BG/L, O(P) storage may not be possible

    New algorithm requires

    O(1) storage O(log P) computations

    Enables scaling to 100K+processors

    AP idea has general applicability beyond hypre

    Data owned by processor 3,in 4s assumed partition

    Actual partition1 2 43

    1 N

    Assumed partition ( p = (iP)/N )

    1 2 3 4

    Actual partition info is sent to the assumedpartition processors distributed directory

    Assumed partition (AP) algorithm is more challenging

  • 7/31/2019 IMA PPtTutorial

    53/104

    53Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    p ( ) g g g

    for structured AMR grids

    AMR can produce grids with gaps

    Our AP function accounts for thesegaps for scalability

    Demonstrated on 32K procs of BG/L

    Simple, nave AP function leavesprocessors withempty partitions

  • 7/31/2019 IMA PPtTutorial

    54/104

    54Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Effect of AP on one ofhypres multigrid solvers

    0

    1

    2

    3

    4

    5

    6

    7

    0 20000 40000 60000

    seconds

    No of procs

    Run times on up to 64K procs on BG/P

    global

    assumed

  • 7/31/2019 IMA PPtTutorial

    55/104

    55Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Machine specification Hera - LLNL

    Multi-core/ multi-socketcluster

    864 diskless nodesinterconnected

    by Quad Data rate (QDR)Infiniband

    AMD opteron quad core(2.3 GHz)

    Fat tree network

  • 7/31/2019 IMA PPtTutorial

    56/104

    56Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Individual 512 KB L2 cache for each core

    2 MB L3 cache shared by 4 cores

    4 sockets per node,16 cores sharing the 32 GBmain memory

    NUMA memory access

    Multicore cluster details (Hera):

    MxV Performance 2-16 cores OpenMP vs. MPI

  • 7/31/2019 IMA PPtTutorial

    57/104

    57Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    p

    7pt stencil

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    0 200000 400000

    Mflops

    matrix size

    OpenMP

    1

    2

    4

    8

    12

    16

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    0 200000 400000

    Mflops

    matrix size

    MPI

  • 7/31/2019 IMA PPtTutorial

    58/104

    58Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    MxV Speedup for various matrix sizes, 7-pt stencil

    0

    5

    10

    15

    20

    25

    30

    1 3 5 7 9 11 13 15

    Speedup

    no of tasks

    MPI

    0

    5

    10

    15

    20

    25

    30

    1 3 5 7 9 11 13 15

    Speedup

    no of threads

    OpenMP

    512

    8000

    21952

    32768

    64000

    110592

    512000

    perfect

  • 7/31/2019 IMA PPtTutorial

    59/104

    59Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Iterative linear solvers

    Krylov solvers :- Conjugate gradient- GMRES- BiCGSTAB

    Generally consist of basic linear kernels

  • 7/31/2019 IMA PPtTutorial

    60/104

    60Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Conjugate Gradient for solving Ax=b

    k = 0, x0

    = 0, r0

    = b, 0

    = r0

    Tr0

    ;

    while (i > 20)

    Solve Mzk = rk;

    k = rkT zk;

    if (k=0) p1 = z0 else pk+1 = zk + k pk / k-1 ;

    k = k+1 ;

    k = k-1 / pkT Apk ;

    xk = xk-1 + k pk ;

    rk = rk-1 - k Apk ;

    k = rkT

    rk ;end

    x = xk ;

  • 7/31/2019 IMA PPtTutorial

    61/104

    61Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Weak scalability diagonally scaled CG

    0

    2

    4

    6

    8

    10

    12

    0 20 40 60 80 100 120

    seconds

    no of procs

    0

    100

    200

    300

    400

    500

    600

    0 20 40 60 80 100 120

    no of procs

    Number of iterationsTotal times 100 iterations

  • 7/31/2019 IMA PPtTutorial

    62/104

    62Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Achieve algorithmic scalability through multigrid

    smooth

    Finest Grid

    First Coarse Grid

    restrict interpolate

  • 7/31/2019 IMA PPtTutorial

    63/104

    63Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Multigrid parallelized by domain partitioning

    Partition fine grid

    Interpolation fully parallel

    Smoothing

  • 7/31/2019 IMA PPtTutorial

    64/104

    64Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Select coarse grids,

    Define interpolation,

    Define restriction and coarse-grid operators

    Algebraic Multigrid

    Setup Phase

    Solve Phase:

    mm)m( fuARelax mm)m( fuARelax

    1m)m(m ePeeInterpolat

    m)m(mm uAfrCompute mmm

    euuCorrect

    1m1m)1m( reA

    Solve

    m)m(1m rRrRestrict

    ,...2,1m,P )m(

    )m()m()m()1(mT)m()m( PARA)P(R

    H b l bili f l i id V l ?

  • 7/31/2019 IMA PPtTutorial

    65/104

    65Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    How about scalability of multigrid V cycle?

    Time for relaxation on finest grid for nxn points per procT0 4 + 4n + 5n

    2 f

    Time for relaxation on level kTk 4 + 4(n/2

    k) + 5(n/2k)2/4

    Total time for V_cycle:TV 2 k (4 + 4(n/2

    k) + 5(n/2k)2/4)

    8(1+1+1+ )

    + 8(1 + 1/2 + 1/4 +) n + 10 (1 + 1/4 + 1/16 + ) n2 f

    8 L + 16 n + 40/3 n2 fL log2(pn) number of levels

    lim p>>0 TV(n,1)/ TV(pn,p) = O(1/log2(p))

    H b t l bilit f lti id W l ?

  • 7/31/2019 IMA PPtTutorial

    66/104

    66Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    How about scalability of multigrid W cycle?

    Total time for W cycle:

    TW 2 k 2k (4 + 4(n/2k) + 5(n/2k)2/4)

    8(1+2+4+ )

    + 8(1 + 1 + 1 +) n

    + 10 (1 + 1/2 + 1/4 + ) n2 f

    8 (2L+1 - 1) + 16 L n + 20 n2 f

    8 (2pn - 1) + 16 log2(pn) n + 20 n2 f

    L log2(pn)

    S th

    http://journals.ametsoc.org/na101/home/literatum/publisher/ams/journals/content/mwre/2001/15200493-129.11/1520-0493(2001)129%3C2660:aotmma%3E2.0.co;2/production/images/small/i1520-0493-129-11-2660-f01.gifhttp://journals.ametsoc.org/na101/home/literatum/publisher/ams/journals/content/mwre/2001/15200493-129.11/1520-0493(2001)129%3C2660:aotmma%3E2.0.co;2/production/images/small/i1520-0493-129-11-2660-f01.gifhttp://journals.ametsoc.org/na101/home/literatum/publisher/ams/journals/content/mwre/2001/15200493-129.11/1520-0493(2001)129%3C2660:aotmma%3E2.0.co;2/production/images/small/i1520-0493-129-11-2660-f01.gifhttp://journals.ametsoc.org/na101/home/literatum/publisher/ams/journals/content/mwre/2001/15200493-129.11/1520-0493(2001)129%3C2660:aotmma%3E2.0.co;2/production/images/small/i1520-0493-129-11-2660-f01.gifhttp://journals.ametsoc.org/na101/home/literatum/publisher/ams/journals/content/mwre/2001/15200493-129.11/1520-0493(2001)129%3C2660:aotmma%3E2.0.co;2/production/images/small/i1520-0493-129-11-2660-f01.gif
  • 7/31/2019 IMA PPtTutorial

    67/104

    67Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Smoothers

    Smoothers significantly affect multigrid convergence and runtime

    solving: Ax=b, A SPD

    smoothing: xn+1 = xn + M-1(b-Axn)

    smoothers:

    Jacobi, M=D, easy to parallelize

    Gauss-Seidel, M=D+L, highly sequential!!

    smooth

    1-n0,...,i,)j(x)j,i(A)i(b)i,i(A

    )i(xn

    ij,jkk

    1

    0

    1

    1

    1-n0,...,i,)j(x)j,i(A)j(x)j,i(A)i(b)i,i(A

    )i(xn

    ijk

    i

    jkk

    1

    1

    1

    1

    0

    1

    P ll l S th

  • 7/31/2019 IMA PPtTutorial

    68/104

    68Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Parallel Smoothers

    Red-Black Gauss-Seidel

    Multi-color Gauss-Seidel

    H b id th

  • 7/31/2019 IMA PPtTutorial

    69/104

    69Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Hybrid smoothers

    pM

    .

    .

    .

    M

    M

    1

    kk

    k

    k LD

    M1

    Hybrid SOR:

    )LD(D)LD(MT

    kkkkkk

    1

    Hybrid Symm. Gau-Seidel:

    kkk

    LDMHybrid Gau-Seidel:

    Can be used for distributed as well as shared parallel computingNeeds to be used with care

    Might require smoothing parameter M = M

    Corresponding block schemes,

    Schwarz solvers, ILU, etc.

    H i th th th d d?

  • 7/31/2019 IMA PPtTutorial

    70/104

    70Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    How is the smoother threaded?

    A1

    A =A2

    Ap

    Ap =

    t1t2

    t3t4

    Hybrid: GS within eachthread, Jacobi on threadboundaries

    the smoother is threadedsuch that each threadworks on a contiguoussubset of the rows

    matrices are distributedacross P processors incontiguous blocks of rows

    FEM li ti 2D S

  • 7/31/2019 IMA PPtTutorial

    71/104

    71Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    FEM application: 2D Square

    AMG determines thread subdomains by splitting nodes (not elements)

    the applications numbering of the elements/nodes becomes relevant!

    (i.e., threaded domains dependent on colors)

    partitions grid element-wise into nice subdomains for each processor

    16 procs

    (colors indicate element numbering)

    W t t O MP lti hi

  • 7/31/2019 IMA PPtTutorial

    72/104

    72Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    We want to use OpenMP on multicore machines...

    hybrid-GS is not guaranteed to converge, but in practice it iseffective for diffusion problems when subdomains are well-defined

    Some Options

    1. applications needs to order unknowns inprocessor subdomain nicely

    may be more difficult due to refinement

    may impose non-trivial burden onapplication

    2. AMG needs to reorder unknowns (rows) non-trivial, costly

    3. smoothers that dont depend on parallelism

    C i l ith

  • 7/31/2019 IMA PPtTutorial

    73/104

    73Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Coarsening algorithms

    Original coarsening algorithm

    Parallel coarsening algorithms

    Choosing the Coarse Grid (Classical AMG)

  • 7/31/2019 IMA PPtTutorial

    74/104

    74Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Choosing the Coarse Grid (Classical AMG)

    Add measures(Number of stronginfluences of a point)

    Choosing the Coarse Grid (Classical AMG)

  • 7/31/2019 IMA PPtTutorial

    75/104

    75Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Choosing the Coarse Grid (Classical AMG)

    Pick a point withmaximal measure andmake it a C-point

    Choosing the Coarse Grid (Classical AMG)

  • 7/31/2019 IMA PPtTutorial

    76/104

    76Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Choosing the Coarse Grid (Classical AMG)

    Pick a point withmaximal measure andmake it a C-point

    Make its neighborsF-points

    Choosing the Coarse Grid (Classical AMG)

  • 7/31/2019 IMA PPtTutorial

    77/104

    77Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Choosing the Coarse Grid (Classical AMG)

    Pick a point withmaximal measure andmake it a C-point

    Make its neighborsF-points

    Increase measures ofneighbors of new F-points by number ofnew F-neighbors

    Choosing the Coarse Grid (Classical AMG)

  • 7/31/2019 IMA PPtTutorial

    78/104

    78Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Choosing the Coarse Grid (Classical AMG)

    Pick a point withmaximal measure andmake it a C-point

    Make its neighborsF-points

    Increase measures ofneighbors of new F-points by number ofnew F-neighbors

    Repeat until finished

    Choosing the Coarse Grid (Classical AMG)

  • 7/31/2019 IMA PPtTutorial

    79/104

    79Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Choosing the Coarse Grid (Classical AMG)

    Pick a point withmaximal measure andmake it a C-point

    Make its neighborsF-points

    Increase measures ofneighbors of new F-points by number ofnew F-neighbors

    Repeat until finished

    Choosing the Coarse Grid (Classical AMG)

  • 7/31/2019 IMA PPtTutorial

    80/104

    80Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Choosing the Coarse Grid (Classical AMG)

    Pick a point withmaximal measure andmake it a C-point

    Make its neighborsF-points

    Increase measures ofneighbors of new F-points by number ofnew F-neighbors

    Repeat until finished

    Parallel Coarsening Algorithms

  • 7/31/2019 IMA PPtTutorial

    81/104

    81Option:UCRL# Option:Additional InformationLawrence Livermore National Laboratory

    Parallel Coarsening Algorithms

    Perform sequential algorithm on each processor,then deal with processor boundaries

    PMIS: start

  • 7/31/2019 IMA PPtTutorial

    82/104

    82Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    PMIS: start

    select C-pt withmaximal measure

    select neighbors asF-pts

    3 5 5 5 5 5 3

    5 8 8 8 8 8 5

    5 8 8 8 8 8 5

    5 8 8 8 8 8 5

    5 8 8 8 8 8 5

    5 8 8 8 8 8 5

    3 5 5 5 5 5 3

    PMIS: add random numbers

  • 7/31/2019 IMA PPtTutorial

    83/104

    83Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    PMIS: add random numbers

    select C-pt withmaximal measure

    select neighbors asF-pts

    3.7 5.3 5.0 5.9 5.4 5.3 3.4

    5.2 8.0 8.5 8.2 8.6 8.2 5.1

    5.9 8.1 8.8 8.9 8.4 8.9 5.9

    5.7 8.6 8.3 8.8 8.3 8.1 5.0

    5.3 8.7 8.3 8.4 8.3 8.8 5.9

    5.0 8.8 8.5 8.6 8.7 8.9 5.3

    3.2 5.6 5.8 5.6 5.9 5.9 3.0

    PMIS: exchange neighbor information

  • 7/31/2019 IMA PPtTutorial

    84/104

    84Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    PMIS: exchange neighbor information

    select C-pt withmaximal measure

    select neighbors asF-pts

    3.7 5.3 5.0 5.9 5.4 5.3 3.4

    5.2 8.0 8.5 8.2 8.6 8.2 5.1

    5.9 8.1 8.8 8.9 8.4 8.9 5.9

    5.7 8.6 8.3 8.8 8.3 8.1 5.0

    5.3 8.7 8.3 8.4 8.3 8.8 5.9

    5.0 8.8 8.5 8.6 8.7 8.9 5.3

    3.2 5.6 5.8 5.6 5.9 5.9 3.0

    PMIS: select

  • 7/31/2019 IMA PPtTutorial

    85/104

    85Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    PMIS: select

    8.8

    8.9 8.9

    8.9

    3.7 5.3 5.0 5.9 5.4 5.3 3.4

    5.2 8.0 8.5 8.2 8.6 8.2 5.1

    5.9 8.1 8.8 8.4 5.9

    5.7 8.6 8.3 8.8 8.3 8.1 5.0

    5.3 8.7 8.3 8.4 8.3 8.8 5.9

    5.0 8.5 8.6 8.7 5.3

    3.2 5.6 5.8 5.6 5.9 5.9 3.0

    select C-pts withmaximal measurelocally

    make neighbors F-pts

    PMIS: update 1

  • 7/31/2019 IMA PPtTutorial

    86/104

    86Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    PMIS: update 1

    5.4 5.3 3.4

    select C-pts withmaximal measurelocally

    make neighbors F-pts(requires neighbor

    info)

    3.7 5.3 5.0 5.9

    5.2 8.0

    5.9 8.1

    5.7 8.6

    8.4

    8.6

    5.6

    PMIS: select

  • 7/31/2019 IMA PPtTutorial

    87/104

    87Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    PMIS: select

    8.6

    5.9

    8.6

    5.4 5.3 3.43.7 5.3 5.0

    5.2 8.0

    5.9 8.1

    5.7

    8.4

    5.6

    select C-pts withmaximal measurelocally

    make neighbors F-pts

    PMIS: update 2

  • 7/31/2019 IMA PPtTutorial

    88/104

    88Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    PMIS: update 2

    5.3 3.4

    select C-pts withmaximal measurelocally

    make neighbors F-pts

    3.7 5.3

    5.2 8.0

    PMIS: final grid

  • 7/31/2019 IMA PPtTutorial

    89/104

    89Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    PMIS: final grid

    select C-pts withmaximal measurelocally

    make neighbor F-pts

    remove neighboredges

    Interpolation

  • 7/31/2019 IMA PPtTutorial

    90/104

    90Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Interpolation

    Nearest neighbor interpolation straight forward toparallelize

    Harder to parallelize long range interpolation

    i

    j

    lk

    Communication patterns 128 procs Level 0

  • 7/31/2019 IMA PPtTutorial

    91/104

    91Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Communication patterns 128 procs Level 0

    Setup Solve

    Communication patterns 128 procs Setupdistance 1 vs distance 2 interpolation

  • 7/31/2019 IMA PPtTutorial

    92/104

    92Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    distance 1 vs. distance 2 interpolation

    Distance 1 Level 4 Distance 2 Level 4

    Communication patterns 128 procs Solvedistance 1 vs distance 2 interpolation

  • 7/31/2019 IMA PPtTutorial

    93/104

    93Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    distance 1 vs. distance 2 interpolation

    Distance 1 Level 4 Distance 2 Level 4

    AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube,

    100x100x100 grid points per node (16 procs per node)

  • 7/31/2019 IMA PPtTutorial

    94/104

    94Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    100x100x100 grid points per node (16 procs per node)

    No. of cores

    secon

    ds

    Solve timesSetup times

    No. of cores

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    0 1000 2000 3000

    Hera-1Hera-2BGL-1BGL-2

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    0 1000 2000 3000

    Distance 1 vs distance 2 interpolation on differentcomputer architectures

  • 7/31/2019 IMA PPtTutorial

    95/104

    95Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    computer architectures

    AMG-GMRES(10), 7pt 3D Laplace problem on a unit cube, using

    50x50x25 points per processor, (100x100x100 per node on Hera)

    0

    5

    10

    15

    20

    25

    30

    0 1000 2000 3000

    seconds

    no of cores

    Hera-1

    Hera-2

    BGL-1

    BGL-2

    AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube,

    100x100x100 grid points per node (16 procs per node)

  • 7/31/2019 IMA PPtTutorial

    96/104

    96Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    100x100x100 grid points per node (16 procs per node)

    No. of cores

    secon

    ds

    Solve cycle timesSetup times

    No. of cores

    0

    5

    10

    15

    20

    25

    30

    35

    0 5000 10000

    MPI 16x1

    H 8x2

    H 4x4

    H 2x8

    OMP 1x16

    MCSUP 1x16

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0 5000 10000

    AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube,

    100x100x100 grid points per node (16 procs per node)

  • 7/31/2019 IMA PPtTutorial

    97/104

    97Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    0 2000 4000 6000 8000 10000 12000

    MPI 16x1

    H 8x2

    H 4x4H 2x8

    OMP 1x16

    MCSUP 1x16

    100x100x100 grid points per node (16 procs per node)

    No. of cores

    Seconds

    Blue Gene/P Solution IntrepidArgonne National Laboratory

  • 7/31/2019 IMA PPtTutorial

    98/104

    98Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Argonne National Laboratory

    40 racks with 1024compute nodes each

    Quad-core 850 MHzPowerPC 450 Processor

    163,840 cores

    2GB main memory pernode shared by all cores

    3D torus network- isolated dedicated partitions

    AMG-GMRES(10) on Intrepid, 7pt 3D Laplace problem on a unit

    cube, 50x50x25 grid points per core (4 cores per node)

    http://upload.wikimedia.org/wikipedia/commons/d/d3/IBM_Blue_Gene_P_supercomputer.jpg
  • 7/31/2019 IMA PPtTutorial

    99/104

    99Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    cube, 50x50x25 grid points per core (4 cores per node)

    0

    1

    2

    3

    4

    5

    6

    0 50000 100000

    Second

    s

    No of cores

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0 50000 100000

    Se

    conds

    No of cores

    smp

    dual

    vn

    vn-omp

    H1x4

    H2x2

    MPI

    H4x1

    Setup times Cycle times

    AMG-GMRES(10) on Intrepid, 7pt 3D Laplace problem on a unit

    cube, 50x50x25 grid points per core (4 cores per node)

  • 7/31/2019 IMA PPtTutorial

    100/104

    100Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    cube, 50x50x25 grid points per core (4 cores per node)

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 50000 100000

    Sec

    onds

    No of procs

    H1x4

    H2x2

    MPI

    H4x1

    H1x4 H2x2 MPI

    128 17 17 18

    1024 20 20 228192 25 25 25

    27648 29 27 28

    65536 30 33 29

    128000 37 36 36

    No. Iterations

    Cray XT-5 Jaguar-PF Oak Ridge National Laboratory

  • 7/31/2019 IMA PPtTutorial

    101/104

    101Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    Cray XT 5 Jaguar PF Oak Ridge National Laboratory

    18,688 nodes in 200 cabinets

    Two sockets with AMD OpteronHex-Core processor per node

    16 GB main memory per node

    Network: 3D torus/ mesh withwrap-around links (torus-like)in two dimensions and withoutsuch links in the remaining dimension

    Compute Node Linux, limited set of services, but reduced noise ratio

    PGIs C compilers (v.904) with and without OpenMP,OpenMP settings: -mp=numa (preemptive thread pinning, localizedmemory allocations)

    AMG-GMRES(10) on Jaguar, 7pt 3D Laplace problem on a unit

    cube, 50x50x30 grid points per core (12 cores per node)

  • 7/31/2019 IMA PPtTutorial

    102/104

    102Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    , g p p ( p )

    Setup times Cycle times

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 50000 100000 150000 200000

    Second

    s

    No. of cores

    MPI

    H12x1

    H1x12

    H4x3

    H2x6

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0 50000 100000 150000 200000

    Seconds

    No. of cores

    AMG-GMRES(10) on Jaguar, 7pt 3D Laplace problem on a unit

    cube, 50x50x30 grid points per core (12 cores per node)

  • 7/31/2019 IMA PPtTutorial

    103/104

    103Option:UCRL# Option:Additional Information

    Lawrence Livermore National Laboratory

    , g p p ( p )

    0

    2

    4

    6

    8

    10

    12

    14

    16

    0 50000 100000 150000 200000

    seconds

    no of cores

    MPI

    H1x12

    H4x3

    H2x6

  • 7/31/2019 IMA PPtTutorial

    104/104

    Thank You!

    This work performed under the auspices of the U.S. Department of

    Energy by Lawrence Livermore National Laboratory under

    Contract DE-AC52-07NA27344.