HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D....

HPC Technology Track:Foundations of Computational Science

Lecture 2

Dr. Greg Wettstein, Ph.D.

Research Support Group LeaderDivision of Information Technology

Adjunct ProfessorDepartment of Computer Science

North Dakota State University

What is High Performance Computing?

Definition:

The solution of problems involving highdegrees of computational complexityor data analysis which require specializedhardware and software systems.

What is Parallel Computing?

Definition:

A strategy of decreasing the time to solutionof a computational problem by carrying outmultiple elements of the computationat the same time.

Does HPC imply Parallel Computing?

Typically but not always. HPC solutions may require specialized systems due

to memory and/or I/O performance issues.

Conversely parallel computing does not necessarily imply high performance computing.

Flynn's Taxonomy:Classification Strategy for Concurrent Execution

SISD Single Instruction, Single Data

MISD Multiple Instruction, Single Data

SIMD * Single Instruction, Multiple Data

MIMD * Multiple Instruction, Multiple Data

* = Relevant to HPC

SIMDThe Origin of HPC

Architectural model at the heart of 'vector processors'.

Performance enhancement in machines at origin of HPC:

CDC STAR-100 and Cray-1 Utility predicated on fact that mathematical

operations on vectors or vector spaces are at the heart of linear algebra.

Vector Processing Diagram

534

310

21

74 67

21

25

2 34

7 4 87

14

Vector Length = 8 'words'

Vector elements

Vector elements

Parallel mathematicaloperations +,-,*,/

Current SIMD Examples Embedded in modern x86 and x86_64 architectures.

primarily focus on graphics/signal processing MMX, PNI, SSE2-4, AVX

Foundation for current trend in 'GPGPU computing' NVIDIA Tesla architecture

Component of Larrabee architecture.

SSE Implementation

534

310

21

74 67

21

25

2 34

7 4 87

14

Vector elements

Vector elements

Parallel operations100+ (SSE4)

128 bit XMM register 128 bit XMM register

Stride Length

MIMDMultiple Instruction Multiple Data

Characterized by multiple execution threads operating on separate data elements.

Threads may operate in shared or disjoint (distributed) memory configurations.

Implementation example SMP (Symmetric Multi-Processing)

SPMDThe Basis for Modern HPC

Defined as a single process executing a common program at different points.

Different from SIMD in that execution is not in lockstep format.

Common implementations: shared memory:

OpenMP Pthreads

distributed memory MPI

Characteristics of MD Models

MIMD/SPMD requires active participation by programmer to implement 'orthogonalization'.

SIMD requires active participation by the compiler with consideration by the programmer to support orthogonalization.

Orthogonalization defn: The isolation ofa problem into discrete elementscapable of being independentlyresolved.

The Real World - A Continuum

Practical programs do not exhibit strict model partitioning.

More pragmatic model is to consider 'dimensions' of parallelism available to a program.

Currently a total of four dimensions of parallelism are exploitable.

Dimensions of Parallelism

First dimension. Standard sequential programming with processor

supplied ILP (Instruction Level Parallelism) Referred to as 'free' or 'invisible' parallelism.

Second dimension. SIMD or OpenMP loop parallelism characterized by isolation of the problem into a

single system image primarily supported by programming language or

compiler

Dimensions of Parallelism - cont.

Third dimension – Two subtypes. use of MPI to partition problem into orthogonal

elements partitioning is frequently implemented on multiple

system images

MIMD threading on a single system image separate threads dispatched to handle separate tasks

which can execute asynchronously Common HPC example is to 'thread' computation

and Input/Output (I/O)

Dimensions of Parallelism - cont.

Fourth dimension partitioning of the problem into orthogonal

elements which can be dispatched to a heterogeneous instruction architecture.

examples: GPGPU/CUDA PowerXcell SPU FPGA

Depth of Parallelism

Measure of the complexity of parallelism implemented.

Simplest metric is the count of the number of programmer implemented dimensions of parallelism on a single system image.

Example MPI implementation with SIMD loop vectorization

on each node Parallelism depth is two

Parallelism Analysis Example

Process based MIMD application. Depth = 1

MPI simulation with OpenMP loop vectorization. Depth = 2

MPI partitioning with CUDA PTree offload and SIMD loop vectorization.

Depth = 3

Escalation of Complexity

Dimension

Architectural decisions must be basedon cost benefit analysis of performancereturns.

Depth

1

N

Least

Most

1 4

Exercise

Verify you have changeset which adds experimental code for SSE/SIMD based boolean PTree operators.

Study the class methods implementing the AND and OR operators.

Review and understood how vector and stride length effect the number of times a loop needs to be executed.

goto skills_lecture1;

HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D....

Documents

Transcript of HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D....