Download - Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Programming the IBM Power3 SP

Eric AubanelAdvanced Computational Research Laboratory

Faculty of Computer Science, UNB

Advanced Computational Research Laboratory

• High Performance Computational Problem-Solving and Visualization Environment

• Computational Experiments in multiple disciplines: CS, Science and Eng.

• 16-Processor IBM SP3

• Member of C3.ca Association, Inc. (http://www.c3.ca)

Advanced Computational Research Laboratory

www.cs.unb.ca/acrl

• Virendra Bhavsar, Director

• Eric Aubanel, Research Associate & Scientific Computing Support

• Sean Seeley, System Administrator


• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

POWER chip: 1990 to 2003

1990– Performance Optimized with Enhanced RISC– Reduced Instruction Set Computer– Superscalar: combined floating point multiply-

add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz

– Initially: 25 MHz (50 MFLOPS) and 64 KB data cache


1991: SP1– IBM’s first SP (scalable power parallel)– Rack of standalone POWER processors (62.5

MHz) connected by internal switch network– Parallel Environment & system software


1993: POWER2– 2 FMAs– Increased data cache size– 66.5 MHz (254 MFLOPS)– Improved instruction set (incl. Hardware square

root)– SP2: POWER2 + higher bandwidth switch for

larger systems


1993: POWERPCSupport SMP

1996: P2SCPOWER2 super chip: clock speeds up to 160

MHz


Feb. ‘99: POWER3– Combined P2SC & POWERPC– 64 bit architecture– Initially 2-way SMP, 200 MHz– Cache improvement, including L2 cache of 1-

16 MB– Instruction & data prefetch

POWER3+ chip: Feb. 2000• Winterhawk II - 375 MHz

• 4- way SMP

• 2 MULT/ ADD - 1500 MFLOPS

• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec

• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec

• 1.6 GB/ s Memory Bandwidth

• 6 GFLOPS/ Node

• Nighthawk II - 375 MHz

• 16- way SMP

• 2 MULT/ ADD - 1500 MFLOPS

• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec

• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec

• 14 GB/ s Memory Bandwidth

• 24 GFLOPS/ Node

The Clustered SMP

ACRL’s SP: Four 4-way SMPs

Each node has its own copy of the O/S

Processors on the node are closer than those on differentnodes

Power3 Architecture

Power4 - 32 way

• Logical UMA

• SP High Node

• L3 cache shared between all processors on node - 32 MB

• Up to 32 GB main memory

• Each processor: 1.1 GHz

• 140 Gflops total peak

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

Going to NUMA32 way GP High nodeOwn copy of AIX 128+ GFLOPS/high nodeMultiple Federation Adapters for scaleable inter-node BWNUMA up to 256 Processors

Federation Switch

SP GP Node

AIX

Federation Adapters

Memory

Processors / Intra-node Interconnect Up to 16

Links

SP GP Node

AIX

Federation Adapters

Memory

Up to 16 Links

Processors / Intra-node Interconnect

NUMA up to 256 processors - 1.1 Teraflops

Uni-processor Optimization

• Compiler options: – start with -O3 -qstrict, then -O3, -qarch=pwr3

• Cache re-use

• Take advantage of superscalar architecture – give enough operations per load/store

• Use ESSL - optimization already maximally exploited

Memory Access Times

Memory to L2or L1

L2 to L1 L1 toRegisters

Width 16 bytes/2cycles

32 bytes/cycle 2 x 8bytes/cycle

Rate 1.6 GB/s 6.4 GB/s 3.2 GB/s

Latency 35 cycles(approximately)

6 to 7 cycles(approximately)

1 cycle

Cache128 byte cache line

2 MB

2 MB

2 MB

2 MB

L2 cache: 4-way set-associative, 8 MB total

L1 cache: 128-way set-associative, 64 KB

How to Monitor Performance?

• IBM’s hardware monitor: HPMCOUNT– Uses hardware counters on chip– Cache & TLB misses, fp ops, load-stores, …– Beta version – Available soon on ACRL’s SP

HMPCOUNT sample output

real*8 a(256,256),b(256,256),c(256,256)

common a,b,c

do j=1,256

do i=1,256

a(i,j)=b(i,j)+c(i,j)

end do

end do

end

PM_TLB_MISS (TLB misses) : 66543

Average number of loads per TLB miss : 5.916

Total loads and stores : 0.525 M

Instructions per load/store : 2.749

Cycles per instruction : 2.378

Instructions per cycle : 0.420

Total floating point operations : 0.066 M

Hardware float point rate : 2.749 Mflop/sec

HMPCOUNT sample output

real*8 a(257,256),b(257,256),c(257,256)

common a,b,c

do j=1,256

do i=1,257

a(i,j)=b(i,j)+c(i,j)

end do

end do

end

PM_TLB_MISS (TLB misses) : 1634

Average number of loads per TLB miss : 241.876

Total loads and stores : 0.527 M

Instructions per load/store : 2.749

Cycles per instruction : 1.271

Instructions per cycle : 0.787

Total floating point operations : 0.066 M

Hardware float point rate : 3.525 Mflop/sec

ESSL

• Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers

• Fast!– 560x560 real*8 matrix multiply

• Hand coding: 19 Mflops

• dgemm: 1.2 GFlops

• Parallel (threaded and distributed) versions

ACRL’s IBM SP

• 4 Winterhawk II nodes– 16 processors

• Each node has:– 1 GB RAM

– 9 GB (mirrored) disk on each node

– Switch adapter

• High Perforrnance Switch

• Gigabit Ethernet (1 node)

• Control workstation

• Disk: SSA tower with 6 18.2 GB disks

Disk

Gigabit Ethernet

IBM Power3 SP Switch

• Bidirectional multistage interconnection networks (MIN)

• 300 MB/sec bi-directional

• 1.2 sec latency

General Parallel File System

Application

GPFS Client

RVSD/VSD

Application

GPFS Client

RVSD/VSD

Application

GPFS Client

RVSD/VSD

Application

GPFS Server

RVSD/VSD

Node 2 Node 3 Node 4

Node 1

SP Switch

ACRL Software• Operating System: AIX 4.3.3

• Compilers– IBM XL Fortran 7.1 (HPF not yet installed)

– VisualAge C for AIX, Version 5.0.1.0

– VisualAge C++ Professional for AIX, Version 5.0.0.0

– IBM Visual Age Java - not yet installed

• Job Scheduler: Loadleveler 2.2

• Parallel Programming Tools– IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O

• Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 )

• Visualization: OpenDX (not yet installed)

• E-Commerce software (not yet installed)

Why Parallel Computing?• Solve large problems in reasonable time• Many algorithms are inherently parallel

– image processing, Monte Carlo

– Simulations (eg. CFD)

• High performance computers have parallel architectures– Commercial off-the shelf (COTS) components

• Beowulf clusters

• SMP nodes

– Improvements in network technology

NRL Layered Ocean Model at Naval Research Laboratory

IBM Winterhawk II SP

Parallel Computational Models

• Data Parallelism– Parallel program looks like serial program

• parallelism in the data

– Vector processors– HPF


• Message Passing (MPI)– Processes have only local memory but can communicate

with other processes by sending & receiving messages– Data transfer between processes requires operations to be

performed by both processes– Communication network not part of computational

model (hypercube, torus, …)

Send Receive


• Shared Memory (threads)– P(osix)threads– OpenMP: higher level standard

Address space

Processes


• Remote Memory Operations– “One-sided” communication

• MPI-2, IBM’s LAPI

– One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory

Put

Get


• Combined: Message Passing & Threads– Driven by clusters of SMPs

– Leads to software complexity!

Address space

Processes

Address space

Processes

Address space

Processes

Network

Message Passing Interface

• MPI 1.0 standard in 1994

• MPI 1.1 in 1995 - IBM support

• MPI 2.0 in 1997– Includes 1.1 but adds new features

• MPI-IO

• One-sided communication

• Dynamic processes

Advantages of MPI

• Universality

• Expressivity– Well suited to formulating a parallel algorithm

• Ease of debugging– Memory is local

• Performance– Explicit association of data with process allows

good use of cache

MPI Functionality• Several modes of point-to-point message passing

– blocking (e.g. MPI_SEND)

– non-blocking (e.g. MPI_ISEND)

– synchronous (e.g. MPI_SSEND)

– buffered (e.g. MPI_BSEND)

• Collective communication and synchronization– e.g. MPI_REDUCE, MPI_BARRIER

• User-defined datatypes

• Logically distinct communicator spaces

• Application-level or virtual topologies

Simple MPI Example

My_Id 0 1

This is from MPI process number 0

This is from MPI processes other than 0

Simple MPI ExampleProgram Trivial

implicit none

include "mpif.h" ! MPI header file

integer My_Id, Numb_of_Procs, Ierr

call MPI_INIT ( ierr )

call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr )

call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr )

print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs

if ( My_Id .eq. 0 ) then

print *, ' This is from MPI process number ',My_Id

else

print *, ' This is from MPI processes other than 0 ', My_Id

end if

call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr

stop

end

MPI Example with send/recv

My_Id 0 1

Send Receive

SendReceive

MPI Example with send/recvProgram Simple

implicit none

Include "mpif.h"

Integer My_Id, Other_Id, Nx, Ierr

Parameter ( Nx = 100 )

Real A ( Nx ), B ( Nx )

call MPI_INIT ( Ierr )

call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr )

Other_Id = Mod ( My_Id + 1, 2 )

A = My_Id

call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr )

call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr )

call MPI_FINALIZE ( Ierr )

stop

end

What Will Happen?/* Processor 0 */

...

MPI_Send(sendbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD);

printf("Posting receive now ...\n");

MPI_Recv(recvbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD,

status);

/* Processor 1 */

...

MPI_Send(sendbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD);

printf("Posting receive now ...\n");

MPI_Recv(recvbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD,

status);

MPI Message Passing Modes

Ready

Standard

Synchronous

Buffered

Ready

Eager

Rendezvous

Buffered

> eager limit

<= eager limit

Default Eager Limit on SP is 4 KB (can be up to 64 KB)

MPI Performance Visualization

• ParaGraph– Developed by University of Illinois– Graphical display system for visualizing

behaviour and performance of MPI programs

Message Passing on SMP

Call MPI_SEND Call MPI_RECEIVE

BufferBuffer

Memory Crossbar or Switch

Data toSend

ReceivedData

export MP_SHARED_MEMORY=yes|no

Shared Memory MPI

MPI_SHARED_MEMORY=<yes|no>

Latency Bandwidth

(sec) (Mbytes/sec)– between 2 nodes: 24 133– same nodes: 30 (no) 80 (no)– same nodes: 10 (yes) 270(yes)

Message Passing off Node

MPI Across all the processors

Many more messages going through the fabric

OpenMP• 1997: group of hardware and software vendors

announced their support for OpenMP, a new API for multi-platform shared-memory programming (SMP) on UNIX and Microsoft Windows NT platforms.

• www.openmp.org• OpenMP parallelism specified through the use of

compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.

OpenMP

• All processors can access all the memory in the parallel system

• Parallel execution is achieved by generating threads which execute in parallel

• Overhead for SMP parallelization is large (100-200 sec)- size of parallel work construct must be significant enough to overcome overhead

OpenMP1.All OpenMP programs begin as a single process: the master thread

2.FORK: the master thread then creates a team of parallel threads

3.Parallel region statements executed in parallel among the various team threads

4.JOIN: threads synchronize and terminate, leaving only the master thread

OpenMP

How is OpenMP typically used?

• OpenMP is usually used to parallelize loops:– Find your most time consuming loops.– Split them up between threads.

• Better scaling can be obtained using OpenMP parallel regions, but can be tricky!

OpenMP Loop Parallelization!$OMP PARALLEL DO

do i=0,ilong

do k=1,kshort

...

end do

end do

#pragma omp parallel for

for(i=0; i <= ilong; i++)

for(k=1; k <= kshort; k++) {

...

}

Variable Scoping• Most difficult part of Shared Memory

Parallelization– What memory is Shared

– What memory is Private - each processor has its own copy

• Compare MPI: all variables are private• Variables are shared by default, except:

– loop indices

– scalars that are set and then used in loop

How Does Sharing Work?

THREAD 1: increment(x)

{

x = x + 1;

}

THREAD 1:

10 LOAD A, (x address)

20 ADD A, 1

30 STORE A, (x address)

THREAD 2: increment(x)

{ x = x + 1;

}

THREAD 2: 10 LOAD A, (x address)

20 ADD A, 1

30 STORE A, (x address)

Shared X initially 0

Result could be 1 or 2

Need synchronization

False Sharing7

6

5

4

3

2

1

0

Processor 1 Processor 2

Block in Cache

Cache line

Address tag

Block

Say A(1-5)starts on cache line, then some of A(6-10) will be on first cache line so won’t be accessible until first thread finished

!$OMP PARALLEL DO do I=1,20 A(I)= ...enddo

Why Hybrid MPI-OpenMP?

• To optimize performance on “mixed-mode” hardware like the SP

• MPI is used for “inter-node” communication, and OpenMP is used for “intra-node” communication– threads have lower latency – threads can alleviate network contention of a

pure MPI implementation

Hybrid MPI-OpenMP?• Unless you are forced against your will, for the hybrid

model to be worthwhile:– There has to be obvious parallelism to exploit

– The code has to be easy to program and maintain• easy to write bad OpenMP code

– It has to promise to perform at least as well as the equivalent all-MPI program

• Experience has shown that converting working MPI code to a hybrid model rarely results in better performance – especially true with applications having a single level of

parallelism

Hybrid Scenario• Thread the computational portions of the code that

exist between MPI calls• MPI calls are “single-threaded” and therefore use

only a single CPU.• Assumes:

– application has two natural levels of parallelism– or that in breaking an MPI code with one level

of parallelism that communication between resulting threads is little/none

MPI-IO

• Part of MPI-2• Resulted work at IBM Research exploring the

analogy between I/O and message passing• See “Using MPI-2”, by Gropp et al. (MIT Press)

memory

processes

file

Conclusion• Don’t forget uni-processor optimization

• If you choose one parallel programming API, choose MPI

• Mixed MPI-OpenMP may be appropriate in certain cases– More work needed here

• Remote memory access model may be the answer