Programming the IBM Power3 SP
Eric AubanelAdvanced Computational Research Laboratory
Faculty of Computer Science, UNB
Advanced Computational Research Laboratory
• High Performance Computational Problem-Solving and Visualization Environment
• Computational Experiments in multiple disciplines: CS, Science and Eng.
• 16-Processor IBM SP3
• Member of C3.ca Association, Inc. (http://www.c3.ca)
Advanced Computational Research Laboratory
www.cs.unb.ca/acrl
• Virendra Bhavsar, Director
• Eric Aubanel, Research Associate & Scientific Computing Support
• Sean Seeley, System Administrator
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
POWER chip: 1990 to 2003
1990– Performance Optimized with Enhanced RISC– Reduced Instruction Set Computer– Superscalar: combined floating point multiply-
add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz
– Initially: 25 MHz (50 MFLOPS) and 64 KB data cache
POWER chip: 1990 to 2003
1991: SP1– IBM’s first SP (scalable power parallel)– Rack of standalone POWER processors (62.5
MHz) connected by internal switch network– Parallel Environment & system software
POWER chip: 1990 to 2003
1993: POWER2– 2 FMAs– Increased data cache size– 66.5 MHz (254 MFLOPS)– Improved instruction set (incl. Hardware square
root)– SP2: POWER2 + higher bandwidth switch for
larger systems
POWER chip: 1990 to 2003
1993: POWERPCSupport SMP
1996: P2SCPOWER2 super chip: clock speeds up to 160
MHz
POWER chip: 1990 to 2003
Feb. ‘99: POWER3– Combined P2SC & POWERPC– 64 bit architecture– Initially 2-way SMP, 200 MHz– Cache improvement, including L2 cache of 1-
16 MB– Instruction & data prefetch
POWER3+ chip: Feb. 2000• Winterhawk II - 375 MHz
• 4- way SMP
• 2 MULT/ ADD - 1500 MFLOPS
• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
• 1.6 GB/ s Memory Bandwidth
• 6 GFLOPS/ Node
• Nighthawk II - 375 MHz
• 16- way SMP
• 2 MULT/ ADD - 1500 MFLOPS
• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
• 14 GB/ s Memory Bandwidth
• 24 GFLOPS/ Node
The Clustered SMP
ACRL’s SP: Four 4-way SMPs
Each node has its own copy of the O/S
Processors on the node are closer than those on differentnodes
Power3 Architecture
Power4 - 32 way
• Logical UMA
• SP High Node
• L3 cache shared between all processors on node - 32 MB
• Up to 32 GB main memory
• Each processor: 1.1 GHz
• 140 Gflops total peak
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
Going to NUMA32 way GP High nodeOwn copy of AIX 128+ GFLOPS/high nodeMultiple Federation Adapters for scaleable inter-node BWNUMA up to 256 Processors
Federation Switch
SP GP Node
AIX
Federation Adapters
Memory
Processors / Intra-node Interconnect Up to 16
Links
SP GP Node
AIX
Federation Adapters
Memory
Up to 16 Links
Processors / Intra-node Interconnect
NUMA up to 256 processors - 1.1 Teraflops
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
Uni-processor Optimization
• Compiler options: – start with -O3 -qstrict, then -O3, -qarch=pwr3
• Cache re-use
• Take advantage of superscalar architecture – give enough operations per load/store
• Use ESSL - optimization already maximally exploited
Memory Access Times
Memory to L2or L1
L2 to L1 L1 toRegisters
Width 16 bytes/2cycles
32 bytes/cycle 2 x 8bytes/cycle
Rate 1.6 GB/s 6.4 GB/s 3.2 GB/s
Latency 35 cycles(approximately)
6 to 7 cycles(approximately)
1 cycle
Cache128 byte cache line
2 MB
2 MB
2 MB
2 MB
L2 cache: 4-way set-associative, 8 MB total
L1 cache: 128-way set-associative, 64 KB
How to Monitor Performance?
• IBM’s hardware monitor: HPMCOUNT– Uses hardware counters on chip– Cache & TLB misses, fp ops, load-stores, …– Beta version – Available soon on ACRL’s SP
HMPCOUNT sample output
real*8 a(256,256),b(256,256),c(256,256)
common a,b,c
do j=1,256
do i=1,256
a(i,j)=b(i,j)+c(i,j)
end do
end do
end
PM_TLB_MISS (TLB misses) : 66543
Average number of loads per TLB miss : 5.916
Total loads and stores : 0.525 M
Instructions per load/store : 2.749
Cycles per instruction : 2.378
Instructions per cycle : 0.420
Total floating point operations : 0.066 M
Hardware float point rate : 2.749 Mflop/sec
HMPCOUNT sample output
real*8 a(257,256),b(257,256),c(257,256)
common a,b,c
do j=1,256
do i=1,257
a(i,j)=b(i,j)+c(i,j)
end do
end do
end
PM_TLB_MISS (TLB misses) : 1634
Average number of loads per TLB miss : 241.876
Total loads and stores : 0.527 M
Instructions per load/store : 2.749
Cycles per instruction : 1.271
Instructions per cycle : 0.787
Total floating point operations : 0.066 M
Hardware float point rate : 3.525 Mflop/sec
ESSL
• Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers
• Fast!– 560x560 real*8 matrix multiply
• Hand coding: 19 Mflops
• dgemm: 1.2 GFlops
• Parallel (threaded and distributed) versions
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
ACRL’s IBM SP
• 4 Winterhawk II nodes– 16 processors
• Each node has:– 1 GB RAM
– 9 GB (mirrored) disk on each node
– Switch adapter
• High Perforrnance Switch
• Gigabit Ethernet (1 node)
• Control workstation
• Disk: SSA tower with 6 18.2 GB disks
Disk
Gigabit Ethernet
IBM Power3 SP Switch
• Bidirectional multistage interconnection networks (MIN)
• 300 MB/sec bi-directional
• 1.2 sec latency
General Parallel File System
Application
GPFS Client
RVSD/VSD
Application
GPFS Client
RVSD/VSD
Application
GPFS Client
RVSD/VSD
Application
GPFS Server
RVSD/VSD
Node 2 Node 3 Node 4
Node 1
SP Switch
ACRL Software• Operating System: AIX 4.3.3
• Compilers– IBM XL Fortran 7.1 (HPF not yet installed)
– VisualAge C for AIX, Version 5.0.1.0
– VisualAge C++ Professional for AIX, Version 5.0.0.0
– IBM Visual Age Java - not yet installed
• Job Scheduler: Loadleveler 2.2
• Parallel Programming Tools– IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O
• Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 )
• Visualization: OpenDX (not yet installed)
• E-Commerce software (not yet installed)
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
Why Parallel Computing?• Solve large problems in reasonable time• Many algorithms are inherently parallel
– image processing, Monte Carlo
– Simulations (eg. CFD)
• High performance computers have parallel architectures– Commercial off-the shelf (COTS) components
• Beowulf clusters
• SMP nodes
– Improvements in network technology
NRL Layered Ocean Model at Naval Research Laboratory
IBM Winterhawk II SP
Parallel Computational Models
• Data Parallelism– Parallel program looks like serial program
• parallelism in the data
– Vector processors– HPF
Parallel Computational Models
• Message Passing (MPI)– Processes have only local memory but can communicate
with other processes by sending & receiving messages– Data transfer between processes requires operations to be
performed by both processes– Communication network not part of computational
model (hypercube, torus, …)
Send Receive
Parallel Computational Models
• Shared Memory (threads)– P(osix)threads– OpenMP: higher level standard
Address space
Processes
Parallel Computational Models
• Remote Memory Operations– “One-sided” communication
• MPI-2, IBM’s LAPI
– One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory
Put
Get
Parallel Computational Models
• Combined: Message Passing & Threads– Driven by clusters of SMPs
– Leads to software complexity!
Address space
Processes
Address space
Processes
Address space
Processes
Network
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
Message Passing Interface
• MPI 1.0 standard in 1994
• MPI 1.1 in 1995 - IBM support
• MPI 2.0 in 1997– Includes 1.1 but adds new features
• MPI-IO
• One-sided communication
• Dynamic processes
Advantages of MPI
• Universality
• Expressivity– Well suited to formulating a parallel algorithm
• Ease of debugging– Memory is local
• Performance– Explicit association of data with process allows
good use of cache
MPI Functionality• Several modes of point-to-point message passing
– blocking (e.g. MPI_SEND)
– non-blocking (e.g. MPI_ISEND)
– synchronous (e.g. MPI_SSEND)
– buffered (e.g. MPI_BSEND)
• Collective communication and synchronization– e.g. MPI_REDUCE, MPI_BARRIER
• User-defined datatypes
• Logically distinct communicator spaces
• Application-level or virtual topologies
Simple MPI Example
My_Id 0 1
This is from MPI process number 0
This is from MPI processes other than 0
Simple MPI ExampleProgram Trivial
implicit none
include "mpif.h" ! MPI header file
integer My_Id, Numb_of_Procs, Ierr
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr )
print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs
if ( My_Id .eq. 0 ) then
print *, ' This is from MPI process number ',My_Id
else
print *, ' This is from MPI processes other than 0 ', My_Id
end if
call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr
stop
end
MPI Example with send/recv
My_Id 0 1
Send Receive
SendReceive
MPI Example with send/recvProgram Simple
implicit none
Include "mpif.h"
Integer My_Id, Other_Id, Nx, Ierr
Parameter ( Nx = 100 )
Real A ( Nx ), B ( Nx )
call MPI_INIT ( Ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr )
Other_Id = Mod ( My_Id + 1, 2 )
A = My_Id
call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr )
call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr )
call MPI_FINALIZE ( Ierr )
stop
end
What Will Happen?/* Processor 0 */
...
MPI_Send(sendbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD);
printf("Posting receive now ...\n");
MPI_Recv(recvbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD,
status);
/* Processor 1 */
...
MPI_Send(sendbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD);
printf("Posting receive now ...\n");
MPI_Recv(recvbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD,
status);
MPI Message Passing Modes
Ready
Standard
Synchronous
Buffered
Ready
Eager
Rendezvous
Buffered
> eager limit
<= eager limit
Default Eager Limit on SP is 4 KB (can be up to 64 KB)
MPI Performance Visualization
• ParaGraph– Developed by University of Illinois– Graphical display system for visualizing
behaviour and performance of MPI programs
Message Passing on SMP
Call MPI_SEND Call MPI_RECEIVE
BufferBuffer
Memory Crossbar or Switch
Data toSend
ReceivedData
export MP_SHARED_MEMORY=yes|no
Shared Memory MPI
MPI_SHARED_MEMORY=<yes|no>
Latency Bandwidth
(sec) (Mbytes/sec)– between 2 nodes: 24 133– same nodes: 30 (no) 80 (no)– same nodes: 10 (yes) 270(yes)
Message Passing off Node
MPI Across all the processors
Many more messages going through the fabric
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
OpenMP• 1997: group of hardware and software vendors
announced their support for OpenMP, a new API for multi-platform shared-memory programming (SMP) on UNIX and Microsoft Windows NT platforms.
• www.openmp.org• OpenMP parallelism specified through the use of
compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.
OpenMP
• All processors can access all the memory in the parallel system
• Parallel execution is achieved by generating threads which execute in parallel
• Overhead for SMP parallelization is large (100-200 sec)- size of parallel work construct must be significant enough to overcome overhead
OpenMP1.All OpenMP programs begin as a single process: the master thread
2.FORK: the master thread then creates a team of parallel threads
3.Parallel region statements executed in parallel among the various team threads
4.JOIN: threads synchronize and terminate, leaving only the master thread
OpenMP
How is OpenMP typically used?
• OpenMP is usually used to parallelize loops:– Find your most time consuming loops.– Split them up between threads.
• Better scaling can be obtained using OpenMP parallel regions, but can be tricky!
OpenMP Loop Parallelization!$OMP PARALLEL DO
do i=0,ilong
do k=1,kshort
...
end do
end do
#pragma omp parallel for
for(i=0; i <= ilong; i++)
for(k=1; k <= kshort; k++) {
...
}
Variable Scoping• Most difficult part of Shared Memory
Parallelization– What memory is Shared
– What memory is Private - each processor has its own copy
• Compare MPI: all variables are private• Variables are shared by default, except:
– loop indices
– scalars that are set and then used in loop
How Does Sharing Work?
THREAD 1: increment(x)
{
x = x + 1;
}
THREAD 1:
10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)
THREAD 2: increment(x)
{ x = x + 1;
}
THREAD 2: 10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)
Shared X initially 0
Result could be 1 or 2
Need synchronization
False Sharing7
6
5
4
3
2
1
0
Processor 1 Processor 2
Block in Cache
Cache line
Address tag
Block
Say A(1-5)starts on cache line, then some of A(6-10) will be on first cache line so won’t be accessible until first thread finished
!$OMP PARALLEL DO do I=1,20 A(I)= ...enddo
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
Why Hybrid MPI-OpenMP?
• To optimize performance on “mixed-mode” hardware like the SP
• MPI is used for “inter-node” communication, and OpenMP is used for “intra-node” communication– threads have lower latency – threads can alleviate network contention of a
pure MPI implementation
Hybrid MPI-OpenMP?• Unless you are forced against your will, for the hybrid
model to be worthwhile:– There has to be obvious parallelism to exploit
– The code has to be easy to program and maintain• easy to write bad OpenMP code
– It has to promise to perform at least as well as the equivalent all-MPI program
• Experience has shown that converting working MPI code to a hybrid model rarely results in better performance – especially true with applications having a single level of
parallelism
Hybrid Scenario• Thread the computational portions of the code that
exist between MPI calls• MPI calls are “single-threaded” and therefore use
only a single CPU.• Assumes:
– application has two natural levels of parallelism– or that in breaking an MPI code with one level
of parallelism that communication between resulting threads is little/none
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
MPI-IO
• Part of MPI-2• Resulted work at IBM Research exploring the
analogy between I/O and message passing• See “Using MPI-2”, by Gropp et al. (MIT Press)
memory
processes
file
Conclusion• Don’t forget uni-processor optimization
• If you choose one parallel programming API, choose MPI
• Mixed MPI-OpenMP may be appropriate in certain cases– More work needed here
• Remote memory access model may be the answer
Top Related