Parallel Computing Explained About the IBM Regatta P690

21
Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University Parallel Computing Explained About the IBM Regatta P690

description

Parallel Computing Explained About the IBM Regatta P690. Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009. Agenda. 1 Parallel Computing Overview - PowerPoint PPT Presentation

Transcript of Parallel Computing Explained About the IBM Regatta P690

Page 1: Parallel Computing Explained About the  IBM  Regatta P690

Slides Prepared from the CI-Tutor Courses at NCSA

http://ci-tutor.ncsa.uiuc.edu/By

S. Masoud SadjadiSchool of Computing and Information

SciencesFlorida International University

March 2009

Parallel Computing ExplainedAbout the IBM Regatta

P690

Page 2: Parallel Computing Explained About the  IBM  Regatta P690

Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning6 Timing and Profiling7 Cache Tuning8 Parallel Performance Analysis9 About the IBM Regatta P690

9.1 IBM p690 General Overview9.2 IBM p690 Building Blocks9.3 Features Performed by the Hardware9.4 The Operating System9.5 Further Information

Page 3: Parallel Computing Explained About the  IBM  Regatta P690

About the IBM Regatta P690To obtain your program’s top performance, it

is important to understand the architecture of the computer system on which the code runs.

This chapter describes the architecture of NCSA's IBM p690.

Technical details on the size and design of the processors, memory, cache, and the interconnect network are covered along with technical specifications for the compute rate, memory size and speed, and interconnect bandwidth.

Page 4: Parallel Computing Explained About the  IBM  Regatta P690

IBM p690 General OverviewThe p690 is IBM's latest Symmetric Multi-

Processor (SMP) machine with Distributed Shared Memory (DSM). This means that memory is physically distributed

and logically shared. It is based on the Power4 architecture and is a

successor to the Power3-II based RS/6000 SP system.

IBM p690 ScalabilityThe IBM p690 is a flexible, modular, and scalable

architecture. It scales in these terms:

Number of processors Memory size I/O and memory bandwidth and the Interconnect bandwidth

Page 5: Parallel Computing Explained About the  IBM  Regatta P690

Agenda9 About the IBM Regatta P6909.1 IBM p690 General Overview9.2 IBM p690 Building Blocks

9.2.1 Power4 Core9.2.2 Multi-Chip Modules9.2.3 The Processor9.2.4 Cache Architecture9.2.5 Memory Subsystem

9.3 Features Performed by the Hardware9.4 The Operating System9.5 Further Information

Page 6: Parallel Computing Explained About the  IBM  Regatta P690

IBM p690 Building BlocksAn IBM p690 system is built from a number

of fundamental building blocks. The first of these building blocks is the Power4

Core, which includes the processors and L1 and L2 caches.

At NCSA, four of these Power4 Cores are linked to form a Multi-Chip Module.

This module includes the L3 cache and four Multi-Chip Modules are linked to form a 32 processor system (see figure on the next slide).

Each of these components will be described in the following sections.

Page 7: Parallel Computing Explained About the  IBM  Regatta P690

32-processor IBM p690 configuration (Image courtesy of IBM)

Page 8: Parallel Computing Explained About the  IBM  Regatta P690

Power4 CoreThe Power4 Chip contains:

Two processors Local caches (L1) External cache for each processor (L2) I/O and Interconnect interfaces

Page 9: Parallel Computing Explained About the  IBM  Regatta P690

The POWER4 chip(Image curtsey of IBM)

Page 10: Parallel Computing Explained About the  IBM  Regatta P690

Multi-Chip ModulesFour Power4 Chips are assembled to form a

Multi-Chip Module (MCM) that contains 8 processors. Each MCM also supports the L3 cache for each

Power4 chip.

Multiple MCM interconnection (Image courtesy of IBM)

Page 11: Parallel Computing Explained About the  IBM  Regatta P690

The ProcessorThe processors at the heart of the Power4 Core are

speculative superscalar out of order execution chips. The Power4 is a 4-way superscalar RISC architecture

running instructions on its 8 pipelined execution units. Speed of the Processor

The NCSA IBM p690 has CPUs running at 1.3 GHz.64-Bit Processor Execution Units

There are 8 independent fully pipelined execution units.2 load/store units for memory access 2 identical floating point execution units capable of fused

multiply/add 2 fixed point execution units 1 branch execution unit 1 logic operation unit

Page 12: Parallel Computing Explained About the  IBM  Regatta P690

The ProcessorThe units are capable of 4 floating point operations,

fetching 8 instructions and completing 5 instructions per cycle.

It is capable of handling up to 200 in-flight instructions.

Performance NumbersPeak Performance:

4 floating point instructions per cycle 1.3 Gcycles/sec * 4 flop/cycle yields 5.2 GFLOPS

MIPS Rating: 5 instructions per cycle 1.3 Gcycles/sec * 5 instructions/cycle yields 65 MIPS

Instruction SetThe instruction set (ISA) on the IBM p690 is the

PowerPC AS Instruction set.

Page 13: Parallel Computing Explained About the  IBM  Regatta P690

Cache Architecture Each Power4 Core has both a primary (L1) cache associated

with each processor and a secondary (L2) cache shared between the two processors. In addition, each Multi-Chip Module has a L3 cache.

Level 1 CacheThe Level 1 cache is in the processor core. It has split

instruction and data caches. L1 Instruction Cache

The properties of the Instruction Cache are: 64KB in size direct mapped cache line size is 128 bytes

L1 Data CacheThe properties of the L1 Data Cache are:

32KB in size 2-way set associative FIFO replacement policy 2-way interleaved cache line size is 128 bytes

Peak speed is achieved when the data accessed in a loop is entirely contained in the L1 data cache.

Page 14: Parallel Computing Explained About the  IBM  Regatta P690

Cache ArchitectureLevel 2 Cache on the Power4 Chip

When the processor can't find a data element in the L1 cache, it looks in the L2 cache. The properties of the L2 Cache are: external from the processor unified instruction and data cache 1.41MB per Power4 chip (2 processors) 8-way set associative split between 3 controllers cache line size is 128 bytes pseudo LRU replacement policy for cache

coherence 124.8 GB/s peak bandwidth from L2

Page 15: Parallel Computing Explained About the  IBM  Regatta P690

Cache ArchitectureLevel 3 Cache on the Multi-Chip ModuleWhen the processor can't find a data

element in the L2 cache, it looks in the L3 cache. The properties of the L3 Cache are: external from the Power4 Core unified instruction and data cache 128MB per Multi-Chip Module (8 processors) 8-way set associative cache line size is 512 bytes 55.5 GB/s peak bandwidth from L2

Page 16: Parallel Computing Explained About the  IBM  Regatta P690

Memory SubsystemThe total memory is physically

distributed among the Multi-Chip Modules of the p690 system (see the diagram in the next slide).

Memory LatenciesThe latency penalties for each of the

levels of the memory hierarchy are:L1 Cache - 4 cycles L2 Cache - 14 cycles L3 Cache - 102 cycles Main Memory - 400 cycles

Page 17: Parallel Computing Explained About the  IBM  Regatta P690

Memory distribution within an MCM

Page 18: Parallel Computing Explained About the  IBM  Regatta P690

Agenda9 About the IBM Regatta P6909.1 IBM p690 General Overview9.2 IBM p690 Building Blocks9.3 Features Performed by the Hardware9.4 The Operating System9.5 Further Information

Page 19: Parallel Computing Explained About the  IBM  Regatta P690

Features Performed by the HardwareThe following is done completely by the

hardware, transparent to the user:Global memory addressing (makes the

system memory shared) Address resolution Maintaining cache coherency Automatic page migration from remote to

local memory (to reduce interconnect memory transactions)

Page 20: Parallel Computing Explained About the  IBM  Regatta P690

The Operating SystemThe operating system is AIX. NCSA's p690

system is currently running version 5.1 of AIX. Version 5.1 is a full 64-bit file system.

CompatibilityAIX 5.1 is highly compatible to both BSD and

System V Unix

Page 21: Parallel Computing Explained About the  IBM  Regatta P690

Further InformationComputer Architecture: A Quantitative

Approach John Hennessy, et al. Morgan Kaufman

Publishers, 2nd Edition, 1996 Computer Hardware and Design: The

Hardware/Software Interface David A. Patterson, et al. Morgan Kaufman

Publishers, 2nd Edition, 1997 IBM P Series [595] at the URL:

http://www-03.ibm.com/systems/p/hardware/highend/590/index.html

IBM p690 Documentation at NCSA at the URL: http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/IB

Mp690/