North Carolina Supercomputing Center NCSC Introduction to the Origin2400.

North Carolina Supercomputing Center

NCSCNCSC

Introduction to the Origin2400


NCSCNCSC

Course Outline

Origin2400 Architecture

Code development and optimization tools

Cache optimization

User Environment


NCSCNCSC

Memory Types

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory Memory

CPU CPU

CPUCPU

Distributed

Shared


NCSCNCSC

Origin2400 Architecture

ccNUMAcache coherent - Non-uniform

memory access

Physically distributed, globally addressable memory

Hardware cache coherence

Scalable shared memory

systems

Bus-basedshared memory

systems

Massively paralleldistributed memory

systems

Easy to programEasy to scale - to a point

Easy to programHard to scale

Hard to programEasy to scale


NCSCNCSC

Node

Two R12000 Processors (400MHz)

64MB-4GB memory (1 GB)

Hub (interface)

Hub

Memory

PP


NCSCNCSC

System Scaling

Hub

Memory

PP

Hu

b

Me

mo

ry

PP

Hub

Memory

PP

Hu

b

Me

mo

ry

PP

Hub

Memory

PP

R


NCSCNCSC

System Scaling

Hu

b

Me

mo

ry

PP

Hub

Memory

PP

R Hu

b

Me

mo

ry

PPHub

Memory

PP

R

Hu

b

Me

mo

ry

PP

Hub

Memory

PP

R Hu

b

Me

mo

ry

PPHub

Memory

PP

R

Hu

b

Me

mo

ry

PP

R

Hub

Memory

P P


NCSCNCSC

System Scaling

Hub

Memory

P P

Hu

b

Me

mo

ry

PP

Hub

Memory

PP

R Hu

b

Me

mo

ry

PPHub

Memory

PP

R

Hu

b

Me

mo

ry

PP

R

Hub

Memory

P P

Hu

b

Me

mo

ry

PP

R


NCSCNCSC

Origin Node Board

Two R12000 processors

1 GB main memory

Additional directory memory - used for cache coherence

Sockets for extra directory memory for systems with more than 32 processors

Hub interconnect chip

Hub

R12000R12000

L2cache

L2cache

Directory(>32proc)

Directory

Memory

XIO

NUMALink


NCSCNCSC

Origin Node Board


NCSCNCSC

Origin Module

Each router has six connections, two to nodes and four to other routers.

Systems with 32 or fewer processors will have extra router ports available and can use these for “express” links

Node 0

Node 3

Node 2

Node 1

Rou

ter

1R

oute

r 0

XB

OW

XB

OW

NUMALink

XIO


NCSCNCSC

Cache Coherence

Directory maintains state information for each L2 cache line in memory.

States unowned - not cached exclusive - 1 r/w copy shared - 1+ r/o copies poisoned - migrated to

another node

Directory includes a bit vector indicating processors with a copy of the cache line

Hub

Memory

PP

Hub

Memory

PP

Interconnection network

c c c c

directory directory

. . .


NCSCNCSC

Cache Architecture

L1 D-cache is 2-way set associative, LRU, writeback, 8 word lines, non-blocking

L2 cache is 2-way set associative, LRU, writeback, 32 word lines, non-blocking 8MB

L2 Cache

32KBIcache

32KBDcache

I register D register

128

128

64

~10 cycles/miss

~60+ cycles/miss

R12000

780MB/s


NCSCNCSC

Translation Lookaside BufferTLB is used to translate virtual addresses to physical addresses

R10000 TLB has 64 entries. Each entry can translate addresses for 2 pages (default page size is 16KB)

TLB miss costs about the same as a cache miss and causes similar performance issues.


NCSCNCSC

Origin2000Bandwidths and Latencies

I/O:Nodes:CPUs Memory I/O MaxLatency

AveLatency

1:1:2 0.780.680.59

1.561.25

313 313

2:2:4 1.561.371.19

3.122.5

497 405

2:4:8 3.122.732.38

6.244.99

601 528

4:8:16 6.245.474.75

12.489.98

703 641

8:16:32 12.4810.94

9.5

24.9619.97

805 710

Bandwidths in GB/s Latencies in ns

PhysicalPeak PayloadPeak Read


NCSCNCSC

R12000 Architecture

Superscalar 400 MHz clock 4 instructions/cycle

Cache 8MB L2 cache dedicated cache bus interleaved cache access non-blocking

Out-of-order Execution 3 instruction queue

Branch Prediction


NCSCNCSC

R12000 Architecture

Superscalar Architecture

Fetch/decode up to 4 instructions/cycle

Execute up to 4 instructions/cycle from 5 execution units

Load/store ALU1 ALU2 FPADD FPMUL

Instruction set binary compatible with

R8000 and R4000 32-bit and 64-bit

instructions

32 integer registers 32 floating-point registers


NCSCNCSC

Instruction Latencies – O2000

Load/Store Load store

Latency 2-3 1

Repeat Rate 1 1

Integer ALU1 add, sub, logic, shift, branches ALU2 add, sub, logic multiply (32-/64-bit) divide (32-/64-bit)

1 1 6/10 35/67

1 1 6/10 35/67

Floating Point add, compare multiply multiply-add divide (single/double) sqrt (single/double) rsqrt (single/double)

2 2 4 12/19 18/33 30/52

1 1 1 14/21 20/35 20/35


NCSCNCSC

Origin2400Architecture References

www.sgi.com/origin/2000

techpubs.sgi.com


NCSCNCSC

Code Developmentand Optimization Tools


NCSCNCSC

Code Porting/Optimization Objectives

Get the right answers

Identify resource consuming code sections

Utilize optimized system libraries

Let the compiler do the work


NCSCNCSC

Porting Issues(getting the right answers)

Application Binary Interface (ABI) 32 n32 (default/recommended for codes <2 GB total

memory) 64 (required for codes with >2GB total memory)

Instruction Set Architecture (ISA) mips2 mips3 mips4 (default)

defaults found from file /etc/compiler.defaults


NCSCNCSC

Profiling Tools

perfex - overall code performance

SpeedShop - procedure level performance data

dprof - memory access patterns


NCSCNCSC

R12000Hardware Performance Registers

Can select from 32 events

Two counter registers (can fully count two events per code execution)

0 - cycles

1 - issued instructions

2 - issued loads

3 - issued stores

4 - issued conditionals

5 - failed conditionals

6 - branches resolved

7 - quadwords written back from s-cache

8 - s-cache data errors (ECC)

9 - I-cache misses

10 - L2 cache miss - instruction

11 - instruction misprediction

12 - external interventions

13 - external invalidations

14 - function unit completion cycles

15 - graduated instructions


NCSCNCSC

R12000Hardware Performance Registers

Each counter can be set to count one of 16 events

counter 0 can count events 0-15

counter 1 can count events 16-31

Counter registers are 32 bit registers. Can be set to generate an interrupt on overflow.

16 - cycles

17 - graduated instructions

18 - graduated loads

19 - graduated stores

20 - graduated store conditionals

21 - graduated floating-point instructions

22 - quadwords written back from d-cache

23 - TLB misses

24 - mispredicted branches

25 - d-cache misses

26 - s-cache misses - data

27 - data misprediction

28 - external intervention s-cache hits

29 - external invalidation s-cache hits

30 - store/prefetch excl to clean block

31 - store/prefetch excl to shared block


NCSCNCSC

perfex

No special compilation needed

Can monitor two counters exactly - OR

Can monitor all counters (each 1/16th of the time) values then multiplied by 16 to approximate full counts

Option to convert counts to estimated times

% perfex -a -y -o data code.x

All counters

Estimate times

Redirect output


NCSCNCSC

perfexOutput

Based on 250 MHz IP27 Event definitions for cpu version 3.x

Typical

Event Counter Name Counter Value Time (sec)

=========================================================================================

0 Cycles...................................................... 898600299008 3594.401196

16 Cycles...................................................... 898600299008 3594.401196

26 Secondary data cache misses................................. 7034639424 2124.461106

7 Quadwords written back from scache.......................... 18935563200 484.750418

25 Primary data cache misses................................... 7449172608 268.468181

2 Issued loads................................................ 59030982976 236.123932

14 ALU/FPU forward progress cycles............................. 48181262304 192.725049

18 Graduated loads............................................. 46436171712 185.744687

3 Issued stores............................................... 19988999248 79.955997

22 Quadwords written back from primary data cache.............. 4971802640 76.565761

19 Graduated stores............................................ 18055579056 72.222316

6 Decoded branches............................................ 5225243088 20.900972

21 Graduated floating point instructions....................... 2699848928 10.799396

24 Mispredicted branches....................................... 1033609888 5.870904

9 Primary instruction cache misses............................ 374656 0.027005

Edited for presentation


NCSCNCSC

perfexOutput23 TLB misses.................................................. 1904 0.000519

10 Secondary instruction cache misses.......................... 256 0.000077

4 Issued store conditionals................................... 160 0.000001

20 Graduated store conditionals................................ 32 0.000000

30 Store/prefetch exclusive to clean block in scache........... 32 0.000000

1 Issued instructions......................................... 147707069072 0.000000

5 Failed store conditionals................................... 0 0.000000

8 Correctable scache data array ECC errors.................... 0 0.000000

11 Instruction misprediction from scache way prediction table.. 512 0.000000

12 External interventions...................................... 2525856 0.000000

13 External invalidations...................................... 7415216 0.000000

15 Graduated instructions...................................... 136445826704 0.000000

17 Graduated instructions...................................... 136469377216 0.000000

27 Data misprediction from scache way prediction table......... 804101376 0.000000

28 External intervention hits in scache........................ 1744336 0.000000

29 External invalidation hits in scache........................ 3193680 0.000000

31 Store/prefetch exclusive to shared block in scache.......... 0 0.000000


NCSCNCSC

perfexOutputStatistics

=========================================================================================

Graduated instructions/cycle................................................ 0.151843

Graduated floating point instructions/cycle................................. 0.003005

Graduated loads & stores/cycle.............................................. 0.071769

Graduated loads & stores/floating point instruction......................... 23.887170

Mispredicted branches/Decoded branches...................................... 0.197811

Graduated loads/Issued loads................................................ 0.786641

Graduated stores/Issued stores.............................................. 0.903276

Data mispredict/Data scache hits............................................ 1.939776

Instruction mispredict/Instruction scache hits.............................. 0.001368

L1 Cache Line Reuse......................................................... 7.657572

L2 Cache Line Reuse......................................................... 0.058927

L1 Data Cache Hit Rate...................................................... 0.884494

L2 Data Cache Hit Rate...................................................... 0.055648

Time accessing memory/Total time............................................ 0.737507

Time not making progress (probably waiting on memory) / Total time.......... 0.946382

L1--L2 bandwidth used (MB/s, average per process)........................... 88.449327

Memory bandwidth used (MB/s, average per process)........................... 334.799259

MFLOPS (average per process)................................................ 0.751126

Not good


NCSCNCSC

SpeedShop

No special compilation needed

Provides the following types of profiling Program counter sampling Ideal time User time Hardware counter profiling Floating-point exception tracing Heap tracing


NCSCNCSC

SpeedShop

PC Sampling

Provides estimate of time spent by each function in executable

Two step process: execute code with ssrun use prof to examine

results

%ssrun -pcsamp prog

%prof prog.pcsamp.4324


NCSCNCSC

pcsamp outputSummary of statistical PC sampling data (pcsamp)--

13060: Total samples

130.600: Accumulated time (secs.)

10.0: Time per sample (msecs.)

2: Sample bin width (bytes)

-------------------------------------------------------------------------

Function list, in descending order by time

-------------------------------------------------------------------------

[index] secs % cum.% samples function (dso: file, line)

[1] 58.230 44.6% 44.6% 5823 zaver (prog: prog.f, 69)

[2] 37.490 28.7% 73.3% 3749 yaver (prog: prog.f, 50)

[3] 34.460 26.4% 99.7% 3446 xaver (prog: prog.f, 31)

[4] 0.420 0.3% 100.0% 42 main (prog: prog.f, 1)

130.600 100.0% 100.0% 13060 TOTAL


NCSCNCSC

SpeedShop

Ideal time

Estimates best possible time the code could achieve - by routine

Useful for identifying routines with cache problems

% ssrun -ideal prog

beginning libraries

/usr/lib32/libssrt.so

/usr/lib32/libftn.so

/usr/lib32/libm.so

ending libraries, beginning prog

% prof prog.ideal.3453


NCSCNCSC

ideal outputSummary of ideal time data (ideal)--

23468025764: Total number of instructions executed

26959868891: Total computed cycles

107.839: Total computed execution time (secs.)

1.149: Average cycles / instruction

-------------------------------------------------------------------------

Function list, in descending order by exclusive ideal time

-------------------------------------------------------------------------

[index] excl.secs excl.% cum.% cycles instructions calls function (dso: file, line)

[1] 36.133 33.5% 33.5% 9033236300 7740175400 100 zaver (prog: prog.f, 69)

[2] 35.737 33.1% 66.6% 8934236300 7839175400 100 xaver (prog: prog.f, 31)

[3] 35.737 33.1% 99.8% 8934236300 7839175400 100 yaver (prog: prog.f, 50)

[4] 0.221 0.2% 100.0% 55184326 46134726 1 main (prog: prog.f, 1)

Hundreds more lines of library calls omitted


NCSCNCSC

SpeedShop

Hardware Counter Profiling

prof_hwd Counter selected with

environment variable_SPEEDSHOP_HWC_COUNTER_NUMBE

R

Most commonly used counters have experiment names

gi_hwc – graduated instructions

cy_hwc – cycles ic_hwc – L1 Icache miss isc_hwc – L2 Icache miss dc_hwc – L1 Dcache miss dsc_hwd – L2 Dcache

miss tlb_hwc – TLB miss gfp_hwc – graduated FP

instructions


NCSCNCSC

SpeedShop

The –b or –gprof options to prof will generate a dynamic calling tree.

Procedures are listed by calling and called by.


NCSCNCSC

WorkShop

One of the Workshop tools, cvperf, provides a GUI interface to view the SpeedShop experiment results


NCSCNCSC

Workshop

ssusage Speed shop program runs executable and prints resources used Useful for finding out memory use ssusage mypgm


NCSCNCSC

WorkShop

Workshop also includes a debugger, cvd

The common UNIX debugger, dbx, is also available


NCSCNCSC

WorkShop

Other WorkShop components include

cvbuild – build dependency analyzer

cvstatic – static source analyzer

WorkShop can be configured to work with a source code revision control system (see cvconfig)

cvpav – parallel analysis for MP Fortran programs


NCSCNCSC

Performance Libraries

fastm Fast transcendental library Link w/ -lfastm Faster results at the trade off of some accuracy See man libfastm

SCSL Scientific Computing Software Library See man intro_scsl and man pages referenced therein Signal processing including FFT, correlation, convolution LAPACK Linear solvers Matrix and Vector routines


NCSCNCSC

Compilers

MIPSpro Compilers CC cc f90 f77

Optimizations Software pipelining (SWP) Inter-procedural analysis

(IPA) Loop nest optimizations

(LNO)


NCSCNCSC

Compilers

-O[n] 0 => no optimization – use

only for debugging (default!) 1 => simple optimizations 2 => conservative

optimizations, should not alter results

If just -O is specified, -O2 is invoked

Fast => -O3 –IPA –OPT:roundoff=3:alias=typed

3 => SWP, LNO, and other aggressive optimizations, may alter results


NCSCNCSC

Compilers

-OPT IEEE_arithmetic=n –

conformance with IEEE floating-point arithmetic

1 (default) compliant 2 inexact results may

differ (not-a-number, infinity)

3 allows arbitrary, mathematically valid transformations

roundoff=n – acceptable round off altering optimization 0-3 where 0 is none and 3 is any

alias=n – pointer aliasing model


NCSCNCSC

Compilers

-OPT:alias=<name>

ANY, COMMON_SCALAR ANY is default

TYPED, NO_TYPED Different base types point to

distinct objects

UNNAMED, NO_UNNAMED Pointers never point to named

objects

RESTRICT, NO_RESTRICT Distinct pointers point to

distinct, non-overlapping objects

parm, no_parm Fortran only

Do not lie to the compiler!


NCSCNCSC

Compilers

Software Pipelining

do i=1,n

y(i) = y(i) + a*x(i)

enddo

Each loop iteration contains

2 loads, 1 store 1 multiply-add 2 address increments Loop end test, branch

Superscalar processor slots

1 load/store 1 ALU1, 1ALU2 1 FP add 1 FP multiply


NCSCNCSC

Software Pipeliningo

pe

rati

on

s

Load x

Load y

x++

madd

Store y

branch

y++

Lo

ad

/sto

re

AL

U1

AL

U2

FP

AD

D

FP

MU

L

clo

ck

0

1

2

3

4

5

6

7

2 flop / 8 cycles achieved16 flop / 8 cycles peakRunning 1/8th of peak performance


NCSCNCSC

Software Pipelining

Pipelined daxpy

Load/store is bottleneck

Optimize to fully utilize load/store unit

Lo

ad

/sto

re

FP

AD

D

FP

MU

L

clo

ck

0

1

2

3

4

5

6

7

8

9

10

11

12

13

8 flop / 14 cycles achieved28 flop / 14 cycles peakRunning better than 1/4th of peak performance


NCSCNCSC

Software Pipelining

Use –O3 to enable pipelining

Vectorizable loops are well suited for pipelining

SWP cannot be done if loop contains

Function calls Complicated conditionals Branching

SWP is impeded by Recurrences between

iterations (can use IVDEP directive)

Very long loop (split loop) Register overflow (split

loop)

SWP algorithms are heuristic

Schedules are not unique Finding schedule may be

computationally expensive


NCSCNCSC

Inter-Procedural Analysis

Analyzes entire program

Precedes other optimizations

Performs optimizations across procedure boundaries

Invoke with -IPA

Compile step will finish quickly – link step will take much longer

If any procedure changes must recompile full program


NCSCNCSC

Inlining

IPA provides automatic inlining with preference to

Small procedures Calls in innermost loops Leaf routines Frequent calls

Manual inlining using command line option -INLINE

Routines must be in same file

Only inlines specified routines


NCSCNCSC

Inlining

Benefits Exposes larger context for

later optimization Eliminates call overhead

Costs Longer compile time Additional contention for

registers Larger code size

• Restrictions• no mismatched parameter types• no static local variables• no recursive routines


NCSCNCSC

Cache Optimization


NCSCNCSC

L2 Cache Organization

2-way set associative i.e. each memory address can be in one of 2

different cache lines

Cache line is 128 bytes e.g. 16x8bytes or 32x4bytes

Least recently used (LRU) replacement strategy

Shared instruction and data cache


NCSCNCSC

Cache Organization

offsetC

ach

e li

ne

Memory address

Set 0 Set 1

Mem

ory


NCSCNCSC

Cache Basics

Access data with stride one wherever possible

Group data to be used together

Avoid power-of-2 array dimensions


NCSCNCSC

Standard Cache Optimization

Small stride – order loops so that innermost loop has smallest stride

Padding – pad leading dimensions of arrays to prevent overlap in cache and/or add padding between arrays in common blocks

Loop fusion – join small loops to increase cache reuse


NCSCNCSC

Cache BlockingDO J = 1, N

DO I = 1, M

DO K = 1, L

C(I,J)=C(I,J) +

A(I,K)*B(K,J)

ENDDO

ENDDO

ENDDO

M,N,L sec MFLOPS

-------- ----- -------------

30 1.6e-4 333.9

200 5.7e-2 282.6

1000 25.4 78.6


NCSCNCSC

TLB Misses

Caused by too few entries for amount of data to be mapped

Increasing the page size allows fixed number of TLB entries to map larger amount of data

IRIX allows two page sizes 16KB (default) and one larger page size

dplace command allows selection of a larger page size (see man dplace)


NCSCNCSC

Loop Nest Optimization (LNO)

Improve cache use and instruction scheduling with loop transformations

Loop interchange Padding Loop fusion Cache blocking Prefetching Loop unrolling

Run by default with -O3 or -Ofast

Disable with –LNO:opt=0

Endless opportunity to tune each optimization individually with directives and flags


NCSCNCSC

Loop Unrolling

Compiler option -LNO:outer_unroll=n

Directives Fortran: c*$* unroll(n) C: #pragma unroll(n)


NCSCNCSC

Loop Interchange

Compiler option -LNO:interchange=off

Directives C*$* no interchange C*$* interchange(i,j,k)

#pragma no interchange #pragma interchange(i,j,k)


NCSCNCSC

Cache Blocking

LNO automatically blocks loop nests to fit cache

To disable -LNO:blocking=off C*$* no blocking #pragma no blocking

Can also provide input to blocking size and cache model (see man LNO)

Disable If loop nest already fits in

cache (to save blocking overhead)

Off if blocking is causing poor performance


NCSCNCSC

Padding

LNO automatically pads locally allocated arrays

For –O3 and –Ofast LNO automatically pads common blocks

Each routine containing common must be compiled with same option

Code must not violate FORTRAN standard

Disable common block padding

-OPT:reorg_common=off


NCSCNCSC

Single Processor TuningSummary

Use perfex and SpeedShop to analyze code

Choose best ISA and ABI (-mips4 -n32)

Use optimized libraries –lfastm –lscs

Inline small procedures or use IPA automatic inlining

Check compiler messages for time consuming loops – may be able to improve with –OPT or –LNO directives

Minimize cache and TLB misses

Use stride one memory accesses (or smallest possible)

Avoid power-of-2 array dimensions

Increase page size to reduce TLB misses


NCSCNCSC

NCSC Origin2400User Environment


NCSCNCSC

System

48 400 MHz R12000 processors

8MB L2 cache/processor

24 GB memory

> 1 TB fast local disks

sonoma.ncsc.org


NCSCNCSC

Storage

Home directory Fairly small quota (100 MB)

/tmp Temporary storage for

executing jobs Not backed up Periodic purge

/dmf Mass storage system Local to sonoma dmls dmget Each user has a dmf

account


NCSCNCSC

Interactive Jobs

Interactive limits are imposed using software developed at NCSA

Interactive limits are 30 CPU minutes 512 MB memory 4 processors Subject to change


NCSCNCSC

Batch Jobs

Jobs too large to be run interactively must be submitted to the batch system

NQE is the current batch system

Create a batch request script using your favorite editor (emacs is a good choice, but jot and vi are also available)

Use the qsub command to submit the job to the batch queue

Request resources needed for the job:

CPU time Memory Processors


NCSCNCSC

Batch Request Script

Text File

Execution will begin in your home directory

Will execute your environment files by default

#QSUB –lT 7200

#QSUB –lM 1024mb

#QSUB –l mpp_p=8

setenv OMP_NUMTHREADS 8

setenv OMP_DYNAMIC false

cd /tmp/user

cp ~/executable .

./executable

mv results /dmf/edu/user

rm *


NCSCNCSC

NQE

qsub

qsub –lT 7200 –lM 1024MB \

-l mpp_p=8 script.q

qstat –au $user

qdel <xxxxx>

qdel –k <xxxxx>

qs

qstat –b

qstat –f <queue_name>

Standard output, standard error, and NQE log are returned in files to the directory from which the qsub command was issued at the end of the job

Use –o and –eo to override this behavior


NCSCNCSC

Here are some options I like …#! /bin/csh -f

# name the request rather than default to script name

#QSUB -r myOpenMP_job

#QSUB -lT 0:15:00

#QSUB -lM 1GB

#QSUB -l mpp_p=4

# send mail to [email protected] when the job ends

#QSUB -me -mu [email protected]

# redirect standard error and output

#QSUB -o batch.log -eo

date

cd $QSUB_WORKDIR

#specify number of processors to run on

setenv OMP_NUM_THREADS 4

# run the job

./a.out

date


NCSCNCSC

Parallel Program IssuesMultiple programming models and APIs are supported

Many are out-of-date and have been superceded by newer models

Many use environment variables for control information

The man page pe_environ gives an up-to-date list of all these environment variables


NCSCNCSC

Parallel Program IssuesNumber of processors for shared memory executables

OMP_NUM_THREADS

Number of processors is “dynamic” by default (based on number of idle processors) This can have undesirable side effects and may be disabledOMP_DYNAMIC FALSE


NCSCNCSC

Running MPI jobs

Use mpirun, see man page mpirun –np 8 mypgm

Use –cpr flag to checkpoint batch jobs

Running with perfex in batch mpirun –cpr –np 8 perfex –a –y mypgm similarly for ssusage, ssrun


NCSCNCSC

Checkpointing

Executing jobs are checkpointed by the system at regular intervals

Some jobs will not successfully checkpoint

3rd party applications using Flexlm license manager

QSUB option –nc will prevent checkpointing

mpirun option –cpr is required to enable checkpointing of MPI jobs

North Carolina Supercomputing Center NCSC Introduction to the Origin2400.

Documents

Transcript of North Carolina Supercomputing Center NCSC Introduction to the Origin2400.