TM Origin System Architecture Hardware and Software Environment.
-
Upload
bruno-harvey -
Category
Documents
-
view
217 -
download
0
Transcript of TM Origin System Architecture Hardware and Software Environment.
TM
Origin System ArchitectureOrigin System Architecture
Hardware
and
Software Environment
TM
Scalar ArchitectureScalar Architecture
Reduced Instruction Set (RISC) Architecture:• load/store instructions refer to memory• functional units operate on items in the register file• memory hierarchy in the Scalar Architecture
– Most recently used items are captured in the cache– Access to cache is much faster than access to memory
memoryCache
Register FileFunctional Unit
(mult, add)
Processor~500 MB/s~100 cycles
~2GB/s~10 cy
TM
Vector ArchitectureVector Architecture
• Vectors will be loaded (loadv instruction) from memory• The performance is determined by memory bandwidth• Optimization takes vector length (64 words) into account
loadf f2,(r3) load scalar A(i,k)loadv v3,(r3) load vector B(k,1:n)mpyvs v3,v3,v2 calculate A(I,k)*B(k,1:n)addvv v4,v4,v3 update C(I,1:n)
loadf f2,(r3) load scalar A(i,k)loadv v3,(r3) load vector B(k,1:n)mpyvs v3,v3,v2 calculate A(I,k)*B(k,1:n)addvv v4,v4,v3 update C(I,1:n)
+ Accumulate C(1,1:n) in a vector register+ Accumulate C(1,1:n) in a vector register
= Xi
C
i
k A
k
B
Vector Registers
Functional Unit(mult, add)
Processor
memory
DO i=1,n DO k=1,n C(i,1:n)=C(i,1:n) + A(i,k)*B(k,1:n)ENDDO ENDDO
Vector OperationVector Operation
TM
Multiprocessor ArchitectureMultiprocessor Architecture
Cache coherency unit will intervene if two or more processors attempt to update same cache line
• All memory (and I/O) is shared by all processors
• Read/write conflicts between processors on the same memory location are resolved by cache coherency unit
• Programming model is an extension of single processor programming model
memory
Cache
Register FileFunctional
Unit(mult, add)
Cache Coherency
UnitProcessor
Cache
Register FileFunctional
Unit(mult, add)
Cache Coherency
UnitProcessor
TM
• All memory and I/O path are independent
• Data movement across the interconnect is “slow”
• Programming model is based on message passing– Processors explicitly engage in communication by sending and
receiving data
Multicomputer ArchitectureMulticomputer Architecture
Mainmemory
Cache
Register FileFunctional
Unit(mult, add)
Processor
Mainmemory
Cache
Register FileFunctional
Unit(mult, add)
Processor
TM
Origin 2000 Node BoardOrigin 2000 Node Board
Basic Building BlockBasic Building Block
DirectoryDirectory>32P>32P
R1*KR1*K
CacheCache
HubHub
R1*KR1*K
CacheCache
Node Board
MainMainMemoryMemory
DirectoryDirectory
•2 X R12000 Processors•64 MB to 4 GB Main Memory
Hub Bandwidth Peaks•780 MB/s [625] --- CPUs•780 MB/s [683] --- memory•1.56 GB/s [1.25] -- XIO link•1.56 GB/s [1.25] -- CrayLink
XIO
CrayLink
TM
HUB Crossbar ASIC:• Single chip integrates all 4 Interfaces:
– Processor Interface; two R1x000 processors multiplex on the same bus– Memory Interface, integrating the memory controller and (Directory)
Cache Coherency– Interface to the CrayLink Interconnect to other nodes in the system– Interface to the I/O devices with XIO-to-PCI bridges
• Memory Access characteristics:– Read Bandwidth single processor 460 MB/s sustained– Average access latency 315 ns to restart processor pipeline
O2000 Node BoardO2000 Node Board
Input/Output on every node: 2x800 MB/s
R1x000processor
L2 Cache1-4-8 MB
R1x000processorL2 Cache1-4-8 MB
HUB
Memory Interface
I/O InterfacePro
c In
terf
ace
Lin
k In
terf
ace
DirectorySDRAM
CrayLinkduplex connection(2x23@400 MHz,
2x800 MB/s)to other nodes
Main Memory up to 4 GB/node SDRAM (144@50 MHz=800MB/s)
HUB ASIC:950K gates100MHz 64bitBTE64 counters /(4KB)page
TM
Origin 2000 Switch TechnologyOrigin 2000 Switch Technology
XBOWXBOW
ccNUMAhypercube
DirectoryDirectory>32P>32P
Proc.
CacheCache
HubHub
Proc.
CacheCache
6 ports to XIO
Router to otherNode Boards
Node Board
N
N
R
R
R
R R
R
R
R
N
N
N
N
N
N
N
N
NN
N
N
N N
MainMainMemoryMemory
DirectoryDirectory
TM
Distributed switch does scale:– Network of crossbars allows for full remote bandwidth– The switch components are distributed and modular
L2 Cache1-4-8 MB
O2000 Scalability PrincipleO2000 Scalability Principle
R1x000processor
L2 Cache1-4-8 MB
R1x000processor
L2 Cache1-4-8 MB
HUB
Memory Interface
I/O Interface
Proc
Int
erfa
ce
Lin
k In
terf
ace
DirectorySDRAM
Main Memory
R1x000processor
R1x000processor
L2 Cache1-4-8 MB
HUB
Memory Interface
I/O Interface
Lin
k In
terf
ace
Proc
Int
erfa
ce
DirectorySDRAM
Main Memory
Crossbar router
network
TM
Origin 2000 ModuleOrigin 2000 Module
System Building BlockSystem Building Block
Module Features:•Up to 8 R12000 CPUs (1-4 Nodes)•Up to 16 GB physical memory•Up to 12 XIO slots•2 XBOW Switches•2 Router Switches•64 bit internal PCI Bus (optional)•Up to 2.5 [3.1] GB/sec system bandwidth•Up to 5.0 [6.2] GB/sec I/O bandwidth
TM
Origin 2000 ModuleOrigin 2000 Module Deskside SystemDeskside System• 2-8 CPUs
• 16GB Memory
• 12 XIO slots
SGI 2100 / 2200SGI 2100 / 2200
R R N
NN
N
TM
Origin 2000 Single RackOrigin 2000 Single Rack
Single Rack SystemSingle Rack System• 2-16 CPUs
• 32GB Memory
• 24 XIO slots
SGI 2400SGI 2400N
R R
R R N
NN
N
N
N N
TM
Origin 2000 Multi-RackOrigin 2000 Multi-Rack
Multi-Rack SystemMulti-Rack System• 17-32 CPUs
• 64GB Memory
• 48 XIO slots
• 32-processor hypercube building block
N
N
R
R
R
R R
R
R
R
N
N
N
N
N
N
NN
NN
N
N
N N
TM
Origin 2000 Large SystemsOrigin 2000 Large Systems
Large Multi-Rack SystemsLarge Multi-Rack Systems• up to 512 CPUs
• up to 1 TB Memory
• 384+ XIO slots
SGI 2800SGI 2800
+
=
++
TM
Modular Architecture
Interface and Form Factor
Standards
PRO
CESSO
R
SUBSY
STEMS
I/O SUBSYSTEMS
INT
ER
CO
NN
EC
T
SUB
SYST
EM
S
SScalablecalable NNodeode Product Product ConceptConcept
Address diverse customer Address diverse customer requirementsrequirements
• Independent scaling of CPU, I/O, and Independent scaling of CPU, I/O, and storage…tailor ratios to suit applicationstorage…tailor ratios to suit application
• Large dynamic range of product Large dynamic range of product configurationsconfigurations
• RAS via component isolationRAS via component isolation
Independent evolution and Independent evolution and upgrade of system upgrade of system componentscomponents
Maximize leverage of Maximize leverage of engineering and technology engineering and technology development effortsdevelopment efforts
TM
C-brickCPU Module
D-brickDisk Storage
R-brickRouter Interconnect
X-brickXIO Expansion
P-brickPCI Expansion
I-brickBase I/O Module
G-brickGraphics Expansion
Origin 3000 Hardware Origin 3000 Hardware Modules Modules (BRICKS)(BRICKS)
TM
Origin 3000 MIPS NodeOrigin 3000 MIPS Node
R1*000
Mem/DirBedrock
ASIC
R1*000 R1*000
R1*000
Two Independent SysAD InterfacesTwo Independent SysAD InterfacesEach 2x O2K Bandwidth
200 MHz, 1600 MB/sec each
NUMALink3 Network PortNUMALink3 Network Port2x O2K Bandwidth
800 MHz, 1600 MB/secBi-directional
XIO+ PortXIO+ Port1.5x O2K Bandwidth
600 MHz, 1200 MB/secBi-directional
Memory InterfaceMemory Interface4x O2K Bandwidth
200 MHz, 3200 MB/sec60% O2K Latency
180 ns local8 GB/node (Max)DDR SDRAM
128 Nodes / 512 CPUs128 Nodes / 512 CPUsper System (Max)per System (Max) L2
Cache
L2 Cache
L2 Cache
L2 Cache
TM
Origin 3000 CPU Brick (Origin 3000 CPU Brick (C-brickC-brick))
• 3U high x 28” deep3U high x 28” deep
• Four MIPS or IA64 CPUs Four MIPS or IA64 CPUs
• 1 - 4 DIMM pairs: 256MB, 1 - 4 DIMM pairs: 256MB, 512MB, 1024MB (premium)512MB, 1024MB (premium)
• 48V DC power input48V DC power input
• N+1 redundant, hot-plug N+1 redundant, hot-plug coolingcooling
• Independent power on/offIndependent power on/off
• Each CPU module can Each CPU module can support one I/O bricksupport one I/O brick
TM
Origin 3000 BEDROCK ChipOrigin 3000 BEDROCK Chip
TM
Memory
Hub
node
CPU
CPU
Memory
node
CPU
CPU
Hub
1600
2x1600
3200
CPU
CPU
CPU
CPU
1600
1600 1600
1600
1600 900900
900900 900900
900900
11501150 11501150
21002100
2x12502x1250
SGI Origin 3000 BandwidthSGI Origin 3000 Bandwidth Theoretical vs. Measured (MB/s) Theoretical vs. Measured (MB/s)
TM
STREAMS Copy STREAMS Copy BenchmarkBenchmark
0.0
500.0
1000.0
1500.0
2000.0
2500.0
3000.0
Number of CPUs
Me
ga
by
tes
/se
c
Origin 2000 R12KS 400 MHz 380.0 381.0 820.0 1538.0
Origin 3000 R12KS 400 MHz 623.0 777.0 1406.0 2855.0
Origin 3000 R14K 500 MHz 685.0 778.0 1401.0 2823.0
1 2 4 8
SGI ConfidentialSGI Confidential
TM
Origin 3000 Router BrickOrigin 3000 Router Brick ( (r/R-brickr/R-brick))
•2U high x 25” deep2U high x 25” deep
•Replaces system mid-planeReplaces system mid-plane
•Multiple ImplementationsMultiple Implementations– r-Brick…6-port (up to 32 CPUs)
– R-Brick…8-port (up to 128 CPUs)
– metarouter…(128 to 512 CPUs)
•48V DC power input48V DC power input
•N+1 redundant, hot-plug cooling N+1 redundant, hot-plug cooling
•Independent power on/offIndependent power on/off
•Latency 50% ORIGIN 2000Latency 50% ORIGIN 2000– 45 ns45 ns
NUMAlink™ 3Router
8 NUMAlink™ 3 NW PortsEach port...3.2GB/s(2x O2K bandwidth)
45ns roundtrip latency(50% O2K router latency)
TM
SGI Origin 3000 SGI Origin 3000 Measured BandwidthMeasured Bandwidth
Router
5000 MB/s5000 MB/s
25002500
25002500
TM
SGI NUMA 3SGI NUMA 3Scalable Architecture (16p - 1hop)Scalable Architecture (16p - 1hop)
R1*000
BedrockASIC
R1*000 R1*000
R1*000 R1*000
R1*000 R1*000
R1*000 R1*000
R1*000 R1*000
R1*000 R1*000
R1*000 R1*000
R1*000
BedrockASIC
BedrockASIC
BedrockASIC
To other Routers
8-port Router
TM
Origin 3000Origin 3000 I/O BricksI/O Bricks X-brick:X-brick:
XIO ExpansionXIO Expansion
• Highest performanceI/O expansion
• Supports HIPPI,GSN, VME, HDTV
• 4 XIO slots per brick
P-brick:P-brick:PCI ExpansionPCI Expansion
• 12 industry-standard,64-bit, 66MHz slots
• Supports almost allsystem peripherals
• All slots are hot-swap
I-brick:I-brick:Base I/O ModuleBase I/O Module
• Base system I/O:• system disk• CD-ROM• 5 PCI slots
• No need to duplicate starting I/O infrastructure
New I/O bricks (e.g., PCI-X) can be attached via same XIO+ port
TM
Types of Computer Types of Computer ArchitectureArchitecturecharacterised by memory accesscharacterised by memory access
MIMD
MultiprocessorsSingle Address spaceShared Memory
MulticomputersMultiple Address spaces
UMACentral Memory
NUMAdistributed memory
NORMAno-remote memory access
PVP (SGI/Cray T90)
SMP (Intel SHV, SUN E10000, DEC 8400SGI Power Challenge, IBM R60, etc.)
COMA (KSR-1, DDM)
CC-NUMA(SGI Origin2000, Origin3000, Cray T3E, HP Exemplar, Sequent NUMA-Q, Data General)
NCC-NUMA (Cray T3D, IBM SP3)
Cluster (IBM SP2, DEC TruCluster,Microsoft Wolfpack, “Beowolf”, etc.)loosely coupled, multiple OS
“MPP” (Intel TFLOPS,TM-5)
tightly coupled & single OSMIMD Multiple Instruction s Multiple Data PVP Parallel Vector ProcessorUMA Uniform Memory Access SMP Symmetric Multi-ProcessorNUMA Non-Uniform Memory Access COMA Cache Only Memory ArchitectureNORMA No-Remote Memory Access CC-NUMA Cache-Coherent NUMAMPP Massively Parallel Processor NCC-NUMA Non-Cache Coherent NUMA
TM
Processor
Cache
Processor
Cache
Origin DSM-ccNUMA Architecture
MainMemoryD
ir
Processor
Cache
XIO+
Processor
Cache
Bedrock
MainMemoryD
ir
XIO+Bedrock
NUMALink3 and R-Bricks
Distributed SharedShared Memory
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
TMDistributed Shared Memory Distributed Shared Memory Architecture (DSM)Architecture (DSM)
• Local memory and independent path to memory as with the Multicomputer Architecture
• Memory of all nodes is organized as one logical “shared memory”• Non-uniform memory access (NUMA):
— “Local memory” access is faster than “remote memory” access• Programming model is (almost) the same as for the Shared Memory
Architecture— data distribution is available for optimization
• Scalability properties similar to the Multicomputer Architecture
Mainmemory
Cache
Register FileFunctional
Unit(mult, add)
Processor
Mainmemory
Cache
Register FileFunctional
Unit(mult, add)
ProcessorCache
CoherencyUnit
Cache Coherency
Unit
interconnect
TM
Processor
Cache
Processor
Cache
Origin DSM-ccNUMA ArchitectureOrigin DSM-ccNUMA Architecture
MainMemoryD
ir
Processor
Cache
XIO+
Processor
Cache
Bedrock
MainMemoryD
ir
XIO+Bedrock
NUMALink3 and R-Bricks
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Directory-Based ScalableScalable Cache Coherence
TM
Origin Cache CoherencyOrigin Cache Coherency• Memory page is divided in data blocks of 32 words or 128 Bytes
each (L2 cache line size)
• Each data request transfers one data block (128 Bytes)
• Each data block has associated presence and state information
• If a node (HUB) requests a data block, the corresponding presence bit is set and the state of that cache line is recorded
• HUB runs the Cache Coherency protocol, updating the state of the data block and notifying nodes for which the presence bit is set.
Each L2 cache line contains 4 data blocks of 8 words or 32 Bytes each (L1 data cache line size)
page
Data Block or Cache line 128 Bytes (32 words)presence(64 bits)
state8bits
Data Block or Cache line 128 Bytes (32 words)presence(64 bits)
state8bits
directoryUnowned: no copiesShared: read-only copiesExclusive: one read-writeBusy: state in transition
TMCC-NUMA Architecture: CC-NUMA Architecture: ProgrammingProgramming
• All data is shared• Additional optimization to place data close to the processor that would do most of
the computations on that data• Automatic (compiler) optimizations for single processor and parallel performance• The data access (data exchange) is implicit in the algorithm;• Except for the additional data placement directives, the source is the same as for the
single processor programming (SMP principle)
i
j k j
i k
= X
Proc 1Proc 2
Proc 3
C every processor holds a column of each matrix:C$distribute A(*,block),B(*,block),C(*,block)C$omp parallel doDO i=1,n DO j=1,n DO k=1,n C(i,j)=C(i,j) + A(i,k)*B(k,j)ENDDO ENDDO ENDDO
TM
Problems of CC-NUMA ArchitectureProblems of CC-NUMA Architecture
SMP programming style + data placement techniques (directives)
SMP programming Cliffremote memory latency jump ~3-5requires correct data placement
64-128 processor O2000ta(remote)/ta(local) ~3-5->correct data placement
Based on 1 GB/s SCI link;latency/hop ~ 500 ns
TM
DSM-ccNUMA MemoryDSM-ccNUMA Memory
Easy to Program Easy to Scale
Shared-memorySystems (SMP)
Massively ParallelSystems (MPP)
Hard to scale Hard to program
Easy to ProgramEasy to Scale
Distributed Shared MemorySystems [ccNUMA)
TM
SGI 3200 (2-8p)SGI 3200 (2-8p)
Short Rack(17U config. space)
Power Bay
Minimum (2p) System
I-Brick
Maximum (8p) System
C-Brick
Power Bay
C-Brick
I-Brick
P, I, or, X-Brick
C-Brick
C-Brick
P,I, or X-BrickI-Brick
P
P
PBR
P
XIO+
P
P
PBR
P
XIO+
System Topology
XIO+ Ports
NetworkNetwork
XIO+ Ports
Router-less Router-less configurations in configurations in deskside form factordeskside form factor
TM
SGI 3400 (4-32p)SGI 3400 (4-32p)
Full-size Rack(39U config. space)
Power Bay
Power Bay
r-Brick
Minimum (4p) System
C-Brick
I-Brick
Maximum (32p) System
C-Brick
C-Brick
C-Brick
Power Bay
r-Brick
C-Brick
r-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
P
P
PBR
P
XIO+
P
P
PBR
P
XIO+
P
P
PBR
P
XIO+
P
P
PBR
P
XIO+
r-Brick 6-port router
r-Brick 6-port router
P
P
P
P
XIO+
P
P
P
P
XIO+
P
P
P
P
XIO+
System Topology
BR BR BR
Power Bay
I-Brick
P, I, or, X-Brick
Power Bay
C-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P
P
P
P
XIO+
BR
TM
SGI 3800 (16-128p)SGI 3800 (16-128p)
Minimum (16p) System Maximum (128p) System
128P System Topology
R
Rack 1
C
CC
C
RC
CC
C
R
Rack 2
C
CC
C
R C
CC
C
R
Rack 3
C
CC
C
RC
CC
C
R
Rack 4
C
CC
C
R C
CC
C
1 2 3 4
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
R-Brick
C-Brick
Power Bay
Power Bay
Power Bay
I-Brick
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
Power Bay
Power Bay
I-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
Power Bay
Power Bay
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
Power Bay
Power Bay
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
Power Bay
Power Bay
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick
R-Brick8-port router
TM
16 proc16 proc
16 proc
16 proc16 proc
16 proc
16 proc 16 proc
SGI 3800 System: 128 processorsSGI 3800 System: 128 processors
TM
SGI 3800 (32-512p)SGI 3800 (32-512p)
512p Power Estimates:512p Power Estimates:MIPS = 77 KWMIPS = 77 KWItaniumItanium
TMTM
= 150 KW= 150 KWMcKinley = 231 KWMcKinley = 231 KW
No I/O or storage included No I/O or storage included in power estimates.in power estimates.
Premium memory requiredPremium memory required
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
C-Brick
C-Brick
C-Brick
Power Bay
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
Power Bay
C-Brick
Power Bay
Power Bay
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
Power Bay
Power Bay
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick I-Brick
R-BrickR-BrickR-Brick
One Quadrant of a 512p System
TM
Router-to-Router ConnectionsRouter-to-Router Connections for 256 Processor Systems for 256 Processor Systems
TM
512 Processor Systems512 Processor Systems
TM
R1xK Family of R1xK Family of ProcessorsProcessors
•Supports the 64-bit MIPS IV ISASupports the 64-bit MIPS IV ISA•4-way superscalar4-way superscalar•Five separate execution unitsFive separate execution units•2 floating point results / cycle2 floating point results / cycle•4-way deep speculative execution of branches4-way deep speculative execution of branches•Out-of-order execution (48 instruction window)Out-of-order execution (48 instruction window)•Register re-namingRegister re-naming•Two-way set associative non-blocking cachesTwo-way set associative non-blocking caches
–Up to 4 outstanding memory read requestsUp to 4 outstanding memory read requests–Prefetching of dataPrefetching of data–1MB to 8MB secondary data cache1MB to 8MB secondary data cache
•Four user-accessible event countersFour user-accessible event counters
MIPS R1x000 is an out-of-order, dynamic-scheduling MIPS R1x000 is an out-of-order, dynamic-scheduling superscalar processor with non-blocking cachessuperscalar processor with non-blocking caches
TMOrigin 3000 Origin 3000 MIPS Processor RoadmapMIPS Processor Roadmap
Origin 2000
O3K-MIPS
R10000250 MHz, 500 MFlops
R12000300 MHz, 600 MFlops
R12000A400 MHz, 800 MFlops
R14000(A)500+ MHz, 1000+ MFlops
R16000xxx MHz, xxx GFlops
1999 2000 2001 2002 2003
R18000xxx MHz, xxx GFlops
8 MB @ 200 MHz8 MB @ 200 MHz
4 MB @ 250 MHz4 MB @ 250 MHz
8 MB @ 266 MHz8 MB @ 266 MHz
8 MB DDR SRAM@ 250+ MHz8 MB DDR SRAM@ 250+ MHz
TM
R14000 Cache InterfacesR14000 Cache Interfaces
TM
Sp
eed
of
Acc
ess
1/cl
ock
64reg
32KB(L1)
8MB(L2)
~1 - 100s GB
Cache subsystem memory
Device Capacity (size)
1
0.1
0.01
~4000 cy
~100 - 300 cy(NUMA)
~10 cy
~2-3cy
disk
Memory HierarchyMemory Hierarchy
175 175235
285335 335
435485
585
343
554
759 759836
1067
1169
0
200
400
600
800
1000
1200
1400
2p 4p 8p 16p 32p 64p 128p 256p 512p
Rem
ote
La
ten
cy
(n
s)
Origin3000 Latency
Origin2000 Latency
TM
Effects of Memory HierarchyEffects of Memory Hierarchy
2MB cache
1MB cache
4MB cache
32 KB L1 cache
4 MB L1 cache
L2 cache:
TM
Instruction Latencies (R12K)Instruction Latencies (R12K) Integer units latency Repeat rate• ALU 1
– add, sub, logic ops, shift, br 1 1• ALU 2
– add, sub, logic ops 1 1– signed multiply (32/64 bit) 6/10 6/10– (unsigned multiply: +1 cycle)– divide (32/64 bit) 35/67 35/67
• Address Unit– load integer 2 1– load floating point 3 1– store - 1– Atomic LL,ADD,SC sequence 6 6
Floating point units• FPU 1
– add, sub, compare, convert 2 1• FPU 2
– multiply 2 1– multiply-add (madd) 4 1
• FPU 3– divide, reciprocal (32/64 bit) 12/19 14/21– sqrt (32/64 bit) 18/33 20/35– rsqrt (32/64 bit) 30/52 34/56
Repeat rate of 1 means that afterpipelining processor can complete1 operation per cycle.
Thus the peak rates:Int operations: 2 int operations/cycleFP operations: 2 fp operations/cycle
For the R14000@500MHz:
4*500 MHz = 2000 MIPS2*500 MHz = 1000 Mflop/s
Repeat rate of 1 means that afterpipelining processor can complete1 operation per cycle.
Thus the peak rates:Int operations: 2 int operations/cycleFP operations: 2 fp operations/cycle
For the R14000@500MHz:
4*500 MHz = 2000 MIPS2*500 MHz = 1000 Mflop/s
Compiler has this table build in.The goal of compiler scheduling is finding instructions that can be executed in parallel to fill all slots:ILP - Instruction Level Parallelism
TM
Instruction Latencies: DAXPY ExampleInstruction Latencies: DAXPY Example
– There are 2 loads (x,y) and 1 store (y)= 3 mem ops.– There are 2 fp operations (+,*) which can be done with 1 madd
• 3 mem ops require 3 cycles minimum (processor can do 1 mem op/cycle)
• theoretically in 3 cycles processor can do 6 fp operations
• only 2 fp operations are available in the code max processor speed is 2fp/6fp=1/3 peak on this code; I.e. for the R12000@300MHz processor 600/3=200 Mflop/s.
DO I=1,n Y(I) = Y(I) + A*X(I)ENDDO
Loop parallelism:2 loads, 1 store1 multiply-add (madd)2 address increments1 loop-end test1 branchper single loop iteration
Processor parallelism:1 load or store 1 ALU1 instruction1 ALU2 instruction1 FP add1 FP multiplyper processor cycle
TM
DAXPY Example: SchedulesDAXPY Example: Schedules
Simple schedule: unrolled by 2:
2fp/(8cycles*2fp/cy)=1/8 peak 4fp/(9cycles*2fp/cy)=2/9 peak
R12000@300MHz ~ 75 Mflop/s ~133 Mflop/s
cycle instructions 0 ld x x++1 ld y23 madd4567 st y br y++
x load delay 3 cycles
madd dela
y 4 cycle
s
DO I=1,n Y(I) = Y(I) + A*X(I)ENDDO
DO I=1,n-1,2 Y(I+0) = Y(I+0) + A*X(I+0) Y(I+1) = Y(I+1) + A*X(I+1)ENDDO
cycle instructions 0 ld x01 ld x12 ld y0 x+=43 ld y1 madd04 madd1567 st y08 st y1 y+=4 br
x load delay 3 cycles
madd dela
y 4 cycle
s
TMDAXPY Example: DAXPY Example: Software PipeliningSoftware Pipelining
• Software pipelining is the way to fill all processor slots by mixing iterations
• replications gives how many iterations are mixed• number of replications depends on the distance (in cycles) between the load
and the calculation
• DAXPY 6 cy schedule with 4 fp ops: 4fp/(6cy*2fp/cy)=1/3 peak
#<swp> replication 0 #cyld x0 ldc1 $f0,0($1) #[0]ld x1 ldc1 $f1,-8($1) #[1]st y2 sdc1 $f3,-8($3) #[2]st y3 sdc1 $f5,0($3) #[3]y+=2 addiu $3,$2,16 #[3]
madd.d $f5,$f2,$f0,$f4 #[4]ld y0 ldc1 $f0,-8($2) #[4]
madd.d $f3,$f0,$f1,$f4 #[5]x+=2 addiu $1,$1,16 #[5]
beq $2,$4,.BB21.daxpy #[5]ld y3 ldc1 $f2,0($3) #[5]
#<swp> replication 1 #cyld x3 ldc1 $f1,0($1) #[0]ld x2 ldc1 $f0,-8($1) #[1]st y1 sdc1 $f3,-8($2) #[2]st y0 sdc1 $f5,0($2) #[3]y+=2 addiu $2,$3,16 #[3]
madd.d $f5,$f2,$f1,$f4 #[4]ld y3 ldc1 $f1,-8($3) #[4]
madd.d $f3,$f1,$f0,$f4 #[5]x+=2 addiu $1,$1,16 #[5]ld y0 ldc1 $f2,0($2) #[5]
TM
DAXPY SWP: Compiler MessagesDAXPY SWP: Compiler Messages F77 -mips4 -O3 -LNO:prefetch=0 -S daxpy.f
• With the -S switch the compiler will produce file daxpy.s with assembler instructions and comments about software pipelining schedules
#<swps> Pipelined loop line 6 steady state #<swps> 50 estimated iterations before pipelining #<swps> 2 unrolling before pipelining #<swps> 6 cycles per 2 iterations #<swps> 4 flops ( 33% of peak)(madds count 2fp) #<swps> 2 flops ( 16% of peak)(madds count 1fp) #<swps> 2 madds ( 33% of peak) #<swps> 6 mem refs (100% of peak) #<swps> 3 integer ops ( 25% of peak) #<swps> 11 instructions ( 45% of peak) #<swps> 2 short trip threshold #<swps> 7 ireg registers used. #<swps> 6 fgr registers used.
• The schedule is the max 1/3 peak processor performance, as expected
• note: it is necessary to switch off prefetch to attain max schedule
TM
• Processor can support 4 outstanding memory requests
Timing linked list references: while(x) x=x->p; #outstanding ref time per pointer fetch: 1 230 ns (480 ns) 2 160 ns (250 ns) 4 110 ns (240 ns)
Multiple Outstanding Mem RefsMultiple Outstanding Mem Refs
Wait for data Wait for dataExecution
independentinstructions
Execution
Execution
Executionindependentinstructions
Wait for data
time
“Sequential” cache miss
“Parallel” cache miss
TM
Origin 3000 Memory LatencyOrigin 3000 Memory Latency
LocalLocalNI to NINI to NIPer RouterPer Router
320 ns320 ns165 ns165 ns105 ns105 ns
O3KO3K
180 ns180 ns50 ns50 ns45 ns45 ns
485 ns + #hops*105 ns 230 ns + #hops*45 ns
ORIGINORIGIN
32 CPU O3K Max Latency:32 CPU O3K Max Latency: 315 ns315 ns
TMRemote Memory Remote Memory LatencyLatency
0
200
400
600
800
1000
1200
1400
2p 4p 8p 16p 32p 64p 128p 256p 512p 1024p
Node Size (CPUs)
Wo
rst
case r
ou
nd
tri
p r
em
ote
late
ncy (
ns)
Origin2000 Latency
SN (Hypercube)Origin 3000 Series
SGI™ 3000 Family vs. SGI™ 2000 Series
TM
R1x000 Event CountersR1x000 Event Counters The R1x000 processor family allows extensive performance monitoring with counters that can be triggered by 32 events:• R10000 has 2 event counters
• R12000 has 4 event counters
The counters are incremented when an event happens in the processor (e.g. cache miss) and the event is selected by the user.
The first counter can be triggered by the events 0-15, the second counter is incremented in response to events 15-31.
R12000 has 2 additional counters that allow to monitor conditional events (i.e. events based on previous events).
User access to the counters is through a software library or shell level tools provided by the IRIX OS.
TM
Origin Address SpaceOrigin Address Space• Physically the memory is distributed and is not contiguous.
• Node id is assigned at boot time
• Logically memory is a shared single contiguous address space, the virtual address space is 44 bits (16 TB)
• The program (compiler) uses the virtual address space.
• Translation from the virtual to the physical address space is by the CPU.
0 1 2 3 4 ...
12 GB
8 GB
4 GB
0
PhysicalAddress
Node id
1 TB max(40 bits)
Empty slotsmemory present
Max for a single node:4 GB memory
Node id 8 bits Node offset 32 bits (4 GB)39 32 31 0
TLB = Translation Look-aside Buffer
Page size is configurable as 16 KB (default),64 KB, 256 KB, 1 MB, 4 MB, 16 MB
Page 0
Page 1
Page 2
Page n
Page k
Page 1
Page n
Page 0
TLB
Virtual
Physical
TM
Process SchedulingProcess Scheduling Irix is a Symmetric Multiprocessing Operating System• Processes and Processors are independent• Parallel programs are executed as jobs with multiple processes• The Scheduler will allocate processes to processors
Priority range from 0 to 2550 weightless (batch)1-40 time share (interactive) (TS)90-239 system (daemons and interrupts)1-255 real time processes (FIFO & RR)
10
40
128
255
system
Timeshare
Real time
TM
Process SchedulingProcess Scheduling
TM
Process SchedulingProcess Scheduling
TM
Process SchedulingProcess Scheduling
TM
System Monitoring CommandsSystem Monitoring Commands uptime(1) returns information about system
usage and user load w(1) who is on the system and what are they
doing? sysmon system log viewer ps(1) a "snapshot" of the process table toptop, gr_top process table dynamic display osviewosview system usage statistics sarsar system activity reporter gr_osviewgr_osview system usage statistics in graphical
form gmemusagegmemusage graphical memory usage monitor sysconfsysconf system limits, options, and parameters
TM
System Monitoring CommandsSystem Monitoring Commands ecstats -C R10K Counter Monitor ja job accounting statistics oview Performance Co-Pilot (bundled
with IRIX) pmchartpmchart Performance Co-Pilot (licensed
software) nstats,linkstat CrayLink connection statistics
(man refcnt(5) ) bufviewbufview system buffer statistics parpar process activity report numa_view, dlooknuma_view, dlook provides process memory
placement information limit [-h]limit [-h] displays system soft [hard] limits
TM
System Monitoring CommandsSystem Monitoring Commands hinvhinv hardware inventory topologytopology system interconnect description
TM
Summary: Origin PropertiesSummary: Origin Properties
• Single machine image– it behaves like a fat workstation
• same compilers• time sharing
– all your old code will run – OS schedules all the hardware resources on the machine
• Processor scalability 2-512 cpu
• I/O scalability 2-300 GB/s• All memory and I/O devices are directly addressable
– no limitation on the size of a single program, it can use all the available memory
– no limitation on the location of the data, all disks can be used in a single file system
• 64 bit operating system and file system– HPC Features: Checkpoint/Restart, DMF, NQE/LSF, TMF, Miser,
job limits, cpusets, enhanced accounting
• Machine stability