Post on 24-Jul-2020
All Rights Reserved, Copyright© FUJITSU LIMITED 2014
SPARC64™ XIfx: Fujitsu’s Next Generation Processor
for HPC
August 11, 2014
Toshio Yoshida
Next Generation Technical Computing Unit Fujitsu Limited
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 2
Agenda
Fujitsu Processor Development
SPARC64TM XIfx
Design Concept and Processor Overview
Node Architecture
HPC-ACE2: ISA enhancements
Microarchitecture
Enhanced VISIMPACT and Sector Cache
Assistant Core
Performance
RAS
Summary
2
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Fujitsu Processor Development
3
GS21
2600
SPARC64
VIIIfx
SPARC64
X
SPARC64
XIfx
UNIX Server
HPC
SPARC64
X+
SPARC64
IXfx
10Peta 20Peta scale 100Peta scale
Mainframe
2011 2012 2013 2014
K computer
8 Cores
HPC-ACE
DIMM
Tofu interconnect
FX10
16 Cores
HPC-ACE
DIMM
Tofu interconnect
Post-FX10
32 Cores
+ 2 Assistant Cores
HPC-ACE2
HMC
Tofu interconnect2
16cores
SMT / SWoC
3GHz
16cores
SMT / SWoC+
3.7GHz
EXA scale
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 4
Agenda
Fujitsu Processor Development
SPARC64TM XIfx
Design Concept and Processor Overview
Node Architecture
HPC-ACE2: ISA enhancements
Microarchitecture
Enhanced VISIMPACT and Sector Cache
Assistant Core
Performance
RAS
Summary
4
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Design Concept of SPARC64TM XIfx
• Designed for massively parallel supercomputer systems
– High performance for wide range of real applications
– High scalability
– Low power consumption
– Groundwork for EXA scale computing
• Enhance and inherit K computer features
– Stand-alone scalar many-core architecture
– Enhanced VISIMPACT and Sector cache
– On-chip integrated Tofu interconnect 2
• Introduce new technologies to EXA scale
– Wider SIMD enhancements HPC-ACE2
– Leading-edge memory technology HMC
– Cores dedicated for non-computation operation Assistant cores
5
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 6
SPARC64™ XIfx Chip Overview
Architecture Features • 32 computing cores
+ 2 assistant cores
• HPC-ACE2
• 24 MB L2 cache
• HMC, Tofu2 , PCI Gen3
20nm CMOS • 3,750M transistors
• 1,001 signal pins
• 2.2GHz
Performance (peak) • 1.1TFlops
• HMC 240GB/s x 2(in/out)
• Tofu2 125GB/s x 2(in/out)
core core
core core
core core
core core
core core
core core
core core
core core
Assistant
core Assistant
core
core core
core core
core core
core core
core core
core core
core core
core core
Tofu2 interface
Tofu2 controller
HM
C in
terface HM
C in
terf
ace
L2 cache
L2 cache
PCI interface
MA
C
MA
C M
AC
M
AC
PCI controller
6
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 7
Agenda
Fujitsu Processor Development
SPARC64TM XIfx
Design Concept and Processor Overview
Node Architecture
HPC-ACE2: ISA enhancements
Microarchitecture
Enhanced VISIMPACT and Sector Cache
Assistant Core
Performance
RAS
Summary
7
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Node Architecture • Stand-alone scalar many-core with wider SIMD
– No accelerator
• Non-hierarchical and high bandwidth memory
– 8x HMCs (32GB, 240GB/s x2 (in/out) )
• Isolation of non-computation operation for jitter reduction
– 32 Computing cores
– 2 Assistant cores
• Daemon, IO, MPI asynchronous communication, etc.
• Sector cache is used for assistant core to avoid cache pollution
– Computing cores and Assistant cores keep cache coherency
• Single OS manages computing and assistant cores
– Single OS minimizes memory management overhead
8
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 9
Agenda
Fujitsu Processor Development
SPARC64TM XIfx
Design Concept and Processor Overview
Node Architecture
HPC-ACE2: ISA enhancements
Microarchitecture
Enhanced VISIMPACT and Sector Cache
Assistant Core
Performance
RAS
Summary
9
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
HPC-ACE2: ISA enhancements
• Wider SIMD enhancements from K computer / FX10
– 256-bit wide SIMD (64-bit x 4 / 32-bit x 8)
– More integer operations
– Stride load/store
– Indirect load/store
– Compress
– Round
– Permutation
10
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Wider SIMD Extensions • 256-bit wide SIMD with 128 FPRs
– 64-bit (DP: Double Precision) x 4 SIMD
– 32-bit (SP: Single Precision) x 8 SIMD
• DP 3.2x, SP 6.1x faster than SPARC64TM IXfx in basic kernels
– Improved L1 cache pipelines
– Higher frequency 1.848GHz -> 2.2GHz
DP 3.23x / SP 6.18x
(average)
No
rmal
ized
Per
form
ance
Basic Kernels Performance per Core
11
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Built-in Functions • Built-in functions accelerated by
– HPC-ACE2 instructions • 256-bit wide SIMD
• Rounding / Bit manipulation / Exponential auxiliary instructions
– Microarchitectural enhancements
3.64x (average) Rounding
Exponential Bit manipulation
256-bit wide SIMD
L1 cache
improvement
No
rmal
ized
Per
form
ance
Built-in Functions Performance per Core
12
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Stride Load/Store Instructions
• Stride access is frequently used in various HPC apps.
– Support from 2 to 7-element stride width
• 3.6x faster than SPARC64TM IXfx
i[0] i[1] i[2] i[3]
lddst,s [%l0]@stride 3, %f0
i[0] i[1] i[2]
i[3]
%l0+0 +32
+64
%f0
E.g. Stride load @ stride width =3
Memory 3.67x 3.63x
No
rmal
ized
Per
form
ance
Stride load Performance
13
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Memory
Indirect Load/Store Instructions • Indirect load and store instructions for list accesses
– List accesses appear in wide ranges of HPC apps.
• More than 1.6x faster than SPARC64TM IXfx
i[0] i[1] i[2] i[3]
lddid,s [%f0], %f2
A D B C
i[2] i[0]
i[1]
C
i[3]
%f0
%f2
A
B
D
E.g. Indirect load
1.67x 1.92x
No
rmal
ized
Per
form
ance
Indirect Load/Store Performance
14
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 15
Agenda
Fujitsu Processor Development
SPARC64TM XIfx
Design Concept and Processor Overview
Node Architecture
HPC-ACE2: ISA enhancements
Microarchitecture
Enhanced VISIMPACT and Sector Cache
Assistant Core
Performance
RAS
Summary
15
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 16
SPARC64TM XIfx Core Pipeline
FLB
L1 I$ 64KB
4ways
Branch Target
Address
Decode
& Issue
RSE
RSA
RSF
RSBR
GUB
GPR 188Registers
EXA
EXB
EAGA
EXC EAGB
EXD
FPR 128x4 Reg.
FUB
Fetch
Port
Store
Port
L1 D $ 64KB
4Way
MAC
(HMC Controller)
Fetch Issue Dispatch Reg-Read Execute Cache and Memory
CSE
Commit
PC
Control
Registers
L2$
HMC
Write
Buffer
Pattern History Table
PCI Controller
Tofu 2 controller
PCI-GEN3 CPU-CPU I/F
34 cores …
FLB FLA Local
Pattern Table
FLB FLB FLB
• 2x 256-bit SIMD FMAs + 4x ALUs (shared with 2 AGENs)
• 2x 256-bit SIMD LOADs or 1x 256-bit SIMD STORE
• Fundamental pipelines are based on SPARC64TM X+
– Superscalar, Out-of-Order, branch prediction, etc.
• No multithreading
16
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Many-Core Architecture • SPARC64TM XIfx has 2 CMGs (Core Memory Group)
– CMG consists of 17 cores, L2 cache and 2 memory controllers (MAC)
– Two CMGs keep cache coherency by ccNUMA with on-chip directory
• 32GB memory capacity
• To bind a process in a CMG is recommended
L2 cache 12MB 24ways
CMG*0 (17cores)
L2 cache 12MB 24ways
CMG*1 (17cores)
PCI controller
Tofu2 controller
CC 0
CC 1
CC 2
CC 15
AC CC 0
CC 1
CC 2
CC 15
AC
MAC
MAC HMC
MAC HMC
MAC
MAC
MAC HMC
MAC HMC
MAC
CPU
17
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
• High bandwidth cache, memory and Tofu2
– 2x Cache bandwidth / Core • Compared to SPARC64TM IXfx
– 8x HMC
• 15 Gbps
• 16 lanes
• 8 ports
– Tofu2
• 25 Gbps
• 4 lanes
• 10 ports
CMG1 CMG0
High Bandwidth
Tofu2
125GB/s x2(in/out)
Memory
240GB/s x2(in/out)
L1 cache
4.4TB/s
L2 cache
2.2TB/s
Perfomance
1.1TFLOPS 256-bit SIMD
FMA x2
L1 cache
70.4
GB/s 140.8
GB/s
70.4
GB/s
35.2
GB/s
Tofu2
controller
12.5
GB/s 12.5
GB/s
120
GB/s
120
GB/s 120
GB/s
120
GB/s
34 cores
HMCs HMCs
x10 ports
L2 cache L2 cache 70.4GB/s
70.4GB/s
Core
18
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 19
Agenda
Fujitsu Processor Development
SPARC64TM XIfx
Design Concept and Processor Overview
Node Architecture
HPC-ACE2: ISA enhancements
Microarchitecture
Enhanced VISIMPACT and Sector Cache
Assistant Core
Performance
RAS
Summary
19
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Enhanced VISIMPACT • Advantages of Hybrid Parallelization
– To reduce communication cost
in highly parallel programs
– To increase user memory space
by reducing communication buffer
• VISIMPACT* (introduced in FX1)
– Automatic parallelization technology by Fujitsu’s compiler
– Hardware barrier for fast synchronization
• Enabling 8 sets of Hardware barriers between 32 cores
– Optimum combination of # Threads and # Processes depends on apps.
– Any combinations of T(Threads) and P(Processes) are supported
• 32 T(Thread) x 1 P(Process), 16 T x 2 P, 8 T x 4 P, etc.
– The goal is heterogeneous hybrid parallelization for load
imbalance and multi physics
MPI com
MPI com
*Virtual Single Processor by Integrated Multi-core Parallel Architecture
20
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Effect of VISIMPACT • Lower memory usage
– By reducing communication buffer for MPI
• Higher performance
– By reducing MPI communication cost
Memory usage and Performance
of #Threads x #Processes
Normalized Memory Usage Normalized Performance
Higher
Performance Lower
Usage
21
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Enhanced Sector Cache • Sector Cache (introduced in K computer)
– Cache line is replaced to keep specified sector size when cache miss occurs
• Like ‘Local Memory’
– Leave the reusable data on cache by dividing cache into segments
• Unlike ‘Local Memory’ – No need for a dedicated address
– No penalty to save and restore in context switch
• SPARC64TM XIfx supports 4 sectors in L1 cache (per core) and L2 cache (per CMG) respectively
– More usable than SPARC64TM IXfx of 2 sectors in L1 and L2 respectively
– Each sector size can be specified separately
L2 Cache
- Instruction fetch
- Normal data
- Streaming data
Reusable
data 3 Reusable
data 2
Reusable
data 1
Sector
0
Sector
1
Sector
2
Sector
3
22
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 23
Agenda
Fujitsu Processor Development
SPARC64TM XIfx
Design Concept and Processor Overview
Node Architecture
HPC-ACE2: ISA enhancements
Microarchitecture
Enhanced VISIMPACT and Sector Cache
Assistant Core
Performance
RAS
Summary
23
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Assistant core • Assistant core serves Daemon, IO, MPI asynchronous
communication instead of computation
– Each CMG has an assistant core allocated on 17th core
– Sector cache within L2 cache allocates one sector to assistant core
to avoid cache pollution
• Minimize performance degradation in large systems by jitter
reduction
Perf degradation ratio by jitter (model)
L2 cache 12MB 24ways
CMG0 (17cores)
L2 cache 12MB 24ways
CMG1 (17cores)
PCI controller
Tofu2 controller
CC 0
CC 1
CC 2
CC 15
AC CC 0
CC 1
CC 2
CC 15
AC
MAC
MAC HMC
MAC HMC
MAC MAC
MAC HMC
MAC HMC
MAC
CPU
CPU block diagram
Performance
improvement
24
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 25
Agenda
Fujitsu Processor Development
SPARC64TM XIfx
Design Concept and Processor Overview
Node Architecture
HPC-ACE2: ISA enhancements
Microarchitecture
Enhanced VISIMPACT and Sector Cache
Assistant Core
Performance
RAS
Summary
25
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Performance • SPARC64TM XIfx boosts performance up by ISA and
mircoarchitectural enhancements
– 97% execution efficiency for DGEMM
• Sector cache realizes the same effect as 2.5x L1 cache size
– 1.7x faster per core than SPARC64TM IXfx in real HPC applications
such as fluid dynamics
40% up
by uArch
33% up
by ISA
No
rmal
ized
Per
form
ance
Real HPC Applications Performance per Core
Memory BW
(HMC)
Out-of-Order
Indirect access
& Integer SIMD 256-bit SIMD
26
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 27
Agenda
Fujitsu Processor Development
SPARC64TM XIfx
Design Concept and Processor Overview
Node Architecture
HPC-ACE2: ISA enhancements
Microarchitecture
Enhanced VISIMPACT and Sector Cache
Assistant Core
Performance
RAS
Summary
27
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014
Reliability, Availability, Serviceability • HPC system requires extensive
RAS capability of CPU and interconnect
• SPARC64™ XIfx inherits mainframe-level RAS features - # checkers in CPU increased to
~92,900
- Tofu2 buses support self-recovery and lane dynamic degradation
Units Error Detection and Correction
Cache (Tags) ECC, Parity & Duplicate
Cache (Data) ECC, Parity
Registers ECC (INT/FP), Parity (Others)
ALUs Parity, Residue
Green: 1-bit error Correctable
Yellow: 1-bit error Detectable
Gray: 1-bit error Harmless
28
SPARC64TM XIfx RAS diagram
Other RAS features
Cache dynamic degradation
Hardware Instruction Retry
Lane dynamic degradation for Tofu2
28
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 29
Agenda
Fujitsu Processor Development
SPARC64TM XIfx
Design Concept and Processor Overview
Node Architecture
HPC-ACE2: ISA enhancements
Microarchitecture
Enhanced VISIMPACT and Sector Cache
Assistant Core
Performance
RAS
Summary
29
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 30
Summary SPARC64TM XIfx is Fujitsu’s latest SPARC processor,
designed for massively parallel supercomputing systems
Enhance and inherit K computer features Stand alone scalar many-core architecture
VISIMPACT and Sector Cache
On-chip integrated Tofu2
Introduce new technologies to EXA scale HPC-ACE2
HMC
Assistant cores
SPARC64TM XIfx has improved performance of real HPC applications significantly
As a next step, Fujitsu goes forward to EXA scale supercomputing
30
SPARC64™ XIfx All Rights Reserved, Copyright© FUJITSU LIMITED 2014 31
Abbreviations
• SPARC64TM XIfx – RSA: Reservation Station for Address generation
– RSE: Reservation Station for Execution
– RSF: Reservation Station for Floating-point
– RSBR: Reservation Station for Branch
– GUB: General-purpose Update Buffer
– FUB: Floating-point Update Buffer
– GPR: General-Purpose Register
– FPR: Floating-Point Register
– CSE: Commit Stack Entry
– EAG: Effective Address Generator
– EX : Execution unit (Integer)
– FL : Floating-point unit
– HPC-ACE: High Performance Computing-Arithmetic Computational Extensions
– HMC: Hybrid Memory Cube
– Tofu: Torus-Fusion
31