Performance Evaluations of Finite Difference Applications Realized on a Single Flux Quantum...
-
Upload
marvin-peters -
Category
Documents
-
view
220 -
download
0
Transcript of Performance Evaluations of Finite Difference Applications Realized on a Single Flux Quantum...
Performance Evaluations of Finite DifferenceApplications Realized on a Single Flux QuantumCircuits-Based Reconfigurable Accelerator
Hiroaki Honda1, Farhad Mehdipour2, Hiroshi Kataoka1,Koji Inoue1, and Kazuaki J. Murakami1
1 Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan
2 Center for Japan-Egypt Cooperation in Science and Technology, Kyushu University, Fukuoka, Japan
Email: [email protected]
Agenda
Introduction Single-flux quantum (SFQ) circuit SFQ-reconfigurable data-path (RDP) processor
Objective Implementing an Application on SFQ-RDP
Tool chain Code modification DFG extraction and mapping
Performance Evaluation Comparison with GPU and GPP results
Conclusions
2
Top500 Supercomputer Rankingand Projection
1 ExaFlop/s [=109 GFlop/s] can be attained in ~2019and 10 ExaFLop/s in ~2022?? (only in next ten years)
PetaFLop/s [=106GFlop/s] world from 2009, 1000 times speed up in 10 years
1EFlops
http://www.top500.org/ 3
10EF
2022
Energy Consumption Estimation for Floating Point Units (FPUs)
Power / [1FPU (2GHz)] is larger than 10 mW (CMOS, ~8nm in ~2019) 1)
Power / [1GFlop/s] is larger than 5 mW
Enegy consumption of FPUs for 10 ExaFlop/s system
is larger than 5 mW * 10 * 109 = 50 MW !!
1) http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf, p178
It is extremely power consuming to construct 10 ExaFlop/s supercomputer system by CMOS circuit processor
• Additional power consumption by memory, network, storage,…
(1ExaFlop/s =109
GFlop/s)
4
•Difficult to implement feed backloops and conditional branches•No practical SFQ memory
Single-Flux Quantum (SFQ) Circuit
Pulse logic:Bit serial/slice description for 32/64 bits
•Ultra high speed switching•Ultra low power•No cost for latch•Suitable for Pipeline processing
Josephson junction
2~3 ps
SFQ Pulse
~1 mVSFQ Pulse(quantized magnetic flux)
Superconductivity loop
Advantages Disadvantages
x 10~100 faster operationx ~1/10 energy consumptionx 10~100 faster operationx ~1/10 energy consumption
5
Single-Flux Quantum-Reconfigurable Data Path (SFQ-RDP) Computer
SMACSMACMain Mem.
:...:::
SMAC
SB
ORN
...
ORN
...
: : : :
ORN
...
ORN
FPU SFQ RDP Chip
80GHz, 2bit slice(32 x 32 PEs)(2.5 GFLOPS/PE)
10 GFLOPS @system(4 SFQ-RDP Chips)
4.2 K
Streaming memoryAccess controller
CMOSCPU
(One Chip)
Memory bandwidth per Chip:256GB/s (max.)(=16GB/s × 16 channels)
SFQ 0.5μm process
PE PEPE
ORN
PE PE PEPE
PE PE PEPE
ORN
オペランドルーティングネットワーク(ORN)
ORN
PE...
...
...
PE PEPE
ORN
PE PE PEPE
PE PE PEPE
ORN
ORN
...
...
...
PEPE
Operand Routing Network(ORN)
..
.
..
.
..
.
..
.
• Large scale two-dimensional floating-point unit array, data-path architecture
• Reconfigurable Operand Routing Network (ORN)• No on-chip memory• Dynamically reconfigurable PEs and ORNs
• Data Flow is unidirectional• No feed back loop• Minimal amount of control circuits
2-ports/1-port Data accessesFor Input / Output
~2.5TFLOPS/chip
One FPU anddata through unitsOne FPU anddata through units
Network connectingbetween PEs and PEsNetwork connectingbetween PEs and PEs
PE
ORN
6
CREST-JST SFQ-RDP Project (2006~): A Low-Power, High-performance Reconfigurable Processor Based on Single-Flux Quantum Circuits
Goals: Discovering appropriate computation-intensive scientific applicationsDeveloping compiler toolsDeveloping performance evaluation toolsDesigning the SFQ-LSRDP architecture
Yokohama National Univ.SFQ-FPU chip, cell library
Kyushu Univ.Architecture, Compiler
and Applications
Nagoya Univ.SFQ-RDP chip, cell library,
and wiring
SFQ-RDP
Nagoya Univ.CAD for logic design and arithmetic circuits
Superconducting Research Lab. (SRL)
SFQ process
7
Prototype 2x3 SFQ-RDP Processorand SFQ-MUL FPU
8-bit ALUs implementing:ADD, SUB, AND, OR, XOR
Frequency: 25GHzProcess: 2mArea: 6.84 x 6.72 mm2 Power: 4.1mW
1) Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.2) H.Hara, et al.,"Design and Implementation of SFQ Half-Precision Floating-Point Multipliers,", ACS08, 2008.
16-bit FPUs: Adder, MultiplierMUL
Frequency: 32GHzPerformance: 2.6 GFLOPsThe number of junctions: 11044 JJsPower consumption: 3.5 mWCircuit area: 6.22 ×3.78 mm2
2x3 SFQ-RDP processor1)
SFQ- Floating Point Multiplier2)
8
Objectives
Performance evaluations by implementing practical applications and showing possibility of efficient computations by SFQ-RDP computer system
Applications: 2D-diffusion,2D-Finite-Difference Time-Domain (2D-FDTD)
Comparisons of execution times with GPP and GPU
2D-FPU array, data-flow architecture Data Flow Graphs (DFGs) are extracted from applications
and mapped onto the SFQ-RDP Compiler tools
Compiler tools have to be developed No on-chip memory
DMA transfer of DRAM has to be fully used to avoid random accesses
Dynamically reconfigurable PEs and ORNs One time reconfiguration is enough for both Diffusion & FDTD
applications
Points
9
Tool Chain for Implementationof an Application on SFQ-RDP
Application:C/Fortran code
Application:C/Fortran code Modified codeModified code
Code Modificationusing SFQ-RDP APICode Modificationusing SFQ-RDP API
Compiler developedfor SFQ-RDP
Compiler developedfor SFQ-RDP
Data Flow Graph (DFG)Extraction (Semi-manual)Data Flow Graph (DFG)
Extraction (Semi-manual)
Object codeObject codeExtracted DFGExtracted DFG
Placement andRouting Tool
Placement andRouting Tool
RDPConfiguration file
RDPConfiguration file
RDP library fileFunctions definition
& declaration
RDP architecture description
Input
GPP
SFQ-RDPTool chain has beenalmost completed10
Implementing an Application on SFQ-RDP:2D Diffusion
• Basic Finite Difference Method (FDM) formula
1, 0 1, 1, 1 , 1 , 1 2 ,n n n n n ni j i j i j i j i j i jf C f f C f f C f
n-axis (time)
x-axis(space)
y-axis (space)
i
j n
Time development calculation by FDM
(time=n points)
n+1 In/Out Ops
5 / 1 7
11
loop n loop i, j
f(n+1)[i,j] = C0 * ( f(n)[i-1,j] + f(n)[i+1,j] ) + C1 * ( f(n)[i,j-1] + f(n)[i,j+1] ) + C2 * f(n)[i,j]
endend
Original Code for GPP ( n ⇒ n+1 )
Code Implementation and Modification for SFQ-RDP
Extracted DFG:
In/Out Ops Byte/Flop
5 / 1 7 3.4
In/Out Ops Byte/Flop
21 / 9 7 * 9 1.9
loop n loop i, j, (+3, +3)
f(n+1)[i,j] = C0 * ( f(n)[i-1,j] + f(n)[i+1,j] ) + C1 * ( f(n)[i,j-1] + f(n)[i,j+1] ) + C2 * f(n)[i,j] f(n+1)[i+1,j] = C0 * ( f(n)[i,j] + f(n)[i+2,j] ) + C1 * ( f(n)[i+1,j-1] + f(n)
[i+1,j+1] ) + C2 * f(n)[i+1,j]
f(n+1)[i+2,j] = …
…
f(n+1)[i+2,j+2]= …
endend
Unrolled Loop Code for SFQ-RDP ( n ⇒ n+1)
9 formulas in loop-body
DFGExtraction
12
Mapping Extracted DFG onto SFQ-RDP
Placement and Routing
Extracted DFG
DFG mappingResult
13
RDP configuration data
i
Array A
Array B
j
Improving Data Access Efficiency-Data Structure Conversion for DMA Transfer
All two dimensional f[i,j] valuesare divided and stored astwo one-dimensional arrays:A[] and B[]
15(A)+15(B) input data areaccessed via two input ports
9 output data areaccessed
i
jUnrolled loop includes21 inputs and 9 outputsfor calculation
Random memory accesses
Data Structure Conversion:
Input point
Output point
i
j
i
j
14
f[i,j]:
A[i]:B[i]:
Sequential memory accesses: possible to use DMA transfer
f[i,j] A[i],B[i]
double buffering
Performance Evaluation
GPP: Simulation by cycle accurate processor simulator SFQ-RDP: Performance evaluation modeling
Estimation of execution times
GPP Processor type Out-of-Order
Freq. 3.2 GHz
Inst. issue width 4 Inst./CC
L1 data cache 64 KB
L2 unified cache 4 MB
Latency of main mem. 300 CC
RDP Freq.(SFQ-RDP) 80 GHz
Reconfiguration latency
30000 CC
Main mem. Bandwidth* 141.7, 157.0 GB/s
No. PEs in a row 22
No. PEs in a column 15
* BW numbers are based on ones for GPU calculation
System Architecture System Configuration
GPP
MainMemory
SFQ-RDP22x15 PEs
80 GHz3.2 GHz
BW:141.7, 157.0GB/s
15
2input/1outputports
Results of Performance Evaluation
SFQ-RDP (GFLop/s)
GPU(GFLop/s)
Ratio(by GPU)
Ratio(by GPP)
2D-Diffusion
50.6 63.0 1) 0.80 79.0
2D-FDTD 23.4 31.4 2) 0.75 26.2
1D-Diffusion
210.0 3) - - -
1D-Vibration
104.9 3) - - -
Comparable results to GPU
• SFQ-RDP processor, which is implementedby superconductivity circuits and simple 2D-array architecture, can be used as an efficient accelerator1) T. Aoki, et al., “CUDA programming primer,”, (Japanese), Kougakusya, ISBN-10:4777514773, 2009.
2) N. Takada, et al., “Speeding up of FDTD finite difference calculations by efficient use of GPU and shared memory,” (Japanese), Proceedings of Forum of Information Science and Technology, 20093) H. Kataoka, et al.,"Reducing Preprocessing Overhead Times in a Reconfigurable Accelerator of Finite Difference Applications", SAAHPC 10, Jul. 2010.
16
Why Can We Achieve Comparable Results?
# of Operation
# of I/O1) Byte/Flop Estimation of GFlop/s 2)
(Max. BW 159.0GB/s)
RDP
Calc.
Original formula
(1 Output)
7 5+1
= 6
6*4/7
= 3.42~4.7
(random access: ~16GB/s)
Unrolled loop formula(9 outputs formula)
7 * 9
= 63
21+9
= 30
30 * 4 / 63
= 1.90~8.4
(random access: ~16GB/s)
Data structure conversion for DMA transfer
7 * 9
= 63
30+9
= 39
39 * 4 / 63
= 2.4864.1
(DMA: ~159.0GB/s)
With GPP calc, comm. and other overheads
50.6(DMA: ~159.0GB/s)
GPU
Calc.
Aoki et al. 3 )
63.04)
1) Based on the utilization of HW for rearrangement of input data2) Single Precision Calculation, BW 159.0GB/s , GeForce GTX 2853) GeForce GTX 285, 1 proc. calculation : (1024x1204 mesh)4) T. Aoki, et al., “CUDA programming primer,”, (Japanese), Kougakusya, ISBN-
10:4777514773, 2009. 17
Conclusions and Future Works
Conclusions An Single-Flux Quantum Reconfigurable Data-Path (SFQ-RDP)
with two-dimensional floating point array architecture implemented by superconducting circuits was introduced.
Two-dimensional Heat (2D-Heat) and Finite Difference Time Domain (2D-FDTD) applications were implemented on SFQ-RDP and performance evaluations were conducted.
For 2D-Heat and 2D-FDTD, 50.6 and 79.0 times faster computation than general purpose processor were achievable respectively, while these performance values were comparable to reported results for the GPU.
SFQ-RDP accelerator can be used for practical scientific calculations especially based on finite difference methods.
Future Works Implementations and performance evaluations of other
applications18
• CAD for logic design and arithmetic circuits• Prof. N.Takagi (Leader), Prof. K.Takagi (Kyoto Univ.)
• SFQ-RDP chip, cell library, and wiring• Prof. A.Fujimaki, Prof. H.Akaike, Prof. M.Tanaka (Nagoya
Univ.)• SFQ-FPU chip, cell library
• Prof. N.Yoshikawa (Yokohama National Univ.)• SFQ process
• Dr. S.Nagasawa, Dr. M.Hidaka (SLRC)
Acknowledgement
This research was supportedby Core Research for Evolutional Science and Technology (CREST)of Japan Science and Technology Corporation (JST).
Other SFQ-RDP research members
19