Synthesis of Platform Architectures from OpenCL Programs
-
Upload
nikos-bellas -
Category
Devices & Hardware
-
view
192 -
download
0
Transcript of Synthesis of Platform Architectures from OpenCL Programs
Synthesis of Platform Architectures from OpenCL Programs
Muhsen Owaida
KonstantisDaloukas
NikolaosBellas
Christos D.Antonopoulos
Department of Computer and Communication EngineeringUniversity of Thessaly
Volos, Greece
05/02/23 FCCM 2011 2
Introduction• High Level Synthesis (HLS) has been at the
research forefront in the last few years.
• Variety of Programming Models have been introduced: C/C++, C-like Languages,
MATLAB, CUDA.• Obstacles:
– Parallelism Expression.– Extensive Compiler Transformations &
Optimizations.
05/02/23 FCCM 2011 3
Motivation• Lack of parallel programming language for
reconfigurable platforms.
• A major shift of Computing industry toward many-core computing systems.
• Reconfigurable fabrics bear a strong resemblance to many core systems.
05/02/23 FCCM 2011 4
Contribution• Silicon-OpenCL “SOpenCL”.• A tool flow to convert an
unmodified OpenCL application into a SoC design with HW/SW components.
• A template-based hardware accelerator generation.
• Decouple data movement and computations.
Front End
Back End
_kerne VecAdd2D(int *A, int* B, int* C){ int I = get_local_id(0); int j = get_local_id(1); C[i*width + j] = A[i*width + j] + B[i*width + j];}
On-Chip CPU
On-Chip-Buss
HWAccelerator
HWAccelerator
Off-Chip Memory
Simulation & Verification
C Function
Drivers& runtimeSystem-On-Chip
SoC
OpenCL Kernel
StreamingUnit
Datapath
Input data
Output data
Architectural Template
05/02/23 FCCM 2011 5
Outline• High-Level Synthesis
• OpenCL Programming Model
• SOpenCL – Front-End
– Back-End
– Run-Time
• Experimental Evaluation
• Conclusion
05/02/23 6
OpenCL Programming Language• Open Computing Language• OpenCL expresses parallelism at its finest granularity.• Computation-grid partitioned in a 3-dimensional space of
work groups.
x = 0, y = 0
Work item (idx*Sx + x, idy*Sy + y)
x = Sx - 1, y = 0
Work item (idx*Sx + x, idy*Sy + y)
x = 0, y = Sy - 1
Work item (idx*Sx + x, idy*Sy + y)
x = Sx - 1, y = Sy - 1
Work item (idx*Sx + x, idy*Sy + y)
Work group (idx, idy)
void chromaMotionCompensation(char* refF, char* outF, int FWidth){int i = get_local_id(0);int j = get_local_id(1);int refX = get_group_id(0);int refY = get_group_id(1);int PixX = get_global_id(0);int PixY = get_global_id(1);Pval = (DXDY * refF[ (refY + j )*FWidth + (refX + i ) + dxDY * refF[ ( refY + j ) * FWidth + ( refX + i + 1) + DXdy * refF[ ( refY + j + 1 ) * FWidth + ( refX + i ) + dxdy * refF[ ( refY + j + 1 ) * FWidth + ( refX + i + 1 ) ]+ 32 ) >> 6; if( Pval < 0) Pval = 0 else if( Pval > 255) Pval = 255; outF[ PixY * FWidth + PixX ] = Pval;}
Computation Grid
Gx
Gy
Sx
Sy
Work-Item Thread
05/02/23 FCCM 2011 7
Data Movement• Explicit Data Movement: Local Buffers and
Global Buffers.
05/02/23 FCCM 2011 8
Outline
• High-Level Synthesis
• OpenCL Programming Model
• SOpenCL – Front-End
– Back-End
– Run-Time
• Experimental Evaluation
• Conclusion
05/02/23 FCCM 2011 9
SOpenCL Front-End (I)Granularity Coarsening
• Work Item represents a light computational load.• Coarsen the granularity due to limited resources and memory
bandwidth.
void chromaMotionCompensation(char* refF, char* outF, int FWidth){int i = get_local_id(0);int j = get_local_id(1);int refX = get_group_id(0);int refY = get_group_id(1);int PixX = get_global_id(0);int PixY = get_global_id(1);
Pval = ( DXDY * refF[ ( refY + j ) * FWidth + ( refX + i ) + dxDY * refF[ ( refY + j ) * FWidth + ( refX + i + 1 ) + DXdy * refF[ ( refY + j + 1 ) * FWidth + ( refX + i ) + dxdy * refF[ (refY + j + 1)*FWidth + (refX + i + 1) ] + 32 ) >> 6; if( Pval < 0) Pval = 0 else if( Pval > 255) Pval = 255; outF[ PixY * FWidth + PixX ] = Pval;}
void chromaMotionCompensation(char* refF, char* outF, int* local_size, int FWidth, int refX, int refY, int PX_init, int PY_init ) { int kernel_i, kernel_j, kernel_k, Pval, i, j, PixX, PixY; for (kernel_k = 0; kernel_k < local_size[2]; kernel_k++) { for (kernel_j = 0; kernel_j < local_size[1]; kernel_j++) { for (kernel_i = 0; kernel_i < local_size[0]; kernel_i++) { PixX = PX_init + kernel_i; PixY = PY_init + kernel_j; i = kernel_i; j = kernel_j;
Pval = ( DXDY * refF[ ( refY + j ) * FWidth + ( refX + i ) + dxDY * refF[ ( refY + j ) * FWidth + ( refX + i + 1 ) + DXdy * refF[ ( refY + j + 1 ) * FWidth + ( refX + i ) + dxdy * refF[ (refY + j + 1 )*FWidth + (refX + i + 1) ]+ 32 ) >> 6; if( Pval < 0) Pval = 0 else if( Pval > 255) Pval = 255; outF[ PixY * FWidth + PixX ] = Pval; }}}}
OpenCL KernelC function
SOpenCL Front-End (II) Barrier Elimination
05/02/23 FCCM 2011 10
triple_nested_loop { Statements_block1
} //barrier(); triple_nested_loop { Statements_block2
}
Statements_block1
barrier(); Statements_block2
OpenCL code
C code
05/02/23 FCCM 2011 11
Outline
• High-Level Synthesis
• OpenCL Programming Model
• SOpenCL – Front-End
– Back-End
– Run-Time
• Experimental Evaluation
• Conclusion
05/02/23 FCCM 2011 12
Hardware Generation• Perform a series of optimizations and
Transformations.– Uses LLVM Compiler Infrastructure.
• Generate synthesizable Verilog.• Generate Test bench and simulation files.
C code(Nested loop)
LLVMCompiler
Optimize LLVM-IR Predication Code
slicing
SMS modscheduling
Veriloggeneration
Simulation
SynthesisFinal bitstream
AcceleratorTemplate
User PerformanceRequirements
SynthesizableVerilog
Test bench
05/02/23 FCCM 2011 13
IF Conversion
• Predication: If-conversion necessary for the application of Modulo-Scheduler.
Predication Codeslicing
SMS modscheduling
Veriloggeneration
bb0:r0 = cmp eq t, 0br r0, bb1, bb2
bb1:r1 = load Abr bb3
bb2:r2 = add a, 1br bb3
bb3:r4 = phi r1, bb1, r2, bb2br bb4
bb0: r0 = cmp eq t, 0 p0 = xor r0, true(r0) r1 = load A(p0) r2 = add a, 1 r4 = select r0, r1, r2 br bb4
Most-inner loop body (LLVM assembly)
Predicates
05/02/23FCCM 2011
Code Slicing
• Decouple Data movement and computations.
• Input Streaming Kernel
• Output Streaming Kernel
• Computational Kernel
Sout Kernel: ind = phi [0, preh], [i2, body] i2 = add ind, 1 i6 = add a2, ind gep4 = getelementptr i8* x1, i6 store i23, i8* gep4
Computational Kernel: i46 = phi [true, preh], [i41, body] ind = phi [0, preh], [i2, body] i2 = add ind, 1 i7 = pop i8* gep0 i10 = pop i8* gep1 i9 = mul i7, a3 i12 = mul i10, a4 i19 = add i9, 32 i20 = add i19, i12 i23 = ashr i22, 6 push i23, i8* gep4 i40 = icmpeq i2, 8 i41 = xor i40, true br i40, exit, body
body: i46 = phi [true, preh], [i41, body] ind = phi [0, preh], [i2, body] i0 = add a0, ind i2 = add ind, 1 i3 = add a0, i2 gep0 = getelementptr i8* x0, i0 gep1 = getelementptr i8* x0, i3 i7 = load i8* gep0 i10 = load i8* gep1 i6 = add a2, ind gep4 = getelementptr i8* x1, i6 i9 = mul i7, a3 i12 = mul i10, a4 i19 = add i9, 32 i20 = add i19, i12 i23 = ashr i22, 6 store i23, i8* gep4 i40 = icmpeq i2, 8 i41 = xor i40, true br i40, exit, body
PredicatedLLVM Loop
Predication Codeslicing
SMS modscheduling
Veriloggeneration
Part of Chroma Interpolation LLVM
Termination
Computation
Sin Kernel: ind = phi [0, preh], [i2, body] i0 = add a0, ind i2 = add ind, 1 i3 = add a0, i2 gep0 = getelementptr i8* x0, i0 gep1 = getelementptr i8* x0, i3 i7 = load i8* gep0 i10 = load i8* gep1
05/02/23 FCCM 2011 15
Modulo Scheduling
• Software Pipelining:– II: Initiation Interval.
• Swing Modulo Scheduling (SMS). • Valid Bits used to implement Prologue and Epilogue.
Predication Codeslicing
SMS modscheduling
Veriloggeneration
Stage AStage BStage CStage DStage E
Stage AStage BStage CStage DStage E
Stage AStage BStage CStage DStage E
Stage AStage BStage CStage DStage E
Stage AStage BStage CStage DStage E
II
Iter 1Iter 2
Prologue
Kernel
Epilogue
Fill Pipeline
Steady State
Drain Pipeline
Iter N-1Iter N
05/02/23 FCCM 2011 16
Verilog Generation
Arbiter
Sin Align Unit Sout Align Unit
Sin Requests
Generator
Cache Unit
Sout AGU
Sin AGU
Data_lineData_line
AddressAddress Data_inData_in
Data_outData_out
AddressAddress
Sin0Sin0 Sin1Sin1 Sout0Sout0
Streaming UnitStreaming UnitSystem InterconnectSystem Interconnect
Local requestLocal request
FU
Data
TerminateTerminate Sin0Sin0 Sin1Sin1 Sout0Sout0
Data PathData Path
Named Register
Named Register
Memory Mapped Registers
Memory Mapped Registers
Multiplexer
TunnelTunnel
Data
FU
Multiplexer
Data
FU
Multiplexer
DataData
Sin Kernel: ind = phi [0, preh], [i2, body] i0 = add a0, ind i2 = add ind, 1 i3 = add a0, i2 gep0 = getelementptr i8* x0, i0 gep1 = getelementptr i8* x0, i3 i7 = load i8* gep0 i10 = load i8* gep1
Computational Kernel: i46 = phi [true, preh], [i41, body] ind = phi [0, preh], [i2, body] i2 = add ind, 1 i7 = pop i8* gep0 i9 = mul i7, a3 i10 = pop i8* gep1 i12 = mul i10, a4 i19 = add i9, 32 i20 = add i19, i12 i23 = ashr i22, 6 push i23, i8* gep4 i40 = icmpeq i2, 8 i41 = xor i40, true br i40, exit, body
Sout Kernel: ind = phi [0, preh], [i2, body] i2 = add ind, 1 i6 = add a2, ind gep4 = getelementptr i8* x1, i6 store i23, i8* gep4
Feed Data in Order
Predication Codeslicing
SMS modscheduling
Veriloggeneration
Write Data in Order
FU types,Bitwidths,
I/O Bandwidth
Requests/DataFIFO Size
05/02/23 FCCM 2011 17
Outline
• High-Level Synthesis
• OpenCL Programming Model
• SOpenCL– Front-End
– Back-End
– Run-Time
• Experimental Evaluation
• Conclusion
05/02/23 FCCM 2011 18
Run-Time
• The OpenCL main program is executed as a main thread in the host processor of the platform (e.g. PowerPC).
• Work-tasks are created by the helper thread.
HostMain thread
Hosthelperthread
CommandQueue
Enqueue OpenCL
command
1
Accelerator
Work queue
InitializeAccelerator
Finish signal
Enqueue new Work tasks
2
3
4
5
Work thread(PowerPC)
05/02/23 FCCM 2011 19
Outline
• High-Level Synthesis
• OpenCL Programming Model
• SOpenCL Front-End
• SOpenCL Back-End
• Run-Time
• Experimental Evaluation
• Conclusion
05/02/23 FCCM 2011 20
Experimental Evaluation• We tested the SOpenCL methodology on six OpenCL and
C applications.• we evaluated our designs on a Xilinx Virtex-5 FX70
FPGA. • We used Xilinx ISE 11.4 toolset for synthesis, placement
and routing.• Evaluation Methodology:
– Three levels of resources availability {Ca, Cb, Cc}.– Three Requests/Data FIFO Sizes.– Cache Usage.
05/02/23 FCCM 2011 21
Results
MatMul.
0
5
10
15
20
25
2 4 8 2 4 8 2 4 8
Data/Req-FIFO Size
Exe.
tim
e (m
s)
300
330
360
390
420
450
Rea
ds Is
sued
x10
00
Exe. time #Reads
CA CB CCCA CB
VAdd
0.0
0.1
0.2
0.3
0.4
0.5
0.6
2 4 8 2 4 8 2 4 8
Data/Req-FIFO Size
Exe.
tim
e (m
s)
0
5
10
15
20
25
30
Read
s Is
sued
x10
00
Exe. time #Reads
CA CB CC
05/02/23 FCCM 2011 22
Results
• The Cache is useful for applications with temporal locality.
LMC
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
CA Cb Cc Ca Cb Cc
Datapath Configuration
Exe.
time
(ms)
0
0.5
1
1.5
2
2.5
3
3.5
Read
s Is
sued
x10
00
Exe. time #Reads
With Cache Without Cache
1-D DCT
0.000
0.005
0.010
0.015
0.020
0.025
Ca Cb Cc Ca Cb Cc
Datapath Configurations
Exe.
tim
e (m
s)
0
0.14
0.28
0.42
0.56
0.7
Read
s Is
sued
x10
00
Exe. time #Reads
With Cache Without Cache
CMC
0.000
0.004
0.008
0.012
0.016
0.020
Ca Cb Cc Ca Cb Cc
Datapath Configuration
Exe.
tim
e (m
s)
00.080.160.240.320.40.480.560.640.720.8
Read
s Is
sued
x10
00
Exe. time #Reads
With Cache Without Cache
05/02/23 FCCM 2011 23
Outline
• High-Level Synthesis
• OpenCL Programming Model
• SOpenCL – Front-End
– Back-End
– Run-Time
• Experimental Evaluation
• Conclusion
05/02/23 FCCM 2011 24
Conclusion• SOpenCL, a tool flow to produce the hardware and
software architecture of accelerator-based SoCs.
• OpenCL serves as a unified programming model for:– Heterogeneous many-core platforms.– Reconfigurable platforms (like FPGA).
• Future Work:– Multiple accelerators support.– Automatic hardware configurations selection.
05/02/23 FCCM 2011 25
Questions
Thank you for your attention