Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration
description
Transcript of Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration
Lithographic Aerial Image Simulation Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration with FPGA based Hardware Acceleration
Jason Cong and Yi Zou
UCLA Computer Science Department
2
Lithography Simulation (Application)Lithography Simulation (Application)Simulation of the optical imaging processSimulation of the optical imaging process
Computational intensive and quite slow for full-chip simulationComputational intensive and quite slow for full-chip simulation
3
Xtremedata Inc’s XD1000Xtremedata Inc’s XD1000TMTM Coprocessor System Coprocessor System (Platform) (Platform)
Socket-compatible :Socket-compatible :
Replace one Opetron CPUReplace one Opetron CPU with the with the
XD1000 coprocessorXD1000 coprocessor
The module connects to the CPU's The module connects to the CPU's
HyperTransport bus and HyperTransport bus and
motherboard DIMMs while utilizing motherboard DIMMs while utilizing
the existing power supply and heat the existing power supply and heat
sink solution for the CPU. sink solution for the CPU.
Dedicated DIMM for FPGA (not Dedicated DIMM for FPGA (not
shared with CPU)shared with CPU)
Coprocessor communicates with Coprocessor communicates with
CPU via hyper-transport link , has CPU via hyper-transport link , has
similar behavior as a PCI devicesimilar behavior as a PCI device
4
Approach: Use of C to RTL ToolsApproach: Use of C to RTL Tools Used two tools in our workUsed two tools in our work
Codeveloper (Impulse C ) by Impulse Accelerated TechnologiesCodeveloper (Impulse C ) by Impulse Accelerated Technologies AutoPilot by AutoESL Design TechnologiesAutoPilot by AutoESL Design Technologies
AdvantagesAdvantages Maintain the design at C levelMaintain the design at C level Shorten the development cycleShorten the development cycle
Perform several tuning and refinement at C levelPerform several tuning and refinement at C level • Loop interchange, loop unrolling and loop pipelining Loop interchange, loop unrolling and loop pipelining • Data distribution and memory partitioningData distribution and memory partitioning• Data prefetching / overlapping computation and communicationData prefetching / overlapping computation and communication
5
Imaging EquationsImaging Equations(n) (n) (n) (n)
k 1 1 k 2 1 2k (n) (n) (n) (n)
1 n=1 k 2 2 k 1 2
[ (x-x , y-y ) - (x-x , y-y )I(x,y) = | |
+ (x-x , y-y ) - (x-x , y-y )]
K N
k
I(x,y) image intensity at (x,y)k(x,y) kth kernel
k(x,y) kth eigenvector
(x1,y1)(x2, y2) (x1,y2) (x2,y1) layout corners
mask transmittance
Pseudo code of the Imaging Equation
Loop over different rectangles
Loop over kernels
Loop over pixels
k k(x,y) = Q(x,y) (x,y)
6
Loop InterchangeLoop Interchange
Loop interchange
Loop over pixels
Loop over kernels
Loop over layout corners
Loop over kernels
Loop over layout corners
Loop over pixels
Different kernels do not have
much correlation, thus put to the
outer loop
Fix one specific layout corner, loop over pixels for more regular data access
7
Interpretation of Inner Loop after Loop InterchangeInterpretation of Inner Loop after Loop Interchange
Kernel Array
Object
(one rectangle)
Image
(partial sum)
+
+-
-
Imaging equation:Imaging equation:
The loop over different layout corners and pixels:
The partial image computed by the inner sum is the weighted sum of shifted kernel, and how much is shifted is determined by layout corners
Layout corners
(n) (n) (n) (n)k 1 1 k 2 1 2
k (n) (n) (n) (n)1 n=1 k 2 2 k 1 2
[ (x-x , y-y ) - (x-x , y-y )I(x,y) = | |
+ (x-x , y-y ) - (x-x , y-y )]
K N
k
8
Loop UnrollingLoop Unrolling Loop unrolling is one option to express parallelism in those toolsLoop unrolling is one option to express parallelism in those tools
The improvementThe improvement by loop unrolling is limited due to port conflictsby loop unrolling is limited due to port conflicts
Data access of the same array cannot be scheduled to the same Data access of the same array cannot be scheduled to the same
cycle due to port conflictscycle due to port conflicts
May increase the initial interval when both loop pipelining and May increase the initial interval when both loop pipelining and
loop unrolling is usedloop unrolling is used
Loop unrolling
9
Further Parallelization needs Memory Partitioning Further Parallelization needs Memory Partitioning
Unrolling did not solve the problem completely due to port conflictionsUnrolling did not solve the problem completely due to port conflictions
Need a multi-port (on-chip) mem with a large number of ports!Need a multi-port (on-chip) mem with a large number of ports!
Implement the multi-port mem via memory partitioningImplement the multi-port mem via memory partitioning
Computing tasks can be done in parallel once we get the multiple data in parallelComputing tasks can be done in parallel once we get the multiple data in parallel
Each PE is responsible for computing one partition of imageEach PE is responsible for computing one partition of image
Each PE composed of one partition of kernel and one partition of image partial sumEach PE composed of one partition of kernel and one partition of image partial sum
Multiplexing logic gets the data from Multiplexing logic gets the data from
different partitions of kernel and providesdifferent partitions of kernel and provides
the data for each PE the data for each PE
To compute one partition of image, To compute one partition of image,
might also need the kernel data inmight also need the kernel data in
other partitionsother partitions
Kernel partition 1
Image
Partial Sum partition 1
Computing Element
Kernel partition 2
Image
Partial Sum partition 2
Computing Element
One partition
of Kernel
One partition of Image
Partial Sum
Computing Element
Kernel partition 4
Image
Partial Sum partition 4
Computing Element
Multi
plexing
Logic
4-PE example
Kernel partition 3
Image
Partial Sum partition 3
10
Choosing Partitioning SchemesChoosing Partitioning SchemesA less optimal partitioning design ( here is 2 x 2 example)A less optimal partitioning design ( here is 2 x 2 example)
Block scheduling to avoid the data access contention ( at any time Block scheduling to avoid the data access contention ( at any time each PE accesses a different kernel partition)each PE accesses a different kernel partition)
Might face load balancing problem if required kernel data lie mostly in Might face load balancing problem if required kernel data lie mostly in some partitionssome partitions
Computing tasks is partitioned into Computing tasks is partitioned into
blocks/stagesblocks/stages
Using Kernel Partition 1
Compute Image Partition 1
Using Kernel Partition 2
Compute Image Partition 1
Using Kernel Partition 3
Compute Image Partition 1
Using Kernel Partition 4
Compute Image Partition 1
PE 1 PE 2 PE 3 PE 4
Using Kernel Partition 2
Compute Image Partition 2
Using Kernel Partition 3
Compute Image Partition 2
Using Kernel Partition 4
Compute Image Partition 2
Using Kernel Partition 1
Compute Image Partition 2
Using Kernel Partition 3
Compute Image Partition 3
Using Kernel Partition 4
Compute Image Partition 3
Using Kernel Partition 1
Compute Image Partition 3
Using Kernel Partition 2
Compute Image Partition 3
Using Kernel Partition 4
Compute Image Partition 4
Using Kernel Partition 1
Compute Image Partition 4
Using Kernel Partition 2
Compute Image Partition 4
Using Kernel Partition 3
Compute Image Partition 4
Tim
e
11
Choosing Partitioning Schemes (Cont)Choosing Partitioning Schemes (Cont)Data partitioning for load balancing
Here different colors different partitions
Memory banking using lower bits
partition 1
partition 2
partition 3
partition 4
Kernel Array Image Partial Sum Array
partition 1partition 2partition 3partition 4
12
Address Generation and Data MultiplexingAddress Generation and Data MultiplexingNeed Address Generation Logic to provide the address for the kernel data and
image partial sum as the memory is partitioned
Need data multiplexing logic to deliver the data from multiple memory blocks to
the correct place
Implemented as 2D ring based shifting (better than naïve Mux on larger partitioning )
Wanted :
Reg_1=array_c[..]
Reg_2=array_d[..]
Reg_3=array_a[..]
Reg_4=array_b[..]
a
d
b
c
configuration 1 configuration 2 configuration 3 configuration 41
4
2
3
Start from:
Reg_1=array_a[..]
Reg_2=array_b[..]
Reg_3=array_c[..]
Reg_4=array_d[..]
Reg_1 Reg_2
Reg_3 Reg_4Shift 1 step in
Y direction
Shift 0 step in
X direction
13
Loop Pipelining and Loop UnrollingLoop Pipelining and Loop Unrolling Loop pipelining can still be applied to the code after memory partitioningLoop pipelining can still be applied to the code after memory partitioning
Can speed up the code by a factor of 10XCan speed up the code by a factor of 10X
Loop unrolling can be used to compact the code via multi-dimension arrayLoop unrolling can be used to compact the code via multi-dimension array One way to represent the memory partitioningOne way to represent the memory partitioning
kernel[size];
Loop body with unrolling pragma and pipelining pragma
{
…. +=kernel […]…
//computation
}
kernel[4][4][size/16];
Loop body with unrolling pragma and pipelining pragma
{
…. +=kernel [i][j][…]…
//if some index are constant
}
14
Overlapping Computation and CommunicationOverlapping Computation and Communication Use ping-pong buffers at Input and Output.Use ping-pong buffers at Input and Output.
Two ways of implementationTwo ways of implementation Function / Block pipelining (AutoPilot) or Inter-Process Communication (Impulse C)Function / Block pipelining (AutoPilot) or Inter-Process Communication (Impulse C)
Reading Input Data
Computation
Writing Output Data
Reading Input Data
Computation
Writing Output Data
Reading Input Data
Computation
Writing Output Data
Reading Input Data
Computation
Writing Output Data
DI1
DI2
Comp
SW HW HW
DI1
DI2
DO2
DO1
DI1
DI2
Comp
CompDO2
DO1
DO2
DO1
DI1:
Transferring Input From software to
SRAM
DI2:
Transferring Input From SRAM to
FPGA
DO2:
Transferring Output From FPGA
to SRAM
DO1:
Transferring Output From
SRAM to Software
15
Implementation FlowImplementation Flow
Original code has nested loop Original code has nested loop
Loop interchange (manual code refinement)Loop interchange (manual code refinement)
Multi-PE implementation : add memory partitioning, address Multi-PE implementation : add memory partitioning, address
generation and data multiplexing logics (manual code refinement)generation and data multiplexing logics (manual code refinement)
Enable loop pipelining for the refined code via specify pragmasEnable loop pipelining for the refined code via specify pragmas
Use Impulse C and AutoPilot to compile the refined codeUse Impulse C and AutoPilot to compile the refined code
Use vendor tool to compile the RTL to bitstreamUse vendor tool to compile the RTL to bitstream
Run the program on the target systemRun the program on the target system
16
Experiment ResultsExperiment Results15X speedup using a 5 by 5 partitioning over Opteron 2.2G 4G RAM
Logic utilization around 25K ALUT (and 8K is used in the interface framework rather than
design)
Power utilization less than 15W in FPGA comparing with 86W in Opteron248
Close to 100X (5.8 x 15) improvement on energy efficiency
Assuming similar performance
17
Experience on the Two Commercial ToolsExperience on the Two Commercial Tools
Impulse CImpulse C Strong platform customization supportStrong platform customization support
Hardware software co-design Hardware software co-design
Smaller subset of CSmaller subset of C
AutopilotAutopilot Support for both C/C++/System CSupport for both C/C++/System C
Larger synthesizable subsetLarger synthesizable subset
Platform customizationPlatform customization
18
DiscussionsDiscussions
The performance without different optimizationsThe performance without different optimizations Roughly 2~3X worse if we do not do memory partitioningRoughly 2~3X worse if we do not do memory partitioning
Polygon based versus image based approachPolygon based versus image based approach Image based is 2D FFTImage based is 2D FFT
Which one is faster depends on actual layoutWhich one is faster depends on actual layout
Implementation on GPUImplementation on GPU The nested loop itself is already data parallelThe nested loop itself is already data parallel
G80 has very fast shared mem for thread blocks. But the size is only 16KB. G80 has very fast shared mem for thread blocks. But the size is only 16KB.
We had to put the kernel array in the texture memory with cachingWe had to put the kernel array in the texture memory with caching
19
AcknowledgmentsAcknowledgments Financial supports from Financial supports from
GRC GRC GSRC(FCRP)GSRC(FCRP) NSFNSF
Industrial support and collaboration fromIndustrial support and collaboration from Altera-AMD-SUN-XDI consortiumAltera-AMD-SUN-XDI consortium Altera, Magma, and Xilinx under the UC MICRO programAltera, Magma, and Xilinx under the UC MICRO program
Valuable discussion and comments fromValuable discussion and comments from Alfred Wong (Magma) Alfred Wong (Magma) Zhiru Zhang (AutoESL)Zhiru Zhang (AutoESL)
20
Q/AQ/A