Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration

Lithographic Aerial Image Simulation Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration with FPGA based Hardware Acceleration

Jason Cong and Yi Zou

UCLA Computer Science Department

2

Lithography Simulation (Application)Lithography Simulation (Application)Simulation of the optical imaging processSimulation of the optical imaging process

Computational intensive and quite slow for full-chip simulationComputational intensive and quite slow for full-chip simulation

3

Xtremedata Inc’s XD1000Xtremedata Inc’s XD1000TMTM Coprocessor System Coprocessor System (Platform) (Platform)

Socket-compatible :Socket-compatible :

Replace one Opetron CPUReplace one Opetron CPU with the with the

XD1000 coprocessorXD1000 coprocessor

The module connects to the CPU's The module connects to the CPU's

HyperTransport bus and HyperTransport bus and

motherboard DIMMs while utilizing motherboard DIMMs while utilizing

the existing power supply and heat the existing power supply and heat

sink solution for the CPU. sink solution for the CPU.

Dedicated DIMM for FPGA (not Dedicated DIMM for FPGA (not

shared with CPU)shared with CPU)

Coprocessor communicates with Coprocessor communicates with

CPU via hyper-transport link , has CPU via hyper-transport link , has

similar behavior as a PCI devicesimilar behavior as a PCI device

4

Approach: Use of C to RTL ToolsApproach: Use of C to RTL Tools Used two tools in our workUsed two tools in our work

Codeveloper (Impulse C ) by Impulse Accelerated TechnologiesCodeveloper (Impulse C ) by Impulse Accelerated Technologies AutoPilot by AutoESL Design TechnologiesAutoPilot by AutoESL Design Technologies

AdvantagesAdvantages Maintain the design at C levelMaintain the design at C level Shorten the development cycleShorten the development cycle

Perform several tuning and refinement at C levelPerform several tuning and refinement at C level • Loop interchange, loop unrolling and loop pipelining Loop interchange, loop unrolling and loop pipelining • Data distribution and memory partitioningData distribution and memory partitioning• Data prefetching / overlapping computation and communicationData prefetching / overlapping computation and communication

5

Imaging EquationsImaging Equations(n) (n) (n) (n)

k 1 1 k 2 1 2k (n) (n) (n) (n)

1 n=1 k 2 2 k 1 2

[ (x-x , y-y ) - (x-x , y-y )I(x,y) = | |

+ (x-x , y-y ) - (x-x , y-y )]

K N

k

I(x,y) image intensity at (x,y)k(x,y) kth kernel

k(x,y) kth eigenvector

(x1,y1)(x2, y2) (x1,y2) (x2,y1) layout corners

mask transmittance

Pseudo code of the Imaging Equation

Loop over different rectangles

Loop over kernels

Loop over pixels

k k(x,y) = Q(x,y) (x,y)

6

Loop InterchangeLoop Interchange

Loop interchange

Loop over pixels

Loop over kernels

Loop over layout corners

Loop over kernels

Loop over layout corners

Loop over pixels

Different kernels do not have

much correlation, thus put to the

outer loop

Fix one specific layout corner, loop over pixels for more regular data access

7

Interpretation of Inner Loop after Loop InterchangeInterpretation of Inner Loop after Loop Interchange

Kernel Array

Object

(one rectangle)

Image

(partial sum)

+

+-

-

Imaging equation:Imaging equation:

The loop over different layout corners and pixels:

The partial image computed by the inner sum is the weighted sum of shifted kernel, and how much is shifted is determined by layout corners

Layout corners

(n) (n) (n) (n)k 1 1 k 2 1 2

k (n) (n) (n) (n)1 n=1 k 2 2 k 1 2

[ (x-x , y-y ) - (x-x , y-y )I(x,y) = | |

+ (x-x , y-y ) - (x-x , y-y )]

K N

k

8

Loop UnrollingLoop Unrolling Loop unrolling is one option to express parallelism in those toolsLoop unrolling is one option to express parallelism in those tools

The improvementThe improvement by loop unrolling is limited due to port conflictsby loop unrolling is limited due to port conflicts

Data access of the same array cannot be scheduled to the same Data access of the same array cannot be scheduled to the same

cycle due to port conflictscycle due to port conflicts

May increase the initial interval when both loop pipelining and May increase the initial interval when both loop pipelining and

loop unrolling is usedloop unrolling is used

Loop unrolling

9

Further Parallelization needs Memory Partitioning Further Parallelization needs Memory Partitioning

Unrolling did not solve the problem completely due to port conflictionsUnrolling did not solve the problem completely due to port conflictions

Need a multi-port (on-chip) mem with a large number of ports!Need a multi-port (on-chip) mem with a large number of ports!

Implement the multi-port mem via memory partitioningImplement the multi-port mem via memory partitioning

Computing tasks can be done in parallel once we get the multiple data in parallelComputing tasks can be done in parallel once we get the multiple data in parallel

Each PE is responsible for computing one partition of imageEach PE is responsible for computing one partition of image

Each PE composed of one partition of kernel and one partition of image partial sumEach PE composed of one partition of kernel and one partition of image partial sum

Multiplexing logic gets the data from Multiplexing logic gets the data from

different partitions of kernel and providesdifferent partitions of kernel and provides

the data for each PE the data for each PE

To compute one partition of image, To compute one partition of image,

might also need the kernel data inmight also need the kernel data in

other partitionsother partitions

Kernel partition 1

Image

Partial Sum partition 1

Computing Element

Kernel partition 2

Image


Computing Element

One partition

of Kernel

One partition of Image

Partial Sum

Computing Element

Kernel partition 4

Image


Computing Element

Multi

plexing

Logic

4-PE example

Kernel partition 3

Image


10

Choosing Partitioning SchemesChoosing Partitioning SchemesA less optimal partitioning design ( here is 2 x 2 example)A less optimal partitioning design ( here is 2 x 2 example)

Block scheduling to avoid the data access contention ( at any time Block scheduling to avoid the data access contention ( at any time each PE accesses a different kernel partition)each PE accesses a different kernel partition)

Might face load balancing problem if required kernel data lie mostly in Might face load balancing problem if required kernel data lie mostly in some partitionssome partitions

Computing tasks is partitioned into Computing tasks is partitioned into

blocks/stagesblocks/stages

Using Kernel Partition 1

Compute Image Partition 1







PE 1 PE 2 PE 3 PE 4

























Tim

e

11

Choosing Partitioning Schemes (Cont)Choosing Partitioning Schemes (Cont)Data partitioning for load balancing

Here different colors different partitions

Memory banking using lower bits

partition 1

partition 2

partition 3

partition 4

Kernel Array Image Partial Sum Array

partition 1partition 2partition 3partition 4

12

Address Generation and Data MultiplexingAddress Generation and Data MultiplexingNeed Address Generation Logic to provide the address for the kernel data and

image partial sum as the memory is partitioned

Need data multiplexing logic to deliver the data from multiple memory blocks to

the correct place

Implemented as 2D ring based shifting (better than naïve Mux on larger partitioning )

Wanted :

Reg_1=array_c[..]

Reg_2=array_d[..]

Reg_3=array_a[..]

Reg_4=array_b[..]

a

d

b

c

configuration 1 configuration 2 configuration 3 configuration 41

4

2

3

Start from:

Reg_1=array_a[..]

Reg_2=array_b[..]

Reg_3=array_c[..]

Reg_4=array_d[..]

Reg_1 Reg_2

Reg_3 Reg_4Shift 1 step in

Y direction

Shift 0 step in

X direction

13

Loop Pipelining and Loop UnrollingLoop Pipelining and Loop Unrolling Loop pipelining can still be applied to the code after memory partitioningLoop pipelining can still be applied to the code after memory partitioning

Can speed up the code by a factor of 10XCan speed up the code by a factor of 10X

Loop unrolling can be used to compact the code via multi-dimension arrayLoop unrolling can be used to compact the code via multi-dimension array One way to represent the memory partitioningOne way to represent the memory partitioning

kernel[size];

Loop body with unrolling pragma and pipelining pragma

{

…. +=kernel […]…

//computation

}

kernel[4][4][size/16];

Loop body with unrolling pragma and pipelining pragma

{

…. +=kernel [i][j][…]…

//if some index are constant

}

14

Overlapping Computation and CommunicationOverlapping Computation and Communication Use ping-pong buffers at Input and Output.Use ping-pong buffers at Input and Output.

Two ways of implementationTwo ways of implementation Function / Block pipelining (AutoPilot) or Inter-Process Communication (Impulse C)Function / Block pipelining (AutoPilot) or Inter-Process Communication (Impulse C)

Reading Input Data

Computation

Writing Output Data

Reading Input Data

Computation

Writing Output Data

Reading Input Data

Computation

Writing Output Data

Reading Input Data

Computation

Writing Output Data

DI1

DI2

Comp

SW HW HW

DI1

DI2

DO2

DO1

DI1

DI2

Comp

CompDO2

DO1

DO2

DO1

DI1:

Transferring Input From software to

SRAM

DI2:

Transferring Input From SRAM to

FPGA

DO2:

Transferring Output From FPGA

to SRAM

DO1:

Transferring Output From

SRAM to Software

15

Implementation FlowImplementation Flow

Original code has nested loop Original code has nested loop

Loop interchange (manual code refinement)Loop interchange (manual code refinement)

Multi-PE implementation : add memory partitioning, address Multi-PE implementation : add memory partitioning, address

generation and data multiplexing logics (manual code refinement)generation and data multiplexing logics (manual code refinement)

Enable loop pipelining for the refined code via specify pragmasEnable loop pipelining for the refined code via specify pragmas

Use Impulse C and AutoPilot to compile the refined codeUse Impulse C and AutoPilot to compile the refined code

Use vendor tool to compile the RTL to bitstreamUse vendor tool to compile the RTL to bitstream

Run the program on the target systemRun the program on the target system

16

Experiment ResultsExperiment Results15X speedup using a 5 by 5 partitioning over Opteron 2.2G 4G RAM

Logic utilization around 25K ALUT (and 8K is used in the interface framework rather than

design)

Power utilization less than 15W in FPGA comparing with 86W in Opteron248

Close to 100X (5.8 x 15) improvement on energy efficiency

Assuming similar performance

17

Experience on the Two Commercial ToolsExperience on the Two Commercial Tools

Impulse CImpulse C Strong platform customization supportStrong platform customization support

Hardware software co-design Hardware software co-design

Smaller subset of CSmaller subset of C

AutopilotAutopilot Support for both C/C++/System CSupport for both C/C++/System C

Larger synthesizable subsetLarger synthesizable subset

Platform customizationPlatform customization

18

DiscussionsDiscussions

The performance without different optimizationsThe performance without different optimizations Roughly 2~3X worse if we do not do memory partitioningRoughly 2~3X worse if we do not do memory partitioning

Polygon based versus image based approachPolygon based versus image based approach Image based is 2D FFTImage based is 2D FFT

Which one is faster depends on actual layoutWhich one is faster depends on actual layout

Implementation on GPUImplementation on GPU The nested loop itself is already data parallelThe nested loop itself is already data parallel

G80 has very fast shared mem for thread blocks. But the size is only 16KB. G80 has very fast shared mem for thread blocks. But the size is only 16KB.

We had to put the kernel array in the texture memory with cachingWe had to put the kernel array in the texture memory with caching

19

AcknowledgmentsAcknowledgments Financial supports from Financial supports from

GRC GRC GSRC(FCRP)GSRC(FCRP) NSFNSF

Industrial support and collaboration fromIndustrial support and collaboration from Altera-AMD-SUN-XDI consortiumAltera-AMD-SUN-XDI consortium Altera, Magma, and Xilinx under the UC MICRO programAltera, Magma, and Xilinx under the UC MICRO program

Valuable discussion and comments fromValuable discussion and comments from Alfred Wong (Magma) Alfred Wong (Magma) Zhiru Zhang (AutoESL)Zhiru Zhang (AutoESL)

20

Q/AQ/A

Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration

Documents

Transcript of Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration