Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush...

27
Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini School of EECS, University of Central Florida, Orlando, FL 32816

Transcript of Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush...

Page 1: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

Novel Hardware-software Architecture for Computation of DWT Using Recusive

Merge Algorithm

Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini

School of EECS, University of Central Florida, Orlando, FL 32816

Page 2: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

2

Goals

Propose a new FPGA-µP based architecture for the RMF algorithm to computer Discrete Wavelet Transform.

Technique to overcome the data routing bottleneck of the Recursive Merge Filtering for DWT technique.

Transformation of the data routing problem for RMF to basic arithmetic computation on the FPGA with local memory access.

Page 3: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

3

Introduction to Reconf. Computing

Reconfigurable Computing (EE Times, Nov. 1998)

Page 4: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

4

Why Reconfigurable Computing? ASICs have high design turnover times

Rapid Prototyping using FPGAs High design change/error costs

Incorrect designs in silicon incur a very high cost of modification Speedup achievable is far greater

Image Correlation: 0.69 sec on FPGA Vs 38 sec using 133MHz Pentium processor[Kean et al., 1997]

512 bit RSA decoding implementation using FPGA decodes at 200kbits/sec Vs 19kbits/sec for ASIC implementation.[Bertin et al., 1992]

Reusability of hardware Same silicon chip used for diverse applications unlike ASICs

Dynamic Reconfigurability FPGA can be configured for a different application while some other

application is already running on the chip

Page 5: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

5

Recursive Merge Filtering Algorithm

Based on the principle of preserving the spatial correlation between the inputs and the wavelet coefficients obtained at any stage.

RMF algorithm based on recursive sub-block computation, reducing the size of the image whose RMF is computed by half at each iteration(1-D case).

Computation of the blocks bottom up followed by hierarchical merging of the sub-blocks to obtain the wavelet transform.

Page 6: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

6

Fast Wavelet Transform Data Flow

Page 7: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

7

RMF DWT Data Flow

Page 8: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

8

Data Routing (DR) in RMF

RMF

Page 9: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

9

RMF Algorithm

Formally the RMF technique is defined as (1-D):

where h and g are Häar Filters and • is the concatenation op.

The DWT operation can be defined in terms of RMF as:

Page 10: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

10

RMF Algorithm (3) :RMF for 2-D:

If k > 1, then

Page 11: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

11

RMF Algorithm (4): RMF for 2-D …

If k =1, then,

Key Point : RMF algorithm has two parts in the computation: Arithmetic computation phase and routing phase corresponding to merge and filter.

Separation of these two phases can lead to improvement.

Page 12: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

12

Transformation of DR to +/-:

x

y

BRx, BRy

TLx, TLy

1 2

3

4

5

7

6

9

8

Quad. 1

Quad. 2

Quad. 3

Quad. 4

• Data movement : Block 9 and block 1 need to be swapped.

• Key : Use a virtual position matrix for the data items in the quadrants instead of moving the data items.

9 2

8

4

5

7

6

1

3

Page 13: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

13

Virtual Mapping Index Initially:

for every position (i, j) in the input data (2-D) doset the virtual index position (i, j) to <i, j>, where <i, j> is the packed storage of i and j.

endfor.

Data (i, j)=Image Pixel Value

Data Matrix

Initial state : Set

Virtual Map

VirtMap (i, j)=<i, j> Data (i, j)=Image Pixel Value

Data Matrix

New State : Pixel (i, j) moved by (x,y)

Virtual Map

VirtMap (i, j)=<i+x, j+y>

Position of data pixel at (i, j) = VirtMap(i, j)

Page 14: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

14

Architecture for RMF Using FPGAs

The use of the virtual mapping index separates the routing and the computation.

The microprocessor can proceed with the arithmetic computation while the circuit loaded onto the FPGA can carry out the data routing.

The virtual index is stored in the FPGA board RAM allowing the FPGA fast access to the virtual index table.

The microprocessor has to refer to the virtual index table to determine the actual position of the values needed during the computation.

Page 15: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

15

Computation using RMF Arch:Initially

1

4

3

2

5

6

7

8

16

<1,1>

<1,4>

<1,3>

<1,2>

<2,1>

<2,2>

<2,3>

<2,4>

<4,4>

Virtual Map (on FPGA RAM)

Intermediate Mapping

Input Data Array(4x4). Main

Memory

Queue 1

Queue 2PBU

MPU

Page 16: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

16

RMF Computation(2)

1

4

3

2

5

6

7

8

16

<1,1>

<1,4>

<1,3>

<1,2>

<2,1>

<2,2>

<2,3>

<2,4>

<4,4>

Virtual Map (on FPGA RAM)

Intermediate Mapping

Input Data Array(4x4). Main

Memory

Queue 1

Queue 2PBU

MPU

•Set global context to Q1 i.e. all input co-ordinates read from Q1.

•Read 2x2 blocks and compute filter operation on CPU.

•Store back results in the main memory.

•Write co-ordinates to Q1.

•Repeat process until all 2x2 blocks computed.

Page 17: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

17

RMF Computation(3)

1

4

3

2

5

6

7

8

16

<1,1>

<1,4>

<1,3>

<1,2>

<2,1>

<2,2>

<2,3>

<2,4>

<4,4>

Virtual Map (on FPGA RAM)

Intermediate Mapping

Input Data Array(4x4). Main

Memory

Queue 1

Queue 2PBU

MPU

•Set global context to Q1 i.e. all input co-ordinates read from Q1.

•Read 2x2 blocks and compute filter operation on CPU.

•Store back results in the main memory.

•Write co-ordinates to Q1.

•Repeat process until all 2x2 blocks computed.

•Read 4 2x2 co-ordinates from Q1 and merge.

• Generate new co-ordinates on FPGA using RAM values.

•If block size > 2x2, then write to Q2 with parameter ‘false’ to specify that block size not yet 2x2

•If block size 2x2 then then basic filter computation needs to be done on CPU. Write to Q2 with ‘true’

Page 18: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

18

RMF Computation (4)

1

4

3

2

5

6

7

8

16

<1,1>

<1,4>

<1,3>

<1,2>

<2,1>

<2,2>

<2,3>

<2,4>

<4,4>

Virtual Map (on FPGA RAM)

Intermediate Mapping

Input Data Array(4x4). Main

Memory

Queue 1

Queue 2PBU

MPU

•When Q1 becomes set global context to Q2

•Read co-ordinates from Q2 and repeat merging process

•Merge and write to Q1 with ’true/false’ set.

•In parallel, PBU checks global context queue and checks the flag.

•If filter operation is to be carried out, PBU reads FPGA RAM location from using co-ordinates in queue and determines the data values.

•Computes the data values and writes back to the same location in main memory.

Page 19: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

19

RMF Computation(5)

1

4

3

2

5

6

7

8

16

<1,1>

<1,4>

<1,3>

<1,2>

<2,1>

<2,2>

<2,3>

<2,4>

<4,4>

Virtual Map (on FPGA RAM)

Intermediate Mapping

Input Data Array(4x4). Main

Memory

Queue 1

Queue 2PBU

MPU

•Process repeated for PBU and MPU until one of the queues contains only a single co-ordinate pair, for the whole input data.

•The final coefficients are generated by resetting the main memory and putting the data in their proper positions.

Page 20: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

20

Filter and Merge Equations

• Given any data item in block 1 in position (x, y), it is moved to position :

Where : (BRx,BRy) and (TLx,TLy) are the bottom right and top left coordinates of the block.

•We further define the width and height of the block as:

Page 21: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

21

Merge and Filter equations:

• The primitive Block( ) computes the merge process for the given top and bottom co-ordinates.

• Two more primitives are defined as a part of the Block ()

• Move_Data( ) : Handles data movement

• Compute_1D ( ) : Computes the 1D RMF for rows and columns.

• To compute the basic 2x2 block we define another primitive

Page 22: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

22

Block primitive invokes:

• If the size of the block is greater than 2x2, the Block primitive is invoked as

•The Block primitive invokes the following primitives to perform the proper filter and merge operations:

Page 23: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

23

Block primitive invokes:

Page 24: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

24

Block primitive invokes:

Page 25: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

25

Architecture for RMF (2)…

Hardware Software Architecture for DWT using RMF.

Primitive Block Computation SW

Unit

Merge Process SW Unit

Queue Structure Q1

Queue 1 Exclusion Zone

Microprocessor FPGA

RA

M

Main Memory

Queue Structure Q2

rMap Access Exclusion Zone

Page 26: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

26

Results : Accesses to main memoryPerformance Comparison between normal RMF and FPGA based

RMF

699048

4168248

889912

184632

174760

43688

0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000

128

256

512

Imag

e S

ize

Execution Time

Data Access for normal RMF Data Access using FPGA

Original Image

Reconstructed Image

Page 27: Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.

11 Feb, 2000 VLSI Systems Lab, School of EECS, UCF

27

Total Data Accesses

Performance Comparison between normal RMF and FPGA based RMF

699048

344064

1835008

9437184

387752

2009768

10136232

4168248

889912

184632

174760

43688

0 2000000 4000000 6000000 8000000 10000000 12000000

128

256

512

Imag

e S

ize

Execution Time

Total accesses usingFPGA

Virtual Map Accesses

Data Access using FPGAto Main Memory

Data Access for normalRMF