Matrix Multipliction

8/8/2019 Matrix Multipliction

1/5

Partially Reconfigurable Matrix Multiplication for Area and

Time Efficiency on FPGAs

Luo Jianwen and Jong Ching ChuenSchool of Electrical and Electronic Engineering, Nanyang Technological University

Nanyang Avenue, Singapore [email protected], [email protected]

Abstract

This paper presents a novel architecture for matrix

multiplication implemented on reconfigurable hardware

with partially reconfigurable feature. The proposed

design significantly reduces the size and achieves the

minimum computation cycles for the n n matrix

multiplication. Compared with the linear array design [1],

the area of our design is reduced by 72% - 81% while the

AT metrics (product of area and latency) is reduced by

40% - 58% for matrix size between 3 3 and 48 48.

The versatility of our design is demonstrated in different

parameterisable instantiation to cater for different

implementations with various time and area requirements.

Partially reconfiguration allows us to reload the design

contents with the minimum configuration overhead. The

performance of our design is even better for larger

matrices.

1. Introduction

Matrix multiplication is one of the essential operations in

a wide range of applications such as graphic, video,image, robotic and signal processing. These applications

need high performance as well as cost efficient design.

Reconfigurable systems offer us a potential for

computation acceleration due to its software-like

programmable nature of the parallel processing units.

Run-time configuration explores a novel research area for

reconfigurable hardware to further speedup the

processing speed by eliminating the configuration

overhead with the overlapping of execution time. This

offers a non-interrupted processing system even with the

change of the circuits and greatly improves the logic

density by the time-sharing logic.

Many existing schemes addressing matrix multiplication

on FPGAs are on the area-time tradeoff issues to achieve

the maximum processing speed. Partially reconfigurable

devices offer us the ability to change the design

implementation without stopping the whole executing

process. To our best knowledge, none of the existing

matrix multiplication design is run-time configurable. In

this paper we present a novel matrix multiplier with

partially reconfigurable feature which greatly improves

area and latency tradeoff when compared with the

existing designs.

Linear array design by Jang et.[1] implemented matrix

multiplication on the Xilinx Virtex-II device. Their design

adopted systolic design architecture and focused on the

area-latency tradeoff minimization, which achieved great

improvement when compared with the state-of-the-art

FPGA based designs [2] and [3]. For a matrix size of 4

4, it had 52% and 69% less in area/speed metricsrespectively, and saved up to 46% silicon against design

in [4], while achieving a maximum frequency of 166MHz.

Xilinx reference design [5] for 3 3 matrix multiplication

maximized the pipelined data flow by multi pump the

embedded multipliers to the speed of 9 times of the

environment frequency, up to 154MHz. We use this

design as our benchmark for matrix multiplication

implemented on the Xilinx Virtex-II device.

Xilinx core generator tool [5] has many parameterisable

library cores for fast design realization. These cores have

guaranteed high performance and density. We

implemented the uniprocessor for matrix multiplication by using this tool and made the comparison with our

proposed design. The uniprocessor can run at 113MHz

when adopting MAC v3.0 core [5].

The rest of this paper is organized as: Section 2 describes

the proposed matrix multiplier design architecture for AT

efficiency. Section 3 presents the FPGA implementation

and comparison with the existing designs. Section 4

addresses the content partial reconfiguration used in our

design. And we conclude in Section 5.

2. Design architecture

Since Virtex-II devices incorporate large amounts of 18

Kbit Block SelectRAM, which has versatile configuration

options, we can instantiate the memory cell with the

operand matrix the way we do with parameterisable

registers. The proposed matrix multiplier uses two chunks

of the memory area. Figure 1 shows the architecture of

the proposed processing element (PE). Memory B is used

1

Proceedings of the EUROMICRO Systems on Digital System Design (DSD04)0-7695-2203-3/04 $ 20.00 IEEE

Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on October 30, 2009 at 05:15 from IEEE Xplore. Restrictions apply.


2/5

to store the columnj of matrixB and memory Cis used to

store the partial and final products of columnj. Compared

with the previous techniques, our design significantly

reduces the number of registers needed for data

movement. 4n registers are required for data movement in

Linear Array design [1], while only n registers are used in

our design. In Linear Array design, n

2

+ 2n cycles areneeded for the n n matrix multiplication. With run-time

configurable parameters and parallel processors, we save

n cycles in our systolic mode design and 2n cycles in the

parallel mode.

Figure 1. Architecture ofPEj in the Proposed Design

Based on the proposed PE architecture, a number of

theorems are derived to show the performance of the

proposed multiplier.Lemma 1 gives the minimum latency

requirement of n n matrix multiplication with n MACs

(multiplier-and-accumulator) and uniprocessor. Lemma 2

improves the Linear Array algorithm for matrix

multiplication with respect to both the number of registers

and number of computation cycles.Lemma 2 is extended

in Corollary 1 and 2 for demonstration of the ability of

the proposed design to meet the latency limitation with n

and one MAC respectively. Lemma 3 addresses the

matrix decomposition when the matrix size is lager than

the number of available PEs and gives the quantitative

analysis of the trade-offs between area and latency.

Lemma 1 n n matrix multiplication can never be

performed in less than n2 cycles with n multipliers or n3

cycles with one multiplier.

Proof: We define O (n3) as the complexity for n n

matrix multiplication. Equation cij=

aik bkj denotes

the calculation of any n n matrix multiplication C =

AB with aik, bkj and cij represent the elements of the n n

matrices A, B and C respectively. We need n times of

multiplication to produce each element of product C.

Thus, to compute the whole n dimension C, n3

times of

multiplication are needed. If we have n multipliers

working in parallel, n

2

cycles are the latency elapsed inmultiplication. Note that we have not counted the latency

for addition in pipelined processing and those wasted in

data movement. So the minimum timing requirement for

an n n matrix multiplication is n2 cycles with n

multipliers and we get the n3 cycles with one multiplier

for the same reason. These are the low bound latency for

implementing n n matrix multiplication.

Lemma 2 n n matrix multiplication can be performed in

n2 + n cycles using n PEs, each with 1 MAC, 1 register

and a Block SelectRAM of2n words and 1 I/O port.

Proof: Figure 1 is devised to compute cij=

aik bkj

for all i, j. aik, bkj and cij represent the elements of the n n matrices A, B and C. PEj denotes the j-th PE in the

whole structure.PEj computes column j of matrix C, c1j,

c2j, , cnj stored in Block SelectRAM Cpart. The input of

PEj connects to the output ofPEj-1 and the output ofPEjis the input of the next array element PEj+1. In phase k,

row kof matrixA (aik , 1 i n ) traversesPE 1 , PE2 ,

PE3 ,PEn in order. Columnj of matrixB resides in

the Block SelectRAM of PEj that can be partially

configured. This scheme allowsPEj to update cij = cij +

aik bkj every clock cycle, where cij represents the

intermediate value ofcij. And it takes n cycles to calculate

each element of Matrix C. The MACinPEj will not start

until the first element of MatrixA a11 arrives. Thus, PEjstarts computing j cycles after the ready signal activates,

and completes onj + n2 cycles. So we get the result after

the last element cnn inPEn is ready, which is after the (n2

+ n) th cycle.

Corollary 1 n n matrix multiplication can be performed

in n2

cycles using n PEs, each with 1 MAC, 1 register and

1 Block SelectRAM of2n words and 1 I/O port.

Proof: There is no change here fromLemma 2 except the

way Matrix A traverses. Instead of through PE1 , PE2,

PE3 ,PEn, the elements of Matrix A are traveling

through the data bus and fed into each PE simultaneously.

We instantiate all the PEs with the parameter ofPE1 butdifferent column value of MatrixB in Block SelectRAM

part B - PEj with j-th column of Matrix B. This method

allows all the PEs start at the same time and finish with

the latency ofPE1 as inLemma 2.

Corollary 2 n n matrix multiplication can be performed

in n3 cycles using 1 PE with 1 MAC, 1 register and a

Block SelectRAM of2n2 words and 1 I/O port.

2




3/5

Proof: This is the same as in the Uniprocessor. n n

matrix multiplication can also be performed using only

PE1 . We can parameterize the value of Block SelectRAM

partB with MatrixB in column order. MatrixA is fed into

PE1 n times with the production rate of 1 column per time.

So n3 cycles are needed in this case.

8887868584838281

7877767574737271

6867666564636261

5857565554535251

4847464544434241

3837363534333231

2827262524232221

1817161514131211

4321

bbbbbbbb

bbbbbbbb

bbbbbbbb

bbbbbbbbbbbbbbbb

bbbbbbbb

bbbbbbbb

bbbbbbbb

B

PEPEPEPE

(b)

Lemma 3 n n matrix multiplication can be performed inrn2 cycles using n/r PEs, each with 1 MAC, 1 register and

1 Block SelectRAM of2n words and 1 I/O port where n

is divisible by r.

Proof: n n matrix multiplication can be decomposed

into r3n/rmatrix multiplications. Using Corollary 1 with

n replaced by n/r, the result is obtained. The matrix

operand management would be like this: Matrix A is fed

in with major sequence of the row of sub-matrix, and

minor sequence of row order in each sub-matrix; Matrix

B resides in the Block SelectRAM, with major order in

the column of sub-matrix and minor order of each column

within the sub-matrix. For example, if we decompose an8 8 matrix multiplication with factor ofr= 2, we can

manipulate the matrices in the arrow sequence as depicted

in Figure 2.

Figure 2. Decomposition of Matrix Multiplication in the

Proposed Scheme

Block SelectRAM of PEs are configured in the order as

shown in Figure 2 (b). The way that Matrix A is fed is

illustrated in the following Pseudo code:For major-row_count = 1 to rdo

For major-row = 1 to rdo

For major-column = 1 to rdo

For minor-row = 1 to n/rdoFirstFor minor-column = 1 to n/rdo

8887868584838281

7877767574737271

6867666564636261

5857565554535251

4847464544434241

3837363534333231

2827262524232221

1817161514131211

aaaaaaaa

aaaaaaaa

aaaaaaaa

aaaaaaaa

aaaaaaaa

aaaaaaaa

aaaaaaaa

aaaaaaaa

A

aik = AijIf (minor-column = n/r) loop again, else minor-

column++;

If (minor-row = n/r) loop again, minor-row++;

If (major-column = r) loop again, else major-

column++;

If (major-row = r) loop again, else major-row++;

If (major-row_count = r) end loop;

Where aik is the register ofaik in Figure 1 and Aij is the

current element of matrixA ready for feeding in.

Lemma 3 caters for area and latency trade-off. A smaller

value ofn/rreduces the number of PEs, resulting in lesser

area. However, it increases the number of cycles to

complete the matrix multiplication.

Next

8887868584838281

7877767574737271

6867666564636261

5857565554535251

4847464544434241

3837363534333231

2827262524232221

1817161514131211

aaaaaaaa

aaaaaaaa

aaaaaaaa

aaaaaaaa

aaaaaaaa

aaaaaaaa

aaaaaaaa

aaaaaaaa

A

3. FPGA implementation

3.1 Performance Comparison

The matrix multiplier described above was implemented

in Xilinx Virtex-II device and its performance in term of

area and latency metrics was evaluated.

We define the performance equation to be Perf = n3 /

(slices Latency), where n is the matrix size, and slices

and latency are for the area consumption and computing

(a)

3




4/5

time respectively. By using the metrics slices Latency

(AT) for evaluation, we are able to take into account the

effect of increased numbers of processing elements and

the area differences for various types of memory. This is

especially relevant in an era of deep pipelines and huge

caches where small performance improvements are

bought at the cost of dramatic increases in area.

Figure 3 shows the performance evaluation of the 3

existing designs against our proposed one for various

sizes of matrix multiplication. The performance equation

shows the significant improvement over the existing

modules under the AT metrics calibration. The

comparable linear array design can run almost 2 times

faster than the proposed module, but the performancedeteriorates after n=15 due to the significant slice

consumption beyond that point.Table 1 shows the different matrix multiplication modules

with various area and latency tradeoff with the Xilinx

reference design run at 154MHz, the module generated by

Core Generator at 113MHz, the linear array at 166MHz

and the proposed module at 74MHz. The reason for

which our design runs at a relative low rate is due to our

non-pipelined design architecture and this part needs

further optimization. Note that the modules by Core

Generator and by Xilinx use single multiplier so the area

is the same for all the matrix sizes.

0 5 10 15 20 25 30 35 40 45 500.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2Perf=n3/(SlicesxLatency)

Matrix Size : n

PerformanceValues

XilinxCoreGenLinearArrayProposed

Table 1. Comparison of 3 existing designs against the

proposed design for various sizes of matrix multiplication

(a) Area Comparison

Matrix

Size

AXilinx

(slices)

ACoreGen

(slices)

ALinearArray

(slices)

AProposed

(slices)

3 3 207 158 393 1106 6 207 158 786 2199 9 207 158 1179 334

12 12 207 158 1572 44615 15 207 158 1965 55824 24 207 158 3912 89348 48 207 158 9360 1798

Figure 3. Performance Evaluation of Matrix

Multiplication with Various Sizes

Lemma 2 and its corollaries show the ability of our design

to be configured as different types of processor according

to different specific requirements. By instantiating the

column parameter all with the first column number, we

get a cluster of vector multipliers that compute in parallel

and can achieve the maximum speedup in any kind of n- processor mode with computation cycles ofn

2. Each

vector multiplier can also be used individually as an

uniprocessor for matrix-vector multiplication. Table 2

shows the latency of the proposed module when

configured as an Uniprocessor, a Systolic Array and the

an Optimal Parallel module:

(b) Latency Comparison

Matrix

Size

LXilinx(cycles

/s)

LCoreGen(cycles

/s)

LLinearArray(cycles

/s)

LProposed(cycles

/s)

3 3 45/0.292

45

/0.398

16

/0.096

13

/0.175

6 6 360/2.337

288

/2.549

49

/0.295

43

/0.581

9 9 1215/7.890

891

/7.885

100

/0.602

91

/1.229

12 12 2280/14.805

2016

/17.841

169

/1.018

157

/2.121

15 15 5625/36.526

3825/33.850

256/1.542

241/3.256

24 24 23040/149.610

14976

/132.531

625

/3.765

601

/8.121

48 48 184320/1196.883

115200

/1019.469

2401

/14.464

2353

/31.797

Table 2. Latency for Various Size of Matrix

Multiplication in Versatility of the Proposed Module

Proposed module configured as:Matrix

Size Uniprocessor

(cycles)

SystolicArray

(cycles)

Parallel

(cycles)3 3 28 13 106 6 217 43 379 9 730 91 82

12 12 1729 157 14515 15 3376 241 22624 24 13825 601 57748 48 110593 2353 2305

4




5/5

Note that the performance evaluation in Table 1 and

Figure 3 is carried out by instantiating the proposed

module in the systolic array mode, while the parallel

mode can achieve even greater improvement.

5

linearly distributed along the Block Ram columns which

Figure 4 shows the Area-Latency tradeoffs as stated in

Lemma 3 with n= 48. For matrices of other sizes, the

trend remains the same. We can see from the chart thatthe metrics Area-Latency are actually inversely

proportional.

Figure 4. Area-Latency Tradeoffs for Matrix

Multiplication (n=48)

3.2 Partial Reconfiguration

One of the novel features in our architecture is the partial

reconfigurability. As the target device Virtex-II FPGA

supports partial configuration, we can partially change

our design within independent configuration bitstream

flows and modify only the desired parts of the silicon

without stopping the processing or reprogramming the

whole device. This gives us a novel space to work in

which the cost of reconfiguration can be alleviated by

reduced size of the bitstream.

The contents of our matrix multiplier can be changed

dynamically by partially reconfiguring the memory cell of

the Block Ram embedded in the Virtex-II device. In this

way, the modification of the matrixB multiplicand can be

carried out at run-time without re-synthesizing the whole

design flow.

At this point, only the contents of matrix multiplicand can

be partially configured. We will extend the components

with partially reconfigurability to the clock templates,

data width, and the number of processing elements for

different matrix size.

Figure 5 shows the whole implementation of 4848

matrix multiplication. The rectangles designate the PEs

each contains a Block Ram for partial content

implementation.

Figure 5. 4848 Matrix product implementation on

4. Conclusion

computation and area efficient architecture for matrix

. References

] Ju-Wook Jang, Seonil Choi, and Viktor K. Prasanna, Area

ccelerating

and M. Flynn, PAM-Blox: High

anna Kumar and Y. Tsai, On Synthesizing

x Application Note XAPP284, Virtex-II Series,

Area-Latency tradeoffs for matrix multiplication

2000

PE1

PE2

1800

1600

1400

Area(Slices)

1200

1000

800

600

400

200

Virtex-II 4000, PEs distribution0 2 4 6 8 10 12x 10

4

Latency (cycles)

A

multiplication is proposed with instantiation versatility

and feature of content partial reconfiguration. We

demonstrate the improved area and latency tradeoff by

comparing its performance with existing designs.

5

[1

and Time Efficient Implementation of Matrix Multiplication on

FPGAs, The First IEEE International Conference on Field

Programmable Technology (FPT), December 2002.

[2] A. Amira, A. Bouridane, and P. Milligan, A

Matrix Product on Reconfigurable Hardware for Signal

Processing, Field- Programmable Logic and Applications

(FPL), pp. 101-111, 2001.

[3] O. Mencer,M. Morf,

Performance FPGA Design for Adaptive Computing, IEEE

Symposium on FPGAs for Custom Computing Machines, pp.

167-174, 1998.

[4] V. K. Pras

Optimal Family of Linear Systolic Arrays for Matrix

Multiplication, IEEE Transactions on Computers, Vol. 40, no.

6, 1991.

[5] Xilin

http://www.xilinx.com, 2003.


Matrix Multipliction

Documents

Transcript of Matrix Multipliction