Lecture Notes in Computer Science: Vol_22_No_3.files/JO… · Web [email protected] 2...

Efficient Designs for AOP-Based Field Multiplication over GF(2m)

Pramod Kumar Meher1, ＊ and Chiou-Yng Lee 2

1 Department of Embedded Systems,

Institute for Infocomm Research,

Singapore 138632

[email protected] Department of Computer Information and Network Engineering,

Lunghwa University of Science and Technology,

Taoyuan 333, [email protected]

Received 15 July 2011; Revised 20 August 2011; Accepted 24 September 2011

Abstract. In this paper, we present an efficient recursive formulation and systolic implementation of poly -nomial basis finite field multiplication over GF(2m) based on irreducible all-one-polynomials (AOP). Us-ing the property of irreducible AOP we have obtained a scheme of computation-free modular-reduction up to degree m, where reduction of degree is achieved by cyclic-left-shift operations. In the proposed systolic architecture, the cyclic-left-shift has been achieved by appropriate reordering of input lines in the process-ing elements (PEs). Compared with the previously existing systolic structures, the proposed one is found to involve significantly less number of registers and requires nearly half the area-time complexity. It is shown that the proposed structure requires nearly the same number of gates as those of the existing bit-par -allel structures. Unlike the existing bit-parallel designs, it does not require rewiring of input lines, and crit -ical path of proposed structure does not increase with the field order m. For m > 17 (which is required in most practical cases), the proposed systolic design is found to have significantly less area-time complexity compared with the existing bit-parallel structures.

Keywords: Finite field, galois field, finite field multiplication, elliptic curve cryptography, error-control-coding, systolic array, VLSI

1 Introduction

In recent years, finite fields arithmetic over GF(2m) has gained more importance after the emergence of elliptic curve cryptography (ECC) as a potential candidate for international standards to realize robust cryoptosystems in resource constrained environments [1], [2]. Along with rapid development in VLSI technology and growing popularity of smart cards and secure communication through portable devices, there is a strong interest to de -sign dedicated circuits to integrate ECC for high-volume applications. In finite field GF(2 m), addition is the simplest of all the field operations because addition of any two bits can be performed by logical XOR opera -tion. Division on the other hand can be implemented by a look-up table arrangement or through a series of multiplications. The time involved in multiplications, consequently, has an important role to determine the throughput and overall latency of implementation. A number of architectures have been proposed for efficient finite field multiplication over GF(2m) in hardware platforms [3]-[16]. Multipliers with different basis of repre-sentation, e.g., dual basis, normal basis and polynomial basis have been realized to be used for various applica -tions. Multiplication in polynomial basis is relatively simpler, offers scalability for the fields of higher orders, and does not require a basis conversion [17]. The polynomial basis multipliers are, therefore, more efficient, and more widely used compared with the multipliers based on the other two bases of representations. Low-complexity algorithms and architectures are suggested in the literature for polynomial basis finite field multi -plication over GF(2m) generated by special class of irreducible polynomials, e.g., trinomials, pentanomials and allone- polynomials (AOP) [6]-[16].

Koc and Sunar [11] have presented a bit-parallel canonical basis multiplier over GF(2m) based on irreduc-ible AOP. For efficient alternative implementation, some other bit-parallel multipliers based on irreducible AOP have also been proposed [10]-[13] in the last few years. The bit-parallel designs are is found to provide a low-latency realization of the multiplication, but due to their large critical path, this class of architecture in-

＊ Correspondence author

Journal of Computers Vol. 22, No. 3, October 2011

volves high average computation time which increases rapidly with the field order m. Moreover, the design complexity of these bit-parallel architectures is also high for high order fields. The systolic designs have a lot of advantages over these non-systolic designs due to regularity and modularity of design, simplicity of their processing elements (PE) and local interconnections [18]. Couple of systolic architectures have also been sug -gested for canonical basis multiplier over GF(2m) based on irreducible AOP [14], [15] which involve consider-ably large number of latches. In this paper, we suggest a new algorithm based on partially-reduced polynomi -als of degree m to derive a more efficient bit-level-pipelined parallel systolic design of multiplier of GF(2 m), generated by an irreducible AOP.

The rest of the paper is organized as follows. The necessary mathematical formulation for derivation of the algorithms is presented in Section 2. The systolic formulation of the proposed systolic design is discussed in Section 3. The hardware and time complexities of the proposed structures are estimated and compared with the existing structures in Section 4. Conclusions are presented in Section 5.

2 Mathematical Formulation

Let the finite field over GF(2m) be defined, in general, by an irreducible primitive polynomial of degree m, given by

(1)

where {qj for 1 ≦ j ≦ m – 1} GF(2). The polynomial basis , (where α is a root of

Q(z)), be used to represent the field elements. Let A and B be any two field elements in GF(2m), represented in the form of polynomials of degree (m – 1) as

and (2)

where aj, bj {0,1} for j = 0, 1, …, m–1. The multiplication of A and B over GF(2m) is given by

(3)

To derive a recurrence relation for systolic implementation of the multiplication, (3) can be expanded to a form:

(4)

where P = Aα, is a polynomial of degree m, and given by

(5)

for p0 = 0 and pj = aj-1 for j = 1, 2, …, m; and (4) can alternately be expressed as

(6a)

where

(6b)

and

for (6c)

Using the polynomial expansion of (5), one can have

4

Meher and Lee: Efficient Designs for AOP-Based Field Multiplication over GF(2m)

(7)

Since α is a root of Q(z), from (1), one can find

(8)

When polynomial Q(z) is an AOP, (8) can be written as

(9)

From (9) it can be found further that

(10)

Using (10), Pα [given by (7)] can be reduced to a polynomial of degree m, to have

(11)

From (11) we may note that a partially-reduced form P1 [of degree m] of polynomial Pα can be obtained by a cyclic-left-shift operation of polynomial P. Similarly, a partially-reduced form P2 [of degree m] of polynomial Pα

2 can be obtained by cyclic-left-shift of P1 (or cyclic-left-shift of P by two bit-locations). This behaviour of AOP based binary fields may be utilized to obtain such partially-reduced polynomial Pα

i by a cyclic-left-shift of Pi-1 (or by cyclic-left-shift of P by i bit locations) for 1 ≦ i ≦ (m-2). The partially-reduced form of polyno-mial Pα

i can be given by a recursive representation:

(12)

where L is the cyclic-left-shift operation, and (6c) may then be equivalently replaced by

(13)

Upon replacement of (6c) by (13), Y gets reduced to a polynomial of degree m, which can be represented in a polynomial form:

(14)

Substituting the expansion of given by (9), one can have a reduced form polynomial [Y mod Q(z)] of de-gree (m-1) given by

(15)

Using (6a), (6b), (13) and (15), we can have a recursive formulation of desired multiplication as follows:

Algorithm for finite field multiplication over GF(2m) generated by AOPSTEP-1: Find the polynomial P of degree m by left-shift operation of A, and find C1 = b1．P .STEP-2: For i = 1 to m – 2

Obtain the partial-reduced form Pi according to (12) by cyclic-left-shift by i bit-locations.

Perform bit-level multiplication of Pi according to (13) to obtain .

Perform field addition of with Ci to obtain partial result polynomial T of degree m according

to (6b). (After m – 2 iterations, i.e. for i = m – 2 the partial result T = Y.)

STEP-3: Perform modular reduction of T to reduce its degree from m to (m – 1) according to (15) to ob-tain [Y mod Q(z)].

STEP-4: Perform multiplication of bit b0 with the input operand A, to find ; and perform field addi-tion of [Y mod Q(z)] with to obtain the desired output C.

Note that all the recursive operations of the proposed algorithm are in STEP-2, while STEP-3 and STEP-4 may be considered as post-processing steps and STEP-1 may be considered as preprocessing steps.

5


3 Proposed Systolic Multiplier for GF(2m) Based on AOP Field Polynomial

The algorithm for field multiplication over GF(2m) generated by irreducible AOP, discussed in Section 2 can be represented by the signal-flow graph (SFG) (shown in Fig. 1). It consists of one left-shift node (LS), one right-shift node (RS), one reduction node (RN), (m–2) cyclic-left-shift nodes S(i) for 1 ≦ i ≦ (m – 2), m bit-multiplication nodes M(i) for 1 ≦ i ≦ m, and (m – 1) XOR nodes X(i) for 1 ≦ i ≦ (m – 1). LS performs the pre-processing of STEP-1 to perform one left-shift operation on A to generate P.

Fig. 1. The signal-flow graph (SFG) for the finite field multiplication over GF(2 m) for AOP irreducible polynomial. The dashed-lines indicate the grouping of SFG nodes for formation of PEs of a systolic array. LS: left-shift node, RS: right-shift node, RN: Reduction node, S(i) for 1 ≦ i ≦ (m − 2) are cyclic-left-shift nodes, M(i) for 1 ≦ i ≦ m are bit-multiplication nodes, X(i) for 1 ≦ i ≦ (m − 1) are XOR nodes.


6



Fig. 4. Proposed bit-level-pipelined systolic structure for finite field multiplication over GF(2 m) based on irreducible all-one-polynomial. (a) The systolic structure. (b) Functional description of the PEs. (c) Functional description of in -put pre-processing cell (IPC). (d) Functional description of output postprocessing cell (OPC). (r * U) implies logi -cal AND operation of bit r with each of the bits of U, and U ⊕ r implies logical XOR operation of bit r with rest of the bits of U.

The bit-multiplication node M(1) performs the multiplication of STEP-1 to generate C1 = b1．P . The cyclic-left-shift nodes S(i), for 1 ≦ i ≦ m − 2 perform the partial-reductions for STEP-2 according to (12), such that S(i) takes the input P and generates the output Pi. The bit-multiplication nodes M(i), for 2 ≦ i ≦ (m−1) perform the bit-multiplications for STEP-2 to generate Ci = bi．P i-1 as given by (13). The XOR nodes X(i) for 1 ≦ i ≦ (m − 2) perform the field addition of Ci+1 with Ci for STEP-2 of the algorithm, such that X(m−2) generates the partial result polynomial Y of degree m. The reduction node RN performs a modular re-duction of Y according to (15) for STEP-3 to produce [Y mod Q(z)]. The bit-multiplication nodes M(m) and XOR node X(m−1) perform the operations for STEP-4. Node RS performs right-shift of P to regenerate the in -put operand A, while M(m) performs the multiplication of bit b0 with all the bits of A to generate b0．A. Node X(m − 1) performs the field addition of b0．A with [Y mod Q(z)] to produce the output C according to (6a). The dashed-lines in Fig. 1 indicate the grouping of SFG nodes for formation of PEs of a linear systolic array.

The proposed systolic array for finite field multiplication over GF(2m) is shown in Fig. 2. It consists of (m − 2) processing elements (PEs) along with one input pre-processing cell (IPC) and one output post-processing cell (OPC). Nodes LS and M(1) of the SFG are mapped to IPC, while the nodes RS, M(m), X(m − 1) and RN of the SFG are mapped to the OPC. The SFG nodes S(i), M(i + 1) and X(i) are mapped to PE[i] for 1 ≦ i ≦ m − 2. The function of each PE, IPC and OPC are depicted, respectively, in Figs. 2(b), (c) and (d). Detailed designs of the IPC, OPC and the PEs are shown in Fig. 3. During each cycle, the IPC receives an m-bit input word A and performs a left-shift operation to generate P, which is latched out as X1out to PE[1]. Also it performs the multiplication of bit b1 with P through AND operations of b1 with all the bits of P using m AND gates as shown in Fig. 3(b). The output of all the AND gates are latched out as output X2out from IPC to PE[1]. During each cycle, each PE is required to perform certain number of cyclic-left-shift operations of input X1in, which can be performed, without rewiring, by appropriate order of connection with the input data line as shown in Fig. 3(a). In case of PE[i], (the i-th PE), a(m-i+j) is used as the (j + 1)-th bit of the i-bit-cyclic-left-shifted word

7


(T) for 0 ≦ j ≦ i−1. The (i+1)-bit of T is 0. The input bits aj for 0≦ j ≦ (m − i − 1) correspond to the (i + j + 2)-th bit of the i-bit-cyclic-left-shifted word T. The i-th PE performs multiplication of left-shifted word T with the operand bit bi+1 by AND operations through m AND gates. Since the (i+1)-th bit of T is 0, no AND gate is required to be used for this bit. The field addition of the output of AND gates with X2in is performed by bit-bit XOR operations through XOR gates. Since (i + 1)-bit of T is 0, the output of AND operation with this bit also zero and the result of subsequent XOR operation is the same as the i-th bit of X2in. Since the addition of 0 makes no change of the corresponding input-bit, XOR gate is not required for this bit. The OPC performs mod-ular reduction of input X2in according to (15) by XOR operations of its MSB with each of its other bits using m XOR gates. Besides, it performs one right-shift to generate A from P and performs the bit-multiplication of bit b0 with operand A by using m AND gates. Finally it adds the output of the AND gates with the reduced form of X2in to produce the desired product word C.

(a) (b)

(c)

Fig. 5. Detailed structure of IPC, OPC and PEs. (a) Internal structure of PE[i].(b) Structure of IPC (c) Structure of OPC

Critical path of IPC is TIPC = TA and that of each PE is TPE = TX + TA, where TX and TA are the delays of XOR gate and AND gate, respectively. Using a pipelining register between each pair of XOR gates (as shown in Fig.3(c)), the critical path of OPC is reduced to TOPC = TX + TA. The minimum of clock period of the systolic array TC = max {TIPC; TOPC; TPE } = TX + TA. The proposed systolic array gives the first output of desired prod-uct m cycles after the first pair of operands are fed to the structure, while the successive output are produced in each cycle thereafter.

4 Area and Time Complexities

The proposed systolic structure for finite field multiplication requires (m − 2) PEs one IPC and one OPC. Each of the (m−2) PEs consists of m XOR gates and m AND gates. IPC requires m AND gates, while OPC requires 2m XOR gates and m AND gates. Apart from that the IPC, OPC and each of the PEs requires 2m bit-latches for pipelining. The systolic mul-tiplier thus requires m2 AND gates, m2 XOR gates and 2m2 bit-registers. (Note that one can also have a semi-systolic im-plementation of the proposed design, where the input X1in could be broadcasted to the PEs, and the number of latches then can be reduced to m2). After a latency of m cycles, it gives the desired output word in every cycle of duration TC = TX +

8


TA. The gate-counts, register-counts, latency in cycles, and duration of cycle period (critical path) of the proposed struc -ture along with those of the previously existing systolic structures and bit-parallel non-systolic structures of multipliers over GF(2m) based on irreducible AOP [10]-[15] are listed in Table 1. The systolic structure of [15] has nearly the same time-complexity as the proposed one, but requires nearly double the number of registers which contributes substantially to the total area of the structure. The systolic structure of [14] has the same ACT as that of the proposed structure, but the proposed one has significantly less critical path compared with the other. Apart from that, although the structure of [14] has nearly the same number of gates, it involves more number of registers compared with the proposed one. The bit-paral -lel structures of [10]-[13], involve nearly the same number of AND gates and XOR gates as the proposed structure. All these bit-parallel structures also do not require any pipelining registers and involve less number of cycles of latency. But, they do not have the advantages of systolic designs. Besides, the critical path of these bit-parallel structures are too high. The area-time complexity (product 6 of area-complexity and average computation time (ACT)) of proposed structure and the previously existing structures are also shown in Table 1. The size of AND gate is assumed to be unit of area, and that of XOR gate and D flip-flop are taken to be two units and four units of area, respectively. Similarly, the delay of 2-input AND gate is taken to be one unit delay and that of an XOR gate is taken to be (3/2) unit delays to evaluate the area-time complexities. (Considering m to be large, the terms containing m and small multiples of m are ignored compared with m2

in the existing designs.) The area time complexities of previously existing systolic structures of [14] and [15] are found to be nearly twice that of the proposed systolic structure. The area-time complexity of the existing bit-parallel structures [10]-[13] are significantly higher compared with that of proposed structure (for m > 17 in case of systolic design, and for m > 9 in case of semi-systolic design) which increases rapidly with the field order m. The average computation time (ACT) is one cycle in all these architectures . The gate-counts refer to 2-input gates. Area-complex-ity and delay of 3-input XOR gate is taken to be twice that of 2-input XOR gate.

Table 1. Hardware and Time Complexities of the Proposed Systolic Structure and the Existing Structures for FiniteField Multiplication over GF(2m) Generated by AOP Primitive Polynomials.

Designs AND XOR Register Latency (Cycles) Cycle Time Area-TimeHasan et al [10] (parallel) m2 m2+m−2 2m 1 TA + (m + n)TX ≈ 4.5(m + n)m2

Koc and Sunar [11] (parallel) m2 m2−1 2m 1 TA + (n + 2)TX ≈ 4.5(n + 8/3)m2

Kim [12] (parallel) m2 m2−1 2m 1 TA + (n + 1)TX ≈ 4.5(n + 5/3)m2

Reyhani [13] (parallel) m2 m2−1 2m 1 TA + (n + 1)TX ≈ 4.5(n + 5/3)m2

Lee et al [14] (systolic) (m+1)2 m2+3m+1 5(m2+5m+2)/2 m/2+2 2TX +TA ≈ 52m2

Lee et al [15] (systolic) (m+1)2 (m+1)2 4(m+1)2 m+2 TX +TA ≈ 47.5m2

Proposed (systolic) m2 m2 2m2m TX +TA 27.5m2

Proposed (semi-systolic) m2 m2 m2m TX +TA 17.5m2

5 Conclusions

An efficient recursive formulation is suggested for systolic implementation of polynomial basis finite field multiplication over GF(2m) based on irreducible AOP, using a computation free scheme of reduction of poly-nomials up to degree m, where the degree-reduction operation is achieved by cyclic-left-shift operations. It is shown further that the necessary cyclic-left-shifts can be achieved by appropriate reordering of input connec -tions in the PEs without using rewiring of input lines (unlike some of the previously existing structures). Com -pared with the existing systolic structures, proposed structure is found to involve significantly less number of registers; and requires nearly half the area-time complexity. It is shown further that the proposed structure re -quires nearly the same number of gates as those of the bit-parallel structures. But, the proposed one involves much less critical path compared with the others, particularly for higher order Galois fields. The proposed sys -tolic design for m > 17 and semi-systolic implementation of proposed structure for m > 9, are found to involve significantly less area-time complexity compared with the bit-parallel structures. The technique suggested in [16] can be used in combination with the proposed design to have a more efficient implementation of finite field multiplication over GF(2m) based on irreducible AOP.

References

[1] R. Lidl and H. Niederreiter, Eds., Introduction to Finite Fields and their Applications, NY: Cambridge University

Press, New York, 1986.

[2] [Online]. Available: http://www.csrc.nist.gov/publications

9


[3] L. Song and K. K. Parhi, “Efficient Finite Field Serial/Parallel Multiplication,” in Proceedings of 1996 Interna-

tional Conference on Application-Specific Systems, Architectures, and Processors , Chicago, IL, USA, pp. 72-82,

1996.

[4] S. K. Jain, L. Song, K. K. Parhi, “Efficient Semisystolic Architectures for Finite-field Arithmetic,” IEEE Transac-

tions on Very Large Scale Integration Systems, Vol. 6, No. 1, pp. 101-113, 1998.

[5] P. K. Meher, “Systolic Formulation for Low-complexity Serial-parallel Implementation of Unified Finite Field

Multiplication over GF(2m),” in Proceedings of IEEE International Conference on Application-Specific Systems,

Architectures and Processors, Montréal, Québec, Canada, 2007.

[6] F. Rodriguez-Henriguez and C. K. Koc, “Parallel Multipliers based on Special Irreducible Pentanomials,” IEEE

Transactions on Computers, Vol. 52, No. 12, pp. 1535-1542, 2003.

[7] J. L. Ima~na, J. M. Sanchez, F. Tirado, “Bit-parallel Finite Field Multipliers for Irreducible Trinomials,” IEEE

Transactions on Computers, Vol. 55, No. 5, pp. 520-533, 2006.

[8] A. Reyhani-Masoleh and M. A. Hasan, “Low Complexity Bit Parallel Architectures for Polynomial Basis Multipli-

cation over GF(2m),” IEEE Transactions on Computers, Vol. 53, No. 8, pp. 945-959, 2004.

[9] W. Tang, H. Wu, M. Ahmadi, “VLSI Implementation of Bit-parallel Word-serial Multiplier in GF(2233),” in Pro-

ceedings of IEEE International Conference on New Circuits and Systems, Québec, Canada, pp. 399-402, 2005.

[10] M. A. Hasan, M. Z. Wang, V. K. Bhargava, “Modular Construction of Low Complexity Parallel Multipliers for a

Class of Finite Fields GF(2n),” IEEE Transactions on Computers, Vol. 41, No. 8, pp. 962-971, 1992.

[11] C. K. Koc and B. Sunar, “Low-complexity Bit-parallel Canonical and Normal Basis Multipliers for a Class of Fi-

nite Fields,” IEEE Transactions on Computers, Vol. 47, No. 3, pp. 353-356, 1998.

[12] C. H. Kim, S. Oh, J. Lim, “A New Hardware Architecture for Operations in GF(2n),” IEEE Transactions on Com-

puters, Vol. 51, No. 1, pp. 90-92, 2002.

[13] A. Reyhani-Masoleh and M. A. Hasan, “A New Construction of Massey-Omura Parallel Multiplier over GF(2n),”

IEEE Transactions on Computers, Vol. 51, No. 5, pp. 511-520, 2002.

[14] C.Y. Lee, E.H. Lu, J.Y. Lee, “New Bit-parallel Systolic Multipliers for a Class of GF(2m),” in Proceedings of IEEE

International Symposium on Circuits and Systems, Sydney, Australia, pp. 578-581, 2001.

[15] C.Y. Lee, E.H. Lu, J. Yien, “High-speed Bit-parallel Systolic Multipliers for a Class of GF(2m),” in Proceedings of

International Symposium on VLSI Technology, Systems, and Applications, Hsinchu, Taiwan, pp. 291-294, 2001.

[16] K.Y. Chang, D. Hong, H.S. Cho, “Low Complexity Bit-parallel Multiplier for GF(2m) Defined by All-one Polyno-

mials Using Redundant Representation,” IEEE Transactions on Computers, Vol. 54, No. 12, pp. 1628-1630, 2005.

[17] I. S. Hsu, T. K. Truong, L. J. Deutsch, I. S. Reed, “A Comparison of VLSI Architecture of Finite Field Multipliers

Using Dual, Normal, or Standard Bases,” IEEE Transactions on Computers, Vol. 37, No. 6, pp. 735-739, 1988.

[18] S. Y. Kung, VLSI Array Processors. Prentice Hall, Inc. Upper Saddle River, NJ, USA, 1987.

10

Lecture Notes in Computer Science: Vol_22_No_3.files/JO… · Web [email protected] 2...

Documents

Transcript of Lecture Notes in Computer Science: Vol_22_No_3.files/JO… · Web [email protected] 2...