IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently...

13
Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things Zhe Liu, Johann Großsch adl, Member, IEEE, Zhi Hu, Kimmo J arvinen, Husen Wang, and Ingrid Verbauwhede Abstract—Verification of an ECDSA signature requires a double scalar multiplication on an elliptic curve. In this work, we study the computation of this operation on a twisted Edwards curve with an efficiently computable endomorphism, which allows reducing the number of point doublings by approximately 50 percent compared to a conventional implementation. In particular, we focus on a curve defined over the 207-bit prime field F p with p ¼ 2 207 5;131. We develop several optimizations to the operation and we describe two hardware architectures for computing the operation. The first architecture is a small processor implemented in 0.13 mm CMOS ASIC and is useful in resource-constrained devices for the Internet of Things (IoT) applications. The second architecture is designed for fast signature verifications by using FPGA acceleration and can be used in the server-side of these applications. Our designs offer various trade-offs and optimizations between performance and resource requirements and they are valuable for IoT applications. Index Terms—VLSI designs, Internet-of-Things, signature verification, elliptic curve cryptography, multiple-precision arithmetic Ç 1 INTRODUCTION T HE Internet of Things (IoT) is a paradigm in which objects, such as Radio Frequency IDentification (RFID) tags, sensors, mobile phones, appliencies, etc., are provided with unique identifiers and the ability to communicate with each others over a network to reach common goals without requiring human interaction [1]. IoT has been a promising approach to many diverse applications (i.e., civilian types) and is playing a major role in the upcoming age of intelli- gent networking. With the increase in popularity of such networks, cryptographic protocols must be widely used to protect their security. Due to the resource, computing, and environmental constraints, it is a challenging task to efficiently implement cryptographic protocols for the IoT. These constraints mean that cryptographic implementations in IoT applications must be fast and compact but still provide security levels comparable to more traditional sys- tems [2]. This has attracted many researchers’ attention and the topic is an active area of fruitful research work. Digital signatures are an indispensable component of modern security protocols such as Transport Layer Security (TLS) [3], which is used to authenticate servers and clients. The Datagram TLS (DTLS) [4], a variant of TLS optimized for connectionless datagram transport (i.e., UDP), is widely considered as the future standard protocol for securing the IoT [5]. The signature algorithms supported by the most recent version (i.e., version 1.2) of TLS are RSA [6], DSA [7], as well as ECDSA [8] through a separate RFC [9]. The ellip- tic curve cryptography used by Elliptic Curve Digital Signa- ture Algorithm (ECDSA) is usually considered to be more applicable for low-end devices than RSA, since it requires relatively small key sizes and operand lengths [10]. In the state-of-the-art implementation, a 255-bit ECDSA signature (matching the security of 128-bit AES) has a size of merely 64 bytes when it is compressed [11], i.e., less than one sixth of the RSA signature size at the same security level. However, an inherent problem with ECDSA signatures is that, despite their small size, the verification process is rel- atively computation intensive. This problem is emphasized in heavily-loaded servers which may require thousands of verifications per second and, thus, benefit from hardware accelerators. The verification of an ECDSA signature req- uires a double scalar multiplication, an operation of the form k G þ l Q, where G is a point on an elliptic curve E that generates a large group of prime order r, Q is an (arbi- trary) element of this group, and k and l are two integers in the range of ½1;r 1 [8]. Normally, k G þ l Q is computed in a simultaneous fashion (i.e., with joint doublings) so that at most m doublings need to be executed in total, where m is the bitlength of r [12]. Z. Liu is with the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China, and the Institute for Quantum Computing, and Department of Combinatorics and Optimization, University of Waterloo, Waterloo, ON N2L 3G1, Canada. E-mail: [email protected]. J. Großschadl and H. Wang are with the University of Luxembourg, Esch-sur-Alzette L-4365, Luxembourg. E-mail: {johann.groszschaedl, husen.wang}@uni.lu. Z. Hu is with the School of Mathematics and Statistics, Central South University, Changsha, Hunan 410083, P.R. China. E-mail: [email protected]. K. Jarvinen was with ESAT/COSIC and iMinds, KU Leuven, Leuven 3000, Belgium. He is now with the Department of Computer Science, Aalto University, Aalto 00076, Finland. E-mail: kimmo.jarvinen@aalto.fi. I. Verbauwhede is with ESAT/COSIC and iMinds, Leuven 3000 KU Leuven, Belgium. E-mail: [email protected]. Manuscript received 19 Mar. 2016; revised 31 July 2016; accepted 24 Aug. 2016. Date of publication 31 Oct. 2016; date of current version 13 Apr. 2017. Recommended for acceptance by C.K. Koc. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TC.2016.2623609 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY 2017 773 0018-9340 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently...

Page 1: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

Elliptic Curve Cryptography with EfficientlyComputable Endomorphisms and Its Hardware

Implementations for the Internet of ThingsZhe Liu, Johann Großsch€adl,Member, IEEE, Zhi Hu, Kimmo J€arvinen,

Husen Wang, and Ingrid Verbauwhede

Abstract—Verification of an ECDSA signature requires a double scalar multiplication on an elliptic curve. In this work, we study the

computation of this operation on a twisted Edwards curve with an efficiently computable endomorphism, which allows reducing the

number of point doublings by approximately 50 percent compared to a conventional implementation. In particular, we focus on a curve

defined over the 207-bit prime field Fp with p ¼ 2207 � 5;131. We develop several optimizations to the operation and we describe two

hardware architectures for computing the operation. The first architecture is a small processor implemented in 0.13 mm CMOS ASIC

and is useful in resource-constrained devices for the Internet of Things (IoT) applications. The second architecture is designed for fast

signature verifications by using FPGA acceleration and can be used in the server-side of these applications. Our designs offer various

trade-offs and optimizations between performance and resource requirements and they are valuable for IoT applications.

Index Terms—VLSI designs, Internet-of-Things, signature verification, elliptic curve cryptography, multiple-precision arithmetic

Ç

1 INTRODUCTION

THE Internet of Things (IoT) is a paradigm in whichobjects, such as Radio Frequency IDentification (RFID)

tags, sensors, mobile phones, appliencies, etc., are providedwith unique identifiers and the ability to communicate witheach others over a network to reach common goals withoutrequiring human interaction [1]. IoT has been a promisingapproach to many diverse applications (i.e., civilian types)and is playing a major role in the upcoming age of intelli-gent networking. With the increase in popularity of suchnetworks, cryptographic protocols must be widely used toprotect their security. Due to the resource, computing,and environmental constraints, it is a challenging task toefficiently implement cryptographic protocols for the IoT.These constraints mean that cryptographic implementationsin IoT applications must be fast and compact but still

provide security levels comparable to more traditional sys-tems [2]. This has attracted many researchers’ attention andthe topic is an active area of fruitful research work.

Digital signatures are an indispensable component ofmodern security protocols such as Transport Layer Security(TLS) [3], which is used to authenticate servers and clients.The Datagram TLS (DTLS) [4], a variant of TLS optimizedfor connectionless datagram transport (i.e., UDP), is widelyconsidered as the future standard protocol for securing theIoT [5]. The signature algorithms supported by the mostrecent version (i.e., version 1.2) of TLS are RSA [6], DSA [7],as well as ECDSA [8] through a separate RFC [9]. The ellip-tic curve cryptography used by Elliptic Curve Digital Signa-ture Algorithm (ECDSA) is usually considered to be moreapplicable for low-end devices than RSA, since it requiresrelatively small key sizes and operand lengths [10]. In thestate-of-the-art implementation, a 255-bit ECDSA signature(matching the security of 128-bit AES) has a size of merely64 bytes when it is compressed [11], i.e., less than one sixthof the RSA signature size at the same security level.

However, an inherent problem with ECDSA signaturesis that, despite their small size, the verification process is rel-atively computation intensive. This problem is emphasizedin heavily-loaded servers which may require thousandsof verifications per second and, thus, benefit from hardwareaccelerators. The verification of an ECDSA signature req-uires a double scalar multiplication, an operation of theform k �Gþ l �Q, where G is a point on an elliptic curve Ethat generates a large group of prime order r, Q is an (arbi-trary) element of this group, and k and l are two integers inthe range of ½1; r� 1� [8]. Normally, k �Gþ l �Q is computedin a simultaneous fashion (i.e., with joint doublings) so thatat most m doublings need to be executed in total, where mis the bitlength of r [12].

� Z. Liu is with the College of Computer Science and Technology, NanjingUniversity of Aeronautics and Astronautics, Nanjing 210016, China, andthe Institute for Quantum Computing, and Department of Combinatoricsand Optimization, University of Waterloo, Waterloo, ON N2L 3G1,Canada. E-mail: [email protected].

� J. Großsch€adl and H. Wang are with the University of Luxembourg,Esch-sur-Alzette L-4365, Luxembourg.E-mail: {johann.groszschaedl, husen.wang}@uni.lu.

� Z. Hu is with the School of Mathematics and Statistics, Central SouthUniversity, Changsha, Hunan 410083, P.R. China.E-mail: [email protected].

� K. J€arvinen was with ESAT/COSIC and iMinds, KU Leuven, Leuven3000, Belgium. He is now with the Department of Computer Science,Aalto University, Aalto 00076, Finland. E-mail: [email protected].

� I. Verbauwhede is with ESAT/COSIC and iMinds, Leuven 3000 KULeuven, Belgium. E-mail: [email protected].

Manuscript received 19 Mar. 2016; revised 31 July 2016; accepted 24 Aug.2016. Date of publication 31 Oct. 2016; date of current version 13 Apr. 2017.Recommended for acceptance by C.K. Koc.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TC.2016.2623609

IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY 2017 773

0018-9340� 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

Most previous attempts to reduce the execution time of thisoperation fall into one of two categories, namely, (a)approaches that aim at minimizing the cost of a single pointaddition or doubling and (b) techniques to reduce the numberof these operations. An example of (a) is EdDSA [11], which isa signature scheme based on a twisted Edwards curve [13]that allowsmore efficient implementations of point arithmeticthan a basic Weierstrass curve [14]. Window methods toreduce the number of point additions in a double scalarmulti-plication (for example, as described in [12, p. 109]) fall into thecategory (b). Another option to cut down the number of pointoperations is to exploit an efficiently computable endomor-phism on a special curve, such as aGallant-Lambert-Vanstone(GLV) curve, as explained in [15] for variable-base scalarmul-tiplication. Also a combination of the above approaches,namely using the twisted Edwards addition law on so-calledGalbraith-Lin-Scott (GLS) [16] and GLV-GLS [17] curves(both of which are defined over Fp2 and possess endomor-phisms) has been investigated in [18] and [19].

In this paper we introduce families of twisted Edwardscurves with an efficiently computable endomorphism f anddemonstrate how such endomorphism can be used to speedup the ECDSA verification process. We focus particularlyon hardware implementation of double scalar multiplica-tion for IoT applications. We study the implementationproperties of the twisted Edwards curve over a prime fielddefined as follows:

ET=Fp : �x2 þ y2 ¼ 1þ x2y2 (1)

(i.e., a ¼ �1 and d ¼ 1), which is birationally equivalent

over Fp to a GLV curve [15] of the form EW : y2 ¼ x3 þ ax.Gallant et al. [15] first described how an efficiently-comput-able endomorphism f can be used to speed up a variable-base scalar multiplication on such curves. In order to accel-erate a scalar multiplication k � P , the scalar k is split intotwo parts k1 and k2 of about half the length compared to theoriginal k (as explained in, e.g., [15]). Then the scalar multi-plication is computed as k � P ¼ k1 � P þ k2 � fðP Þ in a simul-taneous fashion, which saves roughly 50 percent of thepoint doublings compared to a straightforward computa-tion of k � P . While most of the previous work on exploitingendomorphisms has focused primarily on variable-base sca-lar multiplication (such as needed in ECDH key exchange),we direct our attention to the double scalar multiplicationcarried out in the verification of an ECDSA signature. Whentaking advantage of the endomorphism f, an m-bit doublescalar multiplication k �Gþ l �Q can be performed via foursimultaneous half-length (i.e., roughlym=2-bit) scalar multi-plications of the form k1 �Gþ k2 � fðGÞ þ l1 �Qþ l2 � fðQÞ asshown by Galbraith et al. in [16].

The real-world benefit of our settings is that it supports amultitude of implementation options and trade-offs betweenexecution time and silicon area (when thinking about hard-ware implementation) or memory footprint (in the context ofsoftware implementation).1 Our curve allows a designer to

fine-tune an implementation according to the requirementsat hand. When resources are constrained, one can perform adoubled scalar multiplication in the straightforward fashionby computing two simultaneousm-bit scalar multiplications,which is very economic in terms of memory. But if moreresources are available, our curve allows the designer to tradeperformance for memory or area (depending on whether theimplementation is in software or hardware) by exploitingthe efficiently-computable endomorphism. Moreover, it ispossible to achieve further speed-ups by combining the endo-morphism with a window method for simultaneous scalarmultiplication when given plenty of memory resources.

The main contributions of our work include

� We extensively study the computation of double basescalar multiplication on twisted Edwards curves withan efficiently computable endomorphism that allowsreducing the number of point doublings by approxi-mately 50 percent compared to a conventional imple-mentation. In particular, we focus on a curve definedover the 207-bit prime field Fp with p ¼ 2207 � 5;131,which offers a roughly 100-bit security level.

� We develop several optimizations to the operationand describe two hardware architectures for comput-ing the operation. The first architecture is a small pro-cessor implemented in 0.13 mm CMOS ASIC, whichhas an overall silicon area of only 5,821 gate equiva-lents and is useful in resource-constrained devices forIoT applications. The second architecture is designedfor fast signature verifications by using FPGA acceler-ation and can be used in the server-side of these appli-cations. These architectures demonstrate that ourmethods offer various trade-offs and optimizationsbetween performance and resource requirementsand they are valuable for IoT applications.

The paper is organized as follows. In Section 2, we recapthe background and present how to perform endomor-phism on a twisted Edwards curve. In Section 3, wedescribe how to generate such curves and give an examplecurve that is used in our implementations. Section 4 reviewsseveral approaches for computing the double scalar multi-plication and presents how to speed up the operation byusing an endomorphism. We describe the architectures andgive the implementation results for the small processor forsignature generation and verification and for the high-speedcore for verifications in Sections 5 and 6, respectively.Finally, we draw the conclusions in Section 7.

2 TWISTED EDWARDS CURVES WITH

ENDOMORPHISMS

2.1 Twisted Edwards Curves

Twisted Edwards curves were introduced to cryptographyby Bernstein et al. [13] in 2008 and they are currently consid-ered to be one of the most efficient models for ECC imple-mentation. Let Fp be a prime field with p > 3. A twistedEdwards curve over Fp can be defined as

Ea;d : ax2 þ y2 ¼ 1þ dx2y2; (2)

where a and d satisfy adða� dÞ 6¼ 0. As specified in [13], thej-invariant of Ea;d is

1. Such options and trade-offs are particularly important for crypto-graphic schemes for the IoT since IoT devices come in all shapes andsizes, and have, therefore, varying resource constraints. At one end ofthe spectrum are devices with extreme restrictions (e.g., RFID tags, sen-sor nodes) where every single gate and byte counts. At the other end ofthe spectrum are devices with plenty of resources that are equippedwith powerful 32-bit or 64-bit processors or FPGAs.

774 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY 2017

Page 3: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

jðEa;dÞ ¼ 16ða2 þ 14adþ d2Þ3adða� dÞ4 :

There is a remarkable addition law on Twisted Edwardscurves which can be complete when a is a square and d anon-square in Fp [13]. Here completeness means the addi-tion produces a correct result for any two points on Ea;d

without exception (even if one of the points is the neutralelement O ¼ ð0; 1Þ).

In the rest of this work, we adopt the twisted Edwardsmodel for our desired curve, which provides very efficientelliptic curve group arithmetic and high performance secu-rity. On the one side, the group formulae of twisted Edwardscurve are usually more efficient compared with other curvemodels, i.e., requiring less finite field arithmetic operations[13], [14]. On the other side, the complete group law of twistedEdwards curves admits a more secure execution pattern andthus the implementation of scalar multiplication on suchcurvewould resist against certain side-channel attacks [14].

2.2 GLV Method

In 2001, Gallant, Lambert, and Vanstone [15] described a newmethod, now known as the GLV method, for speeding upscalar multiplication on certain classes of elliptic curves withefficiently computable endomorphisms. Let E be an ellipticcurve over a finite fieldFp and letG 2 EðFpÞ have prime orderr. Assume that there exists an efficiently computable endo-morphism f on E such that fðGÞ ¼ � �G 2 hGi. The GLVmethod replaces the computation k �G by amultiscalarmulti-plication of the form k1 �Gþ k2 � fðGÞ, where the sub-scalarsk1 and k2 have lengths of approximately half of the originalscalar k. These two scalar multiplications can be computedsimultaneously by using the so-called Shamir’s trick [12, p.109], which iterates over the scalars so that corresponding bitsfrom the two scalars are processed simultaneously. Thishalves the number of doublings and, hence, the GLV methodpotentially gives significant speedups in scalar multiplica-tions on these elliptic curves.

Gallant et al. described in [15] several families of curvesfeaturing an efficiently computable endomorphism derivedfrom special complexmultiplication (CM). Let f be a complexnumber and K be the extension field QðfÞ. If such an ellipticcurve admits complex multiplication by f, then by [20,

Thm 10.14] we obtain an endomorphism fðx; yÞ ¼ ðf�2 fðxÞgðxÞ ;

yf�3ðfðxÞgðxÞÞ0Þ and fðOÞ ¼ O, where f; g are polynomial func-

tions over Q with �f ¼ NK=QðfÞ and �g ¼ NK=QðfÞ � 1 (Here

NK=Qð�Þ is the norm function fromK toQ).

2.3 Efficient Endomorphism on Twisted EdwardsCurve

The twisted Edwards curve Ea;d : ax2 þ y2 ¼ 1þ dx2y2 is

birationally equivalent to a short Weierstrass curve Es :

y2 ¼ x3 þ asxþ bs, where the birational equivalence mapcan be given as

c : Ea;d ! Es;

ðxt; ytÞ ! ðxs; ysÞ ¼ c1ð1þ ytÞ1� yt

þ c2;c1ð1þ ytÞxtð1� ytÞ

� �;

c�1 : Es ! Ea;d;

ðxs; ysÞ ! ðxt; ytÞ ¼ xs � c2ys

;xs � c3xs þ c4

� �;

(3)

where c1 ¼ ða� dÞ=4, c2 ¼ ðaþ dÞ=6, c3 ¼ ð5a� dÞ=12, andc4 ¼ ða� 5dÞ=12.

The original GLV method works on some ellipticcurves in Weierstrass model with special complex multi-plication (such as CM discriminant D ¼ �3;�4;�7;�8,etc.). If there exists an efficient endomorphism f on ellip-tic curve Es, then we can obtain an efficient endomor-

phism ft on Ea;d as c�1fc. Thus GLV method is alsoapplicable on twisted Edwards curves with some efficientendomorphism.

Usually the computation of endomorphisms on shortWeierstrass model is considerably simpler than on twis-ted Edwards model. Here we take the most commoncases of “GLV friendly” curves with j-invariant 0 and1,728 as examples.

2.3.1 j-Invariant 0

This class of elliptic curves has CM discriminant D ¼ �3,and can be given by a Weierstrass equation of the form

E~b : y2 ¼ x3 þ ~b (4)

over a prime field Fp with p � 1mod 3, which means Fp con-tains an element b of order 3. In this case, the mapf : E~b ! E~b given by ðx; yÞ 7! ðbx; yÞ and O 7! O is an endo-

morphism defined over Fp. If G 2 EbðFpÞ is a point of primeorder r, then fðGÞ ¼ � �G ¼ ðbx; yÞ, where � is an integer

satisfying �2 þ �þ 1 � 0mod r. There are only six possiblegroup orders for such curves when p is fixed.

Alternatively, we can find a twisted Edwards curve bira-tionally equivalent to the GLV curveE~b with help of the equa-

tion for the j-invariant: jðEa;dÞ ¼ 0 requires a2 þ 14adþ d2 ¼0, and when we fix a to �1 then d ¼ �7� 4

ffiffiffi3p

. Thus wecan obtain an endomorphism on its birationally equivalenttwisted Edwards curveEa;d as

ftðx; yÞ ¼xðc5yþ c6Þ

yþ 1;c7yþ c8yþ c9

� �; (5)

where c5 ¼ 5db�2dþbþ23ðdþ1Þ , c6 ¼ dbþ2dþ5b�2

3ðdþ1Þ , c7 ¼ 5dbþdþbþ5ð5dþ1Þðb�1Þ, c8 ¼

5þd5dþ1 and c9 ¼ dbþ5dþ5bþ1

ð5dþ1Þðb�1Þ .

2.3.2 j-Invariant 1728

Elliptic curves with j-invariant of 1,728 have CM discrimi-nant D ¼ �4, and can be defined by a Weierstrass equationof the form

E~a : y2 ¼ x3 þ ~ax (6)

over a prime field Fp with p � 1mod 4, i.e., it is guaranteedthat Fp contains an element a of order 4. In this case, themap f : E~a ! E~a given by ðx; yÞ 7! ð�x;ayÞ and O 7! O isan endomorphism defined over Fp. When G 2 E~aðFpÞ is apoint of prime order r, then fðGÞ ¼ � �G ¼ ð�x;ayÞ, where

� is an integer satisfying �2 þ 1 � 0mod r. There are onlyfour possible group orders for such curves when p is fixed.

Similar as before, by setting a ¼ �1, jðEa;dÞ ¼ 1;728

requires d ¼ 1 or 17� 12ffiffiffi2p

. Then by the above method, weobtain an endomorphism ft on corresponding twisted

Edwards model Ea;d : �x2 þ y2 ¼ 1þ dx2y2 as

LIU ET AL.: ELLIPTIC CURVE CRYPTOGRAPHY WITH EFFICIENTLY COMPUTABLE ENDOMORPHISMS AND ITS HARDWARE... 775

Page 4: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

ftðx; yÞ ¼

�xðð7d� 1Þyþ ð7� dÞÞ3aðdþ 1Þðyþ 1Þ ;

ð2d� 2Þyþ 5þ d

ð5dþ 1Þyþ ð2� 2dÞ� �

:(7)

If d ¼ 1, then ft has a simpler formula as ftðx; yÞ ¼ ðax; 1=yÞ.Note that explicit formulae for endomorphisms on twisted

Edwards curves have also been exploited in [16] and [17].

3 CURVE GENERATION

3.1 CM Method

Let E=Fp be our desired elliptic curve with CM discriminantD. The group order ofE=Fp is#EðFpÞ ¼ pþ 1� t, where t isthe Frobenius trace. It is well known that p and t also satisfy

the CM equation as 4p ¼ t2 �Ds2, where s 2 Z. Note that thej-invariant of such a curve is also determined, and there areonly 2, 4, or 6 possible group orders for a desired curve. Thusthe goal of the curve generation is not to find curve parame-ters (since we have them already), but rather to find a primefield Fp, and then a twisted Edwards curve defined over Fp

(given by a ¼ �1 and some fixed d), which contains a largecyclic subgroup and meets other security requirements. Thiscontrasts with the “traditional” approach for curve genera-tion where the field Fp is fixed and one has to find suitablecurve parameters.

3.2 Example Curve

Wechoose elliptic curvewithCMdiscriminantD ¼ �4 as ourexample. If we fix a ¼ �1 for efficiency reasons [13], then bythe analysis in Section 2.3, the possible value of d is 1 or 17�12

ffiffiffi2p

. We choose d ¼ 1 since the endomorphism on E�1;1 hasa very simple formula in this case as discussed before.

Our example curve is

E�1;1=Fp : �x2 þ y2 ¼ 1þ x2y2;

where the prime p ¼ 2207 � 5; 131. Note that p � 1mod 4,which implies that E�1;1 is ordinary. The group order#E�1;1ðFpÞ ¼ 8 � r, where r ¼ 0xFFFFFFFFFFFFFFFFFFFFFFFFFE090B67A2AE9D8EC7DD7009F95 is a 204-bit prime.Then under the general ECDLP algorithm (such as Pollard’s

Rho attack with computational complexity as O(ffiffiffirp

) ), ourdesired curve is at around 100-bit security level. Moreover,the embedding degree of E�1;1=Fp with respect to r is r� 1,which means that it is resistant to FR-MOV attack.2

There is an efficient endomorphism ft on E�1;1 as

ftðx; yÞ ¼ ða � x; 1=yÞ; (8)

where a = 0x5135DD9F4EBC5D1835EFB3D377F3A4A1FCB1E2DEC2911FF2B59A satisfies a2 þ 1 � 0mod p. And wecan check that ftðGÞ ¼ � �G for G 2 E�1;1ðFpÞ with � = 0xA1D776BEDB1ECFFCE5ABB8F12F8223CC0F494D461EC0

F724D06, here �2 þ 1 � 0mod r.

4 DOUBLE SCALAR MULTIPLICATION

Asmentioned before, double scalar multiplication is the mosttime-consuming operation of ECDSA signature verification

and, therefore, deserves efficient implementation and opti-mization. Formally, double scalar multiplication is an oper-ation of the form k �Gþ l �Q and computes the sum of twoscalar products, where G is fixed and Q is an arbitrarypoint. In the following, we review several approaches forperforming the double scalar multiplication and describehow to speed up this operation by exploiting an endomor-phism. For convenience, we assume that both scalars k andl are exactly m bits long.

4.1 Two Single Scalar Multiplications

Themost straightforwardmethod to perform the double sca-lar multiplication is to compute the two single scalar multi-plications separately and then add up the results. The firstscalar multiplication k �G takes a fixed and a-priori-knownpoint as an input, which can be efficiently performedthrough the fixed-base combmethod as described in [12, Sec-tion 3.3.2]. This single scalar multiplication requires roughly

m=w point doublings and mð2w�1Þw�2w point additions when using

2w � 1 pre-computed points, where w is the window size.The second scalar multiplication, l �Q, is performed with anarbitrary base point Q not known in advance. The simplestoption for its computation is the binary method. In that case,the arbitrary-base scalar multiplication requires m pointdoublings and m=2 point additions in average. In total, thedouble scalar multiplication requires mþm=w point dou-

blings and m=2þ mð2w�1Þw�2w point additions on average. Win-

dows methods (e.g., width-w non-adjacent form (NAF) [12,Section 3.3.1]) allow reducing the number of point additionsin arbitrary-base scalar multiplications by using precompu-tations, but also they requirem point doublings.

4.2 Interleaving Method

A method to speed up the computation of k �Gþ l �Q is toperform them in a simultaneous (or interleaved) fashion byusing Shamir’s trick [12, p. 109]. This method first computesthe sum of G and Q, i.e., S ¼ GþQ. Then, the scalars k and lare scanned simultaneously starting from themost significantbit. One addsG if ki ¼ 1 and li ¼ 0,Q if ki ¼ 0 and li ¼ 1, andS if ki ¼ li ¼ 1. Thismethod reduces the number of point dou-blings so that a double scalar multiplication requires m pointdoublings and 3m=4 point additions on average.

4.3 Joint Sparse Form

Solinas [21] proposed a joint sparse form (JSF) representa-tion for a pair of integers which minimizes the joint Ham-ming weight by using signed-binary representations for kand l. Hence, this representation leads to speedups in dou-ble scalar multiplication k �Gþ l �Q, when S ¼ GþQ andT ¼ G�Q are precomputed. The method works analo-gously to the above interleaving method but uses also pointsubtractions for negative digits. A double scalar multiplica-tion performed in an interleaved fashion with JSF requiresm point doublings and only m=2 point additions (resp. sub-tractions) on average [12, Section 3.3.3].

4.4 (Sliding) Window Method

Another approach to reduce the number of point additionsin a double scalar multiplication is to use a windowmethod.

2. It should be pointed out that E�1;1=Fp is not twist secure. How-ever, since our implementations do not execute in the “x-coordinateonly” pattern, the requirement of twist-secure is not necessary.

776 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY 2017

Page 5: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

Given the fixedwindowwidthw, a double-scalar multiplica-tion first generates a look-up table with points i �Gþ j �Qfor all i; j 2 ½0; 2w � 1�, and then scans w columns of the sca-

lars k and l. This method requires storage of 22w � 1 points.A double scalar multiplication can be performed with

ðm=w� 1Þ � w point doublings and mð22w�1Þw�22w � 1 point addi-

tions on average. The window method can be further

improved by using a “sliding” window, where only 22w�22ðw�1Þ points are needed for the look-up table, and thenumber of point additions is reduced to m

wþð1=3Þ [12].

4.5 Double Scalar Multiplication withEndomorphism

When applying the above described approaches to the dou-ble scalar multiplication, a maximum of m point doublingscould be saved compared to two single scalar multi-plications. Motivated by the work of Galbraith et al. [16], wepresent a strategy to further reduce the number of pointdoublings by some 50 percent using an efficiently comput-able endomorphism as follows.

Algorithm 1. Double Scalar Multiplication Using anEndomorphism

Input: Two m-bit scalars k and l, the fixed base point G and anarbitrary point Q on the curve EðFpÞwith endomorphism f.Output: double scalar multiplication k �Gþ l �Q.1: Use [12, Algorithm 3.74] to find ðk1; k2Þ of k and ðl1; l2Þ of l;2: Compute fðGÞ, fðQÞ using G and Q;3: G ¼ ðk1 > 0Þ?G : �G; fðGÞ ¼ ðk2 > 0Þ?fðGÞ : �fðGÞ;

Q ¼ ðl1 > 0Þ?Q : �Q; fðQÞ ¼ ðl2 > 0Þ?fðQÞ : �fðQÞ;4: Generate look-up table T with 15 points such that T ½i� 1� ¼½ði 3Þ&1� � fðQÞ þ ½ði 2Þ&1� �Q þ ½ði 1Þ&1� � fðGÞ þði&1Þ �G for 1 i 15;

5: Let k1 ¼ jk1j, k2 ¼ jk2j, l1 ¼ jl1j, l2 ¼ jl2j and h ¼ maxfk1;k2; l1; l2g;

6: R ¼ O;7: for i from h by 1 down to 0 do8: R 2R;9: s 8 � ðl2Þi þ 4 � ðl1Þi þ 2 � ðk2Þi þ ðk1Þi;10: if s > 0 then11: R Rþ T ½s� 1�;12: end if13: end for14: return R.

The main idea is to compute a double scalar multiplica-tion, i.e., k �Gþ l �Q, through four simultaneous scalar mul-tiplications k1 �G, k2 � fðGÞ, l1 �Q and l2 � fðQÞ, where k1, k2,

l1 and l2 are roughly m=2 bits long. Algorithm 1 showsthe computation of double scalar multiplication exploitingan efficiently-computable endomorphism. We first split thescalar k into two parts k1 and k2 using [12, Algorithm 3.74],where k1 and k2 have roughly half of the bitlength of k; thesecond scalar l can be decomposed into l1 and l2 in the sameway. Then, we calculate the points fðGÞ and fðQÞ from Gand Q by using (8) from Section 3. These can be computedwith only one inversion and a few multiplications by utiliz-ing the so-called Montgomery’s trick [12, p. 44] that relieson the fact that 1=x ¼ 1=xy � y and 1=y ¼ 1=xy � x. After that,we generate the look-up table with 15 points (line 4). Finally,the four scalar multiplications needed for computingk1 �Gþ k2 � fðGÞ þ l1 �Qþ l2 � fðQÞ are performed simulta-neously, i.e., in an interleaved fashion. A double scalar mul-tiplication using Algorithm 1 requires approximately m=2point doublings and 15m=32þ 11 point additions includingthe overhead for the generation of the look-up table.

4.6 Comparison and Trade-Offs BetweenPerformance and Memory

Table 1 reports the execution times and RAM requirementsof double scalar multiplications for several differentapproaches outlined above, as well as a combination ofendomorphism and window methods. Compared to theimplementation using (a), a double scalar multiplicationusing a combination of (a) and (b) requires the same num-ber of point doubling while it saves approximately 1/4 ofthe point additions. The number of point additions can befurther reduced by using a combination of (a) and (c) witha look-up table of 22w � 1 points. Taking the window widthw ¼ 2 as an example, one can save roughly 1/16 of thepoint additions compared to the implementation with acombination of (a) and (b). In relation to a combination of(a) and (c), the number of point doublings can be furtherreduced by some 50 percent using the technique of (d)with the same RAM occupation. A small number of pointadditions may potentially be saved by using a combinationof (c) and (d). However, the look-up table will grow expo-nentially and a combination of (c) and (d) is only able tosave point additions when n is big enough. For example,given w ¼ 2, a double scalar multiplication using a combi-nation of (c) and (d) requires a look-up table of 255 pointsand even requires more point doublings and point addi-tions. Taking both performance and RAM requirementsinto account, the technique (d) (i.e., double scalar multipli-cation with endomorphisms from Section 4.5) is the bestchoice to speed up the double scalar multiplication onresource-constraint platforms.

TABLE 1Comparison of Execution Time (Including the Generation of Look-Up Table) and RAM Requirements

of Double Scalar Multiplication Using Different Approaches

Method Storage Point Doublings Point Additions

(a) 3 m 1þ 3m=4(a) + (b) 4 m 2þm=2(a) + (c) 22w � 1 ð22ðw�1Þ � 2w�1Þ þm� w ð3 � 22ðw�1Þ � 2w�1 � 1Þ þ m�22w�1

w�22w(d) 24 � 1 m=2� 1 11þ 15m

32(c) + (d) 24w � 1 ð24ðw�1Þ � 2w�1Þ þm=2� w ð15 � 24ðw�1Þ þ 2w�1 � 5Þ þ m�ð24w�1Þ

2w�24w

(a): Interleaved (Section 4.2); (b): JSF (Section 4.3); (c): Window (Section 4.4); (d): Endomorphism (Section 4.5).

LIU ET AL.: ELLIPTIC CURVE CRYPTOGRAPHY WITH EFFICIENTLY COMPUTABLE ENDOMORPHISMS AND ITS HARDWARE... 777

Page 6: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

In the following, we demonstrate the flexibility of ourscheme based on double scalar multiplications with endo-morphisms by designing two hardware implementationsaimed at different target applications within the IoT frame-work. We present two different architectures: a small archi-tecture for signature generation and verification for ASICsin Section 5 and a high-speed verification core for FPGAsin Section 6. The former targets resource constrained devi-ces such as RFID tags, sensor nodes, etc. The latter isdesigned for the server-side where speed of signature veri-fications is important and FPGAs can be used for fast par-allel computations.

5 SMALL PROCESSOR ARCHITECTURE FOR

SIGNATURE GENERATION AND VERIFICATION

In this section, we describe a small processor architecture forsignature generation and verification. It is targetedmainly toresource-constrained devices of the IoT, where small resour-ces (area, memory, power, and energy) are the first priority.We begin with the architecture for Fp arithmetic and otherarchitectural design decisions in Section 5.1 and after thatpresent the results on 130 nmCMOS in Section 5.2.

5.1 Architecture for Fp Arithmetic

The prime p ¼ 2207 � 5;131 is a pseudo-Mersenne prime ofthe form p ¼ 2n � c, where c fits into one word of the targetplatform since we select a 16-bit datapath. The basic idea offast reduction using a pseudo-Mersenne prime is to applythe congruence relation 2n � cmod p repetitively during thereduction process. Suppose z ¼ zH2

n þ zL is a 2n-bit integer,such as obtained as result of a multiplication of two n-bitintegers. We can reduce zwith respect to p as follows:

z ¼ zH2n þ zL mod p � zHcþ zL mod p: (9)

Now z is already only slightly longer than n bits since c issmall. To complete the reduction zmod p, we reduce z againusing (9) and then at most one subtraction of p is needed toget a result that is at most n bits long.

We use the following notation:

� n: the operand size (i.e., n ¼ 207).� W : the word size of the datapath (i.e.,W ¼ 16).� m: the bitlength of the scalars k; l (� the bitlength of

the prime group order), while m=2 roughly denotesthe bitlength of the sub-scalars ki; li.

� A, B: two operands; A½i : j� represents bits at positioni to j of operand A.

� R: product A � B, which is twice as long as operand Aor B.

Our implementation adopts the idea of incomplete mod-ular reduction as described, for example, in [22], whichmeans the arithmetic functions described in the followingsections do not necessarily reduce the result to an integer inthe range of ½0; p� 1�, but only ensure that the result issmaller than 2n so that it fits into dn=We ¼ d207=16e ¼ 13words. Also, all arithmetic functions accept incompletelyreduced inputs of dn=Wewords.

Note that all arithmetic operations (except Montgomeryinverse) we discuss in the following can be easily imple-mented in a regular way without conditional statements so

that their execution time is independent of the values of theoperands. Such constant execution time helps to thwart cer-tain side-channel attack. Even though signature verificationdoes not involve any secret values (and can, therefore, notleak any secrets), it still makes sense to implement theunderlying field arithmetic in a regular way so that it canalso be used for signature generation.

5.1.1 Modular Multiplication and Squaring

The modular multiplication is performed in three basic stepsas shown in Algorithm 2. First, a conventional multi-preci-sion multiplication is performed in a word-wise fashionbased on the product-scanning technique [12]. Then, wemul-tiply the most significant 209 bits of the product by c and addthe result to least significant 207 bits, which yields a resultof (at most) 226 bits length. Finally, we multiply the mostsignificant 19 bits by c and add the product to least signifi-cant 207 bits; the result is now at most 208 bits and, therefore,fits into 13 words. In order to achieve constant executiontime, we always execute both reduction steps, even when theresult is already fully reduced after the first step.

Algorithm 2.Modular Multiplication for p ¼ 2207 � c

Input: Two integers A½207 : 0�, B½207 : 0�, and modulus pOutput: R ¼ A �Bmod p1: R ¼ A �B2: R ¼ R½415 : 207� � cþR½206 : 0� {The 1st reduction}3: R ¼ R½225 : 207� � cþR½206 : 0� {The 2nd reduction}

A modular squaring can be done more efficiently thanksto the symmetry of partial products. Thus, it is possible tosave the computation of (nearly) half of the partial products.

5.1.2 Modular Inversion

Modular inversion is the most time-consuming field arith-metic operation. Traditionally, the Extended EuclideanAlgorithm (EEA) [12] and Montgomery modular inversionalgorithm [23], [24] are used to compute an inverse. Ourinversion is mainly based on the Montgomery modularinverse, but has been optimized for the pseudo-Mersenneprime p ¼ 2n � c.

As shown in Algorithm 3, our inversion consists of twophases: phase I and phase II. In phase I, we perform twoadditions, and then update the variables fu; v; r; s; kg accord-ing to the sign flag of x. The trailing zero detection (DET)and right-shift operation x tlzx can be done in parallelwith the addition of uþ v. Furthermore, the left-shift opera-tion of s� tlzx and r� tlzx can be done in parallel with theaddition of y ¼ rþ s. In phase II, we perform two ordinarymultiplications to get the modular inverse. The input a is setto be odd, but even if initially a is even, it can be easilychanged to be odd via a modular subtraction a� p. The coreidea behind our optimized inversion is to remove all trailingzeros of ðuþ vÞ in every iteration, which keeps u and valways odd so that ðuþ vÞ converges to zero quickly.

Compared to the Multibit Shifting method proposed bySavas et al. in [25], we remove all those iterations for shiftoperation (i.e., the iterations when u or v is even in [23,Algorithm MONTINVER]) and adopt the idea from [26] toavoid a complex comparison step by using the sign flag of

778 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY 2017

Page 7: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

x. More specifically, the number of total iterations in Phase Iof [23] is in the range of ½n; 2n�, with 50 percent for shiftoperations. The number of iterations of our algorithm is inthe range of ½0:5n; n� since such shift-operation iterations arenot required. Furthermore, the optimized inversion can bemade even faster by keeping track of the lengths of variablesfu; v; r; sg. This saves cycles for additions because the wordlengths decrease linearly with the number of iterations.

Algorithm 3. Optimized MontgomeryModular Inversionfor 2n � c

Input: a 2 ½1; 2nÞ and is odd, p > 2 is a n bits prime, precom-

puted T ¼ 2ð�2nÞmod p;Output: R 2 ½1; 2nÞ, where R ¼ a�1 mod p.1: //Phase I2: u ¼ �p, v ¼ a, r ¼ 0, s ¼ 1, k ¼ 0;3: while ð1Þ do4: x ¼ uþ v; {Both u and v are always odd numbers}5: y ¼ rþ s;6: tlzx ¼ DET ðxÞ; {Trailing zero detection}7: if x ¼¼ 0 then8: break;9: else if x < 0 then10: u ¼ x tlzx; {Right-shift operation can be done in paral-

lel with uþ v}11: r ¼ y;12: s ¼ s� tlzx; {Left-shift operation can be done in paral-

lel with rþ s}13: else14: v ¼ x tlzx;15: s ¼ y;16: r ¼ r� tlzx;17: end if18: k ¼ kþ tlzx;19: end while20: //phase II21: s ¼ s � 2ð2n�kÞmod p;22: s ¼ s � T mod p;23: return s.

5.1.3 Modular Addition and Subtraction

An addition modulo p ¼ 2207 � c can be performed in threesteps. First, a conventional multi-precision additionR ¼AþB is performed in a word-wise fashion. Then, for reduction,we reduce the 209-bit result to 208 bits by using Equation (10).To ensure constant execution time, we perform the addition

step and the reduction step for all possible inputs, even if noreduction is required

R � R½209 : 207� � 2207 þR½206 : 0�mod p

� R½209 : 207� � cþR½206 : 0�: (10)

For modular subtraction, a conventional multi-precisionsubtraction R ¼ A�B is performed through word-wisesubtract-with-borrow operations. As the 208-bit input Bcan be bigger than 2p, the result of the subtraction may besmaller than �2p and, thus, up to two addition steps willbe needed. As shown in Equation (11), the first additionstep will guarantee that R > �p. If R is still negative,another addition step as shown in Equation (12) will make

R a positive number in the range ½0; 2208Þ. To ensure con-stant execution time, we perform one subtraction and twoadditions for all possible inputs, but when R is positive afterthe subtraction, the words of the operand are masked out(i.e., set to 0) so that the value of R does not change

R � Rþ 2pmod p (11)

R � Rþ pmod p: (12)

5.1.4 Hardware Architecture

The hardware architecture, as shown in Fig. 1, consists of amicro-controller, a program ROM, an Fp-coprocessor, whichwe call Prime-Field Arithmetic Unit (i.e., PFAU), and twodual ports SRAMs. The program ROM is used to commandsequences that execute high-level functions such as pre-computations, point addition, point doubling, etc. This sec-tion focuses on the ALU.

The architecture of the ALU and other important mod-ules is shown in Fig. 2, where one (16 16)-bit multiplier,one three-input adder, a trailing-zero detection module(tlz), a left-shifting module (lshifter), and a right-shift-ing module (rshifter) are depicted. We decided to imple-ment a 16-bit datapath since previous research has shownthat this allows one to achieve a good trade-off between per-formance and silicon area. The ALU supports the word-level instructions needed for modular multiplication, mod-ular squaring, modular inversion, modular addition andmodular subtraction. The critical path goes from the input

Fig. 1. Hardware architecture.

Fig. 2. ALU architecture.

LIU ET AL.: ELLIPTIC CURVE CRYPTOGRAPHY WITH EFFICIENTLY COMPUTABLE ENDOMORPHISMS AND ITS HARDWARE... 779

Page 8: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

registers of the multiplier to the output registers of theadder. The input from mult to the adder is 33 bits long dueto the fact that we need to double some partial productswhen performing a modular squaring.

The optimized modular inversion requires the tlz,lshifter, rshifter modules. Using the implementa-tion technique from [27], the tlz module can output thenumber of trailing zeros of a word (16 bits) in one clockcycle. To obtain the trailing zeros in a 208-bit operand,we can perform a zero detection word by word. If thenumber of trailing zeros exceeds one word, the detectionprocess will take more than one cycle, but the probabil-ity is only 2�15 (because x is always even in Algorithm3). The lshifter and rshifter receive the output oftlz and perform the corresponding number of shifts onthe 16-bit input. As mentioned before, the shift operationin the modular inverse can be done in parallel with theaddition.

5.2 Implementation Results

We implemented the arithmetic processor in Verilog andsynthesized it with Design Compiler 2013.12 using theUMC 130 nm 1P8M Low Leakage Standard Cell Librarywith typical values (i.e., voltage of 1.2 V and temperature of25 �C). The area (in gate equivalents, GE) after placementand routing is calculated by dividing the overall area by thearea of a single two-input NAND gate. The design has beensynthesized for a clock frequency of 50 MHz, which is morethan sufficient for common IoT devices such as RFID tags orsensor nodes.

5.2.1 Execution Time of Field Arithmetic

As mentioned in the previous section, we implemented themultiplication, squaring, addition and subtraction to haveconstant execution time. Constant execution time (and, thus,constant pattern of operations) gives protection against sim-ple side-channel attacks that target a single side-channeltrace. Protection against more elaborated attacks relying onstatistical analysis of one or more traces is left for futurework. Table 2 summarizes the execution times of the fivebasic arithmetic operations modulo the prime 2207 � 5; 131.The modular addition takes exactly 30 cycles, which is fasterthan the modular subtraction. Our constant-time modularmultiplication executes in exactly 192 cycles, whereas themodular squaring has an execution time of 120 clock cycles,which means the squaring requires merely 60 percent ofthe multiplication cycles. Thanks to the optimized Mont-gomery modular inversion proposed in Algorithm 3, ourinversion requires 4,452 clock cycles in average, which corre-sponds to only 23 multiplications. The execution time ofmodular inversion is evaluated based on the average num-ber of Phase I iterations with two additions per iteration andtwomodular multiplications in Phase II.

5.2.2 Double Scalar Multiplication: High-Speed versus

Memory-Efficient

To demonstrate the trade-offs between performance andRAM requirements that our small processor architectureoffers, we designed two versions of the implementation: thefirst one is optimized for performance and the second forlow RAM footprint.

Speed-Optimized. The speed-optimized implementationrequires a look-up table containing 15 points, of which 11points (except G, Q, fðGÞ and fðQÞ) will be generated by asequence of point additions. In order to take the advantageof the efficient point addition formula on a twisted Edwardscurve (i.e., the 7 M mixed addition formulae based on [14]),we store these points using extended affine coordinates ofthe form ðU; V;WÞ, where U ¼ ðxþ yÞ=2, V ¼ ðy� xÞ=2,W ¼ xy (in our case d ¼ 1). A straightforward method toget the affine form of these points would require 11 inver-sions. For reducing the number of inversions, we performthe 11 inversions by using Montgomery’s trick [12, page 44]:With the help of three temporary variables, the 11 inver-sions can be computed with only one inversion and 83 mul-tiplications. Given an affine point, the extended affinecoordinates ðU; V;WÞ can be obtained by performingone addition, one subtraction and one multiplication. In themain loop, a pre-computed point given in extended affinecoordinates is used as an operand in each iteration (i.e., inline 11 of Algorithm 1). As a result, our speed-optimizeddouble scalar multiplication requires an execution time of365,082 clock cycles with a RAM footprint of 1,612 bytes.

Memory-Optimized. A look-up table with 15 extendedaffine points requires 45 field elements to be stored inRAM. Instead of generating a look-up table with extendedaffine points, the memory-optimized implementation gen-erates a look-up table with standard affine coordinatesðx; yÞ and reduces the RAM requirements to only 30 fieldelements. In the process of look-up table generation, weadopt the point addition formula with Z1 ¼ 1 and Z2 ¼ 1[14, Section 3.1] and directly convert the projective repre-sentation into standard affine representation for each point.In total, the look-up table generation requires 11 pointadditions, 11 inversions and 22 multiplications. We stilluse the efficient point addition formula for twistedEdwards curve (i.e., the 7 M mixed addition formulaebased on [14]) in the main loop of double scalar multiplica-tion. Thus, we compute the extended affine representationof an affine point on-the-fly, which requires one multiplica-tion, one addition and one subtraction for each iteration.As a consequence, our memory-optimized double scalarmultiplication requires an execution time of 415,392 clockcycles with a RAM consumption of only 1,222 bytes, whichcorresponds to a saving of 33 percent for the look-up table(780 instead of 1,170 bytes) and 24 percent in total (i.e.,1,222 instead of 1,612 bytes), by only scarifying roughly 12percent in performance.

For comparison, a double scalar multiplication withoutexploiting the endomorphism (i.e., by using interleavingwith JSF) has an execution time of 454,179 cycles. Thisshows that the endomorphism yields a speed-up of roughly8.5-19.6 percent (i.e., memory optimized version and highspeed version).

TABLE 2Execution Times of Field Arithmetic Operations

(in Clock Cycles)

Operation Mul Sqr Inv Add Sub

Time 198 120 4,452 30 43

780 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY 2017

Page 9: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

5.2.3 Comparison with Other Implementations

Table 3 shows our implementation results and a comparisonwith related work over prime fields. All related implemen-tations (except Lai et al. [29]) use a 16-bit datapath similarlyto our work. There are no implementations available inthe literature with exactly the same security level as ourimplementations. Hence, direct comparison is impossible,but we highlight the size of the prime field in Table 3 (the‘order’ column) in order to make comparison as easy andfair as possible.

We perform the fixed-base scalar multiplication (neededfor signature generation) on the chosen twisted Edwardscurve using a constant-time comb method with w ¼ 4 asdescribed in [35], which only needs to store 8 points fromprecomputation.3 Note that signature generation requires aconstant-time inversion, which we compute using Fermat’stheorem with an addition chain that can be evaluated bycomputing 206 modular squarings and 14 modular multipli-cations. Our implementation requires an execution timeof 182,653 and 365,082 clock cycles for (constant-time) scalarmultiplication and speed-optimized double scalar implemen-tation, respectively, and consumes an area of 5,821 GEs. Onthe other hand, the memory-oriented implementation needs415,392 clock cycles, while consuming only 1.2 kB of RAM.

Since most of the previous implementations only repor-ted the execution time of signature generation, we estimatethe cycle count of verification (i.e., double scalar multiplica-tion) by simply multiplying the generation time by two.As shown in the Table 3, our implementation is at least threetimes faster than all the previous works using the sameword size. In terms of area, the implementations from [30]and [29] support both prime and binary field arithmetic

and, thus, have a large area. On the other hand, the authorsof [32], [33], [34] optimized their implementation with thehelp of a microcode-programmable structure (for fieldarithmetic) and, thus, their implementations require extrainstruction decoding modules and have higher ROM con-sumption in order to save area in the control logic. Ourimplementation does not include the area for SRAM since itvaries for different process technologies and depends signif-icantly on whether one has a RAM generator available ornot. Besides, in some applications the SRAM can be sharedwith other modules in the device and, in such cases, it doesnot incur further costs [36].

6 HIGH-SPEED VERIFICATION CORE

We also provide an architecture tailored for fast signatureverifications that require double scalar multiplicationsk �Gþ l �Q. The architecture is designed primarily for FPGAdevices which have embedded memory blocks and multi-pliers and it can be used in the server-side for achievingvery high throughputs for signature verifications by utilizingmultiple parallel cores. Because these computations operateon public data, there is no need for side-channel counter-measures. We begin with description of the architectureof Fp arithmetic in Section 6.1, discuss latencies of operationsusing the architecture in Section 6.2, and endwith implemen-tation results on a Xilinx Virtex-7 FPGA and discussion inSection 6.3.

6.1 Architecture for Fp Arithmetic

The core for computing double scalar multiplications isdepicted in Fig. 3. The high-level diagram given in Fig. 3ashows that the core consists of two parallel ALUs and twodual-port RAMs.

The ports of the RAMs are arranged as follows. The A-port is used for both writing the output of the correspondingALU into the RAM and reading the contents of the RAM.The B-port of the RAM is dedicated only for reading duringthe operation of the core, but it is used also by the externalinterface to write data into the RAMs. The architectureallows the ALUs to take inputs from both ports of bothRAMs, but an ALU can only write to the correspondingRAM. If the core computes with only two values from differ-ent RAMs (e.g., AþB in ALU 1 and A�B in ALU 2 orA A in ALU 1 andB B in ALU 2), then reading the oper-ands and writing the results can be done concurrently. Ifmore operands need to be read (e.g., A B in ALU 1 andC D in ALU 2), then additional delays occur because read-ing and writing must occur in different clock cycles. Theexternal interface allows writing and reading both RAMs.

The ALU depicted in Fig. 3b has a W -bit datapath thatsupports integer multiplication, addition, and subtraction. Inour case W ¼ 52 and each element of Fp splits into fourwords. However, instead of restricting values to Fp, we allow

an extended range ½0; 2208 � 1� to simplify the arithmetic. TheALU is built around a pipelined (six stages) W W -bitmultiplier and a pipelined (three stages) accumulator. Themultiplier is constructed by using a Xilinx IP Core so that ituses the hardwiredmultipliers of DSP48E1 blocks.

Multiplications are computed using the product-scanning(Comba) algorithm [37] that computes all subproducts of aresult word successively starting from the least-significant

TABLE 3Comparison of Execution Times, Areas and RAM Consumptions

with Related Works over Prime Fields

Implementation Order Time (cycles) ALU SRAM

Sign. Ver. (GE) (bytes)

Chen et al. [28] 256 562,000 1,124,0001 n.a. n.a.Lai et al. [29]2 176 93,399 186,7981 n.a. n.a.Lai et al. [29]2 256 252,067 504,1341 n.a. n.a.Satoh et al. [30] 192 1,362,906 2,725,8121 9,456 n.a.Satoh et al. [30] 224 2,048,166 4,096,3321 10,800 n.a.Furbass et al. [31]3 192 502,000 1,004,0001 21,769 n.a.Hutter et al. [32]3 192 859,188 1,718,3761 2,3714 256Wenger et al. [33] 192 1,377,000 2,645,000 4,3544 422Plos et al. [34] 192 863,109 1,726,2181 3,6084 256This work5 (HS) 204 182,653 365,082 5,821 1,612This work5(ME) 204 182,653 415,392 5,821 1,222

Most of the Works Use a 16-Bit Datapath1: Estimated results from the execution time of signature implementation.2: Four 32 bits multipliers used.3: 0:35mm technology library used.4: Microcode based architecture used, more ROM are required.5: Fixed-base scalar multiplication (for signature generation) and double scalarmultiplication (for verification).

3. The efficient endomorphism can also be used to accelerate thecomputation of k �G via GLV method, but it costs more time under thesame RAM occupation or needs much more storage to improve perfor-mance. Due to the extreme resource constraints of IoT applications, wedid not apply the GLV method to this computation.

LIU ET AL.: ELLIPTIC CURVE CRYPTOGRAPHY WITH EFFICIENTLY COMPUTABLE ENDOMORPHISMS AND ITS HARDWARE... 781

Page 10: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

word. The subproducts are accumulated into two 52-bitregisters reg0 and reg1 for the lower and higher word of themultiplication result and a 5-bit register reg2 for the over-flowing bits. When a word of the result is ready the accumu-lator is shifted to the right (reg0 reg1, reg1 reg2, reg2 0). Additions and subtractions of 52-bit words are com-puted using the adder on the right. The carry input to theadder can be set to either zero, one, or to the carry fromthe previous addition. This allows efficient computation ofmultiprecision additions and subtractions. Two words aresubtracted by inverting all bits and setting the carry to one.The ALU also supports dividing words in reg0 by two (rightshifts) in a way that allows implementing multiprecisiondivisions by two. Division by two is performed so that p isfirst added to the dividend if its lsb is one.

Both point addition and point doubling require sevenmultiplications and six additions/subtractions in Fp byusing the formulae from [14], but they can be computedwith a latency of four multiplications and three additions/subtractions by utilizing the two parallel ALUs. We observethat it is possible to interleave the computation of pointaddition and point doubling so that the combined latencybecomes only seven multiplications and six additions/subtractions. Algorithm 4 gives the operation sequence. The

subscripts of the variables denote the RAM in which thevariable is located (e.g.,X1 is in RAM 1 and Y2 is in RAM 2).

Algorithm 4. Interleaved Point Addition and PointDoubling

Input: P1 ¼ ðX1; Y2; Z2; E1; H2Þ, P2 ¼ ðU1; V2;W2ÞOutput: ðX1; Y2; Z2; E1; H2Þ ¼ 2ðP1 þ P2ÞA1 Y2 þX1; A2 Y2 �X1;B1 E1 H2; B2 A1 V2;C1 B1 W2; Y2 A2 U1;E1 Y2 þB2;H2 Y2 �B2;A1 Z2 þ C1; C2 Z2 � C1;X1 E1 A1; Y2 C2 H2;B1 A1 C2; B2 X1 X1;A1 Y2 Y2; A2 B1 B1;X1 X1 þ Y2; C2 B2 � A1;B1 A2 þA2;H2 B2 þA1;E1 X1 X1; Y2 C2 H2;E1 H2 þ E1; B2 C2 �B1;X1 E1 B2; Z2 C2 B2;

6.2 Latencies

The ALU computes field operations with the following laten-cies: multiplication 61 or 63 clock cycles, addition/subtraction7-18 clock cycles depending on the required reductions (aver-age 11.5), and addition/addition 7-11 clock cycles (average 9).Division by two requires 12 or 17 clock cycles if the lsb is zeroor one, respectively. Hence, when two divisions are per-formed in parallel, the average latency is 15.75 clock cycles. AFermat-based inversion in Fp takes on average ð206þ 14Þ�62 ¼ 13; 640 clock cycles. One iteration of Algorithm 4requires 7 � 62þ 5 � 11:5þ 9 ¼ 500:5 clock cycles on average.Computing only the point addition or point doubling parts ofAlgorithm 4 require 4 � 62þ 3 � 11:5 ¼ 282:5 and 4 � 62þ 2�11:5þ 9 ¼ 280 clock cycles on average, respectively.

In the following we provide estimates for the latency ofcomputing the double scalar multiplication k �Gþ l �Qwith the core of Fig. 3. Similarly as before, we assumethat the base point G is fixed and Q is varying. The scalarmultiplication begins by precomputing all combinations ofa1Gþ a2fðGÞ þ a3Qþ a4fðQÞ with ai 2 f0; 1g. Only thecombinations where a3 ¼ 1 or a4 ¼ 1 need to be computedon the fly. The points depending only on G can be com-puted offline and written to the RAMs once during initiali-zation. The online precomputation is given in Algorithm 5.It uses the point addition part of Algorithm 4 to computepoint additions where the other operand is a point in pro-jective coordinates and the other is a point in extendedaffine coordinates. In the end, all precomputed points inthe table T are in extended affine coordinates. Projectiveversion can be obtained from the affine version withoutcomputational cost: for ðx; yÞ, the projective version isðx; y; 1; x; yÞ. The extended affine version requires computa-tions: ððyþ xÞ=2; ðy� xÞ=2; x � yÞ.

The cost of Algorithm 5 is as follows. Line 1 is performedby writing data in to the RAMs by using the external inter-face; this latency depends on the host processor and is notcounted in the following clock cycle counts. Line 2 convertsQ into extended affine coordinates with one addition/sub-traction (yþ x and y� x in parallel), divisions by two (again

Fig. 3. The architectural diagrams of the verification engine. (a) The high-level diagramof the computation core and (b) the architecture of the ALUs.

782 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY 2017

Page 11: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

in parallel), and a multiplication; this costs 62þ 11:5 þ15:75 ¼ 89:25 clock cycles on average. Line 3 computes fðQÞwhich requires one inversion and one multiplication and,hence, on average 13,702 clock cycles. Lines 4-13 computepoint additions with average latencies of 282.5 clock cycles.Line 14 finds the affine coordinates of 10 points, whichrequires 10 inversions and 20 multiplications. The inver-sions are computed using Montgomery’s trick which trans-lates the problem to one inversion and 27 multiplications.Hence, Line 14 takes 13;640þ ð27þ 20Þ � 62 ¼ 16;554 clockcycles on average. Finally, Algorithm 5 ends in Line 15 withthe computation of extended affine coordinates for the 10points. The total cost of this is 10 � 89:25 ¼ 892:5 clock cycles.Summing up all the latencies gives the average latency of34,062.75 clock cycles for the precomputation.

Algorithm 5. Precomputation (Online)

Input: Affine version of Q and affine and extended affine ver-sions of P , fðP Þ, fðP Þ þ P

Output: Table T of 16 points represented in the extended affinecoordinates

1: Set T ½0; 0; 0; 0� O, T ½1; 0; 0; 0� P , T ½0; 1; 0; 0� fðP Þ,T ½1; 1; 0; 0� fðP Þ þ P ;

2: Set T ½0; 0; 1; 0� Q and convert it to extended affinecoordinates;

3: Compute T ½0; 0; 0; 1� fðQÞ and convert it to extendedaffine coordinates;

4: Compute T ½1; 0; 1; 0� T ½1; 0; 0; 0� þQ with point additionof Algorithm 4 (the same below);

5: Compute T ½0; 1; 1; 0� T ½0; 1; 0; 0� þQ;6: Compute T ½1; 1; 1; 0� T ½1; 1; 0; 0� þQ ;7: Compute T ½1; 0; 0; 1� T ½1; 0; 0; 0� þ fðQÞ;8: Compute T ½0; 1; 0; 1� T ½0; 1; 0; 0� þ fðQÞ;9: Compute T ½1; 1; 0; 1� T ½1; 1; 0; 0� þ fðQÞ10: Compute T ½0; 0; 1; 1� T ½0; 0; 0; 1� þQ;11: Compute T ½1; 0; 1; 1� T ½0; 0; 1; 1� þ P ;12: Compute T ½0; 1; 1; 1� T ½0; 0; 1; 1� þ fðP Þ;13: Compute T ½1; 1; 1; 1� T ½0; 0; 1; 1� þ ðfðP Þ þ P Þ;14: Convert T ½1; 0; 1; 0�; . . . ; T ½1; 1; 1; 1� to affine coordinates by

using Montgomery’s trick for inversions;15: Convert T ½1; 0; 1; 0�; . . . ; T ½1; 1; 1; 1� to extended affine

coordinates.

The scalar array is scanned from the left (the msb) to theright (the lsb). A point doubling is computed for each columnfollowed by a point addition if the column is nonzero (i.e.,contains at least one one-bit). Hence, a point addition is com-puted on average for 15/16 of the columns.Whenever a pointaddition is skipped, we need to be able to compute only apoint doubling. Also, the double scalar multiplication endswith a point addition 15 times out of 16. Hence, in additionto Algorithm 4 also routines for separate point additionand point doubling are needed. They can be constructed fromthe two halves of Algorithm 4 with simple modifications tothe addressing. We assume that we have a 104-bit (� 207

2 ) sca-lar array. Then, the scalarmultiplication latency is given by

PDþ 102 � 15

16� PADþ 1

16� PD

� �þ 15

16� PA;

where PD, PA, and PAD denote point doubling, point addi-tion, and interleaved point addition and point doubling,

respectively. Using the above latencies for these operationsgives 50,190 clock cycles. In the end, the affine coordinatesof the result point are obtained by computing an inversionfollowed by two multiplications with an average latencyof 13,764 clock cycles. Summing up all above latencies givesthat a double scalar multiplication requires 98,017 clockcycles on average.

6.3 Results and Discussion

We compiled the core depicted in Fig. 3 for Xilinx Virtex-7XC7VX330T-1FFG1157 using Xilinx ISE 14.7. The results arecollected in Table 4. They show that the core is compact andoperates on a relative high clock frequency (the critical pathis in the pipelined 52-bit multiplier). If parallel instances ofthe core are implemented in the FPGA, then the number ofDSP48E1 blocks will become the bottleneck. The numbersindicate that even 50 parallel cores could fit in one Virtex-7XC7VX330T.

We are not aware of other published results that would bedirectly comparable with this architecture for double scalarmultiplications for signature verifications. An architecture forverifying self-certified signatures on a Koblitz curve NIST K-163 over F2163 was presented by J€arvinen et al. [38] in CHES2007. It achieved throughput of 166,000 verifications per sec-ond with an Altera Stratix II FPGA, which appears to beslightly faster than what is achievable with our architecture.However, the comparison is not fair because the curve usedin [38] offers less security (approximately 80 bits versus 100bits) and F2n arithmetic is typically much more efficient inhardware (FPGA) than Fp arithmetic. Recently, Sasdrich andG€uneysu [39] presented a highly optimized core for singlescalar multiplications on Curve25519 [40]. Their single-coreimplementation has comparable resource requirements (e.g.,20 DSP blocks) with our core and it computes 2,519 scalarmultiplications in second. Hence, we can estimate that theirimplementation is capable of computing roughly 1,260 doublescalar multiplications in second. Our core achieves a through-put of 2,040 double scalar multiplications in second which isapproximately 60 percent more. However, these numbers arenot directly comparable because Curve25519 offers a roughly128-bit security level whereas our curve offers a 100-bit secu-rity level. Nevertheless, this shows that our core comparesfavorably to the state-of-the-art FPGA implementations offast elliptic curve cryptography over prime fields.

Our high-speed core was purposely designed as simpleas possible in order to maximize the operating frequency.Adding certain features in the ALU would allow shorterlatencies, but would also lead to a drop in the maximum

TABLE 4Results of the Verification Core on Xilinx Virtex-7

XC7VX330T-1FFG1157

LUTs 955 (0.5%)Registers 992 (0.2%)

Slices 377 (0.7%)RAMB36E1 2 (0.3%)DSP48E1 20 (1.8%)

Max. freq. (MHz) 205.634

Latency (clock cycles) 98,017Timing (ms @ 200 MHz) 0.490Throughput (ops @ 200 MHz) 2,040

LIU ET AL.: ELLIPTIC CURVE CRYPTOGRAPHY WITH EFFICIENTLY COMPUTABLE ENDOMORPHISMS AND ITS HARDWARE... 783

Page 12: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

frequency. In particular, adding support for shift operationswould allow optimizing squarings (currently treated as nor-mal multiplications) and using faster inversions based on theExtended Euclidean Algorithm. The future work includesstudies on whether such modifications could lead to furtherspeedups and improvements in speed-area ratio.

7 CONCLUSION

In this work, we introduced a twisted Edwards curve withan efficiently computable endomorphism and describedhow the endomorphism be exploited to speed up doublescalar multiplications. We described two hardware imple-mentations utilizing the endomorphism and they target toresource-constrained IoT devices and FPGAs for the server-side, respectively.

We presented an area-optimized processor architecturefor resource constrained applications. The processor is builtaround a 16-bit datapath and it has an overall silicon areaof only 5,821 GE when synthesized with a 130 nm CMOSstandard-cell library. In addition, we showed that the archi-tecture and the presented methods support various trade-offs between execution time and memory requirements,which gives a designer many options to optimize doublescalar multiplications for different requirements. Our pro-cessor architecture compares favorably to various counter-parts from the literature.

We also provided a high-speed architecture for FPGAdevices. This verification core was designed to use parallelprocessing with two ALUs and RAM memories. It resultedin both fast and compact FPGA implementation of the dou-ble scalar multiplication. It can be used for achieving veryhigh throughputs for signature verifications in server-sideoperations related to the IoT by using parallel instances ofthe core inside one FPGA device.

These implementations show that our curve can be effi-ciently implemented for applications that require low resour-ces or high speed. This is a particularly important advantagefor IoT applications because such systems must be flexible inthe sense that they can be efficiently implemented in environ-ments with varying implementation constraints. Our curveoffers roughly 100-bit security level which is a good tradeoffbetween security and performance. All this makes our meth-ods, the curve, and the architectures good options for imple-menting cryptographic protocols in IoT applications.

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewersfor their valuable comments and helpful suggestions. ZhiHu was partially supported by the Natural Science Founda-tion of China (Grant No. 61602526). Kimmo J€arvinen wasthe FWO Pegasus Marie Curie Fellow.

REFERENCES

[1] L. Atzori, A. Iera, and G. Morabito, “The Internet of Things:A survey,” Comput. Netw., vol. 54, no. 15, pp. 2787–2805, Oct. 2010.

[2] R. Roman, P. Najera, and J. Lopez, “Securing the Internet ofThings,” IEEE Comput., vol. 44, no. 9, pp. 51–58, Sep. 2011.

[3] T. Dierks and E. K. Rescorla ,“The transport layer security (TLS)protocol version 1.2,” Internet Eng. Task Force, Netw. WorkingGroup, RFC 5246, Aug. 2008.

[4] E. K. Rescorla and N. G. Modadugu, “Datagram transport layersecurity version 1.2,” Internet Eng. Task Force, Netw. WorkingGroup, RFC 6347, Jan. 2012.

[5] S. L. Keoh, S. S. Kumar, and H. Tschofenig, “Securing the Internetof Things: A standardization perspective,” IEEE Internet ThingsJ., vol. 1, no. 3, pp. 265–275, Jun. 2014.

[6] R. L. Rivest, A. Shamir, and L. M. Adleman, “Amethod for obtain-ing digital signatures and public key cryptosystems,” Commun.ACM, vol. 21, no. 2, pp. 120–126, Feb. 1978.

[7] Digital Signature Standard (DSS), NIST, FIPS Publication 186–4,Jul. 2013. [Online]. Available: http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.186–4.pdf

[8] D. Johnson, A. J. Menezes, and S. A. Vanstone, “The elliptic curvedigital signature algorithm (ECDSA),” Int. J. Inf. Secur., vol. 1,no. 1, pp. 36–63, Jul. 2001.

[9] S. Blake-Wilson, N. Bolyard, V. Gupta, C. Hawk, and B. M€oller,“Elliptic curve cryptography (ECC) cipher suites for transportlayer security (TLS),” Internet Eng. Task Force, Netw. WorkingGroup, RFC 4492, May 2006.

[10] N. P. Smart, Ed., “ECRYPT II Yearly Report on Algorithms andKeysizes (2011–2012),” Eur. Netw. Excellence Cryptology(ECRYPT II), Tech. Rep. D.SPA.20, Sep. 2012. [Online]. Available:http://www.ecrypt.eu.org/documents/D.SPA.20.pdf

[11] D. J. Bernstein, N. Duif, T. Lange, P. Schwabe, and B.-Y. Yang,“High-speed high-security signatures,” J. Cryptographic Eng.,vol. 2, no. 2, pp. 77–89, Sep. 2012.

[12] D. R. Hankerson, A. J. Menezes, and S. A. Vanstone, Guide to Ellip-tic Curve Cryptography. Berlin, Germany: Springer-Verlag, 2004.

[13] D. J. Bernstein, P. Birkner, M. Joye, T. Lange, and C. Peters,“Twisted Edwards curves,” in Progress in Cryptology—AFRICA-CRYPT 2008, S. Vaudenay, Ed. Berlin, Germany: Springer-Verlag,2008, pp. 389–405.

[14] H. Hisil, K. K.-H. Wong, G. Carter, and E. Dawson, “TwistedEdwards curves revisited,” in Advances in Cryptology—ASIA-CRYPT 2008, J. Pieprzyk, Ed. Berlin, Germany: Springer-Verlag,2008, pp. 326–343.

[15] R. P. Gallant, R. J. Lambert, and S. A. Vanstone, “Faster pointmultiplication on elliptic curves with efficient endomorphism,”in Advances in Cryptology—CRYPTO 2001, J. Kilian, Ed. Berlin,Germany: Springer-Verlag, 2001, pp. 190–200.

[16] S. D. Galbraith, X. Lin, and M. Scott, “Endomorphisms for fasterelliptic curve cryptography on a large class of curves,” in Advancesin Cryptology—EUROCRYPT 2009, A. Joux, Ed. Berlin, Germany:Springer-Verlag, 2009, pp. 518–535.

[17] P. Longa and F. Sica, “Four-dimensional Gallant-Lambert-Vanstone scalar multiplication,” in Advances in Cryptology—ASIACRYPT 2012, X. Wang and K. Sako, Eds. Berlin, Germany:Springer-Verlag, 2012, pp. 719–739.

[18] P. Longa and C. H. Gebotys, “Efficient techniques for high-speedelliptic curve cryptography,” in Cryptographic Hardware andEmbedded Systems—CHES 2010, S. Mangard and F.-X. Standaert,Eds. Berlin, Germany: Springer-Verlag, 2010, pp. 80–94.

[19] A. Faz-Hern�andez, P. Longa, and A. H. S�anchez, “Efficient andsecure algorithms for GLV-based scalar multiplication and theirimplementation on GLV-GLS curves,” in Topics in Cryptology—CT-RSA 2014, J. Benaloh, Ed. Berlin, Germany: Springer-Verlag,2014, pp. 1–27.

[20] D. A. Cox,Primes of the Form x2 + ny2. Hoboken,NJ, USA:Wiley, 1989.[21] J. A. Solinas, “Low-weight binary representations for pairs of inte-

gers,” Centre Appl. Cryptographic Res. (CACR), Univ. Waterloo,Waterloo, Canada, Tech. Rep. CORR 2001–41, 2001.

[22] T. YanIk, E. Savas, and C. K. Koc, “Incomplete reduction in modu-lar arithmetic,” IEE Proc.—Comput. Digit. Techn., vol. 149, no. 2,pp. 46–52, Mar. 2002.

[23] B. S. Kaliski, “The Montgomery inverse and its applications,”IEEE Trans. Comput., vol. 44, no. 8, pp. 1064–1065, Aug. 1995.

[24] E. Savas and C. Koc, “The Montgomery modular inverse-revis-ited,” IEEE Trans. Comput., vol. 49, no. 7, pp. 763–766, Jul. 2000.

[25] E. Savas, M. Naseer, A.-A. Gutub, and C. K. Koc, “Efficient uni-fied Montgomery inversion with multibit shifting,” IEE Proc.—Comput. Digit. Techn., vol. 152, no. 4, pp. 489–498, Jul. 2005.

[26] R. L�orencz and J. Hlav�a�c, “Subtraction-free almost montgomeryinverse algorithm,” Inf. Process. Lett., vol. 94, no. 1, pp. 11–14, 2005.

[27] V. G.Oklobdzija, “An algorithmic and novel design of a leading zerodetector circuit: Comparison with logic synthesis,” IEEE Trans. VeryLarge Scale Integr. Syst., vol. 2, no. 1, pp. 124–128,Mar. 1994.

[28] G. Chen, G. Bai, and H. Chen, “A high-performance elliptic curvecryptographic processor for general curves over GF (p) based on asystolic arithmetic unit,” IEEE Trans. Circuits Syst. II: Express Briefs,vol. 54, no. 5, pp. 412–416, May 2007.

784 IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY 2017

Page 13: IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 5, MAY …Elliptic Curve Cryptography with Efficiently Computable Endomorphisms and Its Hardware Implementations for the Internet of Things

[29] J.-Y. Lai and C.-T. Huang, “A highly efficient cipher processor fordual-field elliptic curve cryptography,” IEEE Trans. Circuits Syst.II: Express Briefs, vol. 56, no. 5, pp. 394–398, May 2009.

[30] A. Satoh and K. Takano, “A scalable dual-field elliptic curvecryptographic processor,” IEEE Trans. Comput., vol. 52, no. 4,pp. 449–460, Apr. 2003.

[31] F. Furbass and J. Wolkerstorfer, “ECC processor with low die sizefor RFID applications,” in Proc. IEEE Int. Symp. Circuits Syst., 2007,pp. 1835–1838.

[32] M. Hutter, M. Feldhofer, and T. Plos, “An ECDSA processor forRFID authentication,” in Radio Frequency Identification: Security andPrivacy Issues. Berlin, Germany: Springer, 2010, pp. 189–202.

[33] E. Wenger, M. Feldhofer, and N. Felber, “Low-resource hardwaredesign of an elliptic curve processor for contactless devices,” inInformation Security Applications. Berlin, Germany: Springer, 2011,pp. 92–106.

[34] T. Plos, M. Hutter, M. Feldhofer, M. Stiglic, and F. Cavaliere,“Security-enabled near-field communication tag with flexible archi-tecture supporting asymmetric cryptography,” IEEE Trans. VeryLarge Scale Integr. Syst., vol. 21, no. 11, pp. 1965–1974, Nov. 2013.

[35] Z. Liu, E. Wenger, and J. Großsch€adl, “MoTE-ECC: Energy-scalable elliptic curve cryptography forwireless sensor networks,” inProc. 12th Int. Conf. Appl. CryptographyNetw. Secur., 2014, pp. 361–379.

[36] E. Wenger, “Hardware architectures for MSP430-based wirelesssensor nodes performing elliptic curve cryptography,” in AppliedCryptography and Network Security—ACNS 2013. Berlin, Germany:Springer, 2013, pp. 290–306.

[37] P. G. Comba, “Exponentiation cryptosystems on the IBM PC,”IBM Syst. J., vol. 29, no. 4, pp. 526–538, 1990.

[38] K. J€arvinen, J. Forsten, and J. Skytt€a, “FPGA design of self-certifiedsignature verification on Koblitz curves,” in Cryptographic Hard-ware and Embedded Systems—CHES 2007. Berlin, Germany:Springer, 2007, pp. 256–271.

[39] P. Sasdrich and T. G€uneysu, “Implementing Curve25519 for side-channel-protected elliptic curve cryptography,” ACM Trans.Reconfigurable Technol. Syst., vol. 9, no. 1, Nov. 2015, Art. no. 3.

[40] D. J. Bernstein, “Curve25519: New Diffie-Hellman speed records,”in Public Key Cryptography—PKC 2006. Berlin, Germany: Springer,2006, pp. 207–228.

Zhe Liu received thePhD degree from the Labora-tory of Algorithmics, Cryptology and Security(LACS), University of Luxembourg, in 2015. Duringthe doctoral studies, he has been a visiting scholarin the City University of HongKong, COSIC, K.U.Leuven as well as Microsoft Research (MSR),Redmond. He is a full professor in the Collegeof Computer Science and Technology, NanjingUniversity of Aeronautics and Astronautics. He iscurrently a postdoctoral research fellow in theInstitute for Quantum Computing (IQC) and the

Department of Combinatorics and Optimization, University of Waterloo,Canada. His research interests include different aspects of informationsecurity. He has co-authoredmore than 40 research peer-reviewed journaland conference papers in the area of cryptographic engineering, includingIEEETIFS and IACRCHES.

Johann Großsch€adl is a member of researchstaff in the Laboratory of Algorithmics, Cryptologyand Security (LACS), University of Luxembourg.Before joining the University of Luxembourg, hewas a research scientist in the Computer ScienceDepartment, University of Bristol, United Kingdom.He has published more than 60 papers in inter-national, peer-reviewed journals and conferenceproceedings, such as ACM Annual ComputerSecurity Applications Conference (ACSAC) Cryp-tographic Hardware and Embedded Systems

(CHES), which are the flagship events in the field of applied cryptography.He is a member of the IEEE and the International Association for Crypto-logic Research (IACR).

Zhi Hu received the BS and PhD degrees fromthe School of Mathematical Sciences, PekingUniversity, China, in 2007 and 2012, respectively.He was a postdoctoral researcher fellow withBeijing International Center for MathematicalResearch (BICMR) from 2012 to 2014. Afterthat, he joined the School of Mathematics andStatistics, Central South University, China, wherehe currently is a lecturer. His research interestsinclude cryptography and information security,especially in elliptic curve cryptography.

Kimmo J€arvinen received the MSc (Tech.) andD.Sc. (Tech.) degrees in electrical engineeringfrom Helsinki University of Technology (TKK),Espoo, Finland, in 2003 and 2008, respectively.He was in the Signal Processing Laboratory, TKKfrom 2002 to 2008. From 2008 to 2014, he workedin the Department of Information and ComputerScience, Aalto University, Espoo, Finland. From2014 to 2015, he was with the COSIC groupof KU Leuven ESAT, Leuven, Belgium. He iscurrently in the Department of Computer Science,

Aalto University. His research interests include efficient and securerealization of cryptosystems, general computer arithmetic, and FPGAs.

Husen Wang received the BS degree fromBeihang University, in 2009, and the MS degreein computer science from Tsinghua University, in2012. Since 2015, he has been a researcherin Security and Trust (SnT), University of Luxem-bourg, Luxembourg. His research interestsinclude information security, with special focus oncryptographic engineering. His works have beenpublished in refereed journals and cryptologyconferences.

Ingrid Verbauwhede received the electricalengineering and PhD degrees from the KULeuven, Heverlee, Belgium, in 1991. From 1992to 1994, she was a postdoctoral researcher and avisiting lecturer with the University of California,Berkeley. From 1994 to 1998, she worked withTCSI and ATMEL, Berkeley, California. In 1998,she joined the Faculty of University of California,Los Angeles (UCLA). She is currently a professorwith the KU Leuven and an adjunct professorwith UCLA. At KU Leuven, she is a co-director of

the Computer Security and Industrial Cryptography (COSIC) Laboratory.Her research interests include circuits, processor architectures anddesign methodologies for real-time embedded systems for security,cryptography, digital signal processing, and wireless communications.This includes the influence of new technologies and new circuit solutionson the design of next-generation systems on chip. She was the programchair of CHES’07, CHES’12, ASAP’08, and ISLPED’02. She was alsothe general chair of ISLPED’03. She was a member of the executivecommittee of DCA’05 and DAC’06.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

LIU ET AL.: ELLIPTIC CURVE CRYPTOGRAPHY WITH EFFICIENTLY COMPUTABLE ENDOMORPHISMS AND ITS HARDWARE... 785