Efficient arithmetic for embedded cryptography and ... · Abstract Plic y Cryptography (PKC) is a...

Arenberg Doctoral School of Science, Engineering & Technology

Faculty of Engineering

Department of Electrical Engineering

Efficient arithmetic for embedded

cryptography and cryptanalysis

Junfeng FAN

Dissertation presented in partial

fulfillment of the requirements for

the degree of Doctor

in Engineering

January 2012

Efficient arithmetic for embedded cryptography and

cryptanalysis

Junfeng FAN

Jury:Prof. dr. ir. Hugo Hens, chairProf. dr. ir. Ingrid Verbauwhede, promotorProf. dr. ir. Bart Preneel, co-promotorProf. dr. ir. Wim DehaeneProf. dr. ir. Joos VandewalleDr. ir. Fré VercauterenDr. ir. Marc Joye

(Technicolor, France)Prof. dr. ir. Patrick Schaumont

(Virginia Tech, USA)

Dissertation presented in partialfulfillment of the requirements forthe degree of Doctorof Engineering

January 2012

© Katholieke Universiteit Leuven – Faculty of EngineeringKasteelpark Arenberg 10, B-3001 Heverlee(Belgium)

Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigden/of openbaar gemaakt worden door middel van druk, fotocopie, microfilm,elektronisch of op welke andere wijze ook zonder voorafgaande schriftelijketoestemming van de uitgever.

All rights reserved. No part of the publication may be reproduced in any formby print, photoprint, microfilm or any other means without written permissionfrom the publisher.

D/2012/7515/13ISBN 978-94-6018-474-1

Acknowledgements

It would not have been possible to write this dissertation without the help andsupport of many people around me, to only some of whom it is possible to giveparticular mention here.

First, I would like to express my deepest gratitude to my supervisor, Prof.Ingrid Verbauwhede for offering me the opportunity to conduct PhD researchat COSIC, for granting the freedom and flexibility in my research work, and forthe guidance and support in the last 5 years. I would like to thank Prof. BartPreneel for the inspiring discussions and excellent comments on my dissertation.

I would also like to thank Prof. Wim Dehaene and Prof. Joos Vandewalle, myassessors during my PhD study, for many valuable advices. I am honored tohave Prof. Hugo Hens to be the chair and Prof. Patrick Schaumont and Dr.Marc Joye to be the members of my jury.

Special thanks go to Dr. Fré Vercauteren, who is a great colleague, co-author,and friend. I am especially grateful for his patience with my questions inmathematics, inspiring talk we had on new ideas, and careful review of mydissertation.

I am very grateful to our COSIC members who make up such a nice mixtureof culture, wisdom and personalities. I thank them for their generosity andencouragement, and for making my life in Leuven a lot more colorful. I wouldespecially like to thank Péla for being so nice and helpful all the time.

During the last 5 years, I was lucky enough to meet and collaborate with manytalented researchers. I would like to thank Kazuo Sakiyama, Lejla Batinaand Nele Mentens for guiding me after I joined COSIC. I also enjoyed thecollaboration with Miroslav Knežević, Duško Karaklajić, Yong Ki Lee, RoelMaes, Vladimir Rožić, Benedikt Gierlichs, Özgül Küçc̈k, Jens Hermans, MarkusUllrich and Elke De Mulder. I am also grateful to many people that I haveremotely collaborated with: Xu Guo, Tanja Lange, Daniel J. Bernstein, Peter

i

ii ACKNOWLEDGEMENTS

Schwabe, Xiaoxu Yao and Tim Güneysu. Their passion and diligence haveencouraged me to push my research forward.

These acknowledgments would certainly remain incomplete without mentioningmany of my friends in Leuven. I would like to thank Elena Andreeva for sharingwith me many delicious dinners, countless quick jokes and an enthusiasticattitude towards life. I am grateful to many Chinese friends in Leuven. Iwould especially like to thank Nina Fan, Lin Zhou, Chang Chen, Yunan Cheng,Hang Gao, Yangyin Chen, Tingyao Wu, Min Li, Li Weng, Junfeng Zhou, FengQi, Lianggong Wen, Hongjun Wu, Beier Li, Min Liu, Enze Chen, Yuemei Ji,Yannan Ding, Fu-Chiao Huang, Yu-Yuan Hung and Kai Zhou for their generoushelp and support, for interesting chats at Alma, and for celebrating with memany Chinese festivals. Thanks to them, I have never felt that home was faraway.

Finally, I would like to thank my parents and my sisters for the unconditionalsupport and love. I would like to thank Di Mo for being supportive andunderstanding during all these years.

Junfeng Fan

January 2012

Abstract

Public Key Cryptography (PKC) is a critical component of today’s informationinfrastructure. The use of PKC covers a wide spectrum of devices ranging fromweb servers to mobile handsets, from contact smart cards to passive RFIDtags. Therefore, PKC implementations tailored to different environments needspecific optimizations to meet the requirements for performance, power andsecurity against physical attacks.

This thesis focuses on arithmetic and architecture design for PKC. In thefirst part, we analyze the computation structures of RSA, Elliptic CurveCryptography (ECC), Hyperelliptic Curve Cryptography (HECC), Torus-basedcryptography and Pairings, and explore various representations, algorithms andarchitectures for different design targets. In particular, we propose a multi-coreMontgomery multiplier, a low-complexity modular multiplication algorithm forpairings, and two novel architectures for low-area implementations of HECC.

In the second part, we use efficient arithmetic as the basis for hardware-basedcryptanalysis. The security margin of a cryptosystem erodes continuously dueto Moore’s law. We study the power of FPGA clusters to break ECC usingthe parallelized Pollard rho method and implement this attack on an FPGAwhere we try to maximize the number of Pollard rho iterations per second. Wealso give an estimation of the effort to break ECC2-131 and ECC2k-160 withstate-of-the-art FPGAs.

In the third and final part, we provide a systematic overview of implementationattacks and countermeasures for ECC. By monitoring the timing, powerconsumption, electromagnetic emission of the device or by inserting faults,adversaries can gain information about internal data or operations andextract the secret key without mathematically breaking the primitives. Weprovide implementers of ECC with ready-to-use recommendations of whichcombinations of countermeasures result in a secure implementation.

iii

Beknopte samenvatting

Publieke-sleutel cryptografie (PSC) speelt een essentiële rol in de huidigeinformatiemaatschappij. Het gebruik van PSC vindt men terug in allerhandetoepassingen, van webservers tot mobiele telefoons, van smartcards totpassieve RFID tags. Deze uiteenlopende toepassingen maken specifiekeoptimalisaties voor de verscheidene omgevingen noodzakelijk zowel op het vlakvan performantie, en energieverbruik als op het valk van en veiligheid tegennevenkanaalaanvallen.

Deze thesis handelt over aritmetica en architectuurontwerp voor PSC. In eeneerste deel analyseren we de algoritmische structuren van RSA, Elliptischekromme cryptografie (ECC), Hyperelliptische kromme cryptografie (HECC),Torus-gebaseerde cryptografie en Paringen, waarbij we verscheidene voorstel-lingswijzen, algoritmes en architecturen voor verschillende ontwerpdoeleindenverkennen. Meer specifiek stellen we een multi-core Montgomery vermenigvul-diger voor, een modulair vermenigvuldigingsalgoritme voor paringen van lagecomplexiteit en tenslotte twee nieuwe architecturen voor HECC implementatiesmet kleine oppervlakte.

In het tweede deel gebruiken we efficiënte aritmetica als basis voor hardware-gebaseerde cryptanalyse. De niet-aflatende verbetering van chiptechnologieënzorgt ervoor dat de veiligheidsmarge van een cryptosysteem continu afneemt.We bestuderen het gebruik van FPGA clusters om ECC aan te vallen viade parallelle Pollard rho methode en implementeren deze aanval op eenFPGA waarbij we het aantal iteraties per seconde proberen te maximaliseren.Bovendien geven we ook een schatting van de praktische veiligheidsmarge vanECC2-131 en ECC2k-160 wanneer we state-of-the-art FPGA’s gebruiken.

In het derde en laatste deel, geven we een systematisch overzicht vanimplementatieaanvallen en tegenmaatregelen voor ECC. Door de looptijd,energieverbruik en electromagnetische straling van een implementatie te metenof door fouten te induceren, kan een aanvaller informatie te weten komen over

v

vi BEKNOPTE SAMENVATTING

interne data en op deze manier de geheime sleutel berekenen. Ons systematischoverzicht kan door programmeurs van ECC gebruikt worden om combinatiesvan tegenmaatregelen te selecteren die in een veilige implementatie resulteren.

Abbreviations

ADPA Address-bit Differential Power AnalysisAES Advanced Encryption StandardALU Arithmetic Logic UnitASIC Application-Specific Integrated Circuit

BN Barreto-Naehrig

CM Complex Multiplication

DA Divisor AdditionDD Divisor DoublingDES Data Encryption StandardDFA Differential Fault AnalysisDH Diffie-HellmanDLP Discrete Logarithm ProblemDPA Differential Power Analysis

EC Elliptic CurveECC Elliptic Curve CryptographyECDLP Elliptic Curve Discrete Logarithm ProblemECSM Elliptic Curve Scalar MultiplicationEEA Extended Euclidean AlgorithmEM Electromagnetic

FA Fault Analysis

vii

viii Abbreviations

FHE Fully Honomorphic EncryptionFIFO First-In-First-OutFIOS Finely Integrated Operand ScanningFLT Fermat’s Little TheoremFPGA Field-Programmable Gate ArrayFSM Finite State Machine

HEC Hyperelliptic CurveHECC Hyperelliptic Curve CryptographyHMM Hybrid Modular MultiplicationHMMB Hybrid Modular Multiplication for BN curvesHW Hamming Weight

ISA Instruction Set Architecture

LE Logic ElementLSB Least Significant Bit

MMM Montgomery Modular MultiplicationMPL Montgomery Powering LadderMSB Most Significant Bit

NAF Non-Adjacent Form

PA Point AdditionPAIA Point-at-Infinity AttackPBC Pairing-Based CryptographyPCIe Peripheral Component Interconnect ExpressPD Point DoublingPE Processing ElementPKC Public Key CryptographyPNS Position Number SystemPV Point Validation

RFID Radio-Frequency identification

ABBREVIATIONS ix

RNS Residue Number SystemRPA Refined Power Analysis

SCA Side Channel AnalysisSPA Simple Power Analysis

UMI Unified Multiplier and Inverter

VLIW Very Long Instruction Word

ZPA Zero-value Point Attack

Contents

Abstract iii

Contents xi

List of Figures xvii

List of Tables xix

1 Introduction 1

1.1 Summary of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 3

2 Public Key Cryptography: Mathematical Background 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Public Key Cryptography . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Torus-based Cryptosystem . . . . . . . . . . . . . . . . . 10

2.2.3 Elliptic Curve Cryptography . . . . . . . . . . . . . . . . 11

2.2.4 Hyperelliptic Curve Cryptography . . . . . . . . . . . . 13

2.2.5 Pairing-based Cryptography . . . . . . . . . . . . . . . . 14

2.2.6 PKC Break-down . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Fp Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

xi

xii CONTENTS

2.3.1 Representations . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.3 Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 F2m Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1 Representations . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.3 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.4 Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Montgomery Multiplication on A Multi-core Platform 25

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 MMM on A Multi-core Platform . . . . . . . . . . . . . . . . . 26

3.2.1 Target Platform . . . . . . . . . . . . . . . . . . . . . . 27

3.2.2 Dependency Analysis and Task Partitioning . . . . . . . 30

3.2.3 Method-I vs. Method-II . . . . . . . . . . . . . . . . . . 33

3.2.4 Scalability Analysis . . . . . . . . . . . . . . . . . . . . 33

3.3 Case Study: ECC, RSA and CEILIDH . . . . . . . . . . . . . . 35

3.3.1 Software/Hardware Interface . . . . . . . . . . . . . . . 36

3.3.2 Control Hierarchy . . . . . . . . . . . . . . . . . . . . . 37

3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Hybrid Modular Multiplication (HMM) and Its Application toPairings 41

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Hybrid Modular Multiplication . . . . . . . . . . . . . . . . . . 43

4.2.1 Parallel Hybrid Modular Multiplication . . . . . . . . . 44

CONTENTS xiii

4.2.2 Digit-serial Version . . . . . . . . . . . . . . . . . . . . . 46

4.2.3 Faster Coefficient Reduction . . . . . . . . . . . . . . . . 48

4.2.4 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 High Performance Pairing Processor Using HMM . . . . . . . . . 51

4.3.1 Pairing-friendly Curves . . . . . . . . . . . . . . . . . . 52

4.3.2 Pairing Computation . . . . . . . . . . . . . . . . . . . . 53

4.3.3 Parameter Selection for Pairing-friendly Curves . . . . . 54

4.3.4 Application to BN Curves . . . . . . . . . . . . . . . . . 56

4.3.5 HMM Multiplier . . . . . . . . . . . . . . . . . . . . . . 57

4.3.6 Implementation Results . . . . . . . . . . . . . . . . . . 59

4.4 Pairing Processor Using RNS . . . . . . . . . . . . . . . . . . . 62

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 HECC over F2m Using Unified Multiplier/Inverters 65

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Unified Multiplier and Inverter . . . . . . . . . . . . . . . . . . 68

5.2.1 Multiplication Algorithms . . . . . . . . . . . . . . . . . 68

5.2.2 Inversion Algorithms . . . . . . . . . . . . . . . . . . . . 69

5.3 High-throughput UMI and HECC processor . . . . . . . . . . . . 71

5.3.1 Type-I UMI Architecture: High Throughput . . . . . . . 71

5.3.2 Type-I HECC Processor . . . . . . . . . . . . . . . . . . 75

5.3.3 Results and Comparison . . . . . . . . . . . . . . . . . . 77

5.4 Lightweight UMI and HECC Processor for RFID . . . . . . . . 79

5.4.1 Type-II UMI Architecture: Low Footprint . . . . . . . . 79

5.4.2 Type-II HECC Processor . . . . . . . . . . . . . . . . . . 81

5.4.3 Results and Comparison . . . . . . . . . . . . . . . . . . 82

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

xiv CONTENTS

6 Breaking ECC with Configurable Hardware 85

6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.2 The Certicom Challenge . . . . . . . . . . . . . . . . . . . . . . 87

6.2.1 The Parallel Pollard Rho Attack . . . . . . . . . . . . . 88

6.2.2 FPGA-based Attacks . . . . . . . . . . . . . . . . . . . . 89

6.3 The Ev1l Project: Design Target . . . . . . . . . . . . . . . . . 90

6.3.1 The Iteration Function . . . . . . . . . . . . . . . . . . . 90

6.4 Arithmetic and Complexity Analysis . . . . . . . . . . . . . . . . 91

6.4.1 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 93

6.4.2 Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.5 Architecture Exploration . . . . . . . . . . . . . . . . . . . . . . 97

6.5.1 Architecture I: Load-Store, Polynomial basis . . . . . . 98

6.5.2 Architecture II: Load-Store, Type-II Normal Basis . . . 99

6.5.3 Architecture III: Fully Expanded, Type-II Polynomial Basis 99

6.6 Results and Comparison . . . . . . . . . . . . . . . . . . . . . . . 101

6.6.1 Total Effort Estimation . . . . . . . . . . . . . . . . . . 102

6.6.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.7 Effort Estimation for ECC2-131 and ECC2K-163 . . . . . . . . 104

6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7 Conclusions 107

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

A Secure ECC Implementation: A Survey on Attacks and Protections 111

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

A.2 Typical Implementations . . . . . . . . . . . . . . . . . . . . . . 112

A.3 Passive Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

CONTENTS xv

A.3.1 Simple Power Analysis . . . . . . . . . . . . . . . . . . . 114

A.3.2 Template Attacks . . . . . . . . . . . . . . . . . . . . . . 114

A.3.3 Differential Power Analysis . . . . . . . . . . . . . . . . 115

A.3.4 Comparative Side-Channel Analysis . . . . . . . . . . . 115

A.3.5 Refined Power Analysis . . . . . . . . . . . . . . . . . . 115

A.3.6 Zero-value Point Attack . . . . . . . . . . . . . . . . . . 115

A.3.7 Carry-based Attack . . . . . . . . . . . . . . . . . . . . 116

A.3.8 Address-bit DPA . . . . . . . . . . . . . . . . . . . . . . 116

A.4 Active Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

A.4.1 Safe-error Analysis . . . . . . . . . . . . . . . . . . . . 117

A.4.2 Weak Curve Based Analysis . . . . . . . . . . . . . . . . 117

A.4.3 Differential Fault Analysis . . . . . . . . . . . . . . . . . 118

A.4.4 Point-at-Infinity Attack . . . . . . . . . . . . . . . . . . 120

A.4.5 Summary of Attacks . . . . . . . . . . . . . . . . . . . . 120

A.5 Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

A.5.1 SPA Countermeasures . . . . . . . . . . . . . . . . . . . 122

A.5.2 DPA Countermeasures . . . . . . . . . . . . . . . . . . . 124

A.5.3 FA Countermeasures . . . . . . . . . . . . . . . . . . . . 127

A.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

A.6.1 On the Magic of Randomness . . . . . . . . . . . . . . . 128

A.6.2 Countermeasure Selection . . . . . . . . . . . . . . . . . 128

A.6.3 Implementation Issues . . . . . . . . . . . . . . . . . . . 129

Bibliography 131

Curriculum 149

List of publications 151

List of Figures

1.1 Organization of the thesis. . . . . . . . . . . . . . . . . . . . . 3

1.2 Summary of the main ideas of Chapter 4-6 in the design space. 4

2.1 Symmetric-key cryptography. . . . . . . . . . . . . . . . . . . . 8

2.2 Public-key cryptography. . . . . . . . . . . . . . . . . . . . . . . 8

2.3 PKC computations break-down. . . . . . . . . . . . . . . . . . . 17

3.1 Architecture of the multi-core platform. . . . . . . . . . . . . . 28

3.2 Data dependency of FIOS Montgomery algorithm. . . . . . . . 30

3.3 Instruction scheduling method-I: each iteration is performedwith one core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Instruction scheduling method-II: each iteration is performedwith several cores . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Performance of 256-bit MMM on a multi-core system . . . . . . 34

3.6 256-bit and 1024-bit MMM on a multi-core system with differentconfigurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.7 Top level block diagram of the platform. . . . . . . . . . . . . . 37

3.8 Torus exponentiation, RSA and ECC on the same platform:program hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Optimal ate pairing: computation hierarchy. . . . . . . . . . . . 54

xvii

xviii LIST OF FIGURES

4.2 Fp multiplier using the HMMB algorithm. . . . . . . . . . . . . 58

4.3 Cox-Rower architecture for pairing computation using RNSMontgomery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Conventional architecture for HECC: using multiple data-paths. 67

5.2 Proposed architecture for HECC: using UMI. . . . . . . . . . . 67

5.3 AND-XOR cell: a building block of multipliers. . . . . . . . . . 72

5.4 LSB-first bit-serial modular multiplier. . . . . . . . . . . . . . . 72

5.5 Right-Shift bit-serial inverter. . . . . . . . . . . . . . . . . . . . 73

5.6 Bit-serial Type-I UMI. . . . . . . . . . . . . . . . . . . . . . . . 74

5.7 Digit-serial Type-I UMI with I/M≈4 . . . . . . . . . . . . . . . 76

5.8 Block diagram of the Type-I HECC processor. . . . . . . . . . 76

5.9 Area of the UMI and delay for DA, DD and SM. . . . . . . . . 77

5.10 The building block and architecture of Type-II UMI. . . . . . . 80

5.11 Block diagram of the Type-II HECC processor. . . . . . . . . . . 81

6.1 RIVYERA cluster system based on Xilinx Spartan-3 5000 FPGAs. 90

6.2 Interface between the host PC and each FPGA. . . . . . . . . . . 91

6.3 Dataflow graph of the iteration function. . . . . . . . . . . . . . 92

6.4 Modular multiplier in GF (2m) using Kwon’s algorithm. . . . . 95

6.5 Shokrollahi multiplier. . . . . . . . . . . . . . . . . . . . . . . . 96

6.6 Archi-I: ECC processor using polynomial basis. . . . . . . . . 98

6.7 Archi-II: ECC processor using normal basis. . . . . . . . . . . 99

6.8 Archi-III: pipelined processor using Shokrollahi multipliers. . 100

6.9 Comparison: attacking ECC2K-130 on different platforms. . . . 103

6.10 Effort estimation for ECC2K-130, ECC2-131 and ECC2K-163on FPGAs of different generations. . . . . . . . . . . . . . . . . 105

A.1 A simplified model of an ECC processor. . . . . . . . . . . . . . 113

List of Tables

3.1 Instructions supported by each core. . . . . . . . . . . . . . . . 29

3.2 Number of data memory accesses caused by data transfers . . . 33

3.3 Performance comparison of modular multiplication on differentplatforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Number of clock cycles for different operations. . . . . . . . . . 38

3.5 Performance comparison between CEILIDH, ECC and RSA onthe same platform. . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 Complexity comparison of different modular multiplicationalgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Selection of z̄ for pairing friendly curves . . . . . . . . . . . . . 55

4.3 Multiplication complexity for each set of parameters. . . . . . . 55

4.4 Number of clock cycles required by different subroutines. . . . . 59

4.5 Performance comparison of software and hardware implementa-tions of pairings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.6 Cycle count for one optimal pairing. . . . . . . . . . . . . . . . 64

5.1 Modular operations required by divisor operations. . . . . . . . 66

5.2 Previous HECC implementations on FPGA. . . . . . . . . . . . 67

5.3 Unified Multiplier and Inverter : Type-I vs. Type-II. . . . . . . . 71

5.4 Configurations and operations of Type-I UMI-I. . . . . . . . . . 73

xix

xx LIST OF TABLES

5.5 Performance comparison of FPGA-based HECC implementa-tions in GF(2m). . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.6 Performance comparison of HECC and ECC implementationstargeting RFID tags. . . . . . . . . . . . . . . . . . . . . . . . . 83

5.7 Register allocation for divisor doubling. . . . . . . . . . . . . . 84

5.8 Register allocation for divisor addition. . . . . . . . . . . . . . . 84

6.1 Certicom challenges and complexity estimation . . . . . . . . . 87

6.2 Size and throughput comparison of various architectures. . . . 102

6.3 Technology nodes of different FPGA generations. . . . . . . . . 105

A.1 Physical attacks on ECC implementations. . . . . . . . . . . . . 120

A.2 Countermeasures and their overhead. . . . . . . . . . . . . . . . . 121

A.3 Attacks versus countermeasures. . . . . . . . . . . . . . . . . . 123

Chapter 1

Introduction

In the last thirty years, the fast development of telecommunication technologyand the vast expansion of the Internet has profoundly changed our dailylives. Today we have a global communication infrastructure that consists ofdistributed servers, Internet backbones and billions of terminal devices rangingfrom personal computers, smart phones to passive RFID tags. While thisinfrastructure offers great conveniences for communication, banking and manyother services, it also creates huge security challenges, such as communicationeavesdropping, bank card fraud, and exposure of privacy-sensitive information.It is thus of vital importance to deploy sufficient security mechanisms to protectthis infrastructure.

Applied cryptography serves as the basis of almost all reported securitymechanisms. The science of cryptology, known solely as a message encryptiontechnique thousands of years ago, has become an independent disciplinethat generates mechanisms to offer confidentiality, data integrity, dataorigin authentication, entity authentication and non-repudiation. Moderncryptographic schemes can be broadly divided into two kinds: the symmetric-key schemes and public-key schemes. A symmetric-key scheme requires a secretkey shared between communicating entities, while public-key schemes haveno such requirements. Public keys of the communicating parties should bedistributed over an authenticated (but not necessarily private) channel. Onthe other hand, public-key schemes are usually orders of magnitude slowerand more power hungry than symmetric key schemes. As a result, public-keyschemes and symmetric-key schemes are used together where the public-keyscheme is used for key agreement, and the symmetric-key scheme is used for

1

2 INTRODUCTION

encrypting large amount of data.

The need for cryptographic implementations, hardware or software, varies fromdevice to device. For instance, a server for on-line banking needs to handlethousands of secure connections simultaneously, thus should be equipped withhigh throughput implementations. In order to achieve high throughput, theimplementations typically use large parallel processing elements. On theother hand, strong cryptography is also required on many ultra-constraineddevices such as smart cards and RFID tags. In such applications, lightweightimplementations are preferred. As such, the abundant variations of applicationscreate a spectrum of implementation requirements in terms of area, power,performance and physical security.

A theoretical way to measure the quality of an algorithm is its computationalcomplexity. In general, the complexity determines the area and delay of itsimplementation. However, computational complexity, which can be expressedby the number of operations required and the amount of memory needed,does not capture all properties of an algorithm. Other parameters, suchas parallelism and locality of data, also reflect important properties of analgorithm. For high speed implementations, we care more about how toparallelize the algorithm and how the performance scales when more data-pathsare used. When designing PKC co-processors on constrained devices, the sizeof the data-path and registers (or memory blocks) has the highest priority. Inmany cases, arithmetic and architecture are optimized together to ensure anoptimal solution.

Besides performance and cost, designers have another important criterion tomeet: physical security. In 1996 Kocher [115] published that a cryptographicsystem can be broken by monitoring and analyzing physical information suchas timing and power consumption. In 1997, Boneh et al. [30] noticed faultsin a cryptographic implementation can also be utilized. In the last 10 years,physical attacks on crypto-systems have been extensively studied, and theyhave been getting increasingly powerful. As a result, designers have to takephysical attacks into account from the very beginning of the design phase.

The study of cryptographic implementations thus has two major targets: higherefficiency (in terms of area and power) and less information leakage. Thesetwo targets are often contradictory since countermeasures against physicalattacks lead to additional area and energy consumption. This thesis focuseson efficient arithmetic and architectures for PKC implementations in hardware.In Appendix A, we give a survey on physical attacks on ECC implementationsand corresponding countermeasures.

SUMMARY OF THE THESIS 3

Figure 1.1: Organization of the thesis.

1.1 Summary of the Thesis

This thesis studies efficient and secure design methods for the most widelyused public key cryptography. The organization of the thesis is illustratedin Figure 1.1. The main body of this thesis consists of two parts: efficientimplementation (Chapter 3, 4 , 5), and security evaluation (Chapter 6). Wedescribe the contribution of each chapter below.

� Chapter 2 We give a brief introduction to the mathematical background ofPublic Key Cryptography and the arithmetic in finite fields.

� Chapter 3 The focus of Chapter 3 is a parallel implementation ofthe Montgomery modular multiplication (MMM) in a general context.Parallelization moves the design from the upper-left corner to the lower-right corner in the design space (see the top left subfigure of Figure 1.2).Ideally, the parallelized design achieves the same delay-area product.However, this is often difficult in practice. In this chapter, we analyze thedata dependencies inside the MMM and study efficient task partitioningmethods. We also explore the scalability of the proposed methods. Inorder to show the flexibility of the platform, we implement RSA, ECC

4 INTRODUCTION

Figure 1.2: Summary of the main ideas of Chapter 4-6 in the design space.

and Tori on it and compare the results.

� Chapter 4 In order to reduce the delay-area product, we need to optimizethe algorithm such that it has lower computational complexity (see thetop right subfigure of Figure 1.2). In this chapter, we study a special typeof moduli that are generated using a low-weight polynomial. Such kind ofmoduli are widely used in pairings. We propose an adapted Montgomeryalgorithm that has a reduced complexity in hardware. This algorithm isused to design a fast pairing processor that achieves a 128-bit securitylevel.

� Chapter 5 In order to put strong cryptography on constrained devices,the area of the cryptographic coprocessor should be minimized (see thebottom left subfigure of Figure 1.2). In this chapter, we focus on theimplementation of HECC defined over binary fields. We reduce the areaby unifying the multiplier and inverter, and we show that an HECC

SUMMARY OF THE THESIS 5

coprocessor that achieves 80-bit security with a reasonable performancecan be made with less than 15 Kgates.

� Chapter 6 Cryptanalysis also benefits from fast implementation techniques.In this section, we explore the power of FPGA in attacking 131-bit ECC.The design target is to execute as many iteration functions (see Chapter6 for details) as possible using one Xilinx Spartan-3 FPGA (see thebottom right subfigure of Figure 1.2). This study is a part of a multi-party distributed project (breaking ECC2K130). Various algorithms andarchitectures were compared to select the most efficient solution. Theresults also show that FPGAs are more cost-efficient than CPU, GPUand ASICs.

The main focus of this thesis is on efficient implementations for PKC, wehave however also results on secure implementations. This is discussed inAppendix A and references [64, 85, 108].

� Appendix A With new tampering methods and new attacks beingcontinuously proposed and improved, designing a secure cryptosystembecomes increasingly difficult. While the adversary only needs to succeedin one out of many attack methods, the designers have to preventall the applicable attacks simultaneously. In Appendix A we give acomprehensive survey of known passive and active attacks on the EllipticCurve Scalar Multiplication (ECSM).

Chapter 2

Public Key Cryptography:Mathematical Background

� 2.1 Introduction

� 2.2 Public Key Cryptography

� 2.3 Fp arithmetic

� 2.4 F2m arithmetic

� 2.5 Conclusion

2.1 Introduction

Cryptography is the technology to enable secure communication over aninsecure channel. Cryptographic schemes can be broadly divided into twocategories: symmetric-key schemes and public-key schemes. Figure 2.1 and

7

8 PUBLIC KEY CRYPTOGRAPHY: MATHEMATICAL BACKGROUND

Figure 2.2 show the basic communication model for symmetric-key schemesand public-key schemes, respectively. Here A (Alice) and B (Bob) are thecommunicating entities and E (Eve) is the adversary.

In symmetric-key schemes, communicating entities A and B need to agreeupon a shared key through a secret and authentic channel. Subsequently, theymay use a symmetric-key encryption scheme such as the Advanced EncryptionStandard (AES) to encrypt the plaintext. Symmetric-key schemes are typicallyvery efficient and can be used to encrypt large amount of data (e.g. a fulldisk) or real-time bitstreams (e.g. video conference). On the other hand,it has several significant drawbacks. One primary problem is known as thekey distribution problem, i.e. a secret and authenticated channel is needed todistribute the key. A second problem is known as the key management problem,i.e. in a network of N entities, each entity may have to maintain a differentshared key with each of the other N-1 entities. While an on-line trusted third-party that distributes keying material as required can be included to avoid theneed of secure storage of multiple keys by each entity, such solutions are notalways possible in some scenarios.

In contrast to symmetric-key schemes, public-key schemes require the keydistributing channel to be authentic but not necessarily secret. Each entitygenerates a single key pair: a public key and a related private key. Thepublic key of each entity can be obtained from an authenticated channel,and the private key is kept secret. Public-key schemes can thus be used forencryption, data authentication, entity authentication and key agreement overan unsecured channel.

Figure 2.1: Symmetric-key cryptography. Figure 2.2: Public-key cryptography.

The rest of the chapter is organized as follows. Section 2.2 gives a shortintroduction to the mathematic background of PKC. In Section 2.3 andSection 2.4, we describe the known algorithms for computation in Fp and F2m ,respectively.

PUBLIC KEY CRYPTOGRAPHY 9

2.2 Public Key Cryptography

The concept of public key cryptography was first introduced by Diffie andHellman in 1975 [56]1. Since then, various public-key schemes have beenproposed. The most commonly used PKC schemes today are RSA [146] andECC [113, 132]. There is also Tori [147], hyperelliptic curve cryptography [114],and later on Pairing-based cryptography [102, 27]. We give a brief introductionto these schemes in this chapter. For descriptions of other PKC schemes suchas DH and ElGamal, we refer to the Handbook of Applied Cryptography [128].

Most of the PKC schemes are defined over finite groups or finite fields.Throughout this thesis we assume the notations below:

• K: a finite field (Fp for a prime field and F2m for a binary field);

• char(K): the characteristic of K;

• K̄: an algebraic closure of K;

• Fpk : an extension field of Fp;

• E(a1, a2, a3, a4, a6) : an elliptic curve with coefficients a1, a2, a3, a4, a6;

• E(K): the group formed by the points on an elliptic curve E defined overK;

• #E: the number of points on curve E, i.e. the order of E;

• G: a finite group;

• ϕ(n): Euler’s totient function;

• Φ(n): the n-th cyclotomic polynomial.

2.2.1 RSA

RSA was proposed by Rivest, Shamir and Adleman in 1978. The security ofRSA is based on the integer factorization problem. The public key consists ofa pair of integers (n, e) where n is the RSA modulus and e is the encryptionexponent. The modulus n is a product of two randomly selected and secret

1It is worth noting here that controversies on who invented public key cryptography stillexist. In 1997, it was publicly disclosed that James H. Ellis, Clifford Cocks, and MalcolmWilliamson at the Government Communications Headquarters (GCHQ) in the UK havesecretly developed asymmetric key algorithms in 1973.


Algorithm 1 Basic RSA encryption.Input: RSA public key (n, e), plaintext m ∈ [0, n− 1].Output: Ciphertext c.

1: compute c← me mod n.Return c.

Algorithm 2 Basic RSA decryption.Input: RSA public key (n, e), private key e, ciphertext c.Output: Plaintext m.

1: compute m← cd mod n.Return m.

primes p and q. The encryption exponent e is an integer satisfying 1 < e < ϕ(n),where ϕ(n) = (p− 1)(q− 1) and gcd(e, ϕ(n))=1. The private key d, also calledthe decryption exponent, is the integer satisfying 1 < d < ϕ(n) and ed ≡1 (modϕ(n)). Determining the private key d from (n, e) is computationallyequivalent to the factorizing n [128]. It is worth to note that computing the eth

root modulo n (to decrypt a ciphertext) does not necessarily require to knowd.

RSA is widely used for data encryption and digital signature. We give the basicversion of the algorithms for encryption and decryption here. Note that suchbasic RSA encryption algorithm is not secure [28]. For a detailed descriptionof RSA based cryptographic schemes, we refer to the handbook of appliedcryptography [7].

2.2.2 Torus-based Cryptosystem

Algebraic torus-based cryptosystem is an alternative PKC. Torus-basedcryptography uses an algebraic torus to construct a group on which the discretelogarithm problem is defined. This idea was first introduced by Rubin andSilverberg in 2003 [147] and they proposed the name CEILIDH. CEILIDH isdefined using the torus T6(Fp) [147], which is a subgroup of the multiplicativegroup F×p6 . However, every element of T6 can be represented by two elements ofFp using the birational maps between Tn and F2

p. As a result, CEILIDH obtainsequivalent security of Fp6 , while the data to be transmitted are compressed bya factor 3. Besides, the underlying arithmetic is performed in a subgroupinstead of Fp6 , which allows for faster computation. In embedded devices,saving in computation and data transfer enables valuable optimizations onenergy consumption.


With a rational parameterization, a torus can be used in any discrete log basedcryptosystem. Consider a torus Tn defined over Fp. Let ρ:Tn(Fp) → F

ϕ(n)p be

the map from Tn to Fϕ(n)p and ψ its inverse. The algorithm below shows the

torus-based version of the Diffie-Hellman key exchange protocol [147].

Algorithm 3 Torus-based Diffie-Hellman Key Exchange Protocol [147].Input: Public parameters: Tn(Fp), α ∈ Tn(Fp) of order l.Output: Shared key K.

1: Alice selects a ∈R [1, l − 1], and sends PA := ρ(αa) ∈ Fϕ(n)p to Bob;

Bob selects b ∈R [1, l − 1], and sends PB := ρ(αb) ∈ Fϕ(n)p to Alice.

2: Alice computes K = ρ(ψ(PB)a) ∈ Fϕ(n)p ;

Bob computes K = ρ(ψ(PA)b) ∈ Fϕ(n)p .

Return K.

The main computation of CEILIDH is exponentiation in a subgroup of Fp6 . Letp ≡ 2 mod 9 (or p ≡ 5 mod 9) then f(x) = x9−1

x3−1 = x6 +x3 + 1 is an irreduciblepolynomial with root z. Then, z6 = −z3−1 and each element from Fp6 can berepresented in the basis {1, z, z2, z3, z4, z5}. Hence, an arbitrary element fromthis group is denoted as A(z) =

∑i=5i=0 aiz

i. We denote multiplication/squaringsand additions/subtractions in Fp with M and A respectively. An addition inFp6 requires 6 additions in Fp, while a multiplication in Fp6 can be performedwith 18M + 60A [80].

2.2.3 Elliptic Curve Cryptography

ECC, independently invented by Koblitz [113] and Miller [132], is the mainalternative PKC to RSA. ECC is typically faster than RSA for an equivalentsecurity level [63, 148], and it is preferred in embedded devices due to its smalleroperand size.

An elliptic curve E over a field K is defined by a so-called Weierstrass equation:

E : y2 + a1xy + a3y = x3 + a2x2 + a4x+ a6 , (2.1)

where a1, a2, a3, a4, a6 ∈ K and ∆ 6= 0. Here ∆ is the discriminant of E. AWeierstrass equation can be simplified by applying a change of coordinates. Ifchar(K) is not equal to 2 or 3, then E can be transformed to

y2 = x3 + ax+ b , (2.2)


where a, b ∈ K. If char(K) = 2, then E can be transformed to

y2 + xy = x3 + ax2 + b (2.3)

if E is non-supersingular. For the discussion of ECC and related algorithmsin this thesis, we always use P (x, y) to denote a point with coordinates (x, y),and we use E(K) to denote the group formed by the points on an elliptic curveE defined over the finite field K.

For cryptographic use, we are only interested in elliptic curves over a finite field.Elliptic curves defined over both prime fields and binary extension fields areused in cryptography. Given two points, P (x1, y1) and Q(x2, y2), the sum of Pand Q is again a point on the same curve under the addition rule. For example,given two points P1(x1, y1) and P2(x2, y2) on an elliptic curve E defined overF2m , one can compute P3(x3, y3) = P1 + P2 as follows:

x3 = λ2 + λ+ x1 + x2 + a , y3 = λ(x1 + x3) + x3 + y1 , (2.4)

where

λ =

{

y1+y2

x1+x2if P1 6= P2

x1 + y1

x1otherwise.

The set of points (x, y) on E together with the point at infinity form an Abeliangroup. Given the base point P ∈ E(K) and a scalar k, the computationk · P is called point multiplication or scalar multiplication. It is the maincomputation of many EC-based cryptosystems such as key agreement andsignature algorithms. Algorithm 4 shows the Left-To-Right binary methodfor scalar multiplication.

Algorithm 4 Left-To-Right (downwards) binary method for point multiplica-tion.Input: P ∈ E(K) and integer k =

∑l−1i=0 ki2i.

Output: k · P .1: R← O.2: for i = l − 1 downto 0 do3: R← 2R;4: If ki = 1 then R← R+ P .5: end for

Return R.

The security of ECC is based on the hardness of the so-called Elliptic CurveDiscrete Logarithm Problem (ECDLP), namely, computing k for two given


points P and Q such that Q = k · P . The variable k is called the scalar whichin most cases corresponds to the secret key.

As an example of ECC-based protocols, we give in Algorithm 5 the EllipticCurve Digital Signature Algorithm (ECDSA). Here H denotes a cryptographichash function whose outputs are smaller than than n. We refer to [88] for otherECC-based protocols.

Algorithm 5 ECDSA signature generation.Input: Domain parameter D=(q, FR, S, a, b, P, n, h), private key d,message m.Output: Signature (r, s).

1: Select k ∈R [1, n− 1].2: Compute kP = (x1, y1) and convert x1 to an integer x̄1.3: Compute r = x̄1 mod n. If r=0 then go to step 1.4: Compute e = H(m).5: Compute s = k−1(e+ dr) mod n. If s = 0 then go to step 1.

Return (r, s).

2.2.4 Hyperelliptic Curve Cryptography

Hyperelliptic curves are a special class of algebraic curves; they can be viewedas a generalization of elliptic curves. Namely, a hyperelliptic curve of genusg = 1 is an elliptic curve, while in general, hyperelliptic curves can be of anygenus g ≥ 1. Using hyperelliptic curves to define finite abelian groups for DLP-based cryptosystems was first introduced by Koblitz [114] in 1989. However,only genus 1 (i.e. EC) and genus 2 curves are used for cryptography.

Here we consider a hyperelliptic curve C of genus g = 2 over K, which is givenby an equation of the form:

C : y2 + h(x)y = f(x) in K[x, y], (2.5)

where h(x) ∈ K[x] is a polynomial of degree at most g (deg(h) ≤ g) and f(x)is a monic polynomial of degree 2g + 1 (deg(f) = 2g + 1). Also, there are nosolutions (x, y) ∈ K̄×K̄ which simultaneously satisfy the equation (2.5) and theequations: h(x) = 0, h′(x)y + f ′(x) = 0. For genus 2, in general the followingequation is used y2 + (h2x

2 +h1x+h0)y = x5 + f4x4 + f3x

3 + f2x2 + f1x+ f0.

A divisor D is a formal sum of points on the hyperelliptic curve C, i.e. D =∑

mPP , where P is a point on C, mP is an integer and mP = 0 for almost allP . The degree of D is defined as degD =

∑

mP . Let Div denote the group of


all divisors on C and Div0 the subgroup of Div of all divisors with degree zero.The Jacobian J of the curve C is defined as quotient group J = Div0/R, whereR is the set of all principal divisors. A divisor D is called principal if D = div(f)for some element f of the function field of C (div(f) =

∑

P∈C ordP (f)P ). Thediscrete logarithm problem in the Jacobian is the basis of security for HECC.In practice, the Mumford representation is commonly used. Each divisor isrepresented as a pair of polynomials [u, v], where u is monic and [u, v] satisfydeg(u) ≤ 2, deg(v) < deg(u) and u|f − hv − v2 (so-called reduced divisors).

2.2.5 Pairing-based Cryptography

Bilinear pairings on elliptic curves have been introduced in cryptography inthe middle 1990’s for cryptanalysis [72, 127]. In 2000, Joux introduced the firstconstructive use of pairings with a tripartite key exchange protocol [102]. Inthe last decade many pairing-based schemes such as identity-based encryption[27], identity-based signatures [41] and short signatures [29] have been proposedand studied.

A bilinear pairing is a map G1 × G2 → GT where G1 and G2 are typicallyadditive groups and GT is a multiplicative group and the map is linear in eachcomponent. Many pairings used in cryptography such as the Tate pairing [12],ate pairing [93], R-ate pairing [121] and optimal pairings [162, 92], choose G1

and G2 to be specific cyclic subgroups of E(Fpk ), and GT to be a subgroup ofF∗pk .

Let K be a finite field Fp. Let r be a large prime dividing the order of thecurve, denoted by #E(K), and k the embedding degree of E(K) with respectto r, namely, the smallest positive integer k such that r|pk − 1. For any finiteextension field K̄ of K, denote with E(K̄)[r] the K̄-rational r-torsion group ofthe curve. For P ∈ E(K̄) and an integer s, let fs,P be a K̄-rational functionwith divisor

(fs,P ) = s(P )− ([s]P )− (s− 1)(O) ,

where O is the point at infinity. This function is also known as a Millerfunction [133, 134].

� Tate pairing Let G1 = E(Fp)[r], G2 = E(Fpk )/rE(Fpk ) and G3 = µr ⊂ F∗pk

(the r-th roots of unity), then the reduced Tate pairing [12] is a well-defined,non-degenerate, bilinear pairing. Let P ∈ G1 and Q ∈ G2, then the reducedTate pairing of P,Q is computed as

e(P,Q) = (fr,P (Q))(pk−1)/r.


� Ate pairing The ate pairing [93] is similar but with different G1 and G2.Here we define G1 = E(Fp)[r] and G2 = E(Fpk )[r] ∩Ker(πp − [p]), where πp isthe p-th power Frobenius endomorphism, i.e. πp : E → E : (x, y) 7→ (xp, yp),and Ker() returns the kernel of the function. Let P ∈ G1, Q ∈ G2 and lett := p + 1 −#E(Fp) be the trace of Frobenius, then the ate pairing is also awell-defined, non-degenerate bilinear pairing, and can be computed as

a(Q,P ) = (ft−1,Q(P ))(pk−1)/r.

� Optimal ate pairing The R-ate pairing is a generalization of the ate pairingand can be seen as an instantiation of optimal pairings. Since the definitionof the optimal ate pairing really depends on the particular elliptic curve oneis using, we only provide the definition in the case of Barreto-Naehrig (BN)curves [144]: using the same G1 and G2 as for the ate pairing, the optimal atepairing on BN curves is defined as

Ra(Q,P ) = (f · (f · laQ,Q(P ))p · lπ(aQ+Q),aQ(P ))(pk−1)/r,

where a = 6z̄+ 2, f = fa,Q(P ) and lA,B denotes the line through points A andB.

This function, (fs,P (Q)), is constructed in stages by using double-and-addmethod [134]. It is given by Algorithm 6 where l(A,B) is the equation of theline arising in the addition of the points A and B and vA is the equation of thevertical line passing through A.

Algorithm 6 Miller algorithm on E(Fpk ).

Input: P,Q ∈ E(Fpk ), r =∑s−1

i=0 ri2i.Output: f(r,Q)(P ) ∈ Fpk .

1: T ← Q, f ← 1.2: for i = s− 2 downto 0 do

3: f ← f2 · l(T,T )(P )

v2T (P ) , T ← 2T .

4: if ri = 1 then

5: f ← f · l(T,Q)(P )

vT +Q(P ) , T ← T +Q.

6: end if

7: end for

Return f .


As an example of pairing-based cryptography, Algorithm 7 shows the three-party one-round key agreement protocol of Joux [102].

Algorithm 7 Three-party one-round key agreement.Input: Bilinear pairing e, and domain parameters (q, FR, S, a, b, P, n, h).Output: Shared key K.

1: Alice selects a ∈R [1, n− 1], broadcasts aP ;Bob selects b ∈R [1, n− 1], broadcasts bP ;Chris selects c ∈R [1, n− 1], broadcasts cP .

2: Alice computes K = e(bP, cP )a;Bob computes K = e(aP, cP )b;Chris computes K = e(aP, bP )c.

Return K.

2.2.6 PKC Break-down

The PKC primitives described above use large integers or complex algebraicstructures. As a result, a naive implementation can be very slow in software orvery large in hardware. Especially, on embedded systems which have limitedcomputing power, implementation of PKC has been a big challenge. For thisreason, intensive studies on how to efficiently realize PKC in software andhardware have been carried out [38, 14, 124, 120, 66, 81].

Computations in PKC can be broken down into operations in the underlyinggroups or fields. Figure 2.3 shows the composition of computations of differentPKC.

• DH, ElGamal, RSA The main computation of DH, ElGamal and RSAis modular exponentiation of a large integer (1024 bits or above). Modularexponentiation is performed with a sequence of modular multiplications.

• ECC, HECC The main computation for ECC and HECC is scalar pointmultiplication and scalar divisor multiplication, respectively. The scalarpoint/divisor multiplication is the analogue of the exponentiation inmultiplicative groups, and is performed with a sequence of point/divisoradditions or doublings. The addition and doubling can be further brokendown into operations in the underlying field.

• Pairing Pairing computation involves both point addition on an ellipticcurve and operations in extension fields. Both of them can be brokendown into operations in the underlying base field.

FP ARITHMETIC 17

Figure 2.3: PKC computations break-down.

• Tori The computation of the torus involves an exponentiation in anextension field, which can be broken down into operations in the basefield.

As shown in Figure 2.3, operations in the base field is the basic operation forall the aforementioned PKC, hence an efficient arithmetic unit for the basefield operations is the most important building block of a high speed PKCimplementation. We discuss in Section 2.3 and Section 2.4 efficient arithmeticin Fp and F2m , respectively.

2.3 Fp Arithmetic

In this section, we describe efficient arithmetic in prime fields. Particularly, wedescribe three different representations of elements in Fp and popular modularmultiplication algorithms using such representations.


2.3.1 Representations

Positional Number System

In the Positional Number System (PNS), an integer is uniquely determined bythe radix, d, and a vector {xs−1, xs−2, . . . , x0}. The value of X is defined asX =

∑s−1i=0 xid

i. Each element of the vector, xi, 0 ∈ [0, s− 1] is called a digit.In the digital world, d is normally chosen to be 2w, where w is the word size.

Residue Number System

A Residue Number System (RNS) represents a large integer using a set ofsmaller integers. Let B = {b1, b2, . . . , bn} be a set of pairwise co-prime integers,and MB =

∏ni=1 bi. For any integer X, 0 ≤ X < MB, there is a unique

RNS representation on B: {X}B = {x1, x2, . . . , xn}, where xi = X mod bi,1 ≤ i ≤ n. Let |a|b denotes a mod b. Given {X}B, one can recover X usingthe Chinese Remainder Theorem (CRT):

X =

∣

∣

∣

∣

∣

n∑

i=1

∣

∣

∣xi ·B−1

i

∣

∣

∣

bi

·Bi

∣

∣

∣

∣

∣

MB

whereBi =MB

bi. (2.6)

The set B is also known as a base, and each element bi, 1 ≤ i ≤ n, is called anRNS modulus or an RNS channel.

RNS representation enables efficient parallel computations. Consider twointegers X,Y and their RNS representations {X}B = {x1, x2, . . . , xn} and{Y }B = {y1, y2, . . . , yn}, then we have

{|X ⊙ Y |MB}B = {|x1 ⊙ y1|b1

, . . . , |xn ⊙ yn|bn}, ⊙ ∈ {+,−,×, /}. (2.7)

Note that the division is available only if Y is co-prime with MB. For allthese operations, computations between xi and yi have no dependency on otherchannels, which makes RNS naturally suitable for parallel implementations.

Polynomial representation

An integer can also be represented as a polynomial, or more precisely, anevaluation of a polynomial. For example, given an integer X and a polynomialf(u) = au2 − bu + 1, it is always possible to find a 3-tuple (aX , bX , u) suchthat X = aXu

2 − bXu+ 1. A polynomial representation enables optimizationsin modular reductions when the moduli are of low-weight polynomial form[47, 67].

FP ARITHMETIC 19

2.3.2 Multiplication

Modular multiplications compute ab mod p, where p is the modulus. It canbe broken down into two steps: integer multiplication, c = ab, and integerreduction, c mod p.

Barrett reduction The Barrett reduction algorithm [13] uses a precomputedvalue µ = ⌊ 22n

p ⌋ to help estimate cp , thus integer division is avoided. Dhem [55]

proposed an improved Barrett modular multiplication algorithm which has asimplified final correction.

Algorithm 8 Modular multiplication using Barrett reduction [13].Input: a = (an−1, .., a0)d, b = (bn−1, .., b0)d, p = (pn−1, .., p0)d,0 ≤ a, b < p,2(n−1)w ≤ p < 2nw. Precompute µ =

⌊

d2n/p⌋

.Output: c = ab mod p.

1: c← ab.2: q̂ ←

⌊

⌊(c/bn−1⌋µ/bn+1⌋

.3: r1 ← c mod bn+1, r2 ← (q̂p) mod bn=1 and r ← r1 − r2.4: if r < 0 then5: r ← r + bn+1.6: end if7: while r ≥ p do8: r ← r − p.9: end while

Return r.

Algorithm 9 Modular multiplication using Montgomery reduction [135].

Input: a, b, p, 0 ≤ a, b < p, R = 2n, R > p. Precompute p′ = −p−1 mod R.Output: c = abR−1 mod p.

1: c← ab.2: µ← c mod R.3: q ← µp′ mod R.4: c← (c+ qp)/R.5: if c ≥ p then6: c← c− p.7: end if

Return c.

Montgomery reduction The Montgomery reduction method [135] precom-putes p′ = −p−1 mod R, where R normally is a power of two. Given c and p,it generates q such that c + qp is a multiple of R. As a result, the division of


(c + qp) by R is exact and can be performed by a shift operation. An integerZ is represented as Zm ← Z ·R mod M , where M is the modulus and R = 2r

is a radix that is coprime to M . This representation is called the Montgomeryresidue. Let Mont(a, b) to be the function of Algorithm 9, then we can convertZ to its representation in the Montgomery domain with ZM ← Mont(Z,R2)and convert it back with Z ← Mont(ZM , 1). The conditional final subtractioncan be avoided if a suitable R is selected [164].

RNS Montgomery reduction RNS representation ensures efficient compu-tation in Z/MBZ. Unfortunately, it can’t be applied directly in Fp since MB isnot prime. One way to utilize RNS for field multiplication is to combine RNSand Montgomery reduction [82, 111]. This is shown in Algorithm 10.

RNS Montgomery reduction requires two bases, B and C, with MC co-primeto MB. The reason of including C is that division by MB is not possible in B.Note that the size of MB and MC, compared to p, determine the upper boundof the input X. Guillermin found that if X < αp2, MB > αp and MC > 2p,then Algorithm 10 has output S < 2p [82, Proposition 1]. This is an importantprinciple for base selection.

Algorithm 10 RNS Montgomery reduction [11].Input: RNS bases B and C with MB > αp,MC > 2p, p coprime with MBMC,{X}B and {X}C being the RNS representations of X < αp2.Precomputed: {| − p−1|MB

}B, {|M−1B|MC}C and {p}C.

Output: {S}B, {S}C such that |S|p = |XM−1B|p and S <

2p.1: {Q}B ← {X}B × {−p−1}B.

2: {Q}B Base Extension−−−−−−−−−−−→ {Q}C.3: {S}C ←

(

{X}C + {Q}C × {p}C)

× {M−1B}C.

4: {S}B Base Extension←−−−−−−−−−−− {S}C.Return {S}B, {S}C.

Chung-Hasan reduction In [47, 46], Chung and Hasan proposed an efficientreduction method for low-weight polynomial form moduli p = f(z̄) = z̄n +fn−1z̄

n−1 +..+f1z̄+f0, where |fi| ≤ 1. The resulting modular multiplication isgiven in Algorithm 11. The polynomial reduction phase is rather efficient sincef(z) is monic, making the polynomial long division (Steps 3-5) simple. Barrettreduction is used to perform divisions required in Phase III. According to theimplementation results [47], the performance of the Chung-Hasan algorithm ismore efficient than the traditional Barrett or Montgomery reduction algorithmswhen the moduli are large (See Figure 5 in [47] for details). In [46], this

F2M ARITHMETIC 21

Algorithm 11 Chung-Hasan multiplication algorithm [47].

Input: positive integers a =∑n−1

i=0 aiz̄i, b =

∑n−1i=0 biz̄

i, modulus p = f(z̄) =z̄n + fn−1z̄

n−1 + · · ·+ f1z̄ + f0.Output: polynomial representation of c(z̄) = a(z̄)b(z̄) mod p.

1: Phase I: Polynomial Multiplication2: c(z)← a(z)b(z) =

∑2n−2i=0 ciz

i.Phase II: Polynomial Reduction

3: for i = 2n− 2 down to n do4: c(z)← c(z)− cif(z)zi−n.5: end for

Phase III: Coefficient Reduction6: cn ← ⌊cn−1/z̄⌉, cn−1 ← cn−1 − cnz̄.7: c(z)← c(z)− cnf(z).8: for i = 0 to n− 1 do9: qi ← ⌊ci/z̄⌉, ri ← ci − qiz̄.

10: ci+1 ← ci+1 + qi, ci ← ri.11: end for12: c(z)← c(z)− qn−1f(z)z.Return c(z).

algorithm is further extended to monic polynomials with |fi| ≤ s where s≪ z̄.Note that the polynomial reduction phase is efficient only when f(z) is monic.

2.3.3 Inversion

Given a and p, with p coprime to a, the computation of b such that ab ≡1mod p is called modular inversion. Modular inversion is considered a costlyoperation compared with multiplication or addition. The most widely usedinversion algorithms are based on Fermat’s Little Theorem (FLT) or ExtendedEuclidean Algorithm (EEA). For any integer x ∈ F∗p, FLT computes x−1 = xp−2

mod p. EEA computes integers u and v such that xu+pv=1, where u is indeedx−1 mod p.

2.4 F2m Arithmetic

Binary extension fields are widely used in ECC, HECC and pairing-basedcryptography. Binary extension fields are preferred in hardware since theytypically require less area and achieve higher speed than prime fields.


2.4.1 Representations

Elements in a binary extension field can be represented in various basis.

� Polynomial basis An element α ∈ F2m can be represented as a polynomialwith coefficients in F2 modulo an irreducible polynomial f(x) ∈ F2(x) of degreem. If θ is a root of f(x) then

P = { 1, θ, θ2, · · · , θm−1}

is a basis of F2m over F2. We can represent α as a polynomial α =∑m−1

i=0 aiθi,

ai ∈ F2.

� Type-II normal basis If p = 2m + 1 is prime and either of the followingtwo conditions holds [158]:

• 2 is a primitive root modulo p

• p=7 (mod 8) and the multiplicative order of 2 modulo p is m,

then we have an optimal normal basis of type II in F2m based on the normalelement β = ζ + ζ−1, where ζ is the primitive pth root of unity.

N = {β, β2, β4, · · · , β2m−1}.

Note that βi ∈ N can also be written as ζj + ζ−j for some j ∈ [1,m]. As such,there exists another base pN which is just a permutation of N

pN = {ζ + ζ−1, ζ2 + ζ−2, ζ3 + ζ−3, · · · , ζm + ζ−m}.

pN is called the permuted normal basis.

� Type II polynomial basis Shokrollahi [155] discovered an alternativepolynomial basis which enables efficient conversion to and from permutednormal basis. Bernstein and Lange [17] simplified the conversion algorithmand named the bases Type II optimal polynomial basis, denoted as nP in thisthesis:

nP = {(ζ + ζ−1), (ζ + ζ−1)2, · · · , (ζ + ζ−1)m}.


In the literature there are various algorithms for multiplications usingpolynomial bases, type II normal bases and type II polynomial bases. Given

CONCLUSION 23

two elements a =∑m−1

i=0 aiθi, b =

∑m−1i=0 biθ

i, c = ab can be computed asfollows:− Step 1 c← (((abm−1)θ + abm−2)θ + · · · ab0)− Step 2 c← c mod f(x).The multiplication in F2m involves a multiplication of two polynomials of degreem − 1 and a polynomial reduction. Obviously, if f(x) is sparse, i.e. it has alimited number of non-zero coefficients, the reduction step will be faster. Wegive the multiplication algorithms in Chapter 6.

Multiplication in normal basis can be performed with the Massey-Omuraalgorithm [143], the Sunar-Koç algorithm [158] and the recently proposedShokrollahi algorithm [163, 155].

2.4.3 Squaring

A squaring in F2m is much faster than a multiplication. Let a =∑m−1

i=0 aiθi be

the polynomial representation of a, then a2 =∑m−1

i=0 aiθ2i. If f(x) is sparse,

then a square can be made very small in hardware.

For hardware implementations, the squaring operation is virtually free inboth normal basis. Let a =

∑m−1i=0 aiβ

2i

be an element in F2m , thena2 = (

∑m−1i=0 aiβ

2i

)2=∑m−1

i=1 ai−1β2i

+am−1β. Indeed, a squaring using normalbasis is simply a cyclic shift. Moreover, repeated squaring, b = a2t

, has thesame complexity as a2.

2.4.4 Inversion

Like inversions in Fp, inversions in F2m are also normally performed with FLTor EEA. We give the EEA algorithm in Chapter 6.

2.5 Conclusion

In this chapter, we gave a brief introduction to a set of widely usedPKC, including RSA, HECC, ECC, Tori and pairing. We also illustratedthe computation hierarchy of these schemes. Clearly, an efficient modularmultiplier is the most important building block for a high speed PKCimplementation. In this chapter, we also gave a brief description of therepresentations and corresponding algorithms for both Fp and F2m arithmetic.In the following chapters, we will further analyze and optimize these finite


field algorithms for better parallelizability (Chapter 3), lower computationalcomplexity (Chapter 4), lower area (Chapter 5) and higher throughput onFPGA (Chapter 6).

Chapter 3

Montgomery Multiplicationon A Multi-core Platform

� 3.1 Motivation

� 3.2 MMM on A Multi-core Platform

� 3.3 Case Study: ECC, RSA and Torus-based Cryptography

� 3.4 Conclusion

25

26 MONTGOMERY MULTIPLICATION ON A MULTI-CORE PLATFORM

3.1 Motivation

The requirements of high speed and low power have pushed designers to adaptparallel architectures. In hardware implementations, systolic arrays [100] andsuper-scalar processors [149] have been proposed. While both high flexibilityand high performance are desirable, they are often contradictory to each other.For instance, parallel data-paths designed for dedicated operations tend tooffer higher throughputs than super-scalar processors that are designed tohandle more general operations. It is thus interesting to search for parallelimplementation methods that have a relatively high throughput and meanwhileallow a good flexibility.

This chapter studies the challenges of implementing one of the most widelyused modular multiplication algorithms, the Montgomery algorithm, on a multi-core platform. The target is to understand the data dependencies inside thealgorithm, and to come up with an efficient task partitioning and schedulingmethod. We propose a new scheduling method that has a reduced numberof inter-core data transfers. We also analyze the scalability of this method interms of different number of cores. The algorithm is mounted on a simplifiedhomemade multi-core processor implemented on an FPGA. As a case study, wecompare the speed of ECC, RSA and Torus-based cryptography on the sameplatform.

3.2 MMM on A Multi-core Platform

The Montgomery modular multiplication algorithm, shown in Algorithm 9,consists of mainly three long integer multiplications, namely, ab, up′ and qp.These three multiplications can be executed sequentially or in an interleavedmanner. Algorithm 12 shows a digit-serial variant of MMM.

In the Finely Integrated Operand Scanning (FIOS) variant of the Montgomeryalgorithm [38] (see Algorithm 12), the operands X, Y and M are representedas a vector of s words, and each word has w bits. In each iteration, X0 · Yi

is calculated and the result is added to Z in step 3. Using Z0, we calculate Twhich is then used in the computation of M · T . The result of M · T is then

This chapter is based on the following publication:

J. Fan, L. Batina, K. Sakiyama, and I. Verbauwhede, “FPGA design for algebraic tori-basedpublic-key cryptography,” in Design, Automation, and Test in Europe – DATE 2008. IEEE,pp. 1292–1297, 2008.

MMM ON A MULTI-CORE PLATFORM 27

Algorithm 12 Radix-2w Montgomery modular multiplication (FIOS) [38].Input: integers M = (Ms−1, ...,M0)r, X = (Xs−1, ...,X0)r, Y =(Ys−1, ..., Y0)r, where 0 ≤ X,Y < M , r = 2w, s = ⌈ n

w ⌉, R = rs withgcd(M, r) = 1 and M

′

= −M−1mod r.Output: X · Y ·R−1 modM

1: Z = (Zs−1, ..., Z0)r ← 0.2: for i = 0 to s− 1 do3: T ← (Z0 +X0 · Yi) ·M

′

mod r.4: Z ← (Z +X · Yi +M · T )/r.5: end for6: if Z > M then7: Z ← Z −M .8: end if9: Return Z

added to Z in step 4, making Z0 = 0. Hence, the division by r is exact andcan be performed by a simple right shift.

Algorithm 12 has a for loop of s iterations, and each iteration includes 2s + 1multiplications. In total, the number of digit multiplications for one MMM is2s2 + s. On a w-bit processor, at least 2s2 + s cycles are required to executethe digit multiplications only. Note that there are also clock cycles needed foradditions and memory accesses. Koç, Acar and Saliski give a comprehensiveanalysis of the number of cycles for different variants of MMM on a single-coreprocessor [38].

3.2.1 Target Platform

For parallel implementation of Montgomery multiplication, many architectureshave been proposed, including a systolic array [100], a bipartite multiplier [106],a multiplier vector [129], and a carry-save adder array [151]. However, since thearchitectures used in these designs are crafted for Montgomery multiplicationonly, the way the algorithm is partitioned and mapped results in very poorportability.

We use a simplified, Very Long Instruction Word (VLIW) processor to resemblemulti-core architectures. The design of the processor follows three guidelines:

• Simple We simplified the Instruction Set Architecture (ISA) and keep thebasic computation capabilities: multiplication, addition and subtraction.


Main ControllerData

Memory

Instruction

Memory

core-1 core-2 core-3 core-m

×××× +

A B

WB

Rin

Decoder

16-bit

RegisterFile

0000

RoutIns

16 16 16

16 16

Data Bus

Instruction Bus

Main ControllerData

Memory

Instruction

Memory

core-1 core-2 core-3 core-m

×××× +

A B

WB

Rin

Decoder

16-bit

RegisterFile

0000

RoutIns

16 16 16

16 16

×××× +

A B

WB

Rin

Decoder

16-bit

RegisterFile

0000

RoutIns

1616 1616 1616

1616 1616

Data Bus

Instruction Bus

Figure 3.1: Architecture of the multi-core platform. (w = 16).

• Shared memory As most popular multi-core systems, the platform has atwo-level memory organization: a local memory block (Registers) and aglobal memory.

• Flexible We want to keep the programmability such that we can supportdifferent operand lengths and an arbitrary number of PEs.

The platform tries to resemble the common features of popular architecturessuch as ARM multi-core processors, and can be easily realized on FPGAs.

As shown in Figure 3.1, the platform consists of a main controller, a datamemory, an instruction memory and several cores. The main controller fetchesinstructions from the instruction memory and dispatches them to all cores inparallel via the instruction bus. Each core executes arithmetic instructions inparallel, and stores the results in its register file. The data memory has onlyone read/write port, therefore, a single data memory access is allowed in eachcycle.


Table 3.1: Instructions supported by each core.

Opcode4-bit

Addr14-bit

Addr24-bit

Addr34-bit

Description

Nop No operation

Load Ri #Addr Load the data from locationAddr of the data memoryinto register Ri

Store Ri #Addr Store the data of register Ri

to location Addr or the datamemory

Mul Ri Rj Rk {R(i+1),Ri} = Rj· Rk

Add Ri Rj Rk {Ca,Ri} = Rj + Rk, Ca isthe carry out and is storedin the status register

Adc Ri Rj Rk {Ca,Ri} = Rj + Rk + Ca

Sub Ri Rj Rk Ri = Rj - Rk

We denote w as the operation size of a w-bit core. A 16-bit (w = 16) core isalso shown in Figure 3.1. It is a highly simplified Load/Store CPU. It has aninstruction decoder, a register file with 16 general 16-bit registers and one statusregister. The Arithmetic Logic Unit (ALU) includes one 16-bit multiplier andone 16-bit adder. It also has an output register to store the data that will bewritten to the data memory, and an input register to buffer the data from thedata memory. Both of them are 16-bit. One 32-bit Write Back (WB) registeris also used to store data from the ALU. The bit-length of both data-path andregisters is doubled if it is configured as a 32-bit (w = 32) core.

The cores here support a simple ISA. As shown in Table 3.1, this simplifiedISA has only 7 general instructions. Here #Addr denotes memory address.Instructions for each core are 16-bit long. All the arithmetic operations areperformed on data stored in the local register file. When data needs to bemoved from one core to another, it is first stored in the data memory, thenit is loaded by the destination core. Cores in this platform support a 4-stageinstruction pipelining.


3.2.2 Dependency Analysis and Task Partitioning

The main dependency in the Montgomery algorithm is due to the carries ofadditions. Taking Algorithm 12 as an example, in each iteration, Zj is replacedby (Zj + (X · Yi)j + (M · T )j + Ca), where Ca is the carry input. The datadependency in one iteration is shown in Figure 3.2. Obviously, Xj · Yi, for any0 ≤ i, j ≤ s − 1, is only dependent on the operands X and Y . We can alsocalculate Mj · T immediately after the generation of T . The products with thesame weight of Zj and the carry from Zj−1 are accumulated to Zj , generatinga new Zj and 2-bit carries. As a result, Zj can only be generated after thecarry from Zj−1 is ready.

On the designed platform (or a general purpose multi-core processor), it willbe very inefficient to transfer the carry from one core to another. Therefore, itwill be desirable to partition the algorithm so that the carry is only used in thecore where it was generated.

Method-I

In [159], Tenca and Koç proposed an iteration-based scheduling method. Inthis method each Processing Element (PE) performs one iteration of the loopin Algorithm 12. This method is attractive because carries are only used locally.

Xs Yi

××××

Ms T

××××

X1 Yi

××××

M1 T

××××

X0 Yi

××××

M0 T

××××

MSB LSB

Z0+

+ 0Z1

Z0

+ Zs

Zs-1

+

Zs

Carry

2

· · ·

· · ·Xs Yi

××××

Ms T

××××

Xs Yi

××××

Xs Yi

××××

Ms T

××××

Ms T

××××

X1 Yi

××××

M1 T

××××

X1 Yi

××××

X1 Yi

××××

M1 T

××××

M1 T

××××

X0 Yi

××××

X0 Yi

××××

M0 T

××××

MSB LSB

Z0+

+ 0Z1

Z0

+ Zs

Zs-1

+

Zs

Carry

2

· · ·

· · ·

Figure 3.2: Data dependency of FIOS Montgomery algorithm.


T

X0·Y0 + M0·T+Z0

X1·Y0 + M1·T+Z1

X2·Y0 + M2·T+Z2

···

X15·Y0 + M15·T+Z15

Z16

T

X0·Y1 + M0·T+Z0

X1·Y1 + M1·T+Z1

X2·Y1 + M2·T+Z2

···

X15·Y1+ M15·T+Z15

Z16

T

X0·Y2 + M0·T+Z0

X1·Y2 + M1·T+Z1

X2·Y2 + M2·T+Z2

···

X15·Y2+ M15·T+Z15

Z16

T

X0·Y3 + M0·T+Z0

X1·Y3 + M1·T+Z1

X2·Y3+ M2·T+Z2

···

X15·Y3 + M15·T+Z15

Z16

T

X0·Y4 + M0·T+Z0

X1·Y4 + M1·T+Z1

X2·Y4 + M2·T+Z2

···

T

X0·Y5+ M0·T+Z0

X1·Y5 + M1·T+Z1

X2·Y5 + M2·T+Z2

···

T

X0·Y6 + M0·T+Z0

X1·Y6 + M1·T+Z1

X2·Y6 + M2·T+Z2

···

T

X0·Y7 + M0·T+Z0

···

Z0

Z1

core-4core-3core-2core-1

Z14

Z15

One Ite

ration *

4

Time

Z0

Z1

Z0

Z1

Z0

Z1

Z14

Z15

Z14

Z15

Z0

Z1

Z0

Z1

Z0

Z1

X14·Y0 + M14·T+Z14

X14·Y1 + M14·T+Z14

X14·Y2 + M14·T+Z14

X14·Y3 + M14·T+Z14

T

X0·Y0 + M0·T+Z0

X1·Y0 + M1·T+Z1

X2·Y0 + M2·T+Z2

···

X15·Y0 + M15·T+Z15

Z16

T

X0·Y1 + M0·T+Z0

X1·Y1 + M1·T+Z1

X2·Y1 + M2·T+Z2

···

X15·Y1+ M15·T+Z15

Z16

T

X0·Y2 + M0·T+Z0

X1·Y2 + M1·T+Z1

X2·Y2 + M2·T+Z2

···

X15·Y2+ M15·T+Z15

Z16

T

X0·Y3 + M0·T+Z0

X1·Y3 + M1·T+Z1

X2·Y3+ M2·T+Z2

···

X15·Y3 + M15·T+Z15

Z16

T

X0·Y4 + M0·T+Z0

X1·Y4 + M1·T+Z1

X2·Y4 + M2·T+Z2

···

T

X0·Y5+ M0·T+Z0

X1·Y5 + M1·T+Z1

X2·Y5 + M2·T+Z2

···

T

X0·Y6 + M0·T+Z0

X1·Y6 + M1·T+Z1

X2·Y6 + M2·T+Z2

···

T

X0·Y7 + M0·T+Z0

···

Z0

Z1

Z0

Z1


Z14

Z15

Z14

Z15

One Ite

ration *

4

TimeTime

Z0

Z1

Z0

Z1

Z0

Z1

Z0

Z1

Z0

Z1

Z0

Z1

Z14

Z15

Z14

Z15

Z14

Z15

Z14

Z15

Z0

Z1

Z0

Z1

Z0

Z1

Z0

Z1

Z0

Z1

Z0

Z1

X14·Y0 + M14·T+Z14

X14·Y1 + M14·T+Z14

X14·Y2 + M14·T+Z14

X14·Y3 + M14·T+Z14

Figure 3.3: Instruction scheduling method-I: each iteration is performed withone core. (n = 256, w = 16, s = 16, Narrow = 240).

Note that this method was originally designed for a hardware implementation.Here we map this algorithm to general purpose multi-core systems. Figure 3.3shows the scheduling method, denoted as method-I, for 256-bit Montgomerymultiplication on a 4-core system. As n = 256 and w = 16, sixteen iterationsare needed. Core-1 performs the first iteration and generates Z0 to Z15 one byone. Each word is transferred to core-2 as soon as it is generated. Next Core-2performs the second iteration and then transfers Z0 to Z15 to core-3. After 4iterations Z = (Z15, ..., Z0) is transferred back to core-1 from core-4 and the5th iteration begins. As in total 16 iterations are required, each core needs toperform 4 iterations. After a conditional subtraction, the result is obtained.

Though method-I can avoid carry transfers between cores, transferring(Zs−1...Z0) causes a big overhead. In Figure 3.3 the transfers of (Zs−1...Z0)are denoted as arrows. For each iteration s = ⌈ n

w ⌉ arrows are required totransfer Z. Since one modular multiplication contains s iterations, s(s − 1)arrows are needed during the whole loop. Let Narrow be the number of arrows,then Narrow is s(s− 1). In Figure 3.3 we have s = 16, therefore Narrow = 240.



T

X0·Y0 + M0·T+Z0

X1·Y0 + M1·T+Z1

X2·Y0 + M2·T+Z2

X3·Y0 + M3·T+Z3

X4·Y0 + M4·T+Z4

X5·Y0 + M5·T+Z5

X6·Y0 + M6·T+Z6

X7·Y0 + M7·T+Z7

X8·Y0 + M8·T+Z8

X9·Y0 + M9·T+Z9

X10·Y0 + M10·T+Z10

X11·Y0 + M11·T+Z11

X12·Y0 + M12·T+Z12

X13·Y0 + M13·T+Z13

X14·Y0 + M14·T+Z14

X15·Y0 + M15·T+Z15Z11+Z11Z7+Z7Z3+Z3

One

Ite

ratio

n

Time

T

X0·Y1 + M0·T+Z0

X1·Y1 + M1·T+Z1

X2·Y1 + M2·T+Z2

X3·Y1 + M3·T+Z3

X4·Y1 + M4·T+Z4

X5·Y1 + M5·T+Z5

X6·Y1 + M6·T+Z6

X7·Y1 + M7·T+Z7

X8·Y1 + M8·T+Z8

X9·Y1 + M9·T+Z9

X10·Y1 + M10·T+Z10

X11·Y1 + M11·T+Z11

X12·Y1 + M12·T+Z12

X13·Y1 + M13·T+Z13

X14·Y1 + M14·T+Z14

X15·Y1 + M15·T+Z15Z11+Z11Z7+Z7Z3+Z3

T

X0·Y15 + M0·T+Z0

X1·Y15 + M1·T+Z1

X2·Y15 + M2·T+Z2

X3·Y15 + M3·T+Z3

X4·Y15 + M4·T+Z4

X5·Y15 + M5·T+Z5

X6·Y15+ M6·T+Z6

X7·Y15+ M7·T+Z7

X8·Y15 + M8·T+Z8

X9·Y15 + M9·T+Z9

X10·Y15 + M10·T+Z10

X11·Y15 + M11·T+Z11

X12·Y15 + M12·T+Z12

X13·Y15 + M13·T+Z13

X14·Y15 + M14·T+Z14

X15·Y15 + M15·T+Z15Z11+Z11Z7+Z7Z3+Z3

… … … …


T

X0·Y0 + M0·T+Z0

X1·Y0 + M1·T+Z1

X2·Y0 + M2·T+Z2

X3·Y0 + M3·T+Z3

X0·Y0 + M0·T+Z0

X1·Y0 + M1·T+Z1

X2·Y0 + M2·T+Z2

X3·Y0 + M3·T+Z3

X4·Y0 + M4·T+Z4

X5·Y0 + M5·T+Z5

X6·Y0 + M6·T+Z6

X7·Y0 + M7·T+Z7

X4·Y0 + M4·T+Z4

X5·Y0 + M5·T+Z5

X6·Y0 + M6·T+Z6

X7·Y0 + M7·T+Z7

X8·Y0 + M8·T+Z8

X9·Y0 + M9·T+Z9

X10·Y0 + M10·T+Z10

X11·Y0 + M11·T+Z11

X12·Y0 + M12·T+Z12

X13·Y0 + M13·T+Z13

X14·Y0 + M14·T+Z14

X15·Y0 + M15·T+Z15Z11+Z11Z7+Z7Z3+Z3

One

Ite

ratio

n

TimeTime

T

X0·Y1 + M0·T+Z0

X1·Y1 + M1·T+Z1

X2·Y1 + M2·T+Z2

X3·Y1 + M3·T+Z3

X4·Y1 + M4·T+Z4

X5·Y1 + M5·T+Z5

X6·Y1 + M6·T+Z6

X7·Y1 + M7·T+Z7

X8·Y1 + M8·T+Z8

X9·Y1 + M9·T+Z9

X10·Y1 + M10·T+Z10

X11·Y1 + M11·T+Z11

X12·Y1 + M12·T+Z12

X13·Y1 + M13·T+Z13

X14·Y1 + M14·T+Z14

X15·Y1 + M15·T+Z15Z11+Z11Z7+Z7Z3+Z3

T

X0·Y15 + M0·T+Z0

X1·Y15 + M1·T+Z1

X2·Y15 + M2·T+Z2

X3·Y15 + M3·T+Z3

X4·Y15 + M4·T+Z4

X5·Y15 + M5·T+Z5

X6·Y15+ M6·T+Z6

X7·Y15+ M7·T+Z7

X8·Y15 + M8·T+Z8

X9·Y15 + M9·T+Z9

X10·Y15 + M10·T+Z10

X11·Y15 + M11·T+Z11

X12·Y15 + M12·T+Z12

X13·Y15 + M13·T+Z13

X14·Y15 + M14·T+Z14

X15·Y15 + M15·T+Z15Z11+Z11Z7+Z7Z3+Z3

… … … …

T

X0·Y1 + M0·T+Z0

X1·Y1 + M1·T+Z1

X2·Y1 + M2·T+Z2

X3·Y1 + M3·T+Z3

X0·Y1 + M0·T+Z0

X1·Y1 + M1·T+Z1

X2·Y1 + M2·T+Z2

X3·Y1 + M3·T+Z3

X4·Y1 + M4·T+Z4

X5·Y1 + M5·T+Z5

X6·Y1 + M6·T+Z6

X7·Y1 + M7·T+Z7

X4·Y1 + M4·T+Z4

X5·Y1 + M5·T+Z5

X6·Y1 + M6·T+Z6

X7·Y1 + M7·T+Z7

X8·Y1 + M8·T+Z8

X9·Y1 + M9·T+Z9

X10·Y1 + M10·T+Z10

X11·Y1 + M11·T+Z11

X12·Y1 + M12·T+Z12

X13·Y1 + M13·T+Z13

X14·Y1 + M14·T+Z14

X15·Y1 + M15·T+Z15Z11+Z11Z7+Z7Z3+Z3

T

X0·Y15 + M0·T+Z0

X1·Y15 + M1·T+Z1

X2·Y15 + M2·T+Z2

X3·Y15 + M3·T+Z3

X0·Y15 + M0·T+Z0

X1·Y15 + M1·T+Z1

X2·Y15 + M2·T+Z2

X3·Y15 + M3·T+Z3

X4·Y15 + M4·T+Z4

X5·Y15 + M5·T+Z5

X6·Y15+ M6·T+Z6

X7·Y15+ M7·T+Z7

X4·Y15 + M4·T+Z4

X5·Y15 + M5·T+Z5

X6·Y15+ M6·T+Z6

X7·Y15+ M7·T+Z7

X8·Y15 + M8·T+Z8

X9·Y15 + M9·T+Z9

X10·Y15 + M10·T+Z10

X11·Y15 + M11·T+Z11

X12·Y15 + M12·T+Z12

X13·Y15 + M13·T+Z13

X14·Y15 + M14·T+Z14

X15·Y15 + M15·T+Z15Z11+Z11Z7+Z7Z3+Z3

… … … …

Figure 3.4: Instruction scheduling method-II: each iteration is performed withseveral cores. (n = 256, w = 16, s = 16, Narrow = 96).

Method-II

Note that in order to generate T , only Z0 must be ready at the end of eachiteration, while (Zs−1...Z1) can be generated later. Based on this observation,a new scheduling method is proposed (see Figure 3.4). In this method, eachiteration in Algorithm 12 is performed by multiple cores. Figure 3.4 shows thisscheduling method for a 4-core system. Here we still choose n = 256, w = 16and s = ⌈ n

w ⌉ = 16. During the whole loop (Z3, .., Z0) is generated and storedin core-1, (Z7, .., Z4) in core-2, (Z11, .., Z8) in core-3 and (Z15, .., Z12) in core-4.The carry is only used in the local core. At the end of each iteration, Z4 is sentto core-1, Z8 is sent to core-2 and Z12 is sent to core-3. After sixteen iterationsand a conditional subtraction, Z = X ·Y ·R−1 mod M is generated and storedseparately in four cores. Z can be written to the data memory or can be usedby another modular multiplication.

Method-II needs significantly less data transfers between cores than method-I.T is always generated in core-1, and then distributed to other cores. On asystem with p cores, p−1 arrows are needed to transfer T in each iteration. Toshift Z to the right, p − 1 words are transferred, making p − 1 arrows in eachiteration. As a result, the number of arrows for one modular multiplication is2(p− 1)s. When s = 16 and p = 4, the number of arrows, Narrow = 96.


3.2.3 Method-I vs. Method-II

Each arrow in Figure 3.3 and Figure 3.4 represents one store and one loadoperation. The comparison of memory accesses caused by data transfers ispresented in Table 3.2. Here Nload−tr and Nstore−tr are the number of loadand store operations caused by data transfers, respectively. Ntotal−tr is the sumof them. Note that in Figure 3.4 the arrows starting from T cause only onestore operation. For method-II Ntotal−tr is 3ps− 2s, while 2s2− s in method-I.As p is typically much smaller than s, the number of memory accesses causedby data transfers in method-II is much smaller than that of method-I. Otherthan data transfer, loading operands also cause memory accesses. Method-IIhas operands and intermediate data distributed in each core, thus has a lowernumber of memory accesses caused by operand loading.

3.2.4 Scalability Analysis

For parallel implementation of MMM on this platform, scalability has twodimensions: scalability in terms of larger operand size and scalability in termsof larger number of cores.

When the number of cores reaches a specific value, the memory access becomesthe bottleneck. Because our proposed architecture uses a single-ported sharedmemory, load and store cannot be operated in parallel. As a result, the cyclesneeded by one modular multiplication are no smaller than Nload +Nstore. LetN(s, p) be the number of cycles that is needed by one s-word multiplication ona p-core system. Then as p increases, there is a point where N(s, p) = Nload +Nstore. After reaching this point, increasing p doesn’t improve the performanceany more. Because method-I needs more load and store instructions thanmethod-II, it reaches this point before method-II as p increases.

The results are presented in Figure 3.5. Clearly, method-II shows a higherscalability in the number of cores compared with that of method-I. If asingle core is used, 2512 cycles are needed to finish one 256-bit Montgomery

Table 3.2: Number of data memory accesses caused by data transfers(p :=number of cores; s :=number of digits).

Scheduling Methods Narrow Nload−tr Nstore−tr Ntotal−tr

Method-I s(s− 1) s2 − s s2 2s2 − sMethod-II 2(p− 1)s 2(p− 1)s ps 3ps− 2s


1342

682

485643

906860852

1664

2512

0

500

1000

1500

2000

2500

3000

1 2 4 6 8

Number of cores (p )

Cy

cles

req

uir

ed b

y o

ne

25

6-b

it M

on

tgo

mer

y

mo

du

lar

mu

ltip

lica

tio

nMethod-I

Method-II

Figure 3.5: Performance of 256-bit MMM on a multi-core system (n = 256,w = 16).

multiplication. When using 2 cores, 1664 and 1342 cycles are required formethod-I and method-II, respectively. If 4 cores are used, only 852 cycles arerequired for method-I, while 682 cycles are required for method-II. On theother hand, when employing more than 4 cores, the performance of method-I is deteriorated because the number of the memory accesses becomes thebottleneck. For method-II, the best performance is obtained when p = 6 asshown in Figure 3.5.

Figure 3.6 shows the speed of 256-bit or 1024-bit MMM using method-II onplatforms with different configurations.

Our implementation (method-II) has better performance than software im-plementations on embedded processors. Using four 32-bit cores, the 256-bitmodular multiplication is almost 20 times faster than the implementation onthe ARM processor [159]. Our design is also faster than the implementationon TI’s DSP (TMS320C6201) [98], which can issue eight 32-bit instructions inparallel. Compared with state-of-the-art hardware implementations [152, 129],our design is still much slower. However, hardware implementations support afixed number of parameter sizes, while our design supports arbitrary operandsizes.

CASE STUDY: ECC, RSA AND CEILIDH 35

Figure 3.6: 256-bit and 1024-bit MMM on a multi-core system with differentconfigurations (Speedup is normalized to the 16-bit 2-core configuration).

3.3 Case Study: ECC, RSA and CEILIDH

Thanks to flexibility in operand size, we can implement different cryptosystemson the same platform. In this section, we describe the implementation of ECC,RSA and torus-based cryptographic (CEILIDH) schemes.

Figure 3.7 shows the block diagram of the platform. It consists of a MicroBlazeprocessor and the cryptographic coprocessor, namely, the multi-core platform.MicroBlaze is a synthesizable core offered by Xilinx, and is used here as acontroller.


Table 3.3: Performance comparison of modular multiplication on differentplatforms.

Reference Description Platform Area Freq. 256-bit 1024-bit

(Slices) (MHz) time(µs) time(µs)

This work 4-cores Xilinx 3173 93 2.2 33.0

(method-II) 4 32x32 mults XC2VP30

Tenca & Koç Software ARM - 80 43 570

[159] processor

Itoh et al. Software DSP - 200 2.68‡ −[98] TMS320C6201

Sakiyama et al. CSAs based Xilinx 4836 110.4 0.80 −[152] Dual-Field XC2VP30

Tenca & Koç 40 PEs ASIC 28 80 7.4 43

[159] 40 8x8 mults (0.5 µm) Kgates

Mentens 130 mults Xilinx 7244 64 0.31 1.07

[129] (16x16) XC2VP30

‡ 239-bit Montgomery modular multiplication.

3.3.1 Software/Hardware Interface

As shown in Figure 3.7, the MicroBlaze processor communicates with thecoprocessor via memory-mapped registers, i.e., instruction register (A) andtwo data sharing registers (B and C), and an interrupt signal. The coprocessorconsists of a decoder, data memory (DataRAM), microinstruction memory(InsRom) and multiple embedded cores. The decoder fetches instructionsfrom the instruction register (register A), and performs the correspondingmicroinstructions stored in InsRom. The microinstructions are dispatched tothe cores in parallel via the instruction bus.

Note that InsRom1 is not always necessary, as the Microblaze can send theinstructions to the coprocessor on the fly. However, this causes a huge interfaceoverhead. As one Fp6 operation consists of 18M + 60A, a total of 78 accessesto register A and 78 interrupts are required. One access to register A togetherwith one interrupt handling requires 184 clock cycles, while one 170-bit modularmultiplication requires 193 clock cycles. Therefore, the communication betweenthe MicroBlaze processor and the coprocessor becomes the bottleneck of thewhole system.

CASE STUDY: ECC, RSA AND CEILIDH 37

Figure 3.7: Top level block diagram of the platform.

3.3.2 Control Hierarchy

In order to reduce the interface overhead, we deployed a two-level programhierarchy. Torus exponentiation is broken down into a sequence of Fp6

operations, that are then performed with a sequence of Fp operations. ECCscalar multiplication and RSA exponentiation are also programmed in a similarway. Figure 3.8 shows the control hierarchy. In the InsRom1 we store thesequence on level 2.

The coprocessor decodes the level-1 instructions (i.e. PA, PD in the case ofECC), and fetches the corresponding sequence of MM and MA in InsRom1. Foreach MM or MA, the coprocessor performs the corresponding microinstructionsstored in InsRom2. This implementation requires only one access to register Aand one interrupt for each Fp6 operation, therefore the performance is improved.

3.3.3 Results

We implemented this platform on a Xilinx FPGA (Virtex-II Pro). 1024-bitRSA, 160-bit ECC and 170-bit CEILIDH are selected for implementation asthey offer an equivalent security level. The platform is configured to have three32-bit cores, since 3-core configuration achieves a lower delay of 160-bit MMM


Figure 3.8: Torus exponentiation, RSA and ECC on the same platform:program hierarchy.

than both 2-core and 4-core configurations. Table 3.4 shows the number ofclock cycles for several modular operations. The result shows that one 170-bit Montgomery modular multiplication requires 193 clock cycles, while oneaddition needs 47 clock cycles. The reason that modular additions are relativelyslow is that we only use one core to perform modular additions and subtractions.This is because the carry needs to be transferred if multiple cores are usedto perform modular additions. While 160-bit modular operations are slightlyfaster than 170-bit operations, 1024-bit Montgomery modular multiplication isabout 23 times slower than a 170-bit multiplication.

Table 3.4: Number of clock cycles for different operations.

Operations Number of cycles

170-bit 160-bit 1024-bit

Interrupt Handling 184

Modular Mult. 193 163 4447

Modular Add. 47 40 -

Modular Sub. 61 53 -

The design is synthesized and implemented on a Xilinx Virtex-II Pro(XC2VP30) FPGA. A maximum frequency of 74 MHz can be achieved. Thedata memory and instruction memory are implemented in block RAM ofthe FPGA board. In total, 5419 slices are used for this design, where thecoprocessor uses 3285 slices. Table 3.5 shows the performance of CEILIDH,ECC and RSA on this platform. One 170-bit T6 exponentiation requires 20 ms,

CONCLUSION 39

Table 3.5: Performance comparison between CEILIDH, ECC and RSA on thesame platform.

PKC Area Freq. Time

[slices] [MHz] [ms]

170-bit CEILIDH 201024-bit RSA 5419 74 96160-bit ECC 9.4

while one 1024-bit RSA exponentiation requires 96 ms. In this case, CEILIDHis about 5 times faster than RSA on the same platform. On the same platform,one 160-bit ECC scalar multiplication requires 9.4 ms, which is about two timesfaster than CEILIDH.

3.4 Conclusion

In this chapter, we analyzed the challenge of parallelizing the Montgomerymodular multiplication on a simplified multi-core platform. We show thatby postponing carry propagation in long integer additions and reducing thenumber of inter-core data transfers, the computation of the Montgomerymultiplication algorithm can be efficiently distributed over parallel cores. Ona 4-core testing platform, we obtain a speedup factor of 3.68 for 256-bitmultiplications. We also proposed a hierarchical control flow that allows usto implement ECC, RSA and CEILIDH on the same platform.

Chapter 4

Hybrid ModularMultiplication (HMM) andIts Application to Pairings

� 4.1 Motivation

� 4.2 Hybrid Modular Multiplication (HMM)

� 4.3 High Performance Pairing Processor Using HMM

� 4.4 Pairing Processor Using RNS

� 4.5 Conclusion

41

42 HYBRID MODULAR MULTIPLICATION (HMM) AND ITS APPLICATION TO PAIRINGS

4.1 Motivation

In Chapter 3 we discussed how to parallelize Montgomery modular multiplica-tion (MMM) algorithm on a multi-core hardware platform. From the resultswe see that parallelizing a long integer multiplication is not so easy and thescalability is only moderate. It is desirable to have a multiplication algorithmthat can achieve the following properties:

• There is no carry propagation chain from LSB to MSB.

• The computational complexity is lower than conventional MMM.

The complexity of MMM is 2s2 + s, where s is the number of digits in theoperands. In this chapter, we focus on modular multiplications with specialmoduli, p = f(z), where f(z) is a low-weight polynomial. Such kind of moduliarise in the finite fields used by pairing-friendly curves [71]. For instance,the family of Barreto-Naehrig (BN) curves is defined over Fp where p =36z4+36z3+24z2+6z+1. The main idea is that, since the modulus is generatedby a low-weight polynomial, performing the Montgomery multiplication ina polynomial ring might reduce the computational complexity. Also, sincepolynomial multiplication does not have carry propagation, parallelizationof this algorithm is easier. While this algorithm is proposed for pairingcomputation, it is applicable to any moduli of such form.

This chapter also includes some recent results on the design of a pairingprocessor using RNS. Compared with PNS, the RNS representation haslower computational complexity in multiplication but a higher complexityin reduction. Recent studies on pairing implementation by Scott [154] andAranha et al. [5] show that the number of reductions can be much less thanmultiplications in pairing computation. As a result, RNS becomes interestingin pairing implementations.

The rest of this chapter is organized as follows. We first introduce the newmultiplication algorithm, and then discuss its application to pairings. We alsoreport some recent results of pairing implementations using RNS.

HYBRID MODULAR MULTIPLICATION 43

4.2 Hybrid Modular Multiplication

In this section we introduce a new reduction algorithm for polynomialform moduli that can be seen as the dual algorithm to Chung-Hasan (seeAlgorithm 11). Whereas Chung and Hasan require the polynomial definingthe modulus to be monic, our algorithm requires the constant coefficient to be±1. As will be shown below this requirement is a very natural one.

Let f(z) = fn−1zn−1 + · · · + f1z + f0 ∈ Z[z] with f0 6= 0 and assume the

modulus is given by p = f(z̄) for some z̄ ∈ Z. In order to compute c mod p,we can use the Montgomery reduction algorithm in a polynomial ring.

1. Step 1: Represent c as a polynomial c(z) =∑2n−2

i=0 cizi such that c =

c(z̄);

2. Step 2: q(z) = c(z)g(z) mod zn, where g(z) ≡ f(z)−1 mod zn;

3. Step 3: r(z) = (c(z)− q(z)f(z))/zn = c(z)z−n mod f(z).

Step 2 generates the polynomial q(z) such that q(z) = c(z)g(z) mod zn. Thedivision by zn is exact and can be computed by a simple right shift of thecoefficients. Note that f0 6= 0 implies that g(z) ∈ Q[z] exists, but g(z) doesnot necessarily have integer coefficients. The following lemma shows that thecondition f0 = ±1 is a natural one.

Lemma 1. Let f(z) = fn−1zn−1 + · · ·+f1z+f0 ∈ Z[z] with f0 6= 0, and define

gk(z) = f(z)−1 mod zk, then a necessary and sufficient condition for the gk tohave integer coefficients is f0 = ±1.

Proof: The argument is classical and is a special case of Newton lifting overthe rational function field Q(z). Indeed, gk(z) ≡ f(z)−1 mod zk is the solutionto the linear equation (in the indeterminate X)

f(z)X − 1 ≡ 0 mod zk ,

which can be solved using the Newton iteration: g1(z) ≡ 1/f0 mod z and then

gk(z) ≡ gk−1(z)− (f(z)gk−1(z)− 1)/f(z)≡ 2gk−1(z)− g2

k−1(z)f(z) mod zk .

Section 4.2 and 4.3 are based on the following publication:

J. Fan, F. Vercauteren, and I. Verbauwhede, “Efficient hardware implementation of Fp-arithmetic for pairing-friendly curves,” Computers, IEEE Transactions on, vol. PP,no. 99,p. 1, 2011.


Hence f0 = ±1 is clearly sufficient for gk(z) to be defined over Z. However, itis also necessary, since gk(z) ≡ f−1

0 mod z. �

Lemma 1 shows that g(z) indeed has integer coefficients if f0 = ±1. However,the output of Step 3 normally has large coefficients. In other words, the resultr(z̄) is not fully reduced. In order to have the results reduced, one coefficientreduction step is needed.

4.2.1 Parallel Hybrid Modular Multiplication

Algorithm 13 describes our modular multiplication algorithm for polynomialform moduli. The algorithm is composed of four phases, i.e. polynomialmultiplication, a partial coefficient reduction, polynomial reduction andcoefficient reduction. The polynomial reduction uses the Montgomeryreduction, while the coefficient reduction uses division. We call this algorithmHybrid Modular Multiplication (HMM).

Algorithm 13 Parallel hybrid modular multiplication algorithm.

Input: positive integers a =∑n−1

i=0 aiz̄i, b =

∑n−1i=0 biz̄

i, modulus p = f(z̄) =fn−1z̄

n−1 + ..+ f1z̄ + f0 with f0 = ±1 and g(z) ≡ f(z)−1 mod zn.Output: polynomial r(z̄) ≡ a(z̄)b(z̄)z̄−n mod p

1: Phase I: Polynomial Multiplication2: c(z) =

∑2n−2i=0 ciz

i ← a(z)b(z).3: Phase II: Coefficient Reduction4: for i = 0 to n− 1 do5: ci+1 ← ci+1 + (ci div z̄), ci ← ci mod z̄.6: end for7: Phase III: Polynomial Reduction8: q(z)← (c(z) mod zn)g(z) mod zn.9: c(z)← (c(z)− q(z)f(z))/zn.

10: Phase IV: Coefficient Reduction11: for i = 0 to n− 2 do12: ci+1 ← ci+1 + (ci div z̄), ci ← ci mod z̄.13: end for

Return r(z)← c(z).

For Algorithm 13 to be useful in practice, we need to show that given a boundedinput, the algorithm returns an output which is similarly bounded. To thisend we require two short lemmata. The first lemma analyzes the coefficient


reduction in Phase IV, while the second lemma gives bounds on the input,such that the output satisfies the same bounds.

Lemma 2. Let c(z) =∑m

i=0 cizi ∈ Z[z] be a polynomial of degree m. Assume

that |ci| < B for i = 0, . . . ,m− 1 and |cm| < D for bounds B and D, then thecoefficient reduction

for i = 0 to m do

ci+1 ← ci+1 + (ci div z̄), ci ← ci mod z̄end for

results in 0 ≤ ci < z̄ for i = 0, . . . ,m and

|cm+1| <D

z̄+

B

z̄(z̄ − 1).

Proof: Using induction it is easy to see that in step i, the size of ci div z̄ isbounded by

|ci div z̄| <i+1∑

k=1

|ci+1−k|/z̄k <

i+1∑

k=1

B/z̄k <B

z̄ − 1.

The size of cm in step m− 1 therefore becomes

|cm| < D +B

z̄ − 1,

which shows that in step m, we obtain the bound

|cm+1| <D

z̄+

B

z̄(z̄ − 1). �

Using Lemma 2 we are now ready to show that given a bounded input, theresult of Algorithm 13 is again bounded. Recall that for a polynomial h =∑

i hizi ∈ Z[z], the infinity norm is defined as ||h||∞ = maxi |hi|.

Lemma 3. Let τ = ⌊log2(z̄−1)⌋, φ = ⌈log2 ||f ||∞⌉ and ζ = ⌈log2 ||g||∞⌉, thenif the input a(z), b(z) satisfies

|ai|, |bi| ≤ 2τ+κ for i = 0, . . . , n− 2 ,

|an−1|, |bn−1| ≤ 2τ/2+κ ,

for some κ ≥ 1, then r(z) computed by Algorithm 13 satisfies

|ri| ≤ 2τ+1 for i = 0, . . . , n− 2

|rn−1| ≤ 4n2(2ζ+φ + 22κ) .

This shows that if τ ≥ 2(log2(4n2(2ζ+φ +22κ))−κ), then r(z) satisfies the samebounds as the input.


Proof: After step 2, the coefficients of ci are clearly bounded by n22τ+2κ fori = 0, . . . , 2n − 3 and |c2n−2| ≤ 2τ+2κ. Lemma 2 shows that after the firstcoefficient reduction we have |ci| ≤ z̄ for i = 0, . . . , n− 1 and |cn| ≤ n22τ+2κ+1.

The coefficients of q(z) are easily seen to be bounded by |qi| ≤ n2τ+ζ+1 so afterstep 9, we obtain

|ci| ≤ n22τ+2κ+1 + n22τ+ζ+φ+1 =: B

|cn−2| ≤ 2τ+2κ + n22τ+ζ+φ+1 =: D .

Applying Lemma 2 again with the above B and D, shows that

|cn−1| ≤ Dz̄ + B

z̄(z̄−1)

≤ n22ζ+φ+2 + n22(κ+1)

≤ 4n2(2ζ+φ + 22κ) .

�

Given a polynomial f(z) it is easy to compute the inverse g(z) ≡ f(z)−1 mod zn

and thus to determine the exact values for φ and ζ in Lemma 3. For eachpolynomial, we therefore obtain a very explicit lower bound on z̄ such thatAlgorithm 13 can be applied repeatedly.

4.2.2 Digit-serial Version

Algorithm 14 presents the digit-serial version of Algorithm 13 by interleavingpolynomial multiplication and reduction. While the parallel version usesthe complete polynomial g(z) ≡ f−1(z) mod zn, the digit-serial reductioninvolves only g0 = ±1. On the other hand, the parallel version could useKaratsuba’s algorithm [109] in step 2, 8 and 9 to reduce the number of sub-wordmultiplications. The coefficient reduction phase is the same as in Algorithm 13.

The following lemma is the analogue of Lemma 3, but for the digit-serial version.

Lemma 4. Let τ = ⌊log2(z̄ − 1)⌋ and φ = ⌈log2 ||f ||∞⌉, then if n2 < z̄ andthe input a(z̄), b(z̄) satisfies

|ai|, |bi| ≤ 2τ+κ for i = 0, . . . , n− 2 ,

|an−1|, |bn−1| ≤ 2τ/2+κ .

for some κ ≥ 1, then r(z) computed by Algorithm 14 satisfies

|ri| ≤ 2τ+1 for i = 0, . . . , n− 2

|rn−1| ≤ 2(n+ 2)(22κ + 2φ) .


Algorithm 14 Digit-serial hybrid modular multiplication algorithm.

Input: a(z̄) =∑n−1

i=0 aiz̄i, b(z̄) =

∑n−1i=0 biz̄

i, and modulus p = f(z̄) =∑n−1

i=0 fiz̄i with f0 = ±1.

Output: r(z̄) ≡ a(z̄)b(z̄)z̄−n mod f(z̄).

1: c(z) =∑n−1

i=0 cizi ← 0 .

2: for i = 0 to n− 1 do3: c(z)← c(z) + a(z)bi .4: µ← c0 div z̄, γ ← c0 mod z̄.5: h(z)← (fn−1z

n−1 + · · ·+ f1z + f0)(−f0γ).6: c(z)← (c(z) + h(z))/z + µ.7: end for8: for i = 0 to n− 2 do9: ci+1 ← ci+1 + (ci div z̄), ci ← ci mod z̄.

10: end for

Return r(z)← c(z).

This shows that if τ ≥ 2(log2(2(n + 2)(22κ + 2φ)) − κ), then r(z) satisfies thesame bounds as the input.

Proof: The idea of the proof is similar to the proof of Lemma 3, i.e. beforethe final coefficient reduction we want to show that the coefficient cn−2 is smallenough to ensure that after reduction, cn−1 is small. Note that at the beginningof each iteration in step 2, the degree of c(z) is at most n − 2, which showsthat after step 7 we have cn−2 = an−1bn−1 − γn−1fn−1f0. Here γi denotes theγ computed in step 4 in the i-th iteration. Since |γi| ≤ |z̄|, this shows that|cn−2| ≤ 2τ+2κ + 2τ+φ+1.

Assume that the coefficients ci for i = 0, . . . , n − 3 are bounded by B (tobe determined below) and let D = 2τ+2κ + 2τ+φ+1, then again we can applyLemma 2 resulting in

|cn−1| ≤D

z̄+

B

z̄(z̄ − 1)≤ 22κ + 2φ+1 +

B

z̄(z̄ − 1).

Obtaining the bound B is easy too: denote with Bi the bound on ||c(z)||∞ atthe start of the i-th iteration, then we have B0 = 0 and for i ≥ 1:

Bi ≤ Bi−1(1 +1z̄

) + 22τ+2κ+1 + 2τ+φ+1 .

Let α = (1 + 1/z̄) and β = 22τ+2κ+1 + 2τ+φ+1, then we need to solve therecursion Bi ≤ αBi−1 + β with B0 = 0. This results in the bound Bn ≤


β(αn − 1)/(α − 1) and since α = 1 + 1/z̄ this becomes Bn ≤ βz̄(αn − 1).Writing out αn − 1 explicitly gives

αn − 1 =n

∑

k=1

(

n

k

)

1z̄k

<n

z̄+∞

∑

j=2

nj

j!z̄j.

Assuming that n2 ≤ z̄, we can easily bound the sum above by (e − 1)/z̄, soαn − 1 ≤ (n+ e− 1)/z̄. So we finally obtain

B = Bn ≤ (n+ e− 1)(22τ+2κ+1 + 2τ+φ+1) .

As a result, the bound on cn−1 becomes

|cn−1| ≤ 22κ + 2φ+1 + (n+ e− 1)(22κ+1 + 2φ−τ+1)≤ (n+ 2)(22κ+1 + 2φ+1) .

�

It is important to point out that both Algorithm 13 and Algorithm 14 mayinvolve negative numbers as intermediate or final results. As a result, a directuse of these algorithms requires the handling of negative numbers. In hardwareimplementations, one can avoid signed multiplication by adding a multiple off(z̄) to the r(z̄). We describe this method in an example in Section 4.3.4.

4.2.3 Faster Coefficient Reduction

Algorithms 13 and 14 require division by z̄. As in Chung-Hasan’s algorithm,division can be performed with Barrett’s reduction algorithm [46]. However, thecomplexity of the division step can be reduced if z̄ is a pseudo-Mersenne number.Algorithm 15 transfers division by z̄ to multiplication by s for z̄ = 2τ +s wheres is small.

Algorithm 15 Division by z̄ = 2τ + s.

Input: a, z̄ = 2τ + s with 0 < s < 2⌊τ/2⌋.Output: (µ, ǫ) with a = µz̄ + ǫ, |ǫ| < z̄.

1: µ← 0, ǫ← a.2: while |ǫ| ≥ z̄ do3: ρ← ǫ div 2τ , ǫ← ǫ mod 2τ .4: µ← µ+ ρ, ǫ← ǫ− sρ.5: end while

Return µ, ǫ.

The following lemma gives the maximum number of iterations for Algorithm 15to finish.


Lemma 5. Define π such that |a| ≤ 2π and ν such that 0 < s ≤ 2ν−1, thenAlgorithm 15 requires at most ⌈π−τ−1

τ−ν ⌉+ 1 iterations.

Proof: After one iteration we have

|ǫ| = |ǫ− sρ| ≤ |ǫ|+ s|ρ| ≤ (2τ − 1) + 2π+ν−1−τ .

To bound the right hand side, we need to consider the two cases:

• (2τ − ν + 1) ≤ π, then |ǫ| ≤ 2π−τ+ν , so in each iteration the bit length isdecreased by τ − ν.

• (2τ − ν + 1) > π, then |ǫ| < 2τ+1, so either ǫ is reduced or one last stepis needed

|ǫ| = |ǫ− sρ| ≤ (2τ − 1) + s < z̄ .

In conclusion: ⌈π−(2τ−ν+1)τ−ν ⌉ iterations are needed to reduce to the second case

and then a maximum of two iterations is needed to finish reduction in thesecond case. In total, the algorithm thus requires ⌈π−τ−1

τ−ν ⌉+ 1 iterations. �

When using Algorithm 15 to perform step 4 and 9 in Algorithm 14, at mostthree iterations are required. This is guaranteed if |ci| ≤ 23τ−2ν+1 for 0 ≤ i ≤n− 1. Recall that ci satisfies the following bounds:

|ci| ≤ B = (n+ e− 1)(22τ+2κ+1 + 2τ+φ+1) ,

for 0 ≤ i < n− 2, and

|cn−2| ≤ D = 2τ+2κ + 2τ+φ+1 .

Since B > D, it is sufficient to ensure B ≤ 23τ−2ν+1. Since we typically haveφ ≤ τ ,

log2 B ≤ log2(n+ e− 1) + 2τ + 2κ+ 2 .

By choosing τ ≥ 2(κ+ ν) + log2(n+ e− 1) + 1, we ensure

log2 B ≤ 3τ − 2ν + 1 .

4.2.4 Complexity

We compare the complexity of the proposed algorithm with the algorithmsby Barrett and Montgomery. Here τ , φ are defined as in Lemma 3, namely,τ = ⌊log2(z̄ − 1)⌋, φ = ⌈log2 ||f ||∞⌉, z̄ = 2τ + s where 0 < s ≤ 2ν−1. Thecomplexity of each step in Algorithm 14 is as follows.


• Step 3: for i ≤ n−2, computing a(z)bi involves one (⌈ τ2 ⌉+κ)×(τ+κ) and

(n− 1) times (τ + κ)× (τ + κ) multiplications. For i = n− 1, computinga(z)bi involves one (⌈ τ

2 ⌉+κ)×(⌈ τ2 ⌉+κ) and (n−1) times (⌈ τ

2 ⌉+κ)×(τ+κ)multiplications.

• Step 5: computing f(z)(−f0γ) needs (n− 1) times τ × φ multiplications.

• Step 4 and 9: as discussed above, three iterations are required at most.We have |ρ| ≤ log2(n + e − 1) + τ + 2κ + 2 and |ρ| ≤ log2(n + e −1) + ν + 2κ + 3 in the first iteration and second iteration, respectively.For the third iteration, |ρ| ≤ 1 is guaranteed. The cost of step 4 or9 is one ν × (τ + ⌈log2(n + e − 1)⌉ + 2κ + 2) multiplication and oneν × (⌈log2(n+ e− 1)⌉+ ν + 2κ+ 3) multiplication.

Table 4.1 compares the complexity of the different multiplication algorithms.Note that φ and ν are chosen to be small, so multiplication with f(z) and scan be efficiently performed. A numerical example is given in Table 4.3.

Table 4.1: Complexity comparison of different modular multiplicationalgorithms (z̄ = 2τ + s, 0 < s ≤ 2ν−1, φ = ⌈log2 ||f ||∞⌉, δ1 = τ + ⌈log2(n+ e−1)⌉+ 2κ+ 2, δ2 = ⌈log2(n+ e− 1)⌉+ ν + 2κ+ 3).

Algorithm (⌈ τ2 ⌉+ κ)× (⌈ τ

2 ⌉+ κ)× (τ + κ)× φ× τ δ1 × ν δ2 × ν

(⌈ τ2 ⌉+ κ) (τ + κ) (τ + κ)

Barrett (n− 1)(2n− 1)

Montgomery (n− 1)(2n− 1)

HMM (Alg. 14) 1 2(n− 1) (n− 1)2 n(n− 1) 2n− 1 2n− 1

The complexity of the HMM algorithm highly depends on the chosen curveparameters since they determine {τ, ν, n, φ, κ}. Each family of pairing-friendlycurves is defined by a fixed polynomial f(z), so n and φ are constant for agiven family. The size of τ is set by the desired security level. To achieve lowercomplexity, it is thus desirable to choose small ν and κ. According to Lemma 4,κ can be as small as 1 and the size of ν is determined by s where z̄ = 2τ + s.

Computing ABR−1 mod M using conventional Montgomery multiplicationrequires 2w2 + w sub-word multiplications (see [38] for details), where w isthe number of digits in A and B. Note that w = n− 1 here since we need onemore digit to represent A(z) or B(z). Barrett multiplication has approximatelythe same complexity as Montgomery’s algorithm [46].

Compared with τ , the parameters φ and ν are much smaller. This makes thepolynomial reduction and coefficient reduction very efficient. In Section 4.3.1,

HIGH PERFORMANCE PAIRING PROCESSOR USING HMM 51

we give some sample parameters used for pairings and give a numericalcomparison of the complexities of the different multiplication algorithms.

4.3 High Performance Pairing Processor Using

HMM

A bilinear pairing is a map G1 × G2 → GT where G1 and G2 are typicallyadditive groups and GT is a multiplicative group and the map is linear ineach component. We refer to Chapter 2 (Section 2.2.5) for the definition ofpairings. Since the arithmetic of pairings is considerably more complicatedthan popular PKC schemes such as ECC and RSA, improving the performanceof pairing computation has drawn substantial research interest. The pairing

computation consists of two main functions, f(r,Q) and fpk−1

ℓ , where r is thelength of Miller’s loop and k is the embedding degree. We summarize here themain advancements reported in recent publications:

• Optimal pairing. The basic idea of optimal pairing [162] is to have aminimum loop length in Miller’s algorithm. Pairings that achieve theminimum loop length are R-ate pairing [121] and optimal Eta pairing [4].

• Paring friendly curves. An elliptic curve with a small embedding degreeand a large prime-order subgroup is called pairing-friendly. Freeman [71]gives a comprehensive study on the known pairing friendly curves.

• Fast arithmetic in Fpk . Using carefully selected tower extension fieldsreduces the complexity of Miller’s loop and the final exponentiation [54,5].

• Lazy reduction. For expressions like∑

AiBj , where Ai, Bj ∈ Fp, onlyone reduction is actually needed, which brings a significant complexityreduction [154, 5].

• Architecture exploration. In hardware implementations, different archi-tectures such as the ASIP [107] and the Karastuba multiplier [19, 20]have been proposed to achieve a high performance.

In this section, we apply the HMM algorithm to pairing computation. Weconsider a class of curves defined over a polynomial form prime p. Oneimportant example of this class is the family of Barreto-Naehrig (BN) curvesdefined over Fp where p = 36z̄4 + 36z̄3 + 24z̄2 + 6z̄ + 1. Existing techniques tospeed up arithmetic in extension fields (see [54, 53] for fast arithmetic in Fp2 ,


Fp6 and Fp12) can be used together with our construction. We also show howto choose parameters for BN curves to obtain a significant improvement on theperformance in hardware of the ate and optimal ate pairing.

4.3.1 Pairing-friendly Curves

An elliptic curve E over Fp is called pairing-friendly whenever there exists alarge prime r | #E(Fp) with r ≥ √p and the embedding degree k is smallenough, e.g. k ≤ log2(r)/8. Furthermore, if #E(Fp) = p + 1 − t with t thetrace of Frobenius, then t2 − 4p should have a very small squarefree part (e.g.less than 1010) to be able to find the equation of the curve using the complexmultiplication (CM) method [8]. These restrictions imply that pairing-friendlycurves are hard to find and several specialized constructions have been proposed(see Freeman [71] for an excellent overview).

Many construction methods result in a parameterized family of elliptic curves,i.e. r and p are given by the evaluation of polynomials r(z) and f(z) at aninteger value z̄. To illustrate this type of curve, we provide several importantexamples of complete families of curves.

Example 1: 112-bit security level. Consider the following family with k = 8given in [71, Construction 6.10]: The polynomials

f(z) =14

(81z6 + 54z5 + 45z4 + 12z3 + 13z2 + 6z + 1)

r(z) = 9z4 + 12z3 + 8z2 + 4z + 1

define a family of pairing-friendly curves with k = 8 and CM by −1, whichadmits quartic twists. Furthermore, finding the equation of the curve isstraightforward due to CM by −1. Given a z̄-value such that r(z̄) is a 224-bit prime, this family provides 112-bit security.

Example 2: 128-bit security level. The Barreto-Naehrig curves [144] are byfar the most important family of pairing-friendly curves for the 128-bit securitylevel. These curves have k = 12 and are defined by

f(z) = 36z4 + 36z3 + 24z2 + 6z + 1

r(z) = 36z4 + 36z3 + 18z2 + 6z + 1 .

Furthermore, since these curves have CM by −3, the equation for such a curveis easy to find. Finally, they also admit sextic twists which speeds up all typesof pairings. Given a z̄-value such that r(z̄) is a 256-bit prime, this familyprovides 128-bit security.


Algorithm 16 Computing the optimal ate pairing on BN curves [54].Input: P ∈ E(Fp)[r], Q ∈ E(Fpk )[r]

⋂

Ker(πp − [p]) and a = 6z̄ + 2.Output: Ra(Q,P ).

1: a =∑L−1

i=0 ai2i.2: T ← Q, f ← 1.3: for i = L− 2 downto 0 do4: T ← 2T .5: f ← f2 · lT,T (P ).6: if ai = 1 then7: T ← T +Q.8: f ← f · lT,Q(P ).9: end if

10: end for11: f ← (f · (f · laQ,Q(P ))p · lπ(aQ+Q),aQ(P ))(pk−1)/r.Return f .

Example 3: 256-bit security level. For very high security levels, the BNcurves are somewhat ill-adapted and it is better to use the following cyclotomicfamily with k = 24 [36]. Recall that the 24-th cyclotomic polynomial isΦ24(z) = z8 − z4 + 1, then

f(z) =13

(z − 1)2Φ24(z) + z r(z) = Φ24(z) .

Again, this family has CM by −3 so sextic twists are possible. Given a z̄-valuesuch that r(z̄) is a 512-bit prime, this family provides 256-bit security.

4.3.2 Pairing Computation

Algorithm 16 describe the algorithm to compute the optimal ate pairing forBN curves. The algorithms for Tate and ate pairings are similar, and can befound in [54].

Figure 4.1 shows the computation of an optimal ate pairing. In Miller’s loop,the main computation is point addition/doubling (T ← 2T ), line evaluation(lT,T (P ) and lT,Q(P )), and multiplication in Fp12 . The final exponentiationcan be broken down to a sequence of multiplications and one inversion inFp12 . Essentially, all the operations in the extension field are performed witha sequence of Fp operations, among which the multiplication is the mostimportant one. Thus, having an efficient multiplier is the most importantand effective way to speed up the pairing computation.


Figure 4.1: Optimal ate pairing: computation hierarchy.

4.3.3 Parameter Selection for Pairing-friendly Curves

The selection of parameters has a substantial impact on the security andperformance of a pairing. For example, the underlying field, the type of curve,the order of G1, G2 and GT should be carefully chosen such that it offerssufficient security, but is still efficient to compute. Also, as analyzed in [87],the characteristic of the underlying field Fp determines not only the securitylevel of a pairing, but also the computational complexity of Miller’s loop andthe final exponentiation.

In this section, we present values for z̄ for each of the three examples given inSection 4.3.1. Note that for the first (resp. third) family, f(z) does not haveintegral coefficients, but 3f(z) (resp. 4f(z)) does, so we simply work modulothese polynomials.

Table 4.2 contains values z̄ that lead to efficient instantiations of the HMMalgorithm. Furthermore, the bit-length of z̄ is chosen to reflect the ideal securitylevel at which the respective families of curves should be used.

Table 4.3 compares the complexity of HMM and Montgomery’s algorithm usingparameter sets of Table 4.2. In hardware designs, one can customize themultipliers for specific operand sizes. HMM takes advantage of this freedom toreduce the complexity of the multiplier. For example, a 6 × 63-bit multiplier


Table 4.2: Selection of z̄ for curves (z̄ = 2τ +s, 0 < s ≤ 2ν−1, φ = ⌈log2 ||f ||∞⌉).Family k z̄ τ φ ν ⌈log(2, p)⌉ ⌈log(2, r)⌉

Example 1 8 239 + 1175 39 7 12 239 159

Example 2 12 263 + 857 63 6 11 258 258

Example 3 24 264 + 11757 64 1 15 639 513

is much smaller and faster than a 64 × 64-bit multiplier. Compared withMontgomery multiplication, the HMM has a significantly lower complexity inthe context of hardware implementation.

Note that the complexity of HMM can be reduced further by exploring thecharacteristics of the coefficients of f(z). For example, BN curves have f(z) =36z4 + 36z3 + 24z2 + 6z + 1, and f(z)(−f0γ) can be computed as follows:

• Step 1: 6γ = 22γ + 2γ ;

• Step 2: 24γ = 22(6γ) ;

• Step 3: 36γ = 24γ + 2(6γ) ;

Instead of performing four 63 × 6 multiplications in each iteration, four shiftsand two additions are performed. Example 3 has φ = 1, resulting in even largersavings in the polynomial reduction step, since no multiplications are requiredas reflected in Table 4.3 by the 1× 64 cell.

Table 4.3: Multiplication complexity for each set of parameters.Example 1: 80-bit security (n = 7, τ = 39, ν = 12, φ = 7, κ = 1, δ1 = 47, δ2 = 21)

21 × 21 21 × 40 40×40 7× 39 47× 12 21× 12

Montgomery 78

HMM (Alg. 14) 1 12 36 42 13 13

Example 2: 128-bit security (n = 5, τ = 63, ν = 11, φ = 6, κ = 1, δ1 = 70, δ2 = 19)

33 × 33 33 × 64 64×64 6× 63 70× 11 19× 11

Montgomery 36

HMM (Alg. 14) 1 8 16 20 9 9

Example 3: 256-bit security (n = 11, τ = 64, ν = 15, φ = 1, κ = 1, δ1 = 72, δ2 = 24)

33 × 33 33 × 65 65×65 1× 64 72× 15 24× 15

Montgomery∗ 210

HMM (Alg. 14) 1 20 100 110 21 21

∗ 64 bits used for each digit (10x64=640), thus 64×64 bit multiplier is used.


Algorithm 17 Parallel hybrid modular multiplication algorithm for BN curves.

Input: positive integers a =∑4

i=0 aiz̄i, b =

∑4i=0 biz̄

i, modulus p = f(z̄) =36z̄4 + 36z̄3 + 24z̄2 + 6z̄ + 1.Output: polynomial r(z̄) ≡ a(z̄)b(z̄)z̄−5 mod p

1: Phase I: Polynomial Multiplication2: c(z) =

∑8i=0 ciz

i ← a(z)b(z).3: Phase II: Coefficient Reduction4: for i = 0 to 4 do5: ci+1 ← ci+1 + (ci div z̄), ci ← ci mod z̄.6: end for7: Phase III: Polynomial Reduction8: q(z) =

∑4i=1 qit

i ← (−c4 + 6(c3 − 2c2 − 6(c1 − 9c0)))z4

+(−c3 + 6(c2 − 2c1 − 6c0))z3

+(−c2 + 6(c1 − 2c0))z2

+(−c1 + 6c0)z.9: h(z) =

∑3i=0 git

i ← (36q4)z3

+36(q4 + q3)z2

+12(2q4 + 3(q3 + q2))z+6(q4 + 4q3 + 6(q2 + q1)).

10: v(z)← c(z)/z5 + h(z);11: Phase IV: Coefficient Reduction12: for i = 0 to 3 do13: vi+1 ← vi+1 + (vi div z̄), vi ← vi mod z̄.14: end for

Return r(z)← v(z).

4.3.4 Application to BN Curves

In this section, we propose an architecture for HMM in hardware. Note thatwe described a digit-serial architecture for the HMM algorithm in [66] for BNcurves, where polynomial multiplication and reduction are interleaved. Inthe architecture proposed here, polynomial multiplication and reduction areseparated as in Algorithm 13. We show that this architecture is flexible andcan achieve higher throughput than the one from [66]. We choose the sameparameters as [66], namely, z̄ = 263 + s and s=29 + 28 + 26 + 24 + 23 + 1. Withthese parameters, the implementation achieves 128-bit security.

Algorithm 17 shows the HMM algorithm for BN curves. Since f(z) = 36z4 +36z3 + 24z2 + 6z + 1, we have g(z) ≡ −f−1(z) ≡ 324z4 − 36z3 − 12z2 + 6z −1 mod z5. Note that both f(z) and g(z) have relatively small coefficients. Asa result, the polynomial reduction phase can be efficiently implemented.


4.3.5 HMM Multiplier

Figure 4.2 shows the architecture of the multiplier. It consists of a row ofmultipliers to carry out polynomial multiplication, five “Mod-1” blocks toperform the first coefficient reduction, a module to accumulate the partialproducts, a module to perform polynomial reduction and four “Mod-2” blocksto perform the second coefficient reduction. Figure 4.2 also gives the bit-lengthof input/output of each block.

Phase I. Five integer multipliers, including four 65 × 65 and one 65 × 32multipliers, are used to carry out the polynomial multiplication. Clearly, givenenough area, the polynomial multiplication stage can be fully parallelized (using13 multipliers [137]). The architecture used here is a tradeoff between areaand throughput. Using the schoolbook method, 5 cycles are required to finishPhase I. Each 65 × 65 multiplier is implemented using a two-level Karatsubamethod, and each multiplication has a delay of 7 cycles.

Phase II. The partial products (aibj , 0 ≤ i, j < n) are reduced immediatelyafter they are generated. Figure 4.2(d) shows the structure of the “Mod-1”block. Since s = 25 · (24 + 23) + 26 + (24 + 23) + 1, multiplication by s isimplemented with four additions. The output of Phase II is then accumulatedand shifted in the accumulator, shown in Figure 4.2(f). Note that the outputbuffers of the adders, except the one on the rightmost side, should be set tozero for each Fp multiplication.

The “Mod-1” block only performs one round of reduction, thus it does not givefully reduced results. It is easy to see that the output of “Mod-1” is always lessthan 277. We shall see this is good enough for this architecture.

Phase III. Once the partially reduced results (c8, .., c0) are ready, Phase III isapplied. The polynomial reduction is performed with only addition and shiftoperations, e.g. 6α = 22α+ 2α, 9α = 23α+ α and 36α = 25α+ 22α.

Since the output of “Mod-1” is less than 277, we have |ci| < (i+1)·277 for 0 ≤ i ≤4. In the computation of q(z) in Algorithm 17, q4 = −c4+6(c3−2c2−6(c1−9c0)),resulting in |q4| < (5 + 6 · (4 + 2 · 3 + 6 · (2 + 9)))277 = 16596 · 277. Likewise, wecan compute the size of hi and vi for 0 ≤ i ≤ 3. One can verify that v(z) hascoefficients |vi| < 292 for 0 ≤ i ≤ 3.

Phase IV. The “Mod-2” block is implemented in the same way as “Mod-1”.However, the input of “Mod-2” is only 93 bits long. Thus, the output of theproposed multiplier has the following bounds: |ri| < 263 + 241, 0 ≤ i ≤ 3, and|r4| < 230.


Figure

4.2:F

pm

ultiplierusing

theH

MM

Balgorithm

.


Note that the resulting polynomial, r(z) may have negative coefficients. Forfuture multiplications, negative coefficients as input are not desirable. We canensure positive coefficients by adding to r(z) the following polynomial:

l(z) = (36ϑ− 2)z4 + (36ϑ+ 2z̄ − 2)z3+(24ϑ+ 2z̄ − 2)z2 + (6ϑ+ 2z̄ − 2)z + (ϑ+ 2z̄) ,

where ϑ = 225. One can verify that l(z̄) = 225f(z̄). Let r′(z) = Σ4i=0r

′iz

i =r(z)+l(z), then we have 0 ≤ r′4 < 232. For 0 ≤ i ≤ 3, 2z̄ < li and |ri| < 2z̄ , thuswe have r′i > 0. On the other hand, ri+li < 2(263+s)+36ϑ−2+263+241 < 265.Thus, r′(z) has only positive coefficients and satisfies the input bounds.

4.3.6 Implementation Results

We used Xilinx FPGA as the design platform. In order to achieve a highclock frequency, a 16-stage pipeline is used in the HMM multiplier. Note thatthe polynomial multiplication takes five iterations. One multiplication on theproposed multiplier has a delay of 20 cycles. On the other hand, the multiplierfinishes one multiplication every 5 cycles, thus has a throughput of 1/5.

Table 4.4: Number of clock cycles required by different subroutines.2T T+Q lT,T (P ) lT,Q(P ) f2 f · l f(pk−1)/r ate optimal ate

220 342 196 120 540 432 138,302 336,366 245,430

Using the multiplier described above, we built a pairing processor. It consistsof a multiplier, an adder, a data memory and an instruction ROM. The datamemory and instruction ROM are implemented with Block RAMs on theFPGA. The data memory has one read port and one write port.

In order to keep the HMM multiplier busy, we explore the parallelism withinthe pairing computation. Consider Fp2 = Fp[u]/(u2 + 2) and two elements inFp2 : (a+ bu) and (c+du), a, b, c, d ∈ F, then (a+ bu)(c+du) can be computedas follows:

(a+ bu)(c+ du) = (ac− 2bd) + ((a+ b)(c+ d)− ac− bd)u .

The Fp multiplications can be performed one after another. We use a C++program to schedule each high level function, e.g. sub-routines of the Millerloop and the final exponentiation. The scheduling is then transfered into micro-instructions stored in the instruction ROM. Table 4.4 gives the cycle counts ofeach function.


On a Xilinx Virtex-6 FPGA (XC6VLX240), the design uses 4,014 Slices, 42DSP48E1s and 5 Block RAMs (RAMB36E1s). The design achieves a maximumfrequency of 210 MHz. Table 4.5 compares the result with the state-of-the-artimplementations.

Kammler et al. [107] reported a hardware implementation of cryptographicpairings using an application-specific instruction-set processor (ASIP). Theychose z̄=0x6000000000001F2D to generate a 256-bit BN curve. Montgomery’salgorithm is used for Fp multiplication. In [66] we reported a design synthesizedwith a 130 nm standard cell library. The implementation uses the HMMalgorithm with z̄=263 + 857. It achieves a factor 5.4 speed-up comparedwith the one of [107]. On the other hand, the ASIP design [107] offers ahigher flexibility than our implementation from [66] since our implementationwas optimized for a dedicated parameter set. Ghosh et al. [76] reported thefirst FPGA implementation of pairings based on BN curves. They also chosez̄=0x6000000000001F2D to achieve 128-bit security. The Fp multiplier usedin [76] is based on Blakley’s algorithm [25], and it does not make use of theDSP slices on the FPGA.

The design proposed here achieves a factor 2.5 speed-up compared withour previous design [66]. The speed-up comes mainly from the highthroughput multiplier. The multiplier used in [66] requires 23 cycles for eachmultiplication, while the multiplier described here finishes one multiplicationevery 5 cycles. The Montgomery multiplier used in [107] requires 68 cycles forone multiplication. On the other hand, the speed-up for pairing computationsis less than the speed-up of the multiplier. This is mainly due to the read-after-write (RAW) dependency that introduces pipeline bubbles.

Table 4.5 also includes the state-of-the-art software implementations [87, 140,79, 21]. Software implementations try to make use of available features (fastmultipliers, vector registers and so on) on the target processor. The speedrecords have been updated shortly after they are set, mainly due to newimplementation techniques and the evolution of processors. The current speedrecord of optimal ate pairing using BN curves is achieved by Aranha et al. [5] onan AMD Phenom II processor, using z̄ = −(262 +255 +1). This choice not onlyreduces the complexity of both the Miller loop and the final exponentiation,but more importantly allows extensive use of lazy reduction techniques. Thisimplementation achieves so far the best performance.


Tab

le4.

5:P

erfo

rman

ceco

mpa

riso

nof

soft

war

ean

dha

rdw

are

impl

emen

tati

ons

ofpa

irin

gs.

De

sig

nP

air

ing

Se

cu

rit

yP

latfo

rm

Alg

orit

hm

Are

aF

re

q.

Cy

cle

De

lay

[bit

][M

Hz]

[ms]

Th

isa

te1

28

Xil

inx

FP

GA

HM

M4

,01

4S

lices

21

03

36

,36

61

.60

desi

gn

op

tim

al

ate

(Vir

tex

-6)

(Pa

rall

el)

42

DS

P4

8E

1s

24

5,4

30

1.1

7

[66

]a

te1

28

AS

ICH

MM

18

3K

ga

tes

20

48

61

,72

44

.22

op

tim

al

ate

(13

0n

m)

(Dig

it-s

eri

al)

59

2,9

76

2.9

1

Ta

te1

,73

0,0

00

34

.6

[76

]a

te1

28

Xil

inx

FP

GA

Bla

kle

y[2

5]

52

kS

lices

50

1,2

07

,00

02

4.2

op

tim

al

ate

Vir

tex

-48

21

,00

01

6.4

Ta

te1

1,6

27

,20

0∗

34

.4

[10

7]

ate

12

8A

SIC

Mo

ntg

om

ery

97

Kg

ate

s3

38

7,7

06

,40

0∗

22

.8

op

tim

al

ate

(13

0n

m)

5,3

40

,40

0∗

15

.8

[62

]T

ate

ov

er

12

8X

ilin

xF

PG

A4

75

5S

lices

19

24

28

,85

32

.23

F3

5·9

7(V

irte

x-4

)-

7B

RA

Ms

[75

]η

Tov

er

12

8X

ilin

xF

PG

A1

51

67

Sli

ces

25

04

7,6

10

0.1

9F

21

22

3(V

irte

x-6

)-

[87

]a

te1

28

64

-bit

Co

re2

Mo

ntg

om

ery

-2

40

01

5,0

00

,00

06

.25

op

tim

al

ate

10

,00

0,0

00

4.1

7

[79

]a

te1

28

64

-bit

Co

re2

Mo

ntg

om

ery

24

00

14

,42

9,4

39

6.0

1

[14

0]

op

tim

al

ate

12

8C

ore

2Q

ua

dH

yb

rid

Mu

lt.

-2

39

44

,47

0,4

08

1.8

6

[21

]o

pti

ma

la

te1

26

Co

rei7

Mo

ntg

om

ery

-2

80

02

,33

0,0

00

0.8

3

[5]

op

tim

al

ate

12

7P

hen

om

IIM

on

tgo

mery

-3

00

0†

1,5

62

,00

00

.52

[6]

ηT

ov

erF

21

22

31

28

Xeo

n(8

co

res)

--

20

00

3,0

20

,00

01

.51

[22

]η

Tov

erF

35

09

12

8C

ore

i7(8

co

res)

--

29

00

5,4

23

,00

01

.87

∗E

stim

ate

db

yth

ea

uth

ors

.†

Pro

cess

or

freq

uen

cy

isn

ot

men

tio

ned

inth

eo

rig

ina

lp

ap

er.

We

tak

e3

.0G

Hz

(ty

pic

al

freq

uen

cy

)fo

rd

ela

yest

ima

tio

n.


An interesting attempt to use a hybrid multiplication algorithm in software hasbeen made by Naehrig et al. [140]. They adapted the hybrid multiplicationalgorithm and proposed a software-oriented algorithm for fast modularmultiplication. An element in Fp is represented as a polynomial of degree11. As such, the coefficients are short enough to fit in the vector registersand overflows in multiplication are avoided. Polynomial reduction can beefficiently performed since p(x) is made monic. This implementation shows asignificant speedup compared with previous software implementations [87, 79].On the other hand, on CPUs where 64-bit multiplication is not much slowerthan addition, the traditional Montgomery algorithm seems to achieve a higherperformance [21, 5].

Table 4.5 also includes state-of-the-art implementations of pairings overbinary or ternary fields achieving 128-bit security. The results of theseimplementations are comparable with pairings using BN curves in software.Estibals reported the first hardware implementation of 128-bit Tate pairing oversuper-singular curves [62], while Ghosh et al. reported the fastest FPGA-basedimplementation of ηT pairing that achieves 128-bit security [75]. Pairings oversmall characteristic fields seem to have lower computational complexity thanpairings over ordinary curves, which results in a shorter computation time.

4.4 Pairing Processor Using RNS

Montgomery algorithm using RNS (Algorithm 10) has been considered morecostly than conventional Montgomery multiplication methods. Indeed, toperform one multiplication in Fp, the RNS Montgomery algorithm uses 2sand 2s2 + 5s word multiplications for C=AB and C=C mod p, respectively.While using conventional MMM, one multiplication uses 2s2 + s word multi-plications. However, when combined with lazy reduction, RNS Montgomerycan significantly reduce the computational complexity. Consider operationAB + CD + EF + GH in Fp, RNS Montgomery uses 2s2 + 13s wordmultiplications, while conventional MMM uses 5s2 + s. Obviously, when s > 2,RNS Montgomery will be faster than MMM. When s = 8, RNS Montgomery is30% faster than MMM. Aranha et al. have shown that lazy reduction cansignificantly speed up optimal ate pairings in software [5]. Duquesne [58]analyzed the efficiency of lazy reduction combined with RNS.

Section 4.4 is based on the following publication:

R. C. C. Cheung, S. Duquesne, J. Fan, N. Guillermin, I. Verbauwhede, and G. X. Yao, “FPGAimplementation of pairings using residue number system and lazy reduction,” in CHES, ser.Lecture Notes in Computer Science, B. Preneel and T. Takagi, Eds., vol. 6917. Springer,pp. 421–441, 2011.

PAIRING PROCESSOR USING RNS 63

We implemented the optimal ate pairing using RNS and lazy reduction on aXilinx Virtex-6 FPGA [44]. Figure 4.3 shows the architecture of the pairingprocessor. We deploy the Cox-Rower architecture model [111]. There ares = 8 PEs, and each PE performs the operations of one channel. RNSallows us to distribute computations over parallel processing elements, andcarry propagation and expensive data exchange are completely avoided. Thedatapath uses a 8-stage pipeline. On a Xilinx Virtex-6 FPGA, it can achievea maximum frequency of 250 MHz.

Figure 4.3: Cox-Rower architecture for pairing computation using RNSMontgomery.

Using the RNS multiplier, one multiplication and one reduction in Fp, wherep is a 254-bit prime, use 2 and 12 cycles, respectively. Combining with lazyreduction, the pairing computation has been accelerated. Table 4.6 comparesthe cycle counts of the sub-routines using RNS and HMM. Note that one HMMmultiplication uses 6 cycles, while an RNS multiplication takes only 2 cycles.Although RNS reduction uses 12 cycles, the average cost is lower. This designcomputes a pairing at 126-bit security level in 0.573 ms on a Xilinx Virtex-6FPGA. This is the first hardware designs of pairing using RNS, and it is so farthe fastest hardware implementation of pairing over BN curves.


Table 4.6: Cycle count for one optimal pairing.

Curve 2T and T+Q and f2 f · g Miller’s Final Total

g(T,T )(P ) g(T,Q)(P ) Loop Exp.

HMM BN128 416 462 540 432 107,128x 138,302 245,430

RNS BN126 320 430 301 289 61,116 81,995 143,111

4.5 Conclusion

In this chapter, we proposed a low-complexity modular multiplication algo-rithm for moduli that are generated with low-weight polynomials. The newidea is to treat the moduli and all operands as polynomials, and to performmodulo reduction in the polynomial ring. Since the moduli is a low-weightpolynomial, the reduction phase has a much lower complexity than conventionalMontgomery reduction. The proposed algorithm is also easier to parallelizethan integer multiplications since there is no carry propagation in polynomialmultiplications. Using this algorithm, we implemented the optimal ate pairingon a Virtex-6 FPGA. The implementation finishes one pairing in 1.17 ms whenit is running at 210 MHz.

Results in this chapter show that not only pseudo-Mersenne numbers butalso moduli that can be represented with a low-weight polynomial are specialmoduli and are of interest in practice. Whether such kind of characteristicscan be generalized is not yet clear. Nevertheless, it is advisable to make use ofsuch characteristics to speed up the multiplication in hardware.

Chapter 5

HECC over F2m Using UnifiedMultiplier/Inverters

� 5.1 Motivation

� 5.2 Multiplier and Inverter in F2m

� 5.3 High-throughput UMI and HECC Processor

� 5.4 Lightweight UMI and HECC Processor for RFID

� 5.5 Conclusion

65

66 HECC OVER F2M USING UNIFIED MULTIPLIER/INVERTERS

5.1 Motivation

This chapter investigates data-path reuse as a measure for area reduction. Wepropose a Unified Multiplier Inverter (UMI) and use HECC as an example todemonstrate its efficiency.

HECC, like ECC, enables valuable optimizations in area and speed inconstrained devices. However, the implementation of HECC is typically morecomplicated than ECC. Table 5.1 describes the computational complexity ofdivisor operations in different coordinate systems [8]. Here I, M and S denotemodular inversion, multiplication and squaring, respectively. Since affinecoordinates uses one I in exchange for about 20M, it is desirable to have afast inverter.

Table 5.1: Modular operations required by divisor operations.Coordinates Divisor Divisor Coordinates

Addition Doubling Conversion

Affine I+22M+3S I+20M+6S -

HECC Inversion-free 49M+4S 38M+7S I+4M

Lange-Stevens I + 21M+3S I+5M+6S† -

† Note this fast doubling formulae only work for curves with deg(h)=1.* This table is not exhaustive. State-of-the-art formulae can be found in [8, 16].

Previous HECC implementations, summarized in Table 5.2, often use multiplemultipliers or inverters to speed up the scalar multiplication. Commonly, thearchitecture shown in Figure 5.1 is used. The use of multiple multipliers inparallel demands a high-throughput memory and a complex data bus, whichresults in further area increase. In this chapter, we explore the power of aUnified Multiplier and Inverter (UMI) for area reduction and performanceimprovement. The architecture shown in Figure 5.2 is more area-efficient ifthe UMI can be made efficiently. We show that the use of a UMI reduces thearea of the data-path and simplifies the data-bus and memory management.Moreover, using affine coordinates further reduces the required storage.


J. Fan, L. Batina, and I. Verbauwhede, “Design and design methods for unified multiplierand inverter and its application for HECC,” Integration, the VLSI Journal, vol. 44, no. 4,pp. 280–289, 2011.

MOTIVATION 67

Table 5.2: Previous HECC implementations on FPGA.Ref. Year Fields Algorithm Coordinates Notes

[168] 2001 -

Cantor’s Affine

Architecture

is only outlined

[33, 50] 2002 GF(2113)

Two multipliers,

one inverter,

one Ring GCD,

one Ring Norm

[60] 2004 GF(2113)Inversion- 12 multipliers,

free one inverter

[167] 2004 GF(281) AffineThree multipliers,

Explicit two inverters

[149] 2006 GF(283)formulae

Projective Three multipliers

[59] 2007 GF(2113)Projective/ 12 multipliers,

Mixed one inverter

Figure 5.1: Conventional architecture for HECC: using multiple data-paths.

Figure 5.2: Proposed architecture for HECC: using UMI.


5.2 Unified Multiplier and Inverter

5.2.1 Multiplication Algorithms

Bit-serial algorithms can be classified into two categories, the Most SignificantBit (MSB) first algorithms and the Least Significant Bit (LSB) first algorithms.Algorithm 18 and Algorithm 19 show an MSB-first and an LSB-first multipli-cation algorithm, respectively. Here we use Ci to denote the value of C afterthe ith iteration, and bi the ith coefficient of B. For the sake of simplicity, weuse capital letters to denote polynomials, and small letters with subscript todenote their coefficients. For example, A stands for A(x), and a0 is the leastsignificant bit of A.

The MSB-first multiplication scans B from the MSB side. In each iteration,biA is added to C, which is then shifted to the left and reduced. The LSB-firstmultiplication scans B from the LSB side. In each iteration, T is updated to xT ,and biT accumulated in C. LSB-first multipliers update T and C in parallel,thus they can achieve a shorter critical path than MSB-first multipliers [18].On the other hand, they require an extra register to keep T .

Algorithm 18 MSB-first multiplication [18].Input: Polynomial A, B and P .Output: R=AB mod P .

1: Cm ← 0.2: for i = m− 1 to 0 do3: Ci ← x(Ci+1 + biA) mod P .4: end for

Return: R← C0/x.

Algorithm 19 LSB-first multiplication [18].Input: Polynomial A, B and P .Output: R = AB mod P .

1: C0 ← 0, T 0 ← A.2: for i = 0 to m− 1 do3: Ci+1 ← Ci + biT

i.4: T i+1 ← xT i mod P .5: end for

Return: R← Cm.

UNIFIED MULTIPLIER AND INVERTER 69

5.2.2 Inversion Algorithms

The most commonly used inversion algorithms are based on Fermat’s littletheorem [99], Extended Euclidean Algorithm (EEA) [112] and Gaussianelimination [89]. EEA is widely used to perform inversion in practice.

The schoolbook EEA-based inversion algorithm in GF(2m) is consideredinefficient due to the long polynomial division in each iteration. This problemwas partially solved by replacing the degree comparison with a counter [34].Algorithm 20 and Algorithm 21 show two variants of EEA, namely, left-shiftinversion and right-shift inversion [169], respectively. Here we use Si to denotethe value of S after ith iteration, and si−1

m the MSB of Si−1. The complementof c1 is represented as c̄1.

Algorithm 20 Left-shift EEA inversion algorithm [112].Input: Polynomial A and P .Output: R = A−1 mod P .

1: R0 ← P , S0 ← A,H0 ← 0, J0 ← x−m,d0 ← 0;

2: for i = 1 to 2m do3: c← (si−1

m )&(di−1 > 0).4: if c=1 then5: {Ri,Hi} ← {Si−1, J i−1}.6: else7: {Ri,Hi} ← {Ri−1,Hi−1}.8: end if9: if c=1 then

10: Si ← x(Ri−1 + Si−1).J i ← x(Hi−1 + J i−1) mod P .di ← −di−1 + 1.

11: else12: Si ← x(si−1

m Ri−1 + Si−1).J i ← x(sj−1

m Hi−1 + J i−1) mod P .di ← di−1 + 1.

13: end if14: end for

Return: R← H2m.

From an implementation perspective, the right-shift inversion algorithm ispreferred for a high-performance inverter. The right-shift inversion algorithm


has no modular operations. As a result, a short critical path delay can be easilyachieved. The counter d is realized with a ring counter [169]. A ring counter dhas only one 1-bit. The value of the counter is defined as (−1)sign ·δ, where δ isthe number of 0s at the right side of 1 in the register d. An n-bit ring countercan count up to n− 1, thus it is larger than an equivalent counter using ripple-carry adder. On the other hand, it has a shorter critical path delay since itonly has shift operations. The left-shift inversion algorithm uses a ripple-carryadder, and it fits better in area-constrained devices.

Algorithm 21 Right-shift EEA inversion algorithm [169].Input: Polynomial A and P .Output: R = A−1 mod P .

1: R0 ← P , S0 ← xA.H0 ← 0, J0 ← xm.d0 ← 2, sign0 ← 1.

2: for i = 1 to 2m− 1 do3: c1 ← si−1

m .c2 ← c1& signi−1.signi ← signi−1 ? c̄1 : di−1

0 .4: if c2=1 then5: {Ri,Hi} ← {Si−1, J i−1/x}.6: else7: {Ri,Hi} ← {Ri−1,Hi−1/x}.8: end if9: if c1=1 then

10: Si ← x(Ri−1 + Si−1).J i ← Hi−1 + J i−1.

11: else12: Si ← xSi−1.

J i ← J i−1.13: end if

di ← signi ? 2di−1 : di−1/2.14: end for

Return: R← H2m−1.

The main observation of this study is that multiplier and inverter can beefficiently merged, which brings a significant reduction in area. For example,Step 3 in Algorithm 19 and Step 10 in Algorithm 21 can be generalized to thefollowing operation:

T ← x(G+ eQ) .

HIGH-THROUGHPUT UMI AND HECC PROCESSOR 71

Another example is Algorithm 18 and Step 12 in Algorithm 20. They can begeneralized to

T ← x(G+ eQ) mod P .

Indeed, a modification of the architecture of a bit-serial multiplier makes it alsoan inverter.

In the following sections we describe two UMI architectures. Table 5.3summarizes the design rationale of these two types of UMI. Let I/M denote theinversion to multiplication ratio in terms of delay. Type-I UMI is optimized forlow critical path delay. It realizes the LSB-first multiplication and the Right-shift EEA algorithm. Here the delay of an inversion is equivalent to the delayof 2 multiplications. Type-I UMI is used to build a high-performance HECCprocessor. Type-II UMI is targeting constrained devices. It realizes the MSB-first multiplication and the Left-shift EEA algorithm. Here one inversion isequivalent to 4 multiplications. Type-II UMI is used to build a low footprintHECC processor.

Table 5.3: Unified Multiplier and Inverter : Type-I vs. Type-II.

Optimization Algorithm Counter d I/M Target

Priority Selection Applications

Type-I Short critical Alg. 19 + Alg. 21 ring 2 High

path delay throughput

Type-II Low Alg. 18 + Alg. 20 carry- 4Lightweight

footprint ripple

5.3 High-throughput UMI and HECC processor

In this section we present the architecture of Type-I UMI and a high-performance HECC processor. We first describe the architecture of the UMI,then we discuss a method to select the I/M ratio. We also compare theperformance of the design with previous implementations at the end of thissection.

5.3.1 Type-I UMI Architecture: High Throughput

Figure 5.3 describes the AND-XOR cell that realizes (biA + C). Figure 5.4shows the architecture of an LSB-first multiplier. Here (tm−1P + (T ≪ 1))


and (biT + C) are performed on the left and the right cell, respectively. Thecritical path delay is TAND +TXOR, where TAND and TXOR denote the delay ofa 2-input AND and XOR gate, respectively. B is shifted to the right by one bitin each clock cycle. Hence one multiplication in GF(2m) takes m clock cycleson this multiplier.

Figure 5.3: AND-XOR cell: a building block of multipliers.

Figure 5.4: LSB-first bit-serial modular multiplier.

Figure 5.5 shows the data-path of a bit-serial inverter using the AND-XORcells. It realizes Algorithm 21. The critical path, from signi−1 to di, has adelay of 2TMUX, where TMUX denotes the delay of a 2-input multiplexer.


Figure 5.5: Right-Shift bit-serial inverter.

Figure 5.6 shows the data-path of the proposed unified inverter and multiplier.The data-path realizes both Algorithm 19 and Algorithm 21. Table 5.4describes how to configure the UMI to perform inversion or multiplication.

Table 5.4: Configurations and operations of Type-I UMI-I.Registers Multiplication Inversion

i = 0 0 < i < m + 1 i = 0 0 < i < 2m

d 0 - 2di−1 ≪ 1 if signi=1

di−1 ≫ 1 if signi=0

sign 0 - 1¬si−1

m if signi−1=1

di−10 if signi−1=0

R P Ri−1 PSi−1 if (si−1

m & signi−1)=1

Ri−1 if (si−1m & signi−1)=0

S xA (Si−1 + si−1m Ri−1)≪ 1 xA (Si−1 + si−1

m Ri−1)≪ 1

C 0 h0(Si−1 ≫ 1) + Ci−1 - -

H B Hi−1 ≫ 1 0Ji−1 ≫ 1 if (si−1

m & signi−1)=1

Hi−1 ≫ 1 if (si−1m & signi−1)=0

J - - xm Ji−1 + si−1m Hi−1

Return Cm H2m−1

The goal of this data-path merging is to maximize the hardware sharing andat the same time to minimize the overhead on critical path delay.

• Hardware Sharing. Three registers (R,S and H) and one AND-XORcell are shared.

• Critical Path. The critical path delay is the same as a standaloneinverter, i.e, 2TMUX.


Figure

5.6:B

it-serialT

ype-I

UM

I.


• Function selection. The selection of a working mode (i.e., multiplica-tion or inversion) is performed on the existing registers at the first cycle.It is also shown in Table 5.4.

• Throughput. The UMI achieves a throughput of 1/(2m− 1) inversionsor 1/m multiplications per cycle.

The critical path delay of UMI is longer than the one of a multiplier. In otherwords, merging an inverter into a multiplier slows down the multiplication.However, for divisor additions in HECC, performing one inversion saves 28multiplications (see Table 5.1). Indeed, having a fast inverter at the cost ofslower multiplication may still speed up the divisor addition and doubling. Thisissue is discussed in the following section.

Digit-serial UMI

While the use of UMI achieves an area reduction of the ALU, it also slowsdown multiplications. For applications where many more multiplications thaninversions are required, maximizing the throughput of an inverter at the costof a slower multiplier is not always desirable. Therefore, we propose a flexiblearchitecture which enables an arbitrary I/M ratio. Figure 5.7 shows a designthat replaces two bit-serial UMI with multipliers. We use wI and wM to denotethe actual digit-size of the inverter and multiplier, respectively. The UMI inFigure 5.7 uses 2 UMIs (wI = 2) and 2 multipliers (wM = 4). When m = 83,one inversion takes ⌈ 2m−1

wI⌉ = 83 clock cycles, while one multiplication takes

⌈ mwM⌉ = 21 clock cycles. The I/M ratio is approximately 2wM/wI .

The wI/wM ratio should be decided by the applications and the designconstraints of the circuit. The next section describes an HECC processor builtwith the UMI.

5.3.2 Type-I HECC Processor

The HECC coprocessor is shown in Figure 5.8. It contains an Instruction ROM,a main controller and the Type-I UMI. The Instruction ROM contains the fieldoperation sequences of divisor addition and doubling. As only a single data-path is used, the coprocessor does not require high-bandwidth register files.Instead, a data RAM is used to keep the curve parameters, base divisor andintermediate data. On FPGAs, Block RAMs are used.

For divisor addition and doubling, we use the explicit formulae proposed byLange and Stevens [120]. One divisor addition takes 1I + 21M + 3S, while one


Figure 5.7: Digit-serial Type-I UMI with I/M ≈ 2wM/wI (wI = 2, wM = 4).

Figure 5.8: Block diagram of the Type-I HECC processor.

divisor doubling takes 1I + 5M + 6S. We give the explicit formulae in the formof register operations at the end of this chapter.

The selection of wM and wI is decided by the following constraints: speed andarea. We choose wM = 14 such that the area meets our constraints, i.e, theoverall area should be smaller than the smallest known implementation ([149]in Table 5.2). We then adjust wI and measure the performance of the UMIon a Xilinx XC2V4000 FPGA. The following equations are used to estimate


the delay of one divisor addition (DA), one divisor doubling (DD) and onescalar multiplication (SM), respectively. Here TI and TM denote the delayof one inversion and multiplication, respectively. Note that squaring is alsoperformed with the UMI, thus TS = TM .

TDA = TI + 24TM , TDD = TI + 11TM , TSM = 166TDD + 83TDA.

Figure 5.9: Area of the UMI and delay for DA, DD and SM.

As shown in Figure 5.9, when wI increases, TM goes up. However, the delayof one inversion goes down. TDA reaches its minimum when wI = 3, whileTDD stays almost unchanged when wI goes from 3 to 5. The delay of onescalar multiplication also reaches its minimum at wI = 3. Note that the areaincreases almost linearly when wI grows. When wI > 3, there is no gain inspeed while area goes up. Thus, we choose wI = 3 and wM = 14 as the bestperformance-area trade-off for this architecture. One multiplication and oneinversion in GF(283) take 47.9 and 439 ns, respectively.

5.3.3 Results and Comparison

We implemented the design from Figure 5.8 and verified it on Virtex-IIPro FPGA. The coprocessor is described with the Gezel [74] language. Forcomparison, we also synthesized the design for a Xilinx Virtex-II (XC2V4000)


FPGA. It uses 2316 slices and 6 Block RAMs. A clock frequency of 125 MHzcan be reached. Table 5.5 gives a comparison of area and performance withprevious FPGA-based implementations of HECC in GF(2m).

Table 5.5: Performance comparison of FPGA-based HECC implementations inGF(2m).

Ref. FPGA Freq. Area RAM Finite Perf. Comments

Design [MHz] [Slices] [bits] Field [µs]

Two mult.

Clancy Xilinx N/A 23,000 0 GF(283)∗ 10,000 One inv.

[50] Virtex-II Using NAF †Xilinx 12 mult.

Elias et al. Virtex-II 45.3 25,271 0 GF(2113) 2,030 One inv.

[59] (XC2V8000) Using NAF

6,586 8,064 GF(283)∗ 420 Three mult.

Using NAF

Sakiyama Xilinx 100 4,749 5,376 GF(283)∗ 549 Two mult.

et al. [149] Virtex-II Pro Using NAF

(XC2VP30) 2,446 2,688 GF(283)∗ 989 One mult.

Using NAF

56.7 7,785 0 GF(281) 415 ‡ Three mult.

Two inv.

Wollinger Xilinx 47.0 5,604 0 GF(281) 724 ‡ Two mult.

[167] Virtex-II One inv.

(XC2V4000) 54.0 3,955 1,536 GF(281) 831 ‡ Two mult.

One inv.

Xilinx

This Virtex-II 125 2,316 2,016 GF(283) 311 Type-I UMI

work (XC2V4000) Using NAF

Designs with ⋆ support fields defined with an arbitrary polynomial P .† Non-Adjacent Form.‡ Using binary method for scalar divisor multiplication.

Among all the previous implementations, the design from Sakiyama et al. andWollinger are of special interest to compare with. They both use explicitformulae, and the designs are much smaller than other implementations.The HECC coprocessor presented in [149] uses projective coordinates anda superscalar architecture. Each configuration uses a different numberof digit-serial (w = 12) multipliers. Our coprocessor, using one unifiedmultiplier/inverter, is faster than the one of [149] using three multipliers.

The architectures proposed in [167], however, uses affine coordinates of theexplicit formulae. Three different architectures ranging from high speed to lowhardware cost are proposed. The high speed version uses three multipliers andtwo inverters, and it takes 415 µs to finish one scalar multiplication. To thebest of our knowledge, this is also the fastest HECC implementation on FPGA

LIGHTWEIGHT UMI AND HECC PROCESSOR FOR RFID 79

up to date. The low-area version uses 3955 slices. However, it requires 831 µsfor one scalar multiplication.

Compared to all the previous implementations, our HECC processor achieves ahigher performance at a lower area cost. The area reduction is attributed to theuse of compact ALU and the reduction of the memory throughput. The ALUin [167] contains two multipliers and one inverter, which in total use 2427 slices.The ALU used in this study requires only 1500 slices. The performance gain ismainly due to the efficient inverter. When running at 56.7 MHz, the inverterin [167] requires 1570 ns on average for one inversion in GF(281), while ourhigh-throughput UMI finishes one inversion in GF(283) in 439 ns. Althoughwe use only one multiplier, which is also slower than the one in [167], the divisoraddition and doubling are faster.

5.4 Lightweight UMI and HECC Processor forRFID

In this section, we describe an HECC processor targeting extremely constraineddevices such as passive RFID tags. In such applications, area and powerconsumption are of higher priority than performance. According to [2], apassive RFID tag should have power consumption less than 15 µW to guarantee1 meter operation range. Some ECC implementations [122, 90] can alreadyfulfill these requirements. We show that HECC can also fulfill the requirementswith a comparable performance.

5.4.1 Type-II UMI Architecture: Low Footprint

Type-II UMI realizes Algorithm 18 and Algorithm 20. In this architecture, thebit-serial multiplier is reused for the inversion. The counter d, implemented as aring counter in Type-I UMI, is replaced with a ripple-carry adder. Figure 5.4.1shows the data-path of the proposed digit-serial inverter and multiplier.

• Multiplication The data-path performs Ci+1 ← ((Ci + biA)≪ 1) modP . In this case, only two registers (S and H) are used, thus R and J canbe used as storage.

• Inversion In this mode, {R,S} pair and {H,S} pair are updatedalternatively. The bit-serial multiplier performs one of the followingoperations:


(a) Type-II bit-serial building block.

(b) Type-II digit-serial UMI (w = 2).

Figure 5.10: The building block and architecture of Type-II UMI.

– Si ← (Ri−1 + Si−1)≪ 1

– Si ← (si−1m Ri−1 + Si−1)≪ 1

– J i ← ((Hi−1 + J i−1)≪ 1) mod P

– J i ← ((si−1m Hi−1 + J i−1)≪ 1) mod P .

Note that R and S are updated first, then H and J are updatedaccordingly in the next cycle.

LIGHTWEIGHT UMI AND HECC PROCESSOR FOR RFID 81

Assuming the digit-size is w, one multiplication in GF(2m) takes ⌈m/w⌉ cycles,while one inversion takes ⌈4m/w⌉ cycles.

5.4.2 Type-II HECC Processor

The HECC processor is shown in Figure 5.11. It contains an InstructionROM, a main controller, a Type-II UMI, a Register File, and an input/outputinterface. It differs from the Type-I HECC processor in both the UMIarchitecture and storage. The Type-I HECC processor uses a dual-port RAM,while the Type-II HECC processor uses a single-port register file.

Besides the multiplier and inverter, the register file contributes a big portionof the area. Reducing area of the register file is the key step towards acompact implementation. An HECC processor using affine coordinates requiresfewer registers to store intermediate results since no Z coordinates are used.Moreover, it also reduces the number of intermediate results. Lange and Mishrastudied the register allocation for parallel multipliers [119]. Our investigationshows that 12 registers are sufficient for scalar multiplication with flexible basedivisor D. Note that the Type-II UMI has four registers, among which twocan be used for storage when it is not working as an inverter. Thus, we onlyneed 10 registers in the register file. The complete register allocation for divisordoubling and divisor addition is given at the end of this chapter.

Figure 5.11: Block diagram of the Type-II HECC processor.


5.4.3 Results and Comparison

The design was verified on a Xilinx Virtex-II Pro FPGA. We synthesizedthe Type-II HECC processor with 130 nm standard cell library. Table 5.6summarizes the area and power of the proposed design.

Our HECC implementation, including InsRom, UMI and register files, uses14.5 Kgates. It finishes one scalar multiplication in 136 838 clock cycles. Thepower consumption, estimated with power compiler, is 13.4 µW when runningat 300 kHz. The implementation of [148], using projective coordinates, requires266 133 clock cycles for one scalar multiplication. Note that it is defined on asmaller field and the result does not include data storage. The power and energyconsumption of our design is 65% lower while it achieves the same throughput.

There are several ECC implementations proposed for RFID tags. Lee etal. [122] use digit-serial multipliers, while Hein et al. use a 16x16 GF(2)multiplier and a 32-bit accumulator. Compared with the implementationin [122], using a 16x16-bit multiplier requires less area and lower powerconsumption. On the other hand, it requires 296k clock cycles, twice as many asLee’s ECC processor (and our HECC processor), for one scalar multiplication,and its energy consumption is about 6 times higher.

Our HECC processor can meet the requirements for passive RFID tagsin terms of area, power and energy. However, ECC implementations arestill leading in terms of the energy efficiency. For each scalar bit, theECC implementation [122] uses 7M+4S in GF(2163), while our HECCimplementation uses 16M+7S in GF(283). As a result, our HECC processorconsumes twice more energy than the ECC processor [122] using the sametechnology node, because the HECC scalar multiplication uses twice as manycycles as ECC.

5.5 Conclusion

In this chapter, we explored the efficiency of a Unified Multiplier and Inverterdata-path in HECC implementations. Two types of UMI are proposed.Type-I UMI, which realizes the LSB-first multiplication and right-shift EEAalgorithms, achieves a short critical path delay. Using the Type-I UMI resultsin a high performance HECC processor on FPGA. The Type-II UMI, whichrealizes the MSB-first multiplication and the left-shift EEA algorithms, achievesa low footprint. Using the Type-II UMI results in a lightweight HECC processorfor constrained devices. Both implementations use curves defined by h(x) = xand f(x) = x5+f3x

3+x2+f0, where f3, f0 ∈ F283 . The high throughput version

CONCLUSION 83

Tab

le5.

6:P

erfo

rman

ceco

mpa

riso

nof

HE

CC

and

EC

Cim

plem

enta

tion

sta

rget

ing

RF

IDta

gs.

Ref.

AS

ICF

init

eA

rea

Perf

.F

req

.P

ow

er

En

erg

y⋆

Co

mm

en

ts

Desi

gn

Tech

.F

ield

[Kg

ate

s][#

cy

cle

][k

Hz]

[µW

][µ

J]

HE

CC

13

0n

mG

F(2

83)

14

.51

36

,83

83

00

13

.46

.03

Ty

pe-I

IU

MI

Th

isw

ork

(d=

4)

HE

CC

13

0n

mG

F(2

67)

7.6

†2

66

,13

35

00

19

†1

0.0

†1

Mu

lt.

Sa

kiy

am

a[1

48

](d

=8

)

EC

C1

4.1

‡1

44

,84

25

90

21

.55

5.2

9(d

=2

)

Lee

et

at.

[12

2]

13

0n

mG

F(2

16

3)

14

.7‡

10

1,1

83

41

11

5.7

53

.88

(d=

3)

15

.4‡

78

,54

43

23

12

.08

2.9

4(d

=4

)

EC

C1

80

nm

GF

(21

63)

11

.92

96

,00

01

06

10

.83

1.3

⋄1

6x

16

mu

lt.

Hein

et

al

[90

]†

Mo

du

lar

Ari

thm

eti

cL

og

icU

nit

on

ly‡

Inclu

din

gE

CC

co

rea

nd

an

8-b

itco

ntr

oll

er

for

cry

pto

gra

ph

icp

roto

co

ls⋆

En

erg

yfo

ro

ne

sca

lar

mu

ltip

lica

tio

n⋄

Est

ima

ted

by

au

tho

rs


uses 2316 slices and 2016 bits of block RAM on a Xilinx Virtex-II FPGA, andfinishes one scalar multiplication in 311 ms. The lightweight HECC processor,implemented with the UMC 130 nm low-leakage library, uses only 14.5 Kgates,and one scalar divisor multiplication takes 450 ms. The power consumption,estimated with power compiler, is 13.4 µW when running at 300 kHz.

Table 5.7: Register allocation for divisor doubling.

Input: {R4,R5,R6,R7} = D1(={u10,u11,v10,v11}).

1. R3:= R4*R4; 2. R4:= R5*R5+f3; 3. R6:= R6*R6+f0;

4. R6:= 1/R6; 5. R6:= R6*R3; 6. R2:= R4*R6;

7. R0:= R2+R5; 8. R5:= R6*R6; 9. R1:= R6+R4 ;

10. R4:= R0*R0+R6; 11. R2:= R1*R2+f2; 12. R2:= R6*R5 +R2;

13. R7:= R7*R7+R2; 14. R6:= R1*R4 +R3;

Return: {R4,R5,R6,R7} = 2*D1.

Table 5.8: Register allocation for divisor addition.

Input: {R4,R5,R6,R7} = D1(={u10,u11,v10,v11}),

{R8,R9,R10,R11} = D0(={u00,u01,v00,v01}).

1. R0:=R5+R9; 2. R1:=R0*R0; 3. R1:=R1*R4;

4. R2:=R5*R0; 5. R3:=R8+R4; 6. R2:=R2+R3;

7. R3:=R3*R2+R1; 8. R6:=R6+R10; 9. R1:=R2*R6;

10. R7:=R7+R11; 11. R6:=R7+R6; 12. R0:=R5+R9;

13. R7:=R0*R7; 14. R2:=R2+R0; 15. R6:=R2*R6+R1;

16. R6:=R7*R5+R6; 17. R6:=R7+R6; 18. R7:=R4*R7+R1;

19. R2:=R3*R6; 20. R2:=1/R2; 21. R6:=R6*R6;

22. R6:=R6*R2; 23. R2:=R3*R2; 24. R3:=R3*R2;

25. R4:=R4+R3; 26. R7:=R7*R2; 27. R0:=R9+R5;

28. R5:=R7+R5; 29. R7:=R7+R0; 30. R4:=R5*R7+R4;

31. R7:=R7+R0; 32. R1:=R9*R7+R8; 33. R4:=R4+R1;

34. R5:=R3*R3; 35. R3:=R8*R7; 36. R4:=R0*R5+R4;

37. R5:=R0+R5; 38. R7:=R9+R7; 39. R7:=R7+R5;

40. R0:=R5*R7+R4; 41. R0:=R0+R1; 42. R7:=R4*R7+R3;

43. R0:=R0*R6+R11; 44. R6:=R7*R6+R10; 45. R7:=R0+1;

Return: {R4,R5,R6,R7} = D1 + D0.

Chapter 6

Breaking ECC withConfigurable Hardware

� 6.1 Motivation

� 6.2 The Certicom Challenge

� 6.3 The ev1l Project: Design Target

� 6.4 Arithmetic and Complexity Analysis

� 6.5 Architecture Exploration

� 6.6 Results and Comparison

� 6.7 Security Estimation for ECC2-131 and ECC2K-163

� 6.8 Conclusion

85

86 BREAKING ECC WITH CONFIGURABLE HARDWARE

6.1 Motivation

In order to have a secure cryptosystem, the complexity of the fastest knownattack should be far beyond the resource (e.g. computing power, storage,time) of any practical adversary. However, the margin of safety provided by acryptosystem erodes with time, as adversaries improve the attack methods andget more powerful computing tools. As a result, to use a cryptosystem in thereal world, we need to know the answers of the following two questions:

• How big should the parameters (or, colloquially, the “key size”) be to avoidpractical attacks? Choosing parameters too small allows computationalattackers to break the schemes, while choosing parameters too largewastes time, communication, and storage. For example, ECC over F2109

is no longer secure since ECDLP can be practically solved [39].

• How long can my implementation stay secure against practical attacks?The security level of a cryptosystem is related to the computing powerof the adversary, and the computing power of adversaries is growingconstantly due to the improvements of the silicon fabrication technology.When DES was proposed in 1976, the complexity of key searching wasbeyond any adversary at that time. However, recent results show thatit takes less than a day to break DES using a brute-force attack [1].The security of all cryptosystems faces the same threat and theircomputational complexity should be frequently evaluated with respectto state-of-the-art hardware platforms.

In this study, we investigate the efficiency of using FPGA clusters tobreak ECC. We explore implementation options for the core finite-fieldarithmetic operations as well as architectures. An in-depth comparison betweenpolynomial basis multiplier, Type-II normal basis multiplier and Shokrollahi’smultiplier is given. Our work proves the superiority of an FPGA platformover other specialized architectures and its suitability for tasks that arecomputationally demanding.


J. Fan, D. V. Bailey, L. Batina, T. Güneysu, C. Paar, and I. Verbauwhede, “Breakingelliptic curve cryptosystems using reconfigurable hardware,” in Field Programmable Logic

and Applications – FPL 2010. IEEE, pp. 133–138, 2010.

THE CERTICOM CHALLENGE 87

6.2 The Certicom Challenge

To encourage the investigation of practical attacks on ECC, researchers atCerticom Corp. published a list of ECDLP challenges in 1997 [40]. TheCerticom ECC challenge is "to increase the industry’s understanding andappreciation for the difficulty of the elliptic curve discrete logarithm problem...".Each challenge in this list includes a set of curve parameters and two points, Pand Q. Table 6.1 summarizes the challenges based on binary curves and theirstrengths estimated by Certicom. Note that the number of machine days areestimated based on a Pentium 100 processor. The complete challenge list isavailable in [40].

Table 6.1: Certicom Challenges (Binary Curves) and their complexityestimated by Certicom [40].

Challenge Field size Security Machine End Machine(in bits) level ‡ days * date days †

ECC2-79 79 ≈ 239 352 1997.12 116

ECC2-89 89 ≈ 244 11278 1998.02 1114

ECC2K-95 97 ≈ 242 18322 1998.05 1709

ECC2-97 97 ≈ 248 180448 1999.09 6118

ECC2K-108 109 ≈ 248 1.3x106 2000.04 1.7x105

ECC2-109 109 ≈ 254 2.1x107 2004.11 4.4x105

ECC2K-130 131 ≈ 258 2.7x109 - -

ECC2-131 131 ≈ 265 6.6x1010 - -

ECC2K-163 163 ≈ 274 3.2x1014 - -

ECC2-163 163 ≈ 281 6.2x1015 - -

ECC2-191 191 ≈ 295 1.0x1020 - -

ECC2K-238 239 ≈ 2111 9.2x1025 - -

ECC2-238 239 ≈ 2119 2.1x1027 - -

ECC2K-358 359 ≈ 2171 2.8x1044 - -

ECC2-353 359 ≈ 2179 1.3x1045 - -* Based on Pentium 100.

† Based on 500 MHz Digital Alpha workstation.

‡ Security estimation: 12

√π2n for ordinary curves, 1

2

√

π2n

n for Koblitz curves.


Smaller members of the list of Certicom challenges have been solved. Escottet al. report on their successful attack on ECCp-97, an ECDLP in a groupof roughly 297 elements [61]. A larger instance, ECC2-109 was solved byMonico et al. [39]. Bos et al. analyze the use of the PlayStation 3 to attackan ECDLP in a group of roughly 2112 elements [31]. Thanks to the morepowerful processors, the solved challenges used much fewer machine days thanpredicted. Today those challenges will take even less machine days to solve.For example, breaking the ECC2K-95 challenge takes 19 hours now on ten2.4GHz Core 2 Quad CPUs [10]. This again reminds us of the importanceof understanding the growth of computing power and its implication on thesecurity of a cryptographic system.

By now the most difficult binary curve challenge that has been solved is ECC2-109. In this chapter we report on our effort to solve the next challenge: ECC2K-130. We use an FPGA cluster as computing platform. As part of this effort, thiswork explores FPGA implementation options for the core finite-field arithmeticoperations as well as architectures. We compare implementations of theattack using polynomial basis, Type-II normal basis, and Type-II polynomialbasis. Most notably, this is the first FPGA implementation of Shokrollahi’smultiplication algorithm.

This work is part of a multi-party distributed effort to break the largestECDLP ever solved [10]. The same attack is implemented on Core 2 Extreme,Playstation 3 and Graphics cards. Compared to software implementations,our work obtains a much higher performance-cost ratio by using FPGAplatforms. The results prove the superiority of an FPGA platform overother specialized architectures and its suitability for the tasks that arecomputationally demanding. Results in this study are relevant both to thecryptanalytic community as well as those interested in fast cryptographicimplementations in normal basis.

6.2.1 The Parallel Pollard Rho Attack

The fastest known attacks against the ECDLP are generic attacks based onPollard’s rho method [145, 35]. Further improvements including parallelizationand the use of group automorphisms were made by Wiener et al. [166], vanOorschot et al. [160], and Gallant et al. [73]. The parallelized Pollard rhomethod employs a number of individual engines that individually search forDistinguished Points (DPs), which are the points that satisfy certain criteria.Each engine runs individually, starting its search from a random point, andreporting the resulting DP to a central server that adds it to a database. Theattack is completed once two identical distinguished points are found. The

THE CERTICOM CHALLENGE 89

core function is thus the update function, also known as iteration function. Inchoosing the criterion for a point to be distinguished, we can carefully tradeoff time and communication, allowing us to tailor the method to FPGAs.

P0 → P1 → · · · → Ps−1 → Ps.

(Starting Point) (DP)(6.1)

6.2.2 FPGA-based Attacks

FPGAs have been applied to the Pollard rho Method in several previous works.Güneysu, et al. analyze ECDLPs over fields of odd-prime characteristic [84, 83],targeting a machine with 128 low-cost FPGAs. They extrapolate that to breakan ECDLP in a group of roughly 2131 elements using this machine takes overa thousand years.

Similarly, Meurice de Dormale et al. apply FPGAs to the ECDLP [131]. Here,they use characteristic-two finite fields, but restrict their inquiry to polynomialbasis. Although conventional wisdom has held that low-weight polynomialbasis is a better choice, in our application we can take advantage of the freerepeated squarings (2n-th powers) offered in normal basis. In addition, recentprogress on normal-basis multiplication by Shokrollahi et al. [163] and improvedby Bernstein et al. [17] further improve the prospects for normal basis. Ourwork is the first FPGA implementation of this new approach to normal basismultiplication and Pollard rho. By contrast, our work uses algorithms in normalbasis to achieve a time-area product that is better by more than a factor ofeight.

For our attack we will make use of a custom-made FPGA cluster (RIVYERA)which has been designed to combine high computing power at lowest costs[153]. The RIVYERA cluster system was designed as a completely modulararchitecture that can be dynamically reconfigured by 16 pluggable modules.Each module is populated with 8 FPGAs whose type can be chosen accordinglyto the needs of the application. For our attack, we employ the Xilinx Spartan-3XC3S5000 FPGA. All FPGAs are tightly connected by two bi-directional ringbuses that enable the transfer of data frames between each individual FPGAand a controlling host PC. The integrated PC connects to the FPGA backplaneby either one or two PCIe bridges, i.e., an interface module that is plugged intoa PCIe slot on the PC’s mainboard and which provides a serial link directly tothe FPGA backplane. Note that although several interfaces are involved, stilla data throughput of 550 MBit/s (reading) and 720 MBit/s (writing) can be


achieved between each FPGA and the PC in an actual application. Figure 6.1shows the system architecture of the RIVYERA cluster.

In order to simplify the interface between PC and FPGAs, we utilize to FIFOsto handle the input data (Base Points) and output data (Distinguished Points).The host PC pushes new base points into the input FIFO if it is not full, andreads back distinguished points from the output FIFO when it is not empty.The interface is shown in Figure 6.2.

6.3 The Ev1l Project: Design Target

6.3.1 The Iteration Function

We briefly describe the iteration function in this section. The rationale behindthe design of iteration function can be found in [10]. The iteration function isimplemented on FPGA.

Our condition for a point Pi to be a distinguished point is that in type-2 normal-basis representation HW(xPi

) ≤ 34, where HW(t) returns the Hamming weightof t. Our iteration function is also defined via the normal-basis representation;it is given by

Pi+1 = σj(Pi) + Pi , (6.2)

BackplaneLVDS

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

CTL

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

XC3S5000

Module1

Module2

Module3

Module16

Host PC

Eth

ern

et

PC

Ie

Core

i7 920 External

Data

CTL

CTLCTL

Figure 6.1: RIVYERA cluster system based on Xilinx Spartan-3 5000 FPGAs.

ARITHMETIC AND COMPLEXITY ANALYSIS 91

Figure 6.2: Interface between the host PC and each FPGA.

where j = ((HW(xPi)/2) mod 8)+3, and σj(x, y)=(x2j

, y2j

). To solve ECC2K-130, 260.9 iterations are required in total.

An efficient implementation of the iteration function is thus the key steptowards a fast attack. Given Pi(x, y), the iteration function computesPi+1(x′, y′) using Eqn. ((6.2)). Figure 6.3 shows the data flow of the iterationfunction.

6.4 Arithmetic and Complexity Analysis

The iteration function consists of two multiplications, one inversion and severalsquarings in GF (2m). Thus, fast finite field arithmetic is essential to optimizethe attack.

A vast body of literature exists on finite field arithmetic (see Sect. 2.4 and Sect.5.2), and we are free to choose from a variety of representations and algorithms.

• Basis: polynomial basis or normal basis;


Figure 6.3: Dataflow graph of the iteration function.


• Inversion: Extended Euclidean Algorithm (EEA) or Fermat’s littletheorem.

This leads to an important question: which combination ensures the mostefficient implementation of the aforementioned iteration function? We try toanswer this question with complexity analysis and design-space exploration.


An element of GF (2131) can be represented in both polynomial basis P andType-II normal basis N, where

P = { 1, w, w2, · · · , w130},

N = { γ + γ−1, γ2 + γ−2, γ22

+ γ−22

, · · · , γ2130

+ γ−2130}.

Here w is a root of an irreducible polynomial of degree 131, while γ is aprimitive 263rd root of unity. Multiplication in polynomial basis has long beenconsidered more efficient than normal basis. On the other hand, squaring innormal basis is simply a circular shift. Moreover, computing any power α2n

can be performed by circularly-shifting by n positions. We implemented bothoptions for comparison.

Besides conventional multiplication algorithms in polynomial and normalbasis, we also implemented a recently reported hybrid algorithm, due toShokrollahi [163]. This algorithm uses only O(m log m) operations for the basisconversion. When multiplication is needed, two field elements are convertedto polynomial basis, a polynomial-basis multiplication is carried out, then theresults are converted back to normal basis.

Polynomial-Basis Multiplier

Algorithms for multiplication in polynomial basis consist of two steps, whichmay be carried out separately or interleaved: polynomial multiplication and

modular reduction. Given two elements A(w) =m−1∑

i=0

aiwi and B(w) =

m−1∑

i=0

biwi, a bit-serial modular multiplication algorithm is shown in Algo-

rithm 22.


Algorithm 22 Bit-serial modular multiplication in GF (2m).

Input: A(w) =m−1∑

i=0

aiwi, B(w) =

m−1∑

i=0

biwi and P (w).

Output: A(w)B(w) mod P (w).1: C(w)← 0;2: for i = m− 1 to 0 do3: C(w)← w(C(w) + cmP (w) + biA(w));4: end for

Return: C(w)/w.

It is well known that one way to reduce area complexity is to use a polynomialP (w) with special form, such as a low Hamming weight. For GF (2131) thereexists an irreducible pentanomial P (w) = w131 +w13 +w2 +w + 1. Thus, thecomplexity of step 3 in Algorithm 22 is 2m+4 XOR and m+4 AND operations.

One can also compute C(w) = A(w)B(w) =∑2m−2

i=0 ciwi first, and then reduce

it with P (w). In this case, the Karatsuba-Ofman method can be used to reducethe complexity of polynomial multiplication. The reduction phase requiresO(m) AND and XOR operations when low-weight P (w) exists. For example,when P (w) is a pentanomial, reducing C(w) requires around 4m AND and 4mXOR operations. The overall complexity of a modular multiplication is M(m)+O(m), where M(m) is the complexity of an m-bit polynomial multiplication.

Normal-Basis Multiplier

Elements in F2131 can also be represented using normal basis. As describedin Sect. 2.4, a basis element (γ2i

+ γ2−i

) for i ∈ [1, 131] can be written as(γj + γ−j) for some j ∈ [1, 131]. As a result, the following two basis pN isequivalent to N:

pN = {γ + γ−1, γ2 + γ−2, γ3 + γ−3, · · · , γ131 + γ−131}.

pN is also known as permuted normal basis. Let βi = (γi + γ−1i

), then anelement T in GF (2m) is represented as T =

∑mi=1 tiβi. One multiplication of

A and B represented with pN requires m2 AND and 3m(m − 1)/2 two-inputXORs.

This algorithm is then adapted by Kwon [118] to deduce a systolic multiplier.Compared to the Sunar-Koc multiplier, Kwon’s architecture shown in Figure6.4, is highly regular and thus can be implemented in a digit-serial manner. On


the other hand, it has higher complexity: 2m AND and 2m XOR gates for abit-serial multiplier.

Figure 6.4: Modular multiplier in GF (2m) using Kwon’s algorithm.

Shokrollahi’s Multiplier

Shokrollahi discovered an efficient algorithm for basis conversion betweenpermuted normal basis and polynomial basis. The polynomial basis used hereis known as new polynomial (nP) basis.

nP = {(γ + γ−1), (γ + γ−1)2, · · · , (γ + γ−1)m} ,

which leads to a hybrid normal-basis multiplication algorithm. We denoteAnP and ApN the representation of A using nP and pN, respectively. Amultiplication then proceeds as follows.

1. converting to polynomial basis: ApN → AnP , BpN → BnP ,

2. polynomial multiplication: CnP ← AnPBnP ,

3. converting back to normal basis: CpN → CnP ,

4. reduction.

The essential observation by Shokrollahi is that basis conversion can be recur-sively performed. For example, converting an 8-bit vector {f1, f2, · · · , f8}pN

can be divided into two 4-bit conversions.

{g1, g2, g3, g4}nP ← {f1, f2, f3, f4}pN

{g5, g2, g7, g8}nP ← {(f5 + f3), (f6 + f2), (f7 + f1), f8}pN

For an in-depth discussion of this algorithm, we refer the readers to a recentpublication by Bernstein and Lange [17].


Figure 6.5: Shokrollahi multiplier.

The pN → nP and nP → pN conversion in GF (2m) takes (m/2)log2(m/4)XOR operations. Note that step 3 converts a polynomial of degree 2m, takingm log2(m/2) XOR operations. The reduction in step 4 takes m/2 XORoperations. In total, one field multiplication takes M(m)+m log2m operations.

Based on the analysis above, we can draw the following conclusions:

• A bit-serial multiplier using polynomial basis has a lower area-complexitythan that using normal basis.

• When low-weight polynomials exists, Shokrollahi’s multiplication algo-rithm is likely to have a higher complexity than conventional polynomialbasis multiplication since the base conversion step is more complex thanthe polynomial reduction.

Though it seems that polynomial basis should be used, normal basis offersseveral advantages in this specific application. First, the iteration function (seeEqn. (6.2)) requires the Hamming weight of the x-coordinate represented innormal basis. In fact, checking the HW of x in normal basis checks 131 pointssimultaneously, bringing a speedup of

√131 [10] to the attack. Second, the

iteration function includes two x2j

routines (known as m-squarings). In normalbasis, σ is essentially a circular shift of j bits, and thus can be performed inone cycle. Gains in m-squaring compensate the loss in multiplications.

ARCHITECTURE EXPLORATION 97

6.4.2 Inversion

Inversion is the most costly of the four basic field operations. Two broadapproaches are found in the literature: the Extended Euclidean Algorithm(EEA) and Fermat’s Little Theorem (FLT). In polynomial basis, the binaryvariant of EEA is generally faster, while the variant of FLT attributed to Itohand Tsujii is the better choice in normal basis because squaring is free [99].Itoh-Tsujii reduces the problem of extension-field inversion to exponentiationand inversion in the subfield. In polynomial basis, exponentiation is generallyquite expensive owing to the need to explicitly compute squares, making EEAa better choice. Itoh-Tsujii raises an element to the exponent r − 1 = 2 + 22 +· · · + 2m−1, using the fact that in normal basis, squaring is free. In addition,this algorithm uses an addition chain to reduce the number of multiplications:in GF (2131), our addition chain has length nine (1, 2, 4, 8, 16, 32, 64, 128,130). The net result is a complexity of eight field multiplications to computean inverse.

Mitigating this cost somewhat is a method for simultaneous inversioncalled Montgomery’s trick that trades inversions for multiplications [136].Algorithm 23 shows this method to invert three inputs. Indeed, we can tradeone inversion for three extra multiplications. As a result, one iteration functionuses 5M + (1/n)I where n is the batch size.

Algorithm 23 Simultaneous inversion (Batch size = 3).Input: α1, α2, α3.Output: α−1

1 , α−12 and α−1

3 .1: d1 ← α1

2: d2 ← d1α2

3: d3 ← d2α3

4: u← d−13

5: t3 ← ud2, u← uα3

6: t2 ← ud1, u← uα2

7: t1 ← u

Return: t1, t2, t3.

6.5 Architecture Exploration

The architecture of the engine has a fundamental impact on the overallthroughput. Among all the design options the following three are of greatimportance.


1. Multiplier architecture

2. Memory architecture

3. Inverter architecture

As an architecture exploration, we implemented three different architecturesusing different types of multipliers.

6.5.1 Architecture I: Load-Store, Polynomial basis

As a starting point, we take a programmable elliptic-curve coprocessor asthe platform. A digit-serial polynomial multiplier (see [150] for details) isused. A dedicated squarer is included for squaring. In each loop, the x-coordinate is converted to its normal basis representation, and its Hammingweight is counted. This adds a base conversion block and the hardware blockfor Hamming weight computation.

Figure 6.6: Archi-I: ECC processor using polynomial basis.

On this platform, squaring or addition takes two clock cycles, while multipli-cation takes ⌊n/d⌋ + 1 cycles given a digit-size d. The design is synthesizedusing ISE 11.2 and the target FPGA is Xilinx Spartan-3 XC3S5000 (4FG676).Implementation results show that d = 22 gives the best trade-off in terms ofarea-delay product.

The design consumes 3656 slices, including 1468 slices for the multiplier, 75slices for the squarer, 1206 slices for the base conversion, and 117 slices forHamming weight calculation.

One Pollard rho iteration takes 71 cycles, among them 35 cycles are used formultiplication. The design achieves a maximum clock frequency of 101 MHz,

ARCHITECTURE EXPLORATION 99

and one iteration takes 704 ns. The m-squaring (i.e. a2m

) is performed withm successive squarings. Obviously, this architecture is not efficient. The m-squaring operations can be largely speeded up when normal basis is used.

6.5.2 Architecture II: Load-Store, Type-II Normal Basis

Archi-II uses a digit-serial normal basis multiplier. The structure of themultiplier is shown in Figure 6.4. When m is small, a full systolic architecturecan be used, performing one multiplication per cycle. However, a systolic arrayfor m = 131 is too large (more than 20000 slices on Spartan-3). Thus, a digit-serial architecture is used. Implementation results show that d = 13 gives thelowest area-delay product. The multiplier alone uses 2093 slices.

The basis-conversion component in Archi-I is no longer needed in Archi-II,saving 1468 slices. In total, the design uses 2578 slices. On this platform, onePollard rho iteration takes 81 cycles, including 55 cycles used for multiplication.Compared to Archi-I, the m-squaring operation is largely improved. However,the multiplier becomes much slower than that in Archi-I. The design achievesa maximum clock frequency of 125 MHz, and one iteration takes 648 ns.

Figure 6.7: Archi-II: ECC processor using normal basis.

6.5.3 Architecture III: Fully Expanded, Type-II PolynomialBasis

Archi-III unrolls the Pollard rho iteration such that a throughput of oneiteration per cycle is achieved. Remember that 5 multiplications are requiredfor each iteration, as a result, five normal basis multipliers are used. The design


is fully pipelined. Since additions and squarings are embedded in the pipeline,it increases the delay of one iteration but does not affect the throughput.

Figure 6.8: Archi-III: pipelined processor using Shokrollahi multipliers.

At the first glance, fully expanding the iteration function seems impossible dueto the inversion in each iteration. Indeed, after dx is generated in Figure6.3,inverting dx will take too much area to fit one FPGA. The solution is to startthe pipeline after the real inversion (d−1

3 in) is performed.

RESULTS AND COMPARISON 101

Figure 6.8 shows the architecture that supports the expanded iteration function.In total, five multipliers are used. Before the starting of the pipeline, x, y, dxand dy of Pi are stored in RAM (x), (y), (dx) and (dy), respectively. RAM(dn) keeps the intermediate data di of Algorithm 23, and u is generated bythe inverter (not shown in Figure 6.8). After starting the pipeline, the fivemultipliers perform the following operations.

• Mul_1: ti ← udi

• Mul_2: u← uαi

• Mul_3: λ← dy(1/dx)

• Mul_4: λ(x′ + x)

• Mul_5: d′i ← d′idx′

Mul_1, Mul_2 and Mul_5 are used by batch inversion (Algorithm 23), whileMul_3 and Mul_4 are used for point addition (Figure 6.3).

The inversion is performed by another multiplier together with a squarer. Inorder to keep full use of the engine, we interleave two groups of iterationfunction. When the engine is executing one group, the inverter is performinginversion (u← d−1

n ) for the other group.

This implementation of Archi-III consumes 22,195 slices and 20 blockRAMs (RAMB16s) on Xilinx Spartan-3 XC3S5000 FPGA. One fully pipelinedShokrollahi’s multiplier uses 4,391 slices. The inverter itself uses 4,761 slices.In total, the design uses 26,731 slices.

The post placing-and-routing results show that this design can achieve amaximum clock frequency of 111 MHz. This design together with the interface(Figure 6.2) are verified on the RIVYERA cluster and it runs correctly at 110MHz.

6.6 Results and Comparison

Table 6.2 summarizes the implementation results on a Spartan-3 XC3S5000FPGA. Based on the available resources (33,280 slices and 104 BRAMs) ofeach XC3S5000 FPGA, we also estimated that at most 9 clones of Archhi-Ior 12 clones of Archhi-II can be implemented on a single FPGA. For Archi-III, one clone uses 80% of the available resources of one FPGA. The remainingarea can accommodate at least two copies of Archi-II.


Table 6.2: Size and throughput comparison of various architectures.Digit- Area BRAM Freq. Cycles Delay Throughput

size #slice [MHz] per step [ns] per FPGA [×106]

Archi-I:22 3, 656 4 101 71 704 12.8

Poly. basis

Archi-II:13 2, 578 4 125 81 648 18.5

Type-II ONB

Archi-III:- 26, 731 20 111 1 8.99 111

Shokrollahi’s

The throughput per engine, Te is computed as Te = F req.Cycles per step , and the

throughput per FPGA Tc is computed as Tc = Te ∗ l. Here, l is the numberof engines on a single FPGA. Compared with Archi-I, Archi-II has smallerarea usage and shorter delay. In other words, Type-II optimal normal basishas significant advantages for this application. On the other hand, Archi-III achieves an 8.6 times speedup over Archi-II. The improvement comesfrom both the field arithmetic and the architecture. The use of Shokrollahi’salgorithm significantly improved the throughput of a multiplier, while theexpansion of the iteration function hides delays caused by addition and squaringin the pipeline.

6.6.1 Total Effort Estimation

The complexity of this attack is around 260.9 iterations. Populated with 128FPGA each, a single RIVYERA finishes 258.7 = 128 · (111 · 220 · 3600 · 24 · 365)iterations in one year. We estimate that given five RIVYERA clusters, theECC2K-130 challenge can be solved in one year.

6.6.2 Comparison

As a part of the global distributed effort to attack ECC2K-130 [10], effortshave been made to speed-up the iteration function on CPUs [10], GPUs[15] and PlayStation 3 [32]. This allows us to compare our results with theimplementations of the same function (same algorithm and base representation)on different platforms. Figure 6.9 gives the throughput (iterations per second)achieved on a Spartant-3 FPGA, a PlayStation 3, a GTX295 Graphix Cardand a core2 Extreme (quadcore). The FPGA-based implementation achieves a2 to 4 times higher throughput than other platforms.

Note that the processors used to run the softwares are quite powerful. Core 2quad has four cores, and each core has three vector ALUs that are capable

RESULTS AND COMPARISON 103

Figure 6.9: Comparison: attacking ECC2K-130 on different platforms.

of performing one 128-bit vector operation per cycle. The Cell processorhas 8 Synergistic Processor Elements (SPE), and each SPE can issue one128-bit vector operation per cycle. The GTX295 card contains two GT200bchips each holding 240 ALUs (32-bit). They are designed for massive parallelcomputations, and they run at much higher clock frequencies than FPGAimplementation.

The throughput achieved on these platforms, as shown in Figure 6.9, are higherthan previous implementations (See [10] for detailed comparison). One ofthe techniques that help to boost the throughput is bit-slicing, which is usedon Core 2, GPU and Cell processor. Bit-slicing is a different way to storeand manipulate data in registers. For example, on a 64-bit machine, oneelement of GF (2131) is stored on 131 registers, each holding one bit. Usingthis 128-bit vector register, one can perform 128 multiplications in parallel.The multiplication is then broken down to bit-wise operations, which are thenperformed with simple logic operations (AND, XOR, OR) between registers.

At the top level, Design-I and Design-II are multi-core systems, similar tothe GPU and Cell processors. Though on FPGA we can construct dedicatedmultipliers for binary field multiplications, the software implementations


achieve higher throughput. The reasons are different: the Core 2 and Cellprocessor have 128-bit vector instructions and run at almost 30 times higherclock frequency; the GTX295 has 480 cores and runs at 10 times higher clockfrequency. Design-III beats all the software implementations. Comparedwith Design-I and Design-II, it makes better use of the hardware since allthe multipliers, adders and squarers are busy all the time. Besides, memoryaccesses have no impact on the throughput. The comparison shows that theexpanded architecture makes better use of hardware resources than multi-corearchitectures.

6.7 Effort Estimation for ECC2-131 and ECC2K-

163

ECC2-131 and ECC2K-163 are the next two unsolved challenges after ECC2K-130. According to the estimation of Certicom, ECC2-131 and ECC2K-163 arearound 24 and 1.2×105 more difficult than ECC2K-130, respectively. Based onthe results reported above, we can already give an estimation of the strengthof ECC2-131 and ECC2K-163 against practical attacks. We measure the effortwith FPGA years, namely, the number of years to solve the challenge using oneFPGA. Since FPGAs from different generation have a big difference in capacityand frequency, we always estimate the efforts using the largest FPGA of thatgeneration.

In order to check the frequency difference of the same design on FPGAs ofdifferent generation, we implemented the same design on Virtex-4, Virtex-5,Virtex-6 and Virtex-7. The following table summarizes the maximum frequencywe can achieve and number of clones we can instantiate on different FPGAs.The frequency and size are post-PAR (placing and routing) results reported byXilinx ISE.

Figure 6.10 shows the estimated FPGA years needed for each challenge. FPGAessentially follows the step of fabrication technology of silicon chips. As shownin Table 6.3, FPGA capacity has been growing continuously. From Virtex-4to Virtex-7, the size of Xilinx FPGAs grew 10 times larger in 6 years. AlteraFPGAs have shown a similar growth from Stratix II GX (2005), which has 132kLogic Elements (LE), to Stratix V (2010) that has 1,087k LEs. Adversariesnow need much less FPGAs to perform the attack compared with 8 years ago.Indeed, ECC2K-130 requires 169 FPGA-years on Virtex-4, compared with 6.5on Virtex-7. In other words, we now need only 7 state-of-the-art FPGAs tosolve ECC2K-130 in one year.

EFFORT ESTIMATION FOR ECC2-131 AND ECC2K-163 105

Table 6.3: Technology nodes of different FPGA generations.

FPGA Type Since Technology Size Freq. Clones

[year] [nm] [Logic Cells] [MHz]

Spartan-3 XC3S5000 2003 90 74,800 111 1

Virtex-4 XC4VLX200 2004 90 200,440 192 2

Virtex-5 XC5VLX330 2006 65 51,840* 240 4

Virtex-6 XC6VLX760 2009 40 758,784 303 10

Virtex-7 XC7V2000T 2010 28 1,954,560 310† 31 †

* slices.† Post-synthesis results.

Figure 6.10: Effort estimation for ECC2K-130, ECC2-131 and ECC2K-163 onFPGAs of different generations.

Figure 6.10 also gives estimations of the effort to solve ECC2-131 and ECC2K-163. Obviously, ECC2-131 is already in the range of practical attacks. ECC2K-163 requires about 7.8 × 105 Virtex-7 FPGAs to solve in one year. The effortin terms of FPGA-year drops about 10 times from 2006 (Virtex-5) to 2010(Virtex-7). If FPGAs grow larger and faster at a similar speed in the future,solving ECC2K-163 in one year will require less than 103 FPGAs in 2025. Notethat this is a rather pessimistic estimation since we did not take into accountpossible optimizations of the attacking method and its implementations.


6.8 Conclusion

In this study we explored the power of FPGA in a brute-force attack on ECC.Using Type-II polynomial basis and Karatsuba multiplication, we could putthe expanded architecture that includes six 131-bit multipliers on one Spartan3 (X3S5000) FPGA, and achieve a throughput of one iteration per cycle. Theresults show that even low-cost FPGAs can deliver higher performance thanthe most recent CPUS, GPUS and Cell processor.

Based on the complexity estimation of Certicom, we also gave an estimation ofthe effort of breaking ECC2-131 and ECC2K-163 using more advanced FPGAs.Using the most advanced FPGA of the Xilinx Virtex family, ECC2-131 can beattacked in 156 FPGA years. ECC2K-163 still requires about 7.8× 105 FPGAyears in 2011, but it will become practically solvable before 2025.

Chapter 7

Conclusions

7.1 Conclusions

In this thesis, we investigated efficient arithmetic in hardware design of PKC.PKC is used on a large variety of devices including servers, mobile handsets,FPGAs, smart cards and passive RFID tags. Therefore, PKC implementationstailored to different environments need specific optimizations to meet therequirements for performance, power and security against physical attacks.This thesis analyzes the computation structure of the most widely used PKC,and provides methods for various optimization targets such as high speed, lowarea or higher physical security.

The first contribution of this thesis is a novel method to parallelize theMontgomery modular multiplication (MMM) algorithm. We analyze the datadependencies inside the MMM algorithm and study efficient task partitioningmethods. A highly simplified multi-core platform with shared memory isdesigned to evaluate the parallelizability of different implementation methods.The experimental results show that carry propagation in long integer additionscauses the main data dependency, and memory access quickly becomes thebottleneck when more cores are added. Therefore, a task partitioning methodthat postpones the carry propagation and reduces the number of inter-core datatransfers is proposed. On a testing platform configured with 4 cores, we obtaina speedup factor of 3.68 for 256-bit multiplications compared to a single-core

107

108 CONCLUSIONS

based implementation.

The second contribution of this thesis is a new modular multiplication algorithmthat has a lower computational complexity than the conventional MMMalgorithm. We focus on modular multiplications with special moduli, p = f(z),where f(z) is a low-weight polynomial. Such kind of moduli arise in the finitefields used by pairing-friendly curves (e.g. p = 36z4 + 36z3 + 24z2 + 6z + 1for BN curves). Since z is much larger than the coefficients of f(z), treatingp as a polynomial leads to a reduction of the computational complexity ofmod p multiplications. The proposed algorithm is also easier to parallelizethan integer multiplications since there is no carry propagation in polynomialmultiplications. Using this algorithm, we implemented the optimal ate pairingon a Virtex-6 FPGA. The implementation finishes one pairing in 1.17 ms whenit is running at 210 MHz, and it was the fastest hardware implementation ofpairings that achieves a 128-bit security level when it was published.

The third contribution of this thesis consists of two low-area implementationsof HECC. Previous HECC implementations typically use multiple multipliersor inverters to speed up the scalar multiplication, resulting in a large area. Weanalyze the atomic operations in the inversion algorithms and multiplicationalgorithms, and propose two Unified Multiplier and Inverter (UMI) data-pathsfor F2m . We present two options for (HECC) using UMIs: an FPGA-basedhigh-performance implementation (Type-I) and an ASIC-based lightweightimplementation (Type-II). The use of a UMI combined with affine coordinatesleads to a smaller data-path, a smaller memory and faster scalar multiplications.Both implementations use curves defined by h(x) = x and f(x) = x5 + f3x

3 +x2 + f0, where f3, f0 ∈ F283 . The high throughput version uses 2316 slices and2016 bits of block RAM on a Xilinx Virtex-II FPGA, and finishes one scalarmultiplication in 311 ms. The lightweight HECC processor, implemented withthe UMC 130 nm low-leakage library, uses only 14.5 Kgates, and one scalarmultiplication takes 450 ms. The power consumption, estimated with powercompiler, is 13.4 µW when running at 300 kHz.

The fourth contribution of this thesis is an efficient implementation of thePollard rho attack on an FPGA cluster. The margin of safety provided by acryptosystem erodes with time, as adversaries improve the attack methods andget more powerful computing tools. In this study, we investigate the feasibilityof using FPGA clusters to break ECC. More specifically, we aim to solve theCerticom challenge ECC2K-130. We explore the implementation options forthe core finite-field arithmetic operations as well as architectures. An in-depthcomparison between a polynomial basis multiplier, a Type-II normal basismultiplier and Shokrollahi’s multiplier is given. Using Type-II polynomialbasis and Karatsuba multiplication, we can put an expanded architecture (forthe iteration function) that includes six 131-bit multipliers on one Spartan 3

FUTURE WORK 109

(X3S5000) FPGA, and achieve a throughput of one iteration per cycle. With600 Xilinx Spartan 3 (X3S5000) FPGAs, we estimate that we can solve theCerticom challenge ECC2K-130 in a year. The results show that even low-costFPGAs can deliver higher performance than the most recent CPUs, GPUs andCell processor.

This thesis also includes a comprehensive survey of physical attacks onECC and known countermeasures. Physical attacks, including side-channelanalysis and fault analysis, are a major threat to the security of cryptographicdevices. While the adversary only needs to succeed in one out of many attackmethods, the designers have to prevent all the applicable attacks simultaneously.Moreover, countermeasures of one attack may surprisingly benefit anotherattack. Appendix A summarizes the known attacks and their requirements (e.g.chosen base point and multiple executions). It also gives a rough estimationof the overhead of the listed countermeasures. The table that summarizes therelation between attacks and countermeasures can be used as a road map byECC implementers to select countermeasures.

7.2 Future Work

Out work described in this thesis can be continued in several directions:

• New applications, e.g. post-quantum cryptography (PQC) and fullyhomomorphic encryption (FHE) schemes,

• New methodologies to protect designs from physical attacks,

• New challenges, e.g. growing leakage power in nanoscale cryptographicprocessors,

• New algorithms with an even lower complexity.

Below we describe three topics that worth considering in our opinion.

Implementation of FHE. In the past two years, fully homomorphicencryption algorithms have been proposed and improved. An FHE schemeenables many interesting applications such as secure voting and confidentialdata processing in cloud computing. This is a relatively new topic and so far itis only of theoretical interest: it is too slow and too big to be used in practice.On the other hand, no hardware implementation of FHE is reported yet. Webelieve it is an important topic to look at and implementation techniques willgradually make FHE schemes more practical.

110 CONCLUSIONS

New methodologies against physical attacks. The current methodologyis attack-driven. An attack is always discovered before its countermeasures.Hence, such countermeasures are typically algorithm-specific and can onlyprevent known attacks. An alternative research direction is to protectimplementations using the so-called private-circuits, which ensure a provablesecure implementation even if the adversary can learn the values of a limitednumber of wires during the computation. However, the known constructionsof private-circuits are orders of magnitude larger than implementations usingstandard logic. How to efficiently reduce the overhead of such generic protectionmethod is still an open problem. Possible solutions include using private circuitsfor secure zone (the circuit unit that needs to be protected) only. However, howto identify the secure zone from the rest of the circuit of an implementation isare not clear yet.

Leakage power and its impact on cryptographic implementations.Power dissipation will continue to be one of the biggest challenges incircuit design. Power (and energy) parameter should be added into thedesign exploration metrics which currently only include area and performance.Parallelism is an effective approach to reduce the dynamic power dissipationusing voltage and frequency scaling. However, parallelism also leads to higherstatic power dissipation due to the increase of the circuit size. Therefore, itwould be interesting to explore the design space for the most power-efficientPKC implementation. Moreover, recent studies [95, 123] suggest that leakagepower can be an exploitable source of information leakage via side channels.Therefore, adding the power parameter into the design space exploration mayalso help the designers to achieve a more secure implementation.

Appendix A

Secure ECC Implementation:A Survey on Attacks andProtections

� A.1 Introduction

� A.2 Typical Implementations

� A.3 Passive Attacks

� A.4 Active Attacks

� A.5 Countermeasures

� A.6 Discussion

111

112 SECURE ECC IMPLEMENTATION: A SURVEY ON ATTACKS AND PROTECTIONS

A.1 Introduction

The advent of physical attacks [115, 116, 130, 42, 70, 78, 170, 23, 64]on cryptographic device has created a big challenge for implementers. Bymonitoring the timing, power consumption, electromagnetic (EM) emission ofthe device or by inserting faults, adversaries can gain information about internaldata or operations and extract the key without mathematically breaking theprimitives. With new tampering methods and new attacks being continuouslyproposed and improved, designing a secure cryptosystem becomes increasinglydifficult. While the adversary only needs to succeed in one out of many attackmethods, the designers have to prevent all the applicable attacks simultaneously.Moreover, countermeasures of one attack may surprisingly benefit anotherattack. As a result, keeping abreast of the most recent developments in thefield of implementation attacks and with the corresponding countermeasures isa never ending task.

In this chapter we provide a systematic overview of implementation attacksand countermeasures of one specific cryptographic primitive: Elliptic CurveCryptography (ECC) [113, 132]. This survey is an updated version of a previousreport [65], and has been influenced by Avanzi’s report [7], by the books of Blakeet al. [24] and by Avanzi et al. [8]. In this chapter, we give a catalogue-likesummary of the known attacks and countermeasures. Implementers can usethis chapter as a road map. For the details of each attack or protection, werefer to the original papers.

A.2 Typical Implementations

A brief introduction of ECC is given in Section 2.2.3. To make this chaptereasy to read, we repeat the important notation below.

• E(a1, a2, a3, a4, a6) : an elliptic curve with coefficients a1, a2, a3, a4, a6;

• O: point at infinity;

• #E: the number of points on curve E, i.e. the order of E;

• weak curve: a curve whose order does not have big prime divisors;


J. Fan, X. Guo, E. D. Mulder, P. Schaumont, B. Preneel, and I. Verbauwhede, “State-of-the-art of Secure ECC Implementations: A Survey on Known Side-channel Attacks andCountermeasures,” in HOST:10, pp. 76–87, 2010.

PASSIVE ATTACKS 113

• the order of point P : the smallest integer r such that rP = O;

• affine coordinates: a point is represented with a two-tuple of numbers(x, y);

• projective coordinates: a point (x, y) is represented as (X,Y,Z), wherex = X/Z, y = Y/Z;

• Jacobian projective coordinates: a point (x, y) is represented as (X,Y,Z),where x = X/Z2, y = Y/Z3.

The core function of an ECC processor is to securely compute k ·P . Figure A.1illustrates a typical (yet simplified) architecture of an ECC processor. Thescalar k is either stored in a secure memory or generated on-the-fly. Forsome protocols, the result (k · P ) are not fully transferred. For instance,the ECDSA signature only contains the x coordinates of the result point. Intheory, it is computationally hard to recover k from P and k · P . In practice,however, the adversaries try to find out k from the information leaked duringthe computation of k · P .

Figure A.1: A simplified model of an ECC processor.

Algorithm 24 shows the Montgomery powering ladder for scalar multiplication.

A.3 Passive Attacks

In practice, execution of an Elliptic Curve Scalar Multiplication (ECSM) canleak information of k in many ways. The goal of the attacker is to retrieve theentire bit stream of k using physical attacks. Physical attacks include mainly

Note that for some scenarios, the attackers only need to recover a few bits of k to breakthe scheme. For example, Nguyen and Shparlinski [141] have shown that a few bits of k froma couple of signatures are enough to break ECDSA [161].


Algorithm 24 Montgomery powering ladder [136].

Input: P ∈ E(F) and integer k =∑l−1

i=0 ki2i.Output: kP .

1: R[0]← P , R[1]← 2P .2: for i = l − 2 downto 0 do3: R[¬ki]← R[0] +R[1], R[ki]← 2R[ki].4: end for

Return R[0].

two types of attacks: Side Channel Analysis (SCA) and Fault Analysis (FA).In this section, we briefly recap the known SCA (also known as passive attacks)on an ECC implementation.

Most SCA attacks are based on power consumption leakage. Most often,electromagnetic (EM) radiation is considered as an extension of the powerconsumption leakage and the attacks/countermeasures are applied withoutchange [138]. For the sake of simplicity, we will only mention power tracesas the side-channel to describe the known attacks. However, it is important topoint out that EM radiation can serve as a better leakage source since radiationmeasurements can be made locally [69].

A.3.1 Simple Power Analysis

Simple power analysis (SPA) attacks make use of distinctive key-dependentpatterns shown in the power traces [116]. As shown by Coron [52], when double-and-add algorithm is used for a point multiplication, the value of the scalar bitscan be revealed if the adversary can distinguish between point doubling andpoint addition from a power trace.

A.3.2 Template Attacks

A template attack [42] requires access to a fully controllable device, andproceeds in two phases. In the first phase, the profiling phase, the attackerconstructs templates of the device. In the second phase, the templates areused for the attack. Medwed and Oswald [126] showed the feasibility of thistype of attacks on an implementation of the ECDSA algorithm. In [91] atemplate attack on a masked Montgomery ladder implementation is presented.

PASSIVE ATTACKS 115

A.3.3 Differential Power Analysis

Differential power analysis (DPA) attacks use statistical techniques to pry thesecret information out of the measurements [116]. DPA sequentially feeds thedevice with N input points Pi, i ∈ {1, 2, .., N}. For each point multiplication,kPi, a measurement over time of the side-channel is recorded and stored. Theattacker then chooses an intermediate value, which depends on both the inputpoint Pi and a small part of the scalar k, and transforms it to a hypotheticalleakage value with the aid of a hypothetical leakage model. The attacker thenmakes a guess of the small part of the scalar. For the correct guess, there willbe a correlation between the measurements and the hypothetical leakages. Thewhole scalar can be revealed incrementally using the same method.

A.3.4 Comparative Side-Channel Analysis

Comparative SCA [94] resides between a simple SCA and a differential SCA.Two portions of the same or different leakage trace are compared to discoverthe reuse of values. The first reported attack belonging to this category is thedoubling attack [70]. The doubling attack is based on the assumption thateven if the attacker does not know which operation is performed, he can detectwhen the same operations are performed twice. For example, for two pointdoublings, 2P and 2Q, the attacker may not know what P and Q are, but hecan tell if P = Q. Comparing two power traces, one for kP and one for k(2P ),it is possible to recover all the bits of k.

A.3.5 Refined Power Analysis

A refined power analysis (RPA) attack exploits the existence of special points:(x, 0) and (0, y). Feeding to a device a point P that leads to a specialpoint R(0, y) (or R(x, 0)) at step i under the assumption of processed bitsof the scalar will generate exploitable side-channel leakage [78]. Especially,applying randomised projective coordinates, randomised EC isomorphisms orrandomised field isomorphisms does not prevent this attack since zero remainszero after randomization.

A.3.6 Zero-value Point Attack

A zero-value point attack (ZPA) [3] is an extension of RPA. Not only consideringthe points (i.e. R[1] and R[0]) generated at step i, a ZPA also considers the


value of auxiliary registers. For some special points P , some auxiliary registerswill predictably have zero value at step i under the assumption of processedbits of the scalar. The attacker can then use the same procedure of RPA toincrementally reveal the whole scalar.

A.3.7 Carry-based Attack

The carry-based attack [69] is designed to attack Coron’s first countermeasure(also known as scalar randomisation). Instead of performing kP , Coronsuggested to perform (k + r#E)P where r is a random number. The crucialobservation here is that, when adding a random number a to a fixed numberb, the probability of generating a carry bit c = 1 depends solely on the valueof b (the carry-in has negligible impact [69]). If (k + r#E) is performed witha w-bit adder, where w is the digit size, the attacker can learn k digit by digitfrom the distribution of the carry bit.

A.3.8 Address-bit DPA

The address-bit attack (ADPA) [130] explores the link between the registeraddress and the key. The first ADPA applied to ECC is by Itoh et al. [96].For example, an implementation of Algorithm 24 performs point addition anddoubling regardless of the value of the key bit, but the address of the doubledpoint depends solely on ki. As a result, ki can be recovered if the attacker candistinguish between data read from R[0] and from R[1].

A.4 Active Attacks

Besides passive side-channel analysis, adversaries can actively disturb thecryptographic devices to derive the secret. Faults on the victim device canbe induced with a laser beam, glitches in clock, a drop of power supply and soon. Readers who are interested in these methods are referred to [117].

In this section, we give a short description of the known fault analysis on ECC.Based on the scalar recovery method, we divide fault attacks on ECC intothree categories, namely, safe-error based analysis, weak-curve based analysisand differential fault analysis.

ACTIVE ATTACKS 117

A.4.1 Safe-error Analysis

The concept of safe-error was introduced by Yen and Joye in [170, 105]. Twotypes of safe-error are reported: C safe-error and M safe-error.

C Safe-error

The C safe-error attack exploits dummy operations which are usually in-troduced to achieve SPA resistance. Taking the add-and-double-alwaysalgorithms [52, Algorithm 1] as an example, the dummy addition in step 3makes safe-error possible. The adversary can induce temporary faults duringthe execution of the dummy point addition. If the scalar bit ki = 1, then thefinal results will be faulty. Otherwise, the final results are not affected. Theadversary can thus recover ki by checking the correctness of the results.

M safe-error

The M safe-error attack exploits the fact that faults in some memory blockswill be cleared. The attack was first proposed by Yen and Joye [170] to attackRSA. However, it also applies to ECSM. Assuming that R[ki] in Algorithm 24is loaded from memory to registers and overwritten by 2R[ki], then faults inR[1] will be cleared only if ki = 1. By simply checking whether the result isaffected or not, the adversary can reveal ki.

A.4.2 Weak Curve Based Analysis

In 2000, Biehl et al. [23] described the first weak curve fault attack on an ECCimplementation. The key observation is that a6 in the diffinition of E (Eq.2.1)is not used in the addition formulae. As a result, the addition formulae forcurve E generates correct results for any curve E′ that differs from E only ina6:

E′ : y2 + a1xy + a3y = x3 + a2x2 + a4x+ a′6 . (A.1)

Thus, the adversary can cheat an ECC processor with a point P ′ ∈ E′(F) whereE′ is a cryptographically weak curve. The adversary can then solve ECDLPon E′ and find out k.

The method of moving a scalar multiplication from a strong curve E to a weakcurve E′ often requires fault induction. With the help of faults, the adversary


makes use of invalid points [23], invalid curves [49] and twist curves [68] to hita weak curve. These methods are described below.

Invalid Point Attacks

The invalid point attack lets the scalar multiplication start with a point P ′ onthe weak curve E′. If kP is performed without checking the validity of P , thenno faults need to be induced. If the ECC processor does check the validity ofP , the adversary will try to change the point P right after the point validation.In order to do so, the attacker should be able to induce a fault at a specifictiming.

Invalid Curve Attacks

Ciet and Joye [49] refined the attack in [23] by loosening the requirementson fault injection. They show that any unknown faults, including permanentfaults in non-volatile memory or transient faults caused on the bus, in any curveparameters, including field representation and curve parameters a1, a2, a3, a4,may cause the scalar multiplication being performed on a weak curve.

Twist Curve Fault Analysis

In 2008, Fouque et al. [68] noticed that many cryptographically strong curveshave weak twist curves. A scalar multiplication kP not using the y-coordinategives correct results for point on both the specified curve E and its quadratictwist, and the result of kP on weak twists can leak k. On an elliptic curvedefined over a prime field Fp, a random x ∈ Fp corresponds to a point on eitherE or its twist with probability one half. As a result, a random fault on thex-coordinate of P has a probability of one half to hit a point on the (weak)twist curve.

A.4.3 Differential Fault Analysis

The Differential Fault Analysis (DFA) uses the difference between the correctresults and the faulty results to deduce certain bits of the scalar.

ACTIVE ATTACKS 119

Biehl-Meyer-Müller DFA

Biehl et al. [23] reported the first DFA on an ECSM. We use a right-to-leftmultiplication algorithm (Algorithm 25) to describe this attack. Let Qi and Ri

denote the value of Q and R at the end of the ith iteration, respectively. Letk(i) = k div 2i. Let Q′i be the value of Q if faults have been induced. Theattack reveals k from the Most Significant Bits (MSB) to the Least SignificantBits (LSB).

1. Run ECSM once and collect the correct result (Ql−1).

2. Run the ECSM again and induce a one-bit flip on Qi, where l−m ≤ i < land m is small.

3. Note that Ql−1=Qi+(k(i)2i)P and Q′l−1=Q′i+(k(i)2i)P . The adversarythen tries all possible k(i) ∈ {0, 1, .., 2m − 1} to generate Qi and Q′i.The correct value of k(i) will result in a {Qi,Q′i} that have only one-bitdifference.

The attack works for the left-to-right multiplication algorithm as well. It alsoapplies if k is encoded with any other deterministic codes such as Non-Adjacent-Form (NAF) and w-NAF. It is also claimed that a fault induced at randommoments during an ECSM is sufficient [23].

Sign Change FA

In 2006, Blömer et al. [26] proposed the sign change fault (SCF) attack. Itattacks implementations where scalar is encoded in Non-Adjacent Form. Whenusing curves defined over the prime field, the sign change of a point implies only

Algorithm 25 Right-To-Left (upwards) binary method for point multiplica-tion.Input: P ∈ E(F) and integer k =

∑l−1i=0 ki2i.

Output: kP .1: R← P , Q← O.2: for i = 0 to l − 1 do3: If ki = 1 then Q← Q+R.4: R← 2R.5: end for

Return Q.


Table A.1: Physical attacks on ECC implementations.

AttackSingle Multiple Chosen Using Incremental

Execution Executions Base Point Output Point Scalar Recovery

SPA√

DPA√ √

Template attack † √

Doubling attack√ √

RPA√ √ √

ZPA√ √ √

Carry-based attack√

ADPA√ √

Safe-error attack√ √

Weak-curve attack√

*√

*√ √

Differential FA√ √ √

PAIA√

*√

*√ √‡

† Attack is reported to recover only a small number of bits of the scalar.

* It may need more than one trial to hit a weak curve.‡ Only if the design is not able to handle O correctly.

a sign change of its y-coordinate. The SCF attack does not force the ellipticcurve operations to leave the original group E(Fp), thus P is always a validpoint.

A.4.4 Point-at-Infinity Attack

The point-at-infinity attack (PAIA) makes use of both faults and side-channelleakages [64]. The attack uses a specially crafted, but valid input points. Thecore idea is that after fault injection, these points turn into points of very loworder. Using side channel information we deduce when the point at infinityoccurs during the scalar multiplication, which leaks information about thesecret key. In the best case, this attack breaks a simple and differential sidechannel analysis resistant implementation with input/output point validity andcurve parameter checks using a single query.

A.4.5 Summary of Attacks

Physical attacks have different application conditions and complexities. Forexample, SPA and Template SPA require a single trace, while DPA and ADPArequire multiple traces. Besides, some attacks make use of the final resultswhile others don’t. These conditions reveal the applicability of each attack

COUNTERMEASURES 121

and suggest possible protections. Table A.1 summarises the attacks and theirapplication conditions.

A.5 Countermeasures

Many protection methods have been proposed to counteract the reportedattacks. However, countermeasures are normally proposed to prevent animplementation from a specific attack. It has been pointed out that acountermeasure against one attack may benefit another one. In this section, wediscuss the cross relationship between known attacks and countermeasures. Wefirst give a summary of known countermeasures. The computational overheadof each countermeasure is estimated using a curve that achieves 128-bit security.The Montgomery power ladder without y-coordinates is used as the benchmark.

Table A.2: Countermeasures and their overhead.Cost estimation: negligible (< 10%), low (10%-50%) and high (> 50%)

Countermeasures Target Attacks Computation Overhead

Indistinguishable Point Operation SPA Low

Double-and-add-always SPA Low

Atomic block SPA Negligible

Montgomery Powering Ladder +y SPA Low

Montgomery Powering Ladder −y SPA -

Scalar randomisation DPA Low

Random key splitting DPA High

Base point blinding DPA Negligible

Random projective coordinates DPA Negligible

Random EC isomorphism DPA Low

Random field isomorphism DPA Low

Random register address ADPA Low

Point Validation Invalid Point Negligible

Curve Integrity Check Invalid Curve Negligible

Coherence Check DFA Low †Combined curve check Sign change Low

Co-factor multiplication Small group (RPA) Negligible+y Using y-coordinate; −y Not using y-coordinate;

† Depends on the number of coherence checks performed in each ECSM.

Table A.3 summarises the most important attacks and their countermeasures.The different attacks, grouped into passive attacks, active attacks andcombined attacks are listed column-wise, while each row represents one specificcountermeasure. Let Aj and Ci denote the attack in the jth row andcountermeasure in the ith column, respectively. The grid (i, j), the cross ofthe ith column and the jth row, shows the relation between Aj and Ci.


•√

: Ci is an effective countermeasure against Aj .

• ×: Ci is attacked by Aj .

• H: Ci helps Aj .

• ?: Ci might be an effective countermeasure against Aj , but the relationbetween Ci and Aj is unclear or unpublished.

• Blank : Ci and Aj are irrelevant (Ci is not effective against Aj).

It is important to make a difference between × and blank. Here × means Ci

is attacked by Aj , where blank means that the use of Ci does not affect theeffort or results of Aj at all. For example, scalar randomisation using a 20-bitrandom number can be attacked by a doubling attack, so we put a × at theircross. The Montgomery powering ladder is designed to thwart SPA, and it doesnot make a DPA attack harder or easier, so we leave the cell a blank.

Below we discuss each countermeasure and its relation to the listed attacks.

A.5.1 SPA Countermeasures

Indistinguishable Point Operation Formulae (IPOF) [37]

IPOF try to eliminate the difference between point addition and point doubling.The usage of unified formulae for point doubling and addition [37] is aspecial case of IPOF. However, even when unified formulae are in use, theimplementation of the underlying arithmetic, especially the operations withconditional instructions, may still reveal the type of the point operation(addition or doubling) [165, 157]. When using double-and-add method, theHamming weight of the secret scalar can be easily leaked.

Double-and-add-always [52]

The double-and-add-always algorithm, introduced by Coron, ensures that thesequence of operations during a scalar multiplication is independent of thescalar by inserting dummy point additions. Due to the use of dummyoperations, it makes C safe-error fault attack possible.

COUNTERMEASURES 123

Tab

leA

.3:

Att

acks

vers

usco

unte

rmea

sure

s.C

1:

Ind

isti

ng

uis

ha

ble

Po

int

Op

era

tio

nC

2:

Do

ub

le-a

nd

-ad

d-a

lway

sC

3:

Ato

mic

blo

ck

C4

:M

on

tgo

mery

Pow

eri

ng

La

dd

er

+y

C5

:M

on

tgo

mery

Pow

eri

ng

La

dd

er

−y

C6

:S

ca

lar

ran

do

miz

ati

on

C7

:R

an

do

mk

ey

spli

ttin

gC

8:

Ba

sep

oin

tb

lin

din

gC

9:

Ra

nd

om

pro

jecti

ve

co

ord

ina

tes

C1

0:

Ra

nd

om

EC

iso

mo

rph

ism

C1

1:

Ra

nd

om

field

iso

mo

rph

ism

C1

2:

Ra

nd

om

reg

iste

ra

dd

ress

C1

3:

Po

int

Va

lid

ati

on

C1

4:

Cu

rve

Inte

gri

tyC

heck

C1

5:

Co

here

nce

Ch

eck

C1

6:

Co

mb

ined

cu

rve

check

C1

7:

Co

-fa

cto

rm

ult

ipli

ca

tio

n

Atta

ck

sC

ou

nte

rm

ea

su

re

s

C1

C2

C3

C4

C5

C6

C7

C8

C9

C1

0C

11

C1

2C

13

C1

4C

15

C1

6C

17

SP

A√

√√

√√

DP

A×

[14

2]

√×

[14

2]

√√

√

Tem

pla

teS

PA

×[1

26

]?

×[1

26

]√

√√

Do

ub

ing

×[7

0]

×[1

71

]×

[17

1]

×[7

0]

?×

[70

]?

??

RP

A√

√√

×[7

8]×

[78

]×

[78

]√

*

ZP

A√

√√

×[3

]×

[3]

×[3

]

Ca

rry

-b

ase

d×

[69

]×‡

AD

PA

√

CS

afe

-err

or

H[1

70

]√

√

MS

afe

-err

or

√√

Inva

lid

Po

int

?√

Inva

lid

Cu

rve

√

Tw

ist

Cu

rve

√H

[68

]√

BM

MD

FA

?√

√†

Sig

nch

an

ge

H[2

6]

√?

√

PA

IA√

⊳√

√⊳

†T

he

co

un

term

ea

sure

sis

eff

ecti

ve

on

lyw

hen

the

Mo

ntg

om

ery

pow

eri

ng

lad

der

isu

sed

.*

Th

eco

un

term

ea

sure

sis

eff

ecti

ve

on

lyw

hen

the

att

ack

er

ma

kes

use

of

po

ints

of

sma

llo

rder.

‡C

7ca

nb

ea

tta

cked

ifit

spli

tsth

ek

as

foll

ow

s:k

1←

r,

k2←

k−

r,

wh

ere

ris

ran

do

mly

sele

cte

d.

⊳D

ep

en

ds

on

the

imp

lem

en

tati

on

.


Atomic Block [45]

Instead of making the group operations indistinguishable, one can rewrite themas sequences of side-channel atomic blocks that are indistinguishable for simpleSPAs.

If dummy atomic blocks are added, then this countermeasure may enable Csafe-error attacks. Depending on the implementation, it may also enable Msafe-error attack.

Montgomery Powering Ladder

The Montgomery ladder [136, 105] for ECC, shown as Algorithm 24, showsprotection against SPA since the scalar multiplication is performed with a fixedpattern inherently unrelated to each bit of the scalar.

It avoids the usage of dummy instructions and also resists the normal doublingattack. However, it is vulnerable to the relative doubling attack proposed byYen et al. [171]. This attack can reveal the relation between two adjacentsecret scalar bits, thereby seriously decreasing the number of key candidates.

With the Montgomery powering ladder, the y-coordinate is not necessaryduring the scalar multiplication, which prevents sign-change attacks. However,for curves that have weak twist curves, using Montgomery powering ladderwithout y-coordinate is vulnerable to twist curve attacks.

Joye and Yen pointed out that Montgomery powering ladder may be vulnerableto M safe-error attacks (see [105] for details). They also proposed a modifiedmethod that allows to detect faults in both R[0] or R[1].

A.5.2 DPA Countermeasures

Scalar Randomisation [52]

This method blinds the private scalar by adding a multiple of #E. For anyrandom number r and k′ = k + r#E, we have k′P = kP since (r#E)P = O.Coron suggested choosing r to be around 20 bits.

The scalar randomisation method was analyzed in [142] and judged weak ifimplemented as presented. Moreover, since #E for standard curves has a longrun of zeros, the blinded scalar, k′, still has a lot of bits unchanged.

COUNTERMEASURES 125

It makes the safe-error and sign-change attacks more difficult. On the otherhand, it is shown in [69] that the randomisation process leaks the scalar underthe carry-based attack. Moreover, as mentioned in [70] the 20-bit random valuefor blinding the scalar k is not sufficient to resist the doubling attack.

Base Point Blinding [52]

This method blinds the point P , such that kP becomes k(P +R). The knownvalue S = kR is subtracted at the end of the computation. The mask S and Rare stored secretly in the cryptographic device and updated at each iteration.

It can resist DPA/DEMA as explained in [52]. In [70], the authors concludethat this countermeasure is still vulnerable to the doubling attack since thepoint which blinds P is also doubled at each execution. This countermeasuremakes RPA/ZPA more difficult since it breaks the assumption that the attackercan freely choose the base point (the base point is blinded).

This countermeasure might make the weak-curve based attacks more difficultsince the attacker does not know the masking point R. In an attack basedon an invalid point, the adversary needs to find out the faulty points P ′ andQ′ = kP ′. With the point blinding, it seems to be more difficult to reveal eitherP ′ or Q′. However, in the case of an invalid curve attack, base point blindingdoes not make a difference.

While neither blinding the base point or the scalar is effective to prevent thedoubling attack, the combined use of them seems to be effective [70].

Random Projective Coordinates [52]

This method randomizes the homogeneous projective coordinates (X,Y,Z)with a random λ 6= 0 to (λX, λY, λZ). The random variable λ can be updatedin every execution or after each doubling or addition. This countermeasureis effective against differential SCA. It fails to resist the RPA as zero is noteffectively randomized.

Random Key Splitting [48]

The scalar can be split in at least two different ways: k = k1 + k2 or k =⌊k/r⌋r + (k mod r) for a random r.


This countermeasure can resist DPA/DEMA attacks since it has a randomscalar for each execution. In [70], the authors have already analysed theeffectiveness of Coron’s first countermeasure against the doubling attack. Ifwe assume that the scalar k is randomly split into two full length scalars, thesearch space is extended to 281 for a 163-bit k (the birthday paradox applieshere). This is enough to resist the doubling attack. It can also help to precludeRPA/ZPA if it is used together with base point randomisation [78, 3, 86].However, this countermeasure is vulnerable to a carry-based attack if the keyis split as follows: choose a random number r < #E, and k1 = r, k2 = k − r.

Random EC isomorphism [104]

This method first applies a random isomorphism of the form ψ : (x, y) 7→(r2x, r3y) and then proceeds by computing Q = k·ψ(P ) and outputting ψ−1(Q).

Random field isomorphism [104]

This method makes use of isomorphisms between fields. To compute Q = kP ,it first randomly chooses a field F ′ isomorphic to F through isomorphism φ,then computes Q = φ−1(k(φ(P ))).

Random EC isomorphism and random field isomorphism have similar strengthand weakness as random projective coordinates.

Random Register Address [97, 101]

This method randomises the register addresses to break the link between keybits and register addresses. In Algorithm 24, the address of the destinationregister for point doubling is ki. If k is not randomised, then the attacker canrecover ki with address-bit DPA. May et al. [125] proposed Random RegisterRenaming (RRR) as a countermeasure on a special processor. Itoh et al. [97]proposed a way to randomise register addresses for double-and-add-always,Montgomery powering ladder and window method. Izumi et al. [101] showedthat the MPL version is still vulnerable and proposed an improved version.

COUNTERMEASURES 127

A.5.3 FA Countermeasures

Point Validation [23, 49]

Point Validation (PV) verifies if a point lies on the specified curve or not. PVshould be performed before and after scalar multiplication. If the base pointor result does not belong to the original curve, no output should be given. It isan effective countermeasure against invalid point attacks and BMM differentialfault attacks. If the y-coordinate is used, it is also effective against a twist-curveattack.

Curve Integrity Check [49]

The curve integrity check is to detect faults on curve parameters. Beforestarting an ECSM the curve parameters are read from the memory and verifiedusing an error detecting code (i.e. cyclic redundancy check) before an ECSMexecution. It is an effective method to prevent invalid curve attacks.

Coherence Check [77]

A coherence check verifies the intermediate or final results with respect to avalid pattern. If an ECSM uses the Montgomery powering ladder, we can usethe fact that the difference between R[0] and R[1] is always P . This can beused to detect faults during an ECSM [57].

Combined Curve Check [26]

This method uses a reference curve to detect faults. This countermeasure makesuse of two curves: a reference curve Et := E(Ft) and a combined curve Ept

that is defined over the ring Zpt. In order to compute kP on curve E, it firstgenerates a combined point Ppt from P and a point Pt ∈ Et(Ft) (with primeorder). Two scalar multiplications are then performed: Qpt = kPpt on Ept

and Qt = kPt on Et. If no error occurred, Qt and Qpt (mod t) will be equal.Otherwise, one of the results is faulty and the results should be aborted. It isan effective countermeasure against the sign-change fault attack.


Co-factor Multiplication [156]

To prevent small subgroup attacks, most protocols can be reformulated usingcofactor multiplication. For instance, the Diffie-Hellman protocol can beadapted as follows: a user first computes Q = h · P and then R = k · Q ifQ 6= O.

This method is an effective countermeasures against RPA [78] if the exploitedspecial points are of small order. However, it does not provide protectionagainst ZPA (since it does not necessarily use points of small order) and thecombined attack.

A.6 Discussions

In this section, we discuss several issues on the selection and implementationof countermeasures.

A.6.1 On the Magic of Randomness

As shown in Table A.3, adding randomness into data, operation and addressesserves as a primary method to prevent differential power (and some faultanalysis). One underlying assumption of randomisation is that only a fewbits of the scalar are leaked from each (randomised) execution, and thesepieces of information cannot be aggregated. In other words, since DPA (orDFA) recover the scalar incrementally, multiple (randomised) executions donot leak more bits of k than one execution. However, the history has shownthat randomness may not work as well as expected. A good example is the useof a Hidden Markov Model (HMM) to analyze the Oswald-Aigner randomisedexponentiation [110] and random scalar splitting [139]. Another example is thehorizontal analysis [51] that uses only a single trace. It is not clear whetherthere is an efficient and general aggregation algorithm to break randomisedexecutions. However, randomness as a protection to DPA (and DFA) shoulddefinitely be used with caution.

A.6.2 Countermeasure Selection

While unified countermeasures to tackle both passive and active attacks areattractive, they are very likely weaker than what is expected. Baek andVasyltsov extended Shamir’s trick, which was proposed for RSA-CRT, to secure

DISCUSSIONS 129

ECC from DPA and FA [9]. However, Joye showed in [103] that a non-negligibleportion of faults was undetected using the unified countermeasure and settingsin [9].

For the selection of countermeasures, we believe three principles should befollowed: Complete, Specific and Additive.

Complete: An adversary needs to succeed in only one out of many possibleattack methods to win, but the implementation has to be protected from allapplicable attacks.

Specific: For an ECC processor designed for a specific application, typicallynot all the attacks are applicable. For example, RPA and ZPA is not applicableif an ECC processor is designed solely for ECDSA since the base point is fixed.

Additive: The combination of two perfect countermeasures may introducenew vulnerabilities. Therefore, selected countermeasures should be evaluatedto make sure they are additive.

A.6.3 Implementation Issues

An obvious yet widely ignored fact is that the implementing process (codingin software or hardware) may also introduce vulnerabilities. For instance, animplementation of the Montgomery powering ladder will inevitably use registersor memory entries for intermediate results. These temporary memory entriesare not visible at the algorithm level, and safe-errors may be introduced inthose memory locations. In order to avoid vulnerabilities introduced duringthe implementation process, a systematic analysis at each representation level(from C to netlist) should be performed.

Bibliography

[1] http://www.sciengines.com/company/news-a-events/

74-des-in-1-day.html. pages 86

[2] ISO/IEC 18000-1:2004, information technology – radio frequencyidentification for item management. Part 3: Parameters for air interfacecommunications at 13,56 MHz. pages 79

[3] T. Akishita and T. Takagi. Zero-Value Point Attacks on Elliptic CurveCryptosystem. In C. Boyd and W. Mao, editors, Information SecurityConference – ISC 2003, volume 2851 of Lecture Notes in ComputerScience, pages 218–233. Springer, 2003. pages 115, 123, 126

[4] D. F. Aranha, J.-L. Beuchat, J. Detrey, and N. Estibals. OptimalEta Pairing on Supersingular Genus-2 Binary Hyperelliptic Curves.Cryptology ePrint Archive, Report 2010/559, 2010. http://eprint.

iacr.org/. pages 51

[5] D. F. Aranha, K. Karabina, P. Longa, C. H. Gebotys, and J. López.Faster Explicit Formulas for Computing Pairings over Ordinary Curves.In K. G. Paterson, editor, Advances in Cryptology – EUROCRYPT 2011,volume 6632 of Lecture Notes in Computer Science, pages 48–68. Springer,2011. pages 42, 51, 60, 61, 62

[6] D. F. Aranha, J. López, and D. Hankerson. High-Speed Parallel SoftwareImplementation of the ηT Pairing. In J. Pieprzyk, editor, 2010, volume5985 of Lecture Notes in Computer Science, pages 89–105. Springer, 2010.pages 61

[7] R. Avanzi. Side Channel Attacks on Implementations of Curve-BasedCryptographic Primitives. Cryptology ePrint Archive, Report 2005/017.Available from http://eprint.iacr.org/. pages 10, 112

131

http://www.sciengines.com/company/news-a-events/74-des-in-1-day.html

http://www.sciengines.com/company/news-a-events/74-des-in-1-day.html

http://eprint.iacr.org/


132 BIBLIOGRAPHY

[8] R. M. Avanzi, H. Cohen, C. Doche, G. Frey, T. Lange, K. Nguyen,and F. Vercauteren. Handbook of Elliptic and Hyperelliptic CurveCryptography. CRC Press, 2005. pages 52, 66, 112

[9] Y.-J. Baek and I. Vasyltsov. How to Prevent DPA and Fault Attack in aUnified Way for ECC Scalar Multiplication - Ring Extension Method. InE. Dawson and D. S. Wong, editors, Information Security Practice andExperience – ISPEC 2007, volume 4464 of Lecture Notes in ComputerScience, pages 225–237. Springer, 2007. pages 129

[10] D. V. Bailey, L. Batina, D. J. Bernstein, P. Birkner, J. W. Bos, H.-C.Chen, C.-M. Cheng, G. van Damme, G. de Meulenaer, L. J. D. Perez,J. Fan, T. GÃĳneysu, F. Gurkaynak, T. Kleinjung, T. Lange, N. Mentens,R. Niederhagen, C. Paar, F. Regazzoni, P. Schwabe, L. Uhsadel, A. VanHerrewege, and B.-Y. Yang. Breaking ECC2K-130. Cryptology ePrintArchive, Report 2009/541, 2009. http://eprint.iacr.org/. pages 88,90, 96, 102, 103

[11] J.-C. Bajard, L.-S. Didier, and P. Kornerup. An RNS MontgomeryModular Multiplication Algorithm. Computers, IEEE Transactions on,47(7):766–776, 1998. pages 20

[12] P. S. L. M. Barreto, H. Y. Kim, B. Lynn, and M. Scott. EfficientAlgorithms for Pairing-Based Cryptosystems. In M. Yung, editor,Advances in Cryptology – CRYPTO 2002, volume 2442 of Lecture Notesin Computer Science, pages 354–368. Springer, 2002. pages 14

[13] P. Barrett. Implementing the Rivest Shamir and Adleman Public KeyEncryption Algorithm on a Standard Digital Signal Processor. In A. M.Odlyzko, editor, Advances in Cryptology – CRYPTO 1986, volume 263 ofLecture Notes in Computer Science, pages 311–323. Springer, 1986. pages19

[14] L. Batina, D. Hwang, A. Hodjat, B. Preneel, and I. Verbauwhede.Hardware/Software Co-design for Hyperelliptic Curve Cryptography(HECC) on the 8051 µP. In J. R. Rao and B. Sunar, editors,Cryptographic Hardware and Embedded Systems – CHES 2005, volume3659 of Lecture Notes in Computer Science, pages 106–118. Springer,2005. pages 16

[15] D. J. Bernstein, H.-C. Chen, C.-M. Cheng, T. Lange, R. Niederhagen,P. Schwabe, and B.-Y. Yang. ECC2K-130 on NVIDIA GPUs. In G. Gongand K. C. Gupta, editors, Progress in Cryptology – INDOCRYPT 2010,volume 6498 of Lecture Notes in Computer Science, pages 328–346.Springer-Verlag, 2010. pages 102


BIBLIOGRAPHY 133

[16] D. J. Bernstein and T. Lange. Explicit-Formulas Database. http://www.

hyperelliptic.org/EFD. pages 66

[17] D. J. Bernstein and T. Lange. Type-II Optimal Polynomial Bases. InM. A. Hasan and T. Helleseth, editors, Workshop on Arithmetic of FiniteFields – WAIFI 2010, volume 6087 of Lecture Notes in Computer Science.Springer, 2010. pages 22, 89, 95

[18] T. Beth and D. Gollman. Algorithm Engineering for Public KeyAlgorithms. Selected Areas in Communications, IEEE Journal on,7(4):458–466, May 1989. pages 68

[19] J. Beuchat, H. Doi, K. Fujita, A. Inomata, A. Kanaoka, M. Katouno,M. Mambo, E. Okamoto, T. Okamoto, T. Shiga, M. Shirase, R. Soga,T. Takagi, A. Vithanage, and H. Yamamoto. FPGA and ASICImplementations of the ηT Pairing in Characteristic Three. CryptologyePrint Archive, Report 2008/280, 2008. http://eprint.iacr.org/.pages 51

[20] J.-L. Beuchat, J. Detrey, N. Estibals, E. Okamoto, and F. Rodríguez-Henríquez. Hardware Accelerator for the Tate Pairing in CharacteristicThree Based on Karatsuba-Ofman Multipliers. Cryptology ePrintArchive, Report 2009/122, 2009. Available from http://eprint.iacr.

org/. pages 51

[21] J.-L. Beuchat, J. E. González-Díaz, S. Mitsunari, E. Okamoto,F. Rodríguez-Henríquez, and T. Teruya. High-Speed SoftwareImplementation of the Optimal Ate Pairing over Barreto-NaehrigCurves. In M. Joye, A. Miyaji, and A. Otsuka, editors, Pairing-Based Cryptography – PAIRING 2010, volume 6487 of Lecture Notes inComputer Science. Springer, 2010. pages 60, 61, 62

[22] J.-L. Beuchat, E. López-Trejo, L. Martínez-Ramos, S. Mitsunari, andF. Rodríguez-Henríquez. Multi-core Implementation of the Tate Pairingover Supersingular Elliptic Curves. In J. A. Garay, A. Miyaji, andA. Otsuka, editors, Cryptology and Network Security – CANS 2009,volume 5888 of LNCS, pages 413–432. Springer, Heidelberg, 2009. pages61

[23] I. Biehl, B. Meyer, and V. Müller. Differential Fault Attacks on EllipticCurve Cryptosystems. In M. Bellare, editor, Advances in Cryptology –CRYPTO 2000, volume 1880 of Lecture Notes in Computer Science, pages131–146. Springer, 2000. pages 112, 117, 118, 119, 127

http://www.hyperelliptic.org/EFD

http://www.hyperelliptic.org/EFD




134 BIBLIOGRAPHY

[24] I. Blake, G. Seroussi, N. Smart, and J. W. S. Cassels. Advances in EllipticCurve Cryptography (London Mathematical Society Lecture Note Series).Cambridge University Press, New York, NY, USA, 2005. pages 112

[25] G. R. Blakley. A Computer Algorithm for Calculating the Product ABModulo M. IEEE Trans. Comput., 32(5):497–500, 1983. pages 60, 61

[26] J. Blömer, M. Otto, and J.-P. Seifert. Sign Change Fault Attacks onElliptic Curve Cryptosystems. In L. Breveglieri, I. Koren, D. Naccache,and J.-P. Seifert, editors, Fault Diagnosis and Tolerance in Cryptography– FDTC 2006, volume 4236 of Lecture Notes in Computer Science, pages36–52. Springer, 2006. pages 119, 123, 127

[27] D. Boneh and M. Franklin. Identity-Based Encryption from the WeilPairing. In J. Kilian, editor, Advances in Cryptology – CRYPTO 2001,volume 2139 of Lecture Notes in Computer Science, pages 213–229.Springer, 2001. pages 9, 14

[28] D. Boneh, A. Joux, and P. Q. Nguyen. Why Textbook ElGamal and RSAEncryption Are Insecure. In T. Okamoto, editor, Advances in Cryptology– ASIACRYPT2000, volume 1976 of Lecture Notes in Computer Science,pages 30–43. Springer, 2000. pages 10

[29] D. Boneh, B. Lynn, and H. Shacham. Short Signatures from the WeilPairing. Journal of Cryptology, 17(4):297–319, 2004. pages 14

[30] Dan Boneh, Richard A. DeMillo, and Richard J. Lipton. On theImportance of Checking Cryptographic Protocols for Faults (ExtendedAbstract). In W. Fumy, editor, Advances in Cryptology – EUROCRYPT1997, volume 1233 of Lecture Notes in Computer Science, pages 37–51.Springer, 1997. pages 2

[31] J. W. Bos, M. E. Kaihara, T. Kleinjung, A. K. Lenstra, and P. L.Montgomery. On the security of 1024-bit RSA and 160-bit elliptic curvecryptography: version 2.1. Cryptology ePrint Archive, Report 2009/389,2009. http://eprint.iacr.org/2009/389. pages 88

[32] J. W. Bos, T. Kleinjung, R. Niederhagen, and P. Schwabe. ECC2K-130on Cell CPUs. In D. J. Bernstein and T. Lange, editors, Progress inCryptology – AFRICACRYPT 2010, volume 6055 of Lecture Notes inComputer Science, pages 225–242. Springer-Verlag, 2010. pages 102

[33] N. Boston, T. Clancy, Y. Liow, and J. Webster. Genus Two HyperellipticCurve Coprocessor. In B. S. Kaliski Jr., Ç. K. Koç, and C. Paar, editors,Cryptographic Hardware and Embedded Systems – CHES 2002, volume2523 of Lecture Notes in Computer Science, pages 400–414. Springer,2003. pages 67

http://eprint.iacr.org/2009/389

BIBLIOGRAPHY 135

[34] R. P. Brent and H. T. Kung. Systolic VLSI Arrays for Polynomial GCDComputation. Computers, IEEE Transactions on, 33(8):731–736, 1984.pages 69

[35] R. P. Brent and J. M. Pollard. Factorization of the Eighth FermatNumber. Mathematics of Computation, 36:627–630, 1981. pages 88

[36] F. Brezing and A. Weng. Elliptic Curves Suitable for Pairing BasedCryptography. Designs, Codes and Cryptography, 37:133–141, 2003.pages 53

[37] E. Brier and M. Joye. Weierstraß Elliptic Curves and Side-ChannelAttacks. In D. Naccache and P. Paillier, editors, Public Key Cryptography– PKC 2002, volume 2274 of Lecture Notes in Computer Science, pages335–345. Springer, 2002. pages 122

[38] Ç. K. Koç, T. Acar, and B. S. Kaliski. Analyzing and ComparingMontgomery Multiplication Algorithms. IEEE Micro, 16:26–33, 1996.pages 16, 26, 27, 50

[39] Certicom. Certicom ECC Challenge. http://www.certicom.com/

images/pdfs/cert_ecc_challenge.pdf, 1997. pages 86, 88

[40] Certicom. ECC Curves List. http://www.certicom.com/index.php/

curves-list, 1997. pages 87

[41] J. C. Cha and J. H. Cheon. An Identity-Based Signature from Gap Diffie-Hellman Groups. In Y. Desmedt, editor, Public Key Cryptography – PKC2003, volume 2567 of Lecture Notes in Computer Science, pages 18–30.Springer, 2003. pages 14

[42] S. Chari, J. R. Rao, and P. Rohatgi. Template Attacks. In B. S. KaliskiJr., Ç. K. Koç, and C. Paar, editors, Cryptographic Hardware andEmbedded Systems – CHES 2002, volume 2523 of Lecture Notes inComputer Science, pages 13–28. Springer, 2002. pages 112, 114

[43] Z. Chen and P. Schaumont. A Parallel Implementation of MontgomeryMultiplication on Multicore Systems: Algorithm, Analysis, andPrototype. Computers, IEEE Transactions on, 60(12):1692 –1703, 2011.pages

[44] R. C. C. Cheung, S. Duquesne, J. Fan, N. Guillermin, I. Verbauwhede,and G. X. Yao. FPGA Implementation of Pairings Using Residue NumberSystem and Lazy Reduction. In B. Preneel and T. Takagi, editors,Cryptographic Hardware and Embedded Systems – CHES 2011, volume6917 of Lecture Notes in Computer Science, pages 421–441. Springer,2011. pages 63

http://www.certicom.com/images/pdfs/cert_ecc_challenge.pdf

http://www.certicom.com/images/pdfs/cert_ecc_challenge.pdf

http://www.certicom.com/index.php/curves-list

http://www.certicom.com/index.php/curves-list

136 BIBLIOGRAPHY

[45] B. Chevallier-Mames, M. Ciet, and M. Joye. Low-Cost Solutionsfor Preventing Simple Side-Channel Analysis: Side-Channel Atomicity.Computers, IEEE Transactions on, 53(6):760–768, 2004. pages 124

[46] J. Chung and M. A. Hasan. Low-Weight Polynomial Form Integers forEfficient Modular Multiplication. Computers, IEEE Transactions on,56(1):44–57, 2007. pages 20, 48, 50

[47] J. Chung and M. A. Hasan. Montgomery Reduction Algorithm forModular Multiplication Using Low-Weight Polynomial Form Integers.In IEEE Symposium on Computer Arithmetic, pages 230–239. IEEEComputer Society, 2007. pages 18, 20, 21

[48] M. Ciet and M. Joye. (Virtually) Free Randomization Techniquesfor Elliptic Curve Cryptography. In S. Qing, D. Gollmann, andJ. Zhou, editors, Information and Communications Security – ICICS2003, volume 2836 of Lecture Notes in Computer Science, pages 348–359.Springer, 2003. pages 125

[49] M. Ciet and M. Joye. Elliptic Curve Cryptosystems in the Presence ofPermanent and Transient Faults. Des. Codes Cryptography, 36(1):33–43,2005. pages 118, 127

[50] T. Clancy. FPGA-based Hyperelliptic Curve Cryptosystems. invitedpaper presented at AMS Central Section Meeting, 2003. pages 67, 78

[51] C. Clavier, B. Feix, G. Gagnerot, M. Roussellet, and V. Verneuil.Horizontal Correlation Analysis on Exponentiation. In M. Soriano,S. Qing, and J. López, editors, Information and Communications Security– ICICS 2010, volume 6476 of Lecture Notes in Computer Science, pages46–61. Springer, 2010. pages 128

[52] J. Coron. Resistance against Differential Power Analysis for EllipticCurve Cryptosystems. In Ç. K. Koç and C. Paar, editors, CryptographicHardware and Embedded Systems – CHES 1999, volume 1717 of LectureNotes in Computer Science, pages 292–302. Springer, 1999. pages 114,117, 122, 124, 125

[53] A. Devegili, C. Ó’ hÉigeartaigh, M. Scott, and R. Dahab. Multiplicationand Squaring on Pairing-Friendly Fields. Cryptology ePrint Archive,Report 2006/471. Available from http://eprint.iacr.org. pages 51

[54] A. Devegili, M. Scott, and R. Dahab. Implementing CryptographicPairings over Barreto-Naehrig Curves. In T. Takagi, T. Okamoto,E. Okamoto, and T. Okamoto, editors, Pairing-Based Cryptography –PAIRING 2007, volume 4575 of Lecture Notes in Computer Science.Springer, 2007. pages 51, 53

BIBLIOGRAPHY 137

[55] J.-F. Dhem. Design of an Efficient Public-Key Cryptographic Library forRISC-Based Smart Cards. PhD thesis, Universite catholique de Louvain,Louvain-la-Neuve, Belgium, 1998. pages 19

[56] W. Diffie and M. Hellman. New directions in cryptography. IEEETransactions on information Theory, 22(6):644–654, 1976. pages 9

[57] A. Dominguez-Oviedo. On Fault-based Attacks and Countermeasuresfor Elliptic Curve Cryptosystems. PhD thesis, University of Waterloo,Canada, 2008. pages 127

[58] S. Duquesne. RNS arithmetic in Fpk and application to fast pairingcomputation. Journal of Mathematical Cryptology, 5(1):51–88, 2011.pages 62

[59] G. Elias, A. Miri, and T. H. Yeap. On Efficient Implementation of FPGA-Based Hyperelliptic Curve Cryptosystems. Computers and ElectricalEngineering, 33(5-6):349–366, 2007. pages 67, 78

[60] G. Elias, A. Miri, and T. H. Yeap. High-Performance, FPGA-BasedHyperelliptic Curve Cryptosystems. In The Proceeding of the 22ndBiennial Symposium on Communications, May 2004. pages 67

[61] A. Escott, J. Sager, A. Selkirk, and D. Tsapakidis. AttackingElliptic Curve Cryptosystems Using the Parallel Pollard rho Method.CryptoBytes (The technical newsletter of RSA Laboratories), 4:15–19,1998. pages 88

[62] N. Estibals. Compact Hardware for Computing the Tate Pairing over 128-Bit-Security Supersingular Curves. In M. Joye, A. Miyaji, and A. Otsuka,editors, Pairing-Based Cryptography – PAIRING 2010, volume 6487 ofLecture Notes in Computer Science. Springer, 2010. pages 61, 62

[63] J. Fan, L. Batina, K. Sakiyama, and I. Verbauwhede. Fpga design foralgebraic tori-based public-key cryptography. In Design, Automation,and Test in Europe – DATE 2008, pages 1292–1297. IEEE, 2008. pages11

[64] J. Fan, B. Gierlichs, and F. Vercauteren. To Infinity and Beyond:Combined Attack on ECC Using Points of Low Order. In B. Preneeland T. Takagi, editors, Cryptographic Hardware and Embedded Systems– CHES 2011, volume 6917 of Lecture Notes in Computer Science, pages143–159. Springer, 2011. pages 5, 112, 120

[65] J. Fan, X. Guo, E. De Mulder, P. Schaumont, B. Preneel, andI. Verbauwhede. State-of-the-art of Secure ECC Implementations: A

138 BIBLIOGRAPHY

Survey on Known Side-channel Attacks and Countermeasures. InJ. Plusquellic and K. Mai, editors, Hardware-Oriented Security and Trust– HOST 2010, pages 76–87. IEEE Computer Society, 2010. pages 112

[66] J. Fan, F. Vercauteren, and I. Verbauwhede. Faster Fp-Arithmetic forCryptographic Pairings on Barreto-Naehrig Curves. In C. Clavier andK. Gaj, editors, Cryptographic Hardware and Embedded Systems – CHES2009, volume 5747 of Lecture Notes in Computer Science. Springer, 2009.pages 16, 56, 60, 61

[67] J. Fan, F. Vercauteren, and I. Verbauwhede. Efficient HardwareImplementation of Fp-arithmetic for Pairing-Friendly Curves. Computers,IEEE Transactions on, PP(99):1, 2011. pages 18

[68] P. Fouque, R. Lercier, D. Réal, and F. Valette. Fault Attack on EllipticCurve Montgomery Ladder Implementation. In L. Breveglieri, S. Gueron,I. Koren, D. Naccache, and J.-P. Seifert, editors, Fault Diagnosis andTolerance in Cryptography – FDTC 2008, pages 92–98. IEEE ComputerSociety, 2008. pages 118, 123

[69] P. Fouque, D. Réal, F. Valette, and M. Drissi. The Carry Leakage on theRandomized Exponent Countermeasure. In E. Oswald and P. Rohatgi,editors, Cryptographic Hardware and Embedded Systems – CHES 2008,volume 5154 of Lecture Notes in Computer Science. Springer, 2008. pages114, 116, 123, 125

[70] P. Fouque and F. Valette. The Doubling Attack : Why Upwards Is Betterthan Downwards. In C. D. Walter, Ç. K. Koç, and C. Paar, editors,Cryptographic Hardware and Embedded Systems – CHES 2003, volume2779 of Lecture Notes in Computer Science, pages 269–280. Springer,2003. pages 112, 115, 123, 125, 126

[71] D. Freeman, M. Scott, and E. Teske. A Taxonomy of Pairing-FriendlyElliptic Curves. Journal of Cryptology, 23:224–280, 2010. pages 42, 51,52

[72] G. Frey and H. G. Rück. A Remark Concerning m-divisibility and theDiscrete Logarithm in the Divisor Class Group of Curves. Mathematicsof computation, 62(206):865–874, 1994. pages 14

[73] Robert P. Gallant, Robert J. Lambert, and Scott A. Vanstone. Improvingthe parallelized Pollard lambda search on anomalous binary curves.Mathematics of Computation, 69(232):1699–1705, 2000. pages 88

[74] GEZEL. http://rijndael.ece.vt.edu/gezel2/. pages 77

http://rijndael.ece.vt.edu/gezel2/

BIBLIOGRAPHY 139

[75] S. Ghosh, D. R. Chowdhury, and A. Das. High Speed Cryptoprocessorfor ηT Pairing on 128-bit Secure Supersingular Elliptic Curves overCharacteristic Two Fields. In B. Preneel and T. Takagi, editors,Cryptographic Hardware and Embedded Systems – CHES 2011, volume6917 of Lecture Notes in Computer Science, pages 442–458. Springer,2011. pages 61, 62

[76] S. Ghosh, D. Mukhopadhyay, and D. R. Chowdhury. High Speed FlexiblePairing Cryptoprocessor on FPGA Platform. In M. Joye, A. Miyaji,and A. Otsuka, editors, Pairing-Based Cryptography – PAIRING 2010,volume 6487 of Lecture Notes in Computer Science. Springer, 2010. pages60, 61

[77] C. Giraud. An RSA Implementation Resistant to Fault Attacks and toSimple Power Analysis. Computers, IEEE Transactions on, 55(9):1116–1120, 2006. pages 127

[78] L. Goubin. A Refined Power-Analysis Attack on Elliptic CurveCryptosystems. In Y. Desmedt, editor, Public Key Cryptography – PKC2003, volume 2567 of Lecture Notes in Computer Science, pages 199–210.Springer, 2003. pages 112, 115, 123, 126, 128

[79] P. Grabher, J. Großchäedl, and D. Page. On Software ParallelImplementation of Cryptographic Pairings. In R. M. Avanzi, L. Keliher,and F. Sica, editors, Selected Areas in Cryptography – SAC2008, volume5381 of Lecture Notes in Computer Science, pages 34–49. Springer, 2008.pages 60, 61, 62

[80] R. Granger, D. Page, and M. Stam. A Comparison of CEILIDH and XTR.In D. A. Buell, editor, Algorithmic Number Theory, 6th InternationalSymposium, ANTS-VI 2004, volume 3076 of Lecture Notes in ComputerScience, pages 235–249. Springer, 2004. pages 11

[81] J. Großschädl. High-Speed RSA Hardware Based on Barret’s ModularReduction Method. In Ç. K. Koç and C. Paar, editors, CryptographicHardware and Embedded Systems – CHES 2000, volume 1965 of LectureNotes in Computer Science, pages 191–203. Springer, 2000. pages 16

[82] N. Guillermin. A High Speed Coprocessor for Elliptic Curve ScalarMultiplications over Fp. In S. Mangard and F.-X. Standaert, editors,Cryptographic Hardware and Embedded Systems – CHES 2010, volume6225 of Lecture Notes in Computer Science. Springer, 2010. pages 20

[83] T. Güneysu, T. Kasper, M. Novotný, C. Paar, and A. Rupp.Cryptanalysis with COPACOBANA. IEEE Transactions on Computers,57(11):1498–1513, November 2008. pages 89

140 BIBLIOGRAPHY

[84] T. Güneysu, C. Paar, and J. Pelzl. Attacking Elliptic CurveCryptosystems with Special-Purpose Hardware. In A. DeHon andM. Hutton, editors, Proceedings of the ACM/SIGDA 15th InternationalSymposium on Field Programmable Gate Arrays– FPGA 2007, page 215.ACM, ACM, 2007. pages 89

[85] X. Guo, J. Fan, P. Schaumont, and I. Verbauwhede. Programmable andParallel ECC Coprocessor Architecture: Tradeoffs between Area, Speedand Security. In C. Clavier and K. Gaj, editors, Cryptographic Hardwareand Embedded Systems – CHES 2009, volume 5747 of Lecture Notes inComputer Science, pages 289–303. Springer, 2009. pages 5

[86] J. Ha, J. Park, S. Moon, and S. Yen. Provably Secure CountermeasureResistant to Several Types of Power Attack for ECC. In S. Kim,M. Yung, and H.-W. Lee, editors, Information Security Applications –WISA 2007, volume 4867 of Lecture Notes in Computer Science, pages333–344. Springer, 2007. pages 126

[87] D. Hankerson, A. Menezes, and M. Scott. Software Implementation ofPairings, volume 2 of Cryptology and Information Security Series, pages188–206. IOS Press, M. Joye and G. Neven edition, 2009. pages 54, 60,61, 62

[88] D. Hankerson, A. Menezes, and S. Vanstone. Guide to Elliptic CurveCryptography. Springer, NY, 2004. pages 13

[89] M. A. Hasan and V. K. Bhargava. Bit-Serial Systolic Divider andMultiplier for Finite Fields GF(2m). Computers, IEEE Transactions on,41(8):972–980, Aug 1992. pages 69

[90] D. Hein, J. Wolkerstorfer, and N. Felber. ECC is Ready for RFID -A Proof in Silicon. In R. M. Avanzi, L. Keliher, and F. Sica, editors,Selected Areas in Cryptography – SAC2008, volume 5381 of Lecture Notesin Computer Science. Springer, 2008. pages 79, 83

[91] C. Herbst and M. Medwed. Using Templates to Attack MaskedMontgomery Ladder Implementations of Modular Exponentiation. InK.-I. Chung, K. Sohn, and M. Yung, editors, Information SecurityApplications – WISA 2008, volume 5379 of Lecture Notes in ComputerScience, pages 1–13. Springer, 2008. pages 114

[92] F. Hess. Pairing Lattices. In S. D. Galbraith and K. G. Paterson, editors,Pairing-Based Cryptography – PAIRING 2008, volume 5209 of LectureNotes in Computer Science. Springer, 2008. pages 14

BIBLIOGRAPHY 141

[93] F. Hess, N. P. Smart, and F. Vercauteren. The Eta Pairing Revisited.Information Theory, IEEE Transactions on, 52(10):4595–4602, 2006.pages 14, 15

[94] N. Homma, A. Miyamoto, T. Aoki, A. Satoh, and A. Shamir. Collision-Based Power Analysis of Modular Exponentiation Using Chosen-MessagePairs. In E. Oswald and P. Rohatgi, editors, Cryptographic Hardwareand Embedded Systems – CHES 2008, volume 5154 of Lecture Notes inComputer Science, pages 15–29. Springer, 2008. pages 115

[95] D. D. Hwang, K. Tiri, A. Hodjat, B.-C. Lai, S. Yang, P. Schaumont, andI. Verbauwhede. AES-Based Security Coprocessor IC in 0.18-µm CMOSWith Resistance to Differential Power Analysis Side-Channel Attacks.Solid-State Circuits, IEEE Journal of, 41(4):781–792, 2006. pages 110

[96] K. Itoh, T. Izu, and M. Takenaka. Address-Bit Differential PowerAnalysis of Cryptographic Schemes OK-ECDH and OK-ECDSA. InB. S. Kaliski Jr., Ç. K. Koç, and C. Paar, editors, Cryptographic Hardwareand Embedded Systems – CHES 2002, volume 2523 of Lecture Notes inComputer Science, pages 129–143. Springer, 2002. pages 116

[97] K. Itoh, T. Izu, and M. Takenaka. A Practical Countermeasure againstAddress-Bit Differential Power Analysis. In C. D. Walter, Ç. K. Koç,and C. Paar, editors, Cryptographic Hardware and Embedded Systems –CHES 2003, volume 2779 of Lecture Notes in Computer Science, pages382–396. Springer, 2003. pages 126

[98] K. Itoh, M. Takenaka, N. Torii, S. Temma, and Y. Kurihara. FastImplementation of Public-Key Cryptography ona DSP TMS320C6201. InÇ. K. Koç and C. Paar, editors, Cryptographic Hardware and EmbeddedSystems – CHES 1999, volume 1717 of Lecture Notes in ComputerScience, pages 61–72. Springer, 1999. pages 34, 36

[99] T. Itoh and S. Tsujii. A Fast Algorithm for Computing MultiplicativeInverses in GF (2m) Using Normal Bases. Inf. Comput., 78(3):171–177,1988. pages 69, 97

[100] K. Iwamura, T. Matsumoto, and H. Imai. High-Speed ImplementationMethods for RSA Scheme. In R. A. Rueppel, editor, Advances inCryptology – EUROCRYPT 1992, volume 658 of Lecture Notes inComputer Science, pages 221–238. Springer, 1992. pages 26, 27

[101] M. Izumi, J. Ikegami, K. Sakiyama, and K. Ohta. ImprovedCountermeasure against Address-Bit DPA for ECC Scalar Multiplication.In Design, Automation, and Test in Europe – DATE 2010, pages 981–984.IEEE, 2010. pages 126

142 BIBLIOGRAPHY

[102] A. Joux. A One Round Protocol for Tripartite Diffie–Hellman. Journalof Cryptology, 17(4):263–276, 2004. pages 9, 14, 16

[103] M. Joye. On the Security of a Unified Countermeasure. In L. Breveglieri,S. Gueron, I. Koren, D. Naccache, and J.-P. Seifert, editors, FaultDiagnosis and Tolerance in Cryptography – FDTC 2008, pages 87–91.IEEE Computer Society, 2008. pages 129

[104] M. Joye and C. Tymen. Protections against Differential Analysis forElliptic Curve Cryptography. In Ç. K. Koç, D. Naccache, and C. Paar,editors, Cryptographic Hardware and Embedded Systems – CHES 2001,volume 2162 of Lecture Notes in Computer Science, pages 377–390.Springer, 2001. pages 126

[105] M. Joye and S.-M. Yen. The Montgomery Powering Ladder. InB. S. Kaliski Jr., Ç. K. Koç, and C. Paar, editors, Cryptographic Hardwareand Embedded Systems – CHES 2002, volume 2523 of Lecture Notes inComputer Science. Springer, 2002. pages 117, 124

[106] M. E. Kaihara and N. Takagi. Bipartite Modular Multiplication Method.IEEE Trans. Computers, 57(2):157–164, 2008. pages 27

[107] D. Kammler, D. Zhang, P. Schwabe, H. Scharwaechter, M. Langenberg,D. Auras, G. Ascheid, R. Leupers, R. Mathar, and H. Meyr. Designingan ASIP for Cryptographic Pairings over Barreto-Naehrig Curves. InC. Clavier and K. Gaj, editors, Cryptographic Hardware and EmbeddedSystems – CHES 2009, volume 5747 of Lecture Notes in ComputerScience. Springer, 2009. pages 51, 60, 61

[108] D. Karaklajić, J. Fan, J. Schmidt, and I. Verbauwhede. Low-cost FaultDetection Method for ECC Using Montgomery Powering Ladder. InDATE, pages 1016–1021. IEEE, 2011. pages 5

[109] A. Karatsuba and Y. Ofman. Multiplication of Many-Digital Numbersby Automatic Computers. Translation in Physics-Doklady, 145:595–596,1963. pages 46

[110] C. Karlof and D. Wagner. Hidden Markov Model Cryptoanalysis. InC. D. Walter, Ç. K. Koç, and C. Paar, editors, Cryptographic Hardwareand Embedded Systems – CHES 2003, volume 2779 of Lecture Notes inComputer Science, pages 17–34. Springer, 2003. pages 128

[111] S. Kawamura, M. Koike, F. Sano, and A. Shimbo. Cox-RowerArchitecture for Fast Parallel Montgomery Multiplication. In B. Preneel,editor, Advances in Cryptology – EUROCRYPT 2000, volume 1807 ofLecture Notes in Computer Science, pages 523–538. Springer, 2000. pages20, 63

BIBLIOGRAPHY 143

[112] D. E. Knuth. The Art of Computer Programming, volume 2. Addison-Wesley, 1981. pages 69

[113] N. Koblitz. Elliptic Curve Cryptosystems. Mathematics of computation,48(177):203–209, 1987. pages 9, 11, 112

[114] N. Koblitz. Hyperelliptic Cryptosystems. Journal of Cryptology, 1(3):129–150, 1989. pages 9, 13

[115] P. C. Kocher. Timing Attacks on Implementations of Diffie-Hellman,RSA, DSS, and Other Systems. In N. Koblitz, editor, Advances inCryptology – CRYPTO 1996, volume 1109 of Lecture Notes in ComputerScience, pages 104–113. Springer, 1996. pages 2, 112

[116] P. C. Kocher, J. Jaffe, and B. Jun. Differential Power Analysis. In M. J.Wiener, editor, Advances in Cryptology – CRYPTO 1999, volume 1666 ofLecture Notes in Computer Science, pages 388–397. Springer, 1999. pages112, 114, 115

[117] O. Kömmerling and M. G. Kuhn. Design Principles for Tamper-Resistant Smartcard Processors. In WOST’99 Proceedings of theUSENIX Workshop on Smartcard Technology on USENIX Workshop onSmartcard Technology, pages 9–20, 1999. pages 116

[118] S. Kwon. A Low Complexity and a Low Latency Bit Parallel SystolicMultiplier over GF(2m) Using an Optimal Normal Basis of Type II.In IEEE Symposium on Computer Arithmetic, pages 196–202. IEEEComputer Society, 2003. pages 94

[119] T. Lange and P. K. Mishra. SCA Resistant Parallel Explicit Formula forAddition and Doubling of Divisors in the Jacobian of Hyperelliptic Curvesof Genus 2. In S. Maitra, C. E. Veni Madhavan, and R. Venkatesan,editors, International Conference on Cryptology in India– INDOCRYPT2005, volume 3797 of Lecture Notes in Computer Science, pages 403–416.Springer, 2005. pages 81

[120] T. Lange and M. Stevens. Efficient Doubling on Genus Two Curves overBinary Fields. In H. Handschuh and M. A. Hasan, editors, Selected Areasin Cryptography – SAC 2004, volume 3357 of Lecture Notes in ComputerScience, pages 170–181. Springer, 2004. pages 16, 75

[121] E. Lee, H.-S. Lee, and C.-M. Park. Efficient and GeneralizedPairing Computation on Abelian Varieties. Information Theory, IEEETransactions on, 55(4):1793 –1803, 2009. pages 14, 51

144 BIBLIOGRAPHY

[122] Y. K. Lee, K. Sakiyama, L. Batina, and I. Verbauwhede. Elliptic-Curve-Based Security Processor for RFID. IEEE Trans. Comput., 57(11):1514–1527, 2008. pages 79, 82, 83

[123] L. Lin and W. Burleson. Leakage-based Differential Power Analysis(LDPA) on Sub-90nm CMOS Cryptosystems. In InternationalSymposium on Circuits and Systems (ISCAS 2008), 18-21 May 2008,Sheraton Seattle Hotel, Seattle, Washington, USA, pages 252–255. IEEE,2008. pages 110

[124] J. López and R. Dahab. Fast Multiplication on Elliptic Curves overGF(2m) without Precomputation. In Ç. K. Koç and C. Paar, editors,Cryptographic Hardware and Embedded Systems – CHES 1999, volume1717 of Lecture Notes in Computer Science, pages 316–327. Springer,1999. pages 16

[125] D. May, H. L. Muller, and N. P. Smart. Random Register Renaming toFoil DPA. In Ç. K. Koç, D. Naccache, and C. Paar, editors, CryptographicHardware and Embedded Systems – CHES 2001, volume 2162 of LectureNotes in Computer Science, pages 28–38. Springer, 2001. pages 126

[126] M. Medwed and E. Oswald. Template Attacks on ECDSA. In K.-I.Chung, K. Sohn, and M. Yung, editors, Information Security Applications– WISA 2008, volume 5379 of Lecture Notes in Computer Science, pages14–27. Springer, 2008. pages 114, 123

[127] A. J. Menezes, T. Okamoto, and S. A. Vanstone. Reducing Elliptic CurveLogarithms to Logarithms in a Finite Field. Information Theory, IEEETransactions on, 39(5):1639–1646, 1993. pages 14

[128] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone. Handbook ofApplied Cryptography. CRC Press, 1996. pages 9, 10

[129] N. Mentens, K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede.A Side-channel Attack Resistant Programmable PKC Coprocessor forEmbedded Applications. In H. Blume, G. Gaydadjiev, C. J. Glossner,and P. M. W. Knijnenburg, editors, Proceedings of the 2007 InternationalConference on Embedded Computer Systems: Architectures, Modeling andSimulation (IC-SAMOS 2007), pages 194–200. IEEE, 2007. pages 27, 34,36

[130] T. S. Messerges, E. A. Dabbish, and R. H. Sloan. Power AnalysisAttacks of Modular Exponentiation in Smartcards. In Ç. K. Koç andC. Paar, editors, Cryptographic Hardware and Embedded Systems – CHES1999, volume 1717 of Lecture Notes in Computer Science, pages 144–157.Springer, 1999. pages 112, 116

BIBLIOGRAPHY 145

[131] G. Meurice de Dormale, P. Bulens, and J.-J. Quisquater. Collision Searchfor Elliptic Curve Discrete Logarithm over GF(2m) with FPGA. InP. Paillier and I. Verbauwhede, editors, Cryptographic Hardware andEmbedded Systems – CHES 2007, volume 4727 of Lecture Notes inComputer Science, pages 378–393. Springer, 2007. pages 89

[132] V. S. Miller. Uses of Elliptic Curves in Cryptography. In H. C. Williams,editor, Advances in Cryptology – CRYPTO 1985, volume 218 of LectureNotes in Computer Science, pages 417–426. Springer, 1985. pages 9, 11,112

[133] V. S. Miller. Short Programs for Functions on Curves, 1986.Unpublished manuscript, Available at http://crypto.stanford.edu/

miller/miller.pdf. pages 14

[134] V. S. Miller. The Weil Pairing, and Its Efficient Calculation. Journal ofCryptology, 17(4):235–261, 2004. pages 14, 15

[135] P. L. Montgomery. Modular Multiplication Without Trial Division.Mathematics of Computation, 44(170):519–521, 1985. pages 19

[136] P. L. Montgomery. Speeding the Pollard and elliptic curve methods offactorization. Mathematics of Computation, 48(177):243–264, 1987. pages97, 114, 124

[137] P. L. Montgomery. Five, Six, and Seven-Term Karatsuba-Like Formulae.IEEE Trans. Comput., 54(3):362–369, 2005. pages 57

[138] E. De Mulder, S. Örs, B. Preneel, and I. Verbauwhede. Differential Powerand Electromagnetic Attacks on A FPGA Implementation of EllipticCurve Cryptosystems. Computers & Electrical Engineering, 33(5-6):367–382, 2007. pages 114

[139] F. Muller and F. Valette. High-Order Attacks Against the ExponentSplitting Protection. In M. Yung, Y. Dodis, A Kiayias, and T. Malkin,editors, Public Key Cryptography – PKC 2006, volume 3958 of LectureNotes in Computer Science, pages 315–329. Springer, 2006. pages 128

[140] M. Naehrig, R. Niederhagen, and P. Schwabe. New Software SpeedRecords for Cryptographic Pairings. In M. Abdalla and P. S. L. M.Barreto, editors, Cryptology and Information Security in Latin America –LATINCRYPT 2010, volume 6212 of Lecture Notes in Computer Science,pages 109–123. Springer, 2010. pages 60, 61, 62

[141] P. Q. Nguyen and I. Shparlinski. The Insecurity of the Elliptic CurveDigital Signature Algorithm with Partially Known Nonces. Des. CodesCryptography, 30(2):201–217, 2003. pages 113

http://crypto.stanford.edu/miller/miller.pdf

http://crypto.stanford.edu/miller/miller.pdf

146 BIBLIOGRAPHY

[142] K. Okeya and K. Sakurai. Power Analysis Breaks Elliptic CurveCryptosystems even Secure against the Timing Attack. In B. K. Roy andE. Okamoto, editors, International Conference on Cryptology in India–INDOCRYPT 2000, volume 1977 of Lecture Notes in Computer Science,pages 178–190. Springer, 2000. pages 123, 124

[143] J. K. Omura and J. L. Massey. Computational Method and Apparatusfor Finite Field Arithmetic, 1986. pages 23

[144] P. S. L. M. Barreto and M. Naehrig. Pairing-Friendly Elliptic Curves ofPrime Order. In B. Preneel and S. E. Tavares, editors, Selected Areasin Cryptography – SAC 2005, volume 3897 of Lecture Notes in ComputerScience, pages 319–331. Springer, 2006. pages 15, 52

[145] J. M. Pollard. Monte Carlo Methods for Mndex Momputation (mod p).Mathematics of Computation, 32:918–924, 1978. pages 88

[146] R. L. Rivest, A. Shamir, and L. Adleman. A Method for ObtainingDigital Signatures and Public-Key Cryptosystems. Communications ofthe ACM, 21(2):120–126, 1978. pages 9

[147] K. Rubin and A. Silverberg. Torus-Based Cryptography. In D. Boneh,editor, Advances in Cryptology – CRYPTO 2003, volume 2729 of LectureNotes in Computer Science, pages 349–365. Springer, 2003. pages 9, 10,11

[148] K. Sakiyama. Secure Design Methodology and Implementation forEmbedded Public-key Cryptosystems. PhD thesis, Katholieke UniversiteitLeuven, Belgium, 2007. pages 11, 82, 83

[149] K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede. SuperscalarCoprocessor for High-Speed Curve-Based Cryptography. volume 4249,pages 415–429, 2006. pages 26, 67, 76, 78

[150] K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede. MulticoreCurve-Based Cryptoprocessor with Reconfigurable Modular ArithmeticLogic Units over GF(2n). IEEE Trans. Computers, 56(9):1269–1282,2007. pages 98

[151] K. Sakiyama, N. Mentens, L. Batina, B. Preneel, and I. Verbauwhede.Reconfigurable Modular Arithmetic Logic Unit for High-PerformancePublic-Key Cryptosystems. In K. Bertels, J. M. P. Cardoso, andS. Vassiliadis, editors, Applied Reconfigurable Computing – ARC 2006,volume 3985 of Lecture Notes in Computer Science, pages 347–357.Springer, 2006. pages 27

BIBLIOGRAPHY 147

[152] K. Sakiyama, B. Preneel, and I. Verbauwhede. A Fast Dual-FieldModular Arithmetic Logic Unit and Its Hardware Implementation. InInternational Symposium on Circuits and Systems – ISCAS 2006. IEEE,2006. pages 34, 36

[153] SciEngines GmbH. RIVYERA S3-5000. http://www.sciengines.com/.pages 89

[154] M. Scott. Implementing Cryptographic Pairings. In T. Takagi,T. Okamoto, E. Okamoto, and T. Okamoto, editors, Pairing-BasedCryptography – PAIRING 2007, volume 4575 of LNCS, pages 117–196.Springer, 2007. pages 42, 51

[155] J. Shokrollahi. Efficient Implementation of Elliptic Curve Cryptographyon FPGAs. PhD thesis, University of Bonn, Germany, 2007. pages 22,23

[156] N. P. Smart. An Analysis of Goubin’s Refined Power Analysis Attack. InC. D. Walter, Ç. K. Koç, and C. Paar, editors, Cryptographic Hardwareand Embedded Systems – CHES 2003, volume 2779 of Lecture Notes inComputer Science, pages 281–290. Springer, 2003. pages 128

[157] D. Stebila and N. Thériault. Unified Point Addition Formulæ and Side-Channel Attacks. In L. Goubin and M. Matsui, editors, CryptographicHardware and Embedded Systems – CHES 2006, volume 4249 of LectureNotes in Computer Science, pages 354–368. Springer, 2006. pages 122

[158] B. Sunar and Ç. K. Koç. An Efficient Optimal Normal Basis Type IIMultiplier. IEEE Transactions on Computers, 50:83–87, 2001. pages 22,23

[159] A. F. Tenca and Ç. K. Koç. A Scalable Architecture for ModularMultiplication Based on Montgomery’s Algorithm. IEEE Trans.Computers, 52(9):1215–1221, 2003. pages 30, 34, 36

[160] P. C. van Oorschot and M. J. Wiener. Parallel Collision Search withCryptanalytic Applications. Journal of Cryptology, 12(1):1–28, 1999.pages 88

[161] S. Vanstone. Responses to NIST’s Proposal. Communications of theACM, 35:50–52, 1992. pages 113

[162] F. Vercauteren. Optimal Pairings. Information Theory, IEEETransactions on, 56(1):455 –461, 2010. pages 14, 51

http://www.sciengines.com/

148 BIBLIOGRAPHY

[163] J. von zur Gathen, A. Shokrollahi, and J. Shokrollahi. EfficientMultiplication Using Type 2 Optimal Normal Bases. In C. Carlet andB. Sunar, editors, Workshop on Arithmetic of Finite Fields – WAIFI2007, volume 4547 of Lecture Notes in Computer Science, pages 55–68.Springer, 2007. pages 23, 89, 93

[164] C. D. Walter. Montgomery Exponentiation Needs no Final Subtractions.Electronics Letters, 35(21):1831 –1832, 1999. pages 20

[165] C. D. Walter. Simple Power Analysis of Unified Code for ECC Double andAdd. In M. Joye and J.-J. Quisquater, editors, Cryptographic Hardwareand Embedded Systems – CHES 2004, volume 3156 of Lecture Notes inComputer Science, pages 191–204. Springer, 2004. pages 122

[166] M. J. Wiener and R. J. Zuccherato. Faster Attacks on Elliptic CurveCryptosystems. In S. E. Tavares and H. Meijer, editors, Selected Areasin Cryptography – SAC 1998, volume 1556 of Lecture Notes in ComputerScience, pages 190–200. Springer, 1998. pages 88

[167] T. Wollinger. Software and Hardware Implementation of HyperellipticCurve Cryptosystems. PhD thesis, Ruhr-University Bochum, Germany,2004. pages 67, 78, 79

[168] T. Wollinger. Computer Architectures for Cryptosystems Based onHyperelliptic Curves. Master’s thesis, Worcester Polytechnic Institute,Worcester, Massachusetts, May 2001. pages 67

[169] Z. Yan, D. V. Sarwate, and Z. Liu. High-speed systolic architecturesfor finite field inversion. Integration, VLSI Journal, 38(3):383–398, 2005.pages 69, 70

[170] S. M. Yen and M. Joye. Checking Before Output May Not Be EnoughAgainst Fault-Based Cryptanalysis. Computers, IEEE Transactions on,49(9):967–970, 2000. pages 112, 117, 123

[171] S.-M. Yen, L.-C. Ko, S.-J. Moon, and J. Ha. Relative Doubling AttackAgainst Montgomery Ladder. In D. Won and S. Kim, editors, InformationSecurity and Cryptology – ICISC 2005, volume 3935 of Lecture Notes inComputer Science. Springer, 2005. pages 123, 124

Curriculum

Junfeng Fan received his Bachelor and Master degrees in electrical engineeringfrom Zhejiang University, China, in 2003 and 2006, respectively. FromAugust 2006 to July 2007, he studied as a predoctoral student at COSIC(Computer Security and Industrial Cryptography) at the Department ofElectrical Engineering (ESAT) of K.U.Leuven. He joined the group as a PhDstudent from August 2007. His research interests include efficient arithmetic forpublic key cryptography, low-power design for ubiquitous security and physicalattack resistant implementations.

149

List of publications

International Journals

1. J. Fan, F. Vercauteren, and I. Verbauwhede, “Efficient hardwareimplementation of Fp-arithmetic for pairing-friendly curves,” Computers,IEEE Transactions on, vol. PP, no. 99, p. 1, 2011.

2. K. Sakiyama, M. Knežević, J. Fan, B. Preneel, and I. Verbauwhede,“Tripartite modular multiplication,” Integration, the VLSI Journal,vol. 44, no. 4, pp. 259–269, 2011.

3. J. Fan, L. Batina, and I. Verbauwhede, “Design and design methods forunified multiplier and inverter and its application for HECC,” Integration,the VLSI Journal, vol. 44, no. 4, pp. 280–289, 2011.

4. M. Knežević, K. Kobayashi, J. Ikegami, S. Matsuo, A. Satoh, Ü. Kocabas,J. Fan, T. Katashita, T. Sugawara, K. Sakiyama, I. Verbauwhede,K. Ohta, N. Homma, and T. Aoki, “Fair and consistent hardwareevaluation of fourteen round two SHA-3 candidates,” Very Large ScaleIntegration (VLSI) Systems, IEEE Transactions on, vol. PP, no. 99, pp.1 –13, 2011.

5. J. Fan, K. Sakiyama, and I. Verbauwhede, “Elliptic curve cryptographyon embedded multicore systems,” Design Automation for EmbeddedSystems, vol. 12, pp. 231–242, 2008.

Book Chapters

6. M. Knežević, L. Batina, E. De Mulder, J. Fan, B. Gierlichs, Y. K.Lee, R. Maes, and I. Verbauwhede, “Signal Processing for Cryptographyand Security Applications,” In Handbook of Signal Processing Systems,Springer, pp. 161-177, 2010.

151

152 LIST OF PUBLICATIONS

Lecture Notes in Computer Science

7. R. C. C. Cheung, S. Duquesne, J. Fan, N. Guillermin, I. Verbauwhede,and G. X. Yao, “FPGA implementation of pairings using residue numbersystem and lazy reduction,” in Cryptographic Hardware and EmbeddedSystems – CHES 2011, ser. Lecture Notes in Computer Science,B. Preneel and T. Takagi, Eds., vol. 6917. Springer, pp. 421–441,2011.

8. J. Fan, B. Gierlichs, and F. Vercauteren, “To infinity and beyond:Combined attack on ecc using points of low order,” in CryptographicHardware and Embedded Systems – CHES 2011, ser. Lecture Notes inComputer Science, B. Preneel and T. Takagi, Eds., vol. 6917. Springer,pp. 143–159, 2011.

9. J. Fan, J. Hermans, and F. Vercauteren, “On the claimed privacy of EC-RAC III,” in Radio Frequency Identification: Security and Privacy Issues– RFIDSec 2010, ser. Lecture Notes in Computer Science, S. B. O. Yalcin,Ed., vol. 6370. Springer, pp. 66–74, 2010.

10. J. Fan, F. Vercauteren, and I. Verbauwhede, “Faster Fp-arithmetic forcryptographic pairings on Barreto-Naehrig curves,” in CryptographicHardware and Embedded Systems – CHES 2009, ser. Lecture Notes inComputer Science, C. Clavier and K. Gaj, Eds., vol. 5747. Springer, pp.240–253, 2009.

11. X. Guo, J. Fan, P. Schaumont, and I. Verbauwhede, “Programmable andparallel ECC coprocessor architecture: Tradeoffs between area, speed andsecurity,” in Cryptographic Hardware and Embedded Systems – CHES2009, ser. Lecture Notes in Computer Science, C. Clavier and K. Gaj,Eds., vol. 5747. Springer, pp. 289–303, 2009.

12. J. Fan, L. Batina, and I. Verbauwhede, “HECC goes embedded: an area-efficient implementation of HECC,” in Selected Areas in Cryptography– SAC 2008, ser. Lecture Notes in Computer Science, R. M. Avanzi,L. Keliher, and F. Sica, Eds., vol. 5381. Springer, pp. 387–400, 2008.

13. M. Knežević, K. Sakiyama, J. Fan, and I. Verbauwhede, “Modularreduction in GF(2n) without pre-computational phase,” in Workshopon Arithmetic of Finite Fields – WAIFI 2008, ser. Lecture Notes inComputer Science, J. von zur Gathen, J. L. Imaña, and Ç. K. Koç, Eds.,vol. 5130. Springer, pp. 77–87, 2008.

LIST OF PUBLICATIONS 153

IEEE conferences

14. D. Karaklajić, J. Fan, J. Schmidt, and I. Verbauwhede, “Low-costfault detection method for ECC using Montgomery powering ladder,”in Design, Automation, and Test in Europe – DATE 2011. IEEE, pp.1016–1021, 2011.

15. J. Fan, D. V. Bailey, L. Batina, T. Güneysu, C. Paar, and I. Verbauwhede,“Breaking elliptic curve cryptosystems using reconfigurable hardware,” inField Programmable Logic and Applications – FPL 2010. IEEE, pp. 133–138, 2010.

16. J. Fan, X. Guo, E. D. Mulder, P. Schaumont, B. Preneel, and I. Ver-bauwhede, “State-of-the-art of secure ECC implementations: a survey onknown side-channel attacks and countermeasures,” in Hardware-OrientedSecurity and Trust – HOST 2010, pp. 76–87, 2010.

17. Ü. Kocabas, J. Fan, and I. Verbauwhede, “Implementation of binaryEdwards curves for very-constrained devices,” in Application-specific Sys-tems Architectures and Processors – ASAP 2010, F. Charot, F. Hannig,J. Teich, and C. Wolinski, Eds. IEEE, pp. 185–191, 2010.

18. J. Fan and I. Verbauwhede, “A digit-serial architecture for inversion andmultiplication in GF(2m),” in Signal Processing Systems – SiPS 2008.IEEE, pp. 7–12, 2008.

19. J. Fan, L. Batina, K. Sakiyama, and I. Verbauwhede, “FPGA designfor algebraic tori-based public-key cryptography,” in Design, Automation,and Test in Europe – DATE 2008. IEEE, pp. 1292–1297, 2008.

Technical Reports

20. D. Karaklajić, J. Fan, and I. Verbauwhede, “The Devil is in Details: ASafe-Error Attack on a Tiny ECC Processor,” COSIC internal report, 14pages, 2011.

21. J. Fan, “Hardware evaluation of the hash function Hamsi,” COSICinternal report, 5 pages, 2009

Arenberg Doctoral School of Science, Engineering & Technology

Faculty of Engineering

Department of Electrical Engineering

Computer Security and Industrial Cryptography

Kasteelpark Arenberg 10

B-3001 Heverlee

Efficient arithmetic for embedded cryptography and ... · Abstract Plic y Cryptography (PKC) is a...

Documents

Transcript of Efficient arithmetic for embedded cryptography and ... · Abstract Plic y Cryptography (PKC) is a...