FPGA Benchmarking of Round 2 Candidates in - NIST
Transcript of FPGA Benchmarking of Round 2 Candidates in - NIST
FPGA Benchmarking of Round 2 Candidates in the NIST Lightweight Cryptography
Standardization Process: Methodology, Metrics, Tools, and Results
Kamyar Mohajerani, Richard Haeussler, Rishub Nagpal, Farnoud Farahmand, Abubakr Abdulgadir,
Jens-Peter Kaps and Kris Gaj
George Mason University USA
GMU CERG LWC Benchmarking Team
Kamyar Richard Rishub Farnoud Bakry Mohajerani Haeussler Nagpal Farahmand Abdulgadir
2
FPGA Benchmarking Goals
• Stimulate the development of hardware implementations that can be fairly compared with each other
• Perform design space exploration of at least selected candidates
• Evaluate and rank candidates from the point of view of their performance in FPGAs
• Identify the best and worst performers in terms of major benchmarking metrics
• Develop optimized code of unprotected implementations to be used as a basis for the development and analysis of protected implementations in Round 3
3
Previous Similar Benchmarking Efforts
CAESAR – Competition for Authenticated Encryption: Security, Applicability, & Robustness
• HDL code requirement established by the CAESAR Committee in the middle of Round 2 in May 2016
• CAESAR Round 2 : 2015-2016 • 14 Hardware Design Teams • 28 out of 29 candidates implemented
• CAESAR Round 3 : 2016-2017 • 10 Hardware Design Teams • 15 out of 15 candidates implemented
4
LWC Hardware API proposed by GMU, TUM, & VT
1. Minimum Compliance Criteria 3. Communication Protocol
• Supported operations • Permitted input sizes • Decrypted plaintext release • Permitted data port widths
etc.
2. Interface clk rst
instruction = ACTKEY
instruction = ENC
seg_0_header
seg_0 = Npub
seg_1_header
seg_1 = AD
seg_2_header
seg_2 = Plaintext
clk rst 4. Timing Characteristics LWC
w w pdi_data do_data DO pdi_valid do_valid Data Output
PDI
Public Data Input pdi_ready do_ready PortsPorts
do_last sw
SDI sdi_data
Secret Data Input sdi_valid
Ports sdi_ready
Based on the CAESAR API. Stable since October 2019. 5
�
LWC Hardware Development Package by GMU, TUM, & VT
Helpful for designers following Helpful for all designers the recommended design flow
clk rst
Two Pass clk rst FIFO
LWC
w w sw sdi_datapdi_data do_data DOPDI sdi_valid
pdi_valid do_valid Data Output sdi_readyPublic Data Input
sdi_data
do_last
ccw/8
pdi_data
bdi_valid bdi_ready
fdi_
ready
fdi_
valid
fdi_
data
fdo_va
lidfd
o_data
fdo_re
ady
bdi_eot
bdi_valid_bytes
bdi_pad_loc
bdo_valid bdo_readybdi_ready
bdi_valid
Processor msg_auth msg_auth
sdi_valid
sdi_ready
do_ready
do_valid
bdi bdobdi do_data
fw fw
4 4
ccw/8
ccw
ccsw
ccw/8
ccw
pdi_valid
pdi_ready ww
ccw/8+1
LWC
CryptoCore
key key_valid key_ready
bdi_type
bdi_eoi end_of_block bdo_valid_bytes
bdo_type
key_valid key_ready
key
bdi_type bdi_eot bdi_eoi
bdi_valid_bytes
bdi_pad_loc
end_of_block bdo_valid_bytes
bdo_type bdo_ready bdo_valid bdo
Post Pre
din_valid din_ready
din Header FIFO
dout_valid dout_ready
dout
bdi_size
decrypt_in
key_update msg_auth_ready msg_auth_valid
cmd_valid cmd_ready
cmd
bdi_size
decrypt
key_update
cmd_valid cmd_ready
cmd
msg_auth_ready msg_auth_valid
hash hash_in
Processor
w do_data
pdi_ready do_ready Ports do_valid do_ready
Ports do_last
sw do_last SDI sdi_data
Secret Data Input sdi_valid
Ports sdi_ready w pdi_data pdi_valid
pdi_ready
dout_
ready
dout_
valid
dout
din
_re
ady
din
_va
liddin
• Universal testbench • VHDL code of a generic PreProcessor, (LWC_TB) PostProcessor, and Header FIFO
• Test vector generator • CryptoCore of a dummy authenticated cipher with (cryptotvgen) hash functionality 6
Choice of FPGA Platforms and Tools
1. Representing widely used low-cost, low-power FPGA families
2. Capable of holding SCA-protected designs
(possibly using up to 4 times more resources than unprotected designs)
3. Supported by free versions of state-of-the-art industry tools
7
FPGA Platforms and Tools
Xilinx: Artix-7 : xc7a12tcsg325-3 8,000 LUTs 16,000 FFs 40 18Kbit BRAMs 40 DSPs 150 I/Os
Intel: Cyclone 10 LP : 10CL016-YF484C6 15,408 LEs 15,408 FFs 56 M9K blocks 56 MULs 162 I/Os
Lattice Semiconductor: ECP5 : LFE5U-25F-6BG381C 24,000 LUTs 24,000 FFs 56 18Kbit blocks 28 MULs 197 I/Os.
Xilinx Artix-7 LUTs have 6 inputs; Cyclone 10 LP and ECP5 LUTs have 4 inputs 8
Optimization Targets
Maximum Throughput with up to 2000 LUTs, 4000 flip-flops of Artix-7 FPGA. No BRAMs & no DSP units.
Alternative targets: 1. Basic-iterative architecture 2. Architectures most natural for a given authenticated cipher
a. Folding in block-cipher-based submissions b. Generating 2d bits per clock cycle in stream-cipher-based submissions
3. Maximum Throughput for 1000 LUTs, 2000 flip-flops of Artix-7 FPGA. No BRAMs & no DSP units.
9
Benchmarking Metrics (1)
Metrics obtained from tool reports after placing and routing:
1. Resource utilization Number of LUTs (LEs for Cyclone 10LP) and flip-flops, assuming no use of embedded memories (such as BRAMs), DSP units, and embedded multipliers
2. Maximum clock frequency in MHz (used only for the calculation of maximum throughput)
10
Benchmarking Metrics (2)
Metrics calculated based on the execution time measurements (in clock cycles) obtained using functional simulation and the maximum clock frequencies (in MHz): Throughput in Mbits/s
for the following sizes of inputs a. long [with Throughput = d · Block size/(Time(N+d blocks)-Time(N blocks))] b. 1536 bytes c. 64 bytes d. 16 bytes.
All throughputs calculated separately for • authenticated encryption: AD, plaintext, AD+plaintext (sender's side) • authenticated decryption: AD, ciphertext, AD+ciphertext (receiver's side) • hashing: hash message (both sides) 11
Timeline of Round 2 FPGA Benchmarking
Phase 1: Sep. 1, 2020: 1st submission deadline Sep. 26, 2020: Publication of the living report
Phase 2: Oct. 11, 2020: 2nd submission deadline Oct. 21, 2020: Phase 2 updates to the report
Phase 3: Nov. 9, 2020: 3rd submission deadline Nov. 30, 2020: Final version of the report
12
Summary of Submissions Phases 1 & 2
27 submissions representing 22 out of 32 candidates (69%)
Candidates with two submissions from two different groups: Ascon, COMET, Gimli, TinyJAMBU, and Xoodyak
13
Summary of Submissions Phases 1 & 2
• George Mason University Cryptographic Engineering Research Group (CERG), USA (6): Elephant, PHOTON-Beetle, Pyjamask, Saturnin, TinyJAMBU, and Xoodyak
• Virginia Tech Signatures Analysis Lab, USA (5): Ascon, COMET, GIFT-COFB, SCHWAEMM & ESCH, and Spoc
• CINVESTAV-IPN, Mexico (4) COMET, ESTATE, LOCUS-AEAD/LOTUS-AEAD, and Oribatida
• Institute of Applied Information Processing and Communications, TU Graz, Austria (2) Ascon and ISAP
• Submissions from the respective candidate teams (8): Gimli, KNOT, Romulus, Spook, Subterranean 2.0, WAGE, TinyJAMBU, and Xoodyak
• Submissions from other groups and independent researchers (2): Gimli, DryGASCON 14
Design Variants
Different variants corresponds to • different algorithms of the same family described in a single
submission to the NIST LWC standardization process • different parameter sets, such as sizes of keys, nonces, tags, etc. • support for AEAD vs. AEAD+Hash • different hardware architectures, e.g., basic iterative, folded,
unrolled, pipelined, etc.
72 variants Minimum: 1, Maximum: 12, Average: 2.7
per hardware design submission 15
Benchmarking Flow – Part 1
1. Functional verification using GMU Known Answer Tests (KATs) not known to the designers in advance
2. Timing measurements a. 16 bytes, 64 bytes, 1536 bytes, N input blocks, N + d input blocks,
with N = 4 and d = 1 or 2 b. AD: AD only
PT: Plaintext/Ciphertext only AD+PT: equal-size AD and Plaintext/Ciphertext
16
Benchmarking Flow – Part 2
3. Synthesis, Implementation, and Optimization of Tool Options Xilinx:
Xilinx Vivado 2020.1 (lin64) Minerva
Intel: Intel Quartus Prime Lite Edition Design Software, ver. 20.1 ATHENa
Lattice Semiconductor: Lattice Diamond Software v3.11 SP2
Synthesis: Lattice Synthesis Engine (LSE) or Synplify Pro Xeda (new)
All results reported after placing & routing 17
Benchmarking Flow – Part 3
4. Calculation of Throughputs
5. Generation of graphs and tables
6. Analysis of results
18
Throughput vs. Area for Long Inputs Artix-7 FPGA: Auth Encryption, Plaintext only
�
�
�
� � �
���
�
�
�
�
�
� � �
����
�
�
�
�
�
� �
6XEWHUUDQHDQ�Y�
;RRG\DNB;7�Y�
$VFRQB*UD]�Y�
*LPOL�*7�Y�
.127�Y�
&20(7B97�Y�
'U\*$6&21�Y�
6SRRN�Y��Y�
7LQ\-$0%8B7-7�Y�
5RPXOXV�Y�
6DWXUQLQ�Y�
*,)7�&2)%�Y�
6&+:$(00�Y�
3+2721�%HHWOH�Y�
(OHSKDQW�Y�
,6$3�Y�
(67$7(�Y�
3\MDPDVN�Y�
2ULEDWLGD�Y�
:$*(�Y�
6SR&�Y�
/2&86�Y�
37�7KURX
JKSX
W�>0
ELWV�V@
���� ���� ���� ���� ����
$UHD�>/87V@ 19
Throughput vs. Area for Long Inputs Artix-7 FPGA: Auth Encryption, AD only
�
� �
���
�
�
�
�
�
�
� �
����
�
�
�
�
�
�
�
6XEWHUUDQHDQ�Y�
;RRG\DNB;7�Y�
7LQ\-$0%8B7-7�Y�
$VFRQB*UD]�Y�
&20(7B97�Y�
*LPOL�*7�Y�
.127�Y�
6DWXUQLQ�Y�
5RPXOXV�Y�
'U\*$6&21�Y�
(OHSKDQW�Y�
6SRRN�Y��Y�
6&+:$(00�Y�
3+2721�%HHWOH�Y�
,6$3�Y�
*,)7�&2)%�Y�
(67$7(�Y�
2ULEDWLGD�Y�
3\MDPDVN�Y�
/2&86�Y�
:$*(�Y�
6SR&�Y�
$'�7KURX
JKSX
W�>0
ELWV�V@
���� ���� ���� ���� ����
$UHD�>/87V@ 20
Throughput vs. Area for Long Inputs Artix-7 FPGA: Hashing
���
�
�
�
�
�
�
�
�
����
�
*LPOLB*7�Y�
;RRG\DNB;7�Y�
'U\*$6&21�Y�
6DWXUQLQ�Y�
$VFRQB*UD]�Y�
6XEWHUUDQHDQ�Y�
6&+:$(00�Y�
3+2721�%HHWOH�Y�
+DVK�7K
URXJ
KSXW�>0ELWV�V@
���� ���� ���� ���� ���� ���� ����
$UHD�>/87V@ 21
Design Space Exploration: Ascon Artix-7 FPGA: Auth Encryption, Plaintext only
�
�
�
�
�
�
�
�
����
�
�
�
�
�
�
6XEWHUUDQHDQ�Y�
$VFRQB*UD]�Y�
$VFRQB*UD]�Y�
$VFRQB97�Y�
$VFRQB97�Y�
7LQ\-$0%8B7-7�Y�
6&+:$(00�Y�
37�7KURX
JKSX
W�>0
ELWV�V@
Ascon-128+Ascon-Hash
Ascon-128a+Ascon-Hash
Ascon-128+Ascon-Hash
Ascon-128
Subterranean 2.0
TinyJAMBU
SCHWAEMM
���� ���� ���� ���� ����
$UHD�>/87V@ 22
23
Design Space Exploration: KNOTArtix-7 FPGA: Auth Encryption, Plaintext only
���� ���� ���� ���� ����
�
�
�
�
�
�
�
�����
�
�
�
�
�
�
6XEWHUUDQHDQ�Y�
.127�Y�
7LQ\-$0%8B7-7�Y�
6&+:$(00�Y�
.127�Y�
.127�Y�
.127�Y�
$UHD�>/87V@
37�7KURX
JKSX
W�>0
ELWV�V@
Subterranean 2.0
TinyJAMBU SCHWAEMM
v2=(128, 384, 192)
v1=(128, 256, 64)
v3=(192, 384, 96)
v4=(256, 512, 128)
KNOT (key length, state size, bit rate)bit rate
state size
24
Design Space Exploration: RomulusArtix-7 FPGA: Auth Encryption, Plaintext only
���� ���� ���� ���� ����
�
�
�
�����
���
�
�
�
�����
����
�
�
�
�����
6XEWHUUDQHDQ�Y�
7LQ\-$0%8B7-7�Y�
5RPXOXV�Y�
5RPXOXV�Y�
6&+:$(00�Y�
5RPXOXV�Y�
5RPXOXV�Y�
5RPXOXV�Y�
$UHD�>/87V@
37�7KURX
JKSX
W�>0
ELWV�V@
Round-based
Subterranean 2.0
TinyJAMBU
SCHWAEMM
Two-round Four-roundEight-round
Low-area
Dependence of Rankings on the Input SizeArtix-7 FPGA: Auth Encryption, Plaintext only
Position Long 1536 B 64 B 16 B
1 Subterranean-v1 Subterranean-v1 Subterranean-v1 COMET_VT-v1
2 Xoodyak_XT-v8 Xoodyak_XT-v8 Ascon_Graz-v2 Subterranean-v1
3 Ascon_Graz-v2 Ascon_Graz-v2 Xoodyak_XT-v8 Ascon_Graz-v2
4 Gimli_GT-v2 Gimli_GT-v2 COMET_VT-v1 DryGASCON-v1
5 KNOT-v2 KNOT-v2 DryGASCON-v1 Xoodyak_XT-v8
6 COMET_VT-v1 COMET_VT-v1 TinyJAMBU_TJT-v3 TinyJAMBU_TJT-v3
7 DryGASCON-v1 DryGASCON-v1 KNOT-v2 Romulus-v2
8 Spook-v2-v1 Spook-v2-v1 Gimli_GT-v2 PHOTON-Beetle-v1
9 TinyJAMBU_TJT-v3 TinyJAMBU_TJT-v3 Romulus-v2 KNOT-v2
10 Romulus-v2 Romulus-v2 Spook-v2-v1 Gimli_GT-v2
11 Saturnin-v2 Saturnin-v2 PHOTON-Beetle-v1 Elephant-v2
25Color code: Higher position for smaller messages Lower position for smaller messages
Dependence of Rankings on the Input SizeArtix-7 FPGA: Auth Encryption, Plaintext only
Position Long 1536 B 64 B 16 B
12 GIFT-COFB-v1 GIFT-COFB-v1 GIFT-COFB-v1 GIFT-COFB-v1
13 SCHWAEMM-v1 SCHWAEMM-v1 Elephant-v2 ESTATE-v1
14 PHOTON-Beetle-v1 PHOTON-Beetle-v1 SCHWAEMM-v1 Spook-v2-v1
15 Elephant-v2 Elephant-v2 Saturnin-v2 SCHWAEMM-v1
16 ISAP-v2 ISAP-v2 ESTATE-v1 Oribatida-v1
17 ESTATE-v1 ESTATE-v1 Oribatida-v1 Saturnin-v2
18 Pyjamask-v2 Pyjamask-v2 ISAP-v2 LOCUS-v1
19 Oribatida-v1 Oribatida-v1 Pyjamask-v2 SpoC-v1
20 WAGE-v1 WAGE-v1 SpoC-v1 ISAP-v2
21 SpoC-v1 SpoC-v1 LOCUS-v1 Pyjamask-v2
22 LOCUS-v1 LOCUS-v1 WAGE-v1 WAGE-v1
26Color code: Higher position for smaller messages Lower position for smaller messages
Tentative Conclusions
27
Highest Throughput for #LUTs < 2500
Plaintext Only
1. Subterranean 2.02. Xoodyak3. Ascon4. Gimli5. KNOT6. DryGASCON7. Spook
~6 Gbit/s
2-3 Gbit/s
1-2 Gbit/s
AD Only
1. Subterranean 2.02. Xoodyak3. Ascon4. TinyJAMBU5. Gimli6. KNOT7. Saturnin8. Romulus9. DryGASCON10. Elephant11. Spook
~6 Gbit/s3-4 Gbit/s
1-2 Gbit/s
2-3 Gbit/s
Future Work in Round 2Phase 3:Nov. 9, 2020: 3rd submission deadlineNov. 30, 2020: Final version of the report
28
• Hopefully 85%-100% of candidates!
• Improved designs• More design space explorations• A new tool supporting the derivation of formulas for the execution times• Evaluation in terms of Power consumption and Energy per bit
Living Report from Round 2
29
Cryptology ePrint Archive: Report 2020/1207
https://cryptography.gmu.edu/athena
Released on GMU website: Sep. 26, 2020Posted on ePrint: Oct. 2, 2020
Phase 2 Report available tomorrow, October 21, 2020Regular updates
Changelog at the end of the document
Q & A
~90 pages, ~25 graphs, ~60 tables