Lessons Learned from Designing a 65nm ASIC for Third Round SHA-3 …sha3/kgf_slides.pdf · 2012. 3....
Transcript of Lessons Learned from Designing a 65nm ASIC for Third Round SHA-3 …sha3/kgf_slides.pdf · 2012. 3....
-
Lessons Learned fromDesigning a 65 nm ASIC
for Third Round SHA-3 Candidates
Frank K. Gürkaynak, Kris Gaj, Beat Muheim,Ekawat Homsirikamol, Christoph Keller, Marcin Rogawski,
Hubert Kaeslin, Jens-Peter Kaps
ETH Zurich - George Mason University
22-23 March 2012
-
Motivation
Present comparative ASIC performanceresults on all SHA-3 third round candidates
Assumptions
We make no claims about the cryptographic security
Authors’ recommendations for SHA-2-256 equivalent securityhave been followed.
Microelectronics Design Center 2 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Background
Timeline
earlier GMU releases ATHENa, a database for FPGA resultsETH publishes a study on 2nd round candidates
May 2011 Quo Vadis 2011 Wokshop in WarsawStart of collaboration
Jun 2011 Start of project
Aug 2011 Common interface, all cores (ETH Zurich-GMU)compatible
Oct 2011 Tape-out
Dec 2011 Production problem with I/O transistors
Feb 2012 Measured 5 ASICs from first batch
Microelectronics Design Center 3 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Two Groups, Two Different Approaches
Development was mostly independent.Groups did not compete for performance goals.
All ASIC development by ETH Zurich
George Mason University
Academic approach
Optimized for maximum:Throughput per Area
Taken VHDL codes fromextensive architectureevaluations for FPGAs
ETH Zurich
Quasi industrial approach
Specific throughput target:2.488 Gbit/s
Selected smallest design forthe throughput
Deliberately tried to increasearchitectural diversity
Microelectronics Design Center 4 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
One ASIC, Many Cores
A common I/O interface for all cores
LFSR based input assembles random input message
FinalBlock signal tells that current message block is last
Last message block is padded (fixed padding length)
All inputs applied parallel, 1088 bits for Keccak, 512 for others
Multiplexer selects 16-bits out of 256 output bits
Microelectronics Design Center 5 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
SHABZIGER: Our ASIC with all SHA-3 Candidates
TechologyUMCLL65nm
Supply1.2V VDD
Metallization8-Metal
Package56pin QFN56
Total Size1.825mm x1.825mm
Area Unit1 GE=1.44µm2
Microelectronics Design Center 6 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Main Problem
EDA tools are designed for industry requirements
Circuit has to function to specification even in worstconditions. Constraints are defined to ensure this.
Tools are not designed to see how much better(faster/smaller) a specific circuit can be made.
In general, Academia is interested in limits
Performance numbers are needed to show how we compareagainst others.
It is not easy to get fair numbers from Industrial tools.
Constraints are mis-used to squeeze out more performance.
Microelectronics Design Center 7 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
The Design Flow
RTL Description
(VHDL)
Synthesis
(Synopsys DC)
Place and Route
(Cadence EDI)
Synthesis
(Synopsys DC)Wireload Model
Constraints
Place and Route
(Cadence EDI)
Architecture
(GMU)
Specifications
Architecture
(ETH Zurich)
ASIC
(UMC65nm)
Accuracy of ResultsHigh Low
Microelectronics Design Center 8 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
The Verification Flow
LFSR
Padding Unit
RTL/Netlist
Formatter
NIST KAT
Check Results Generate TV
HP83000
Mentor Modelsim
manufactured ASIC
Random Input Stimuli
Simulated Response
Expected Response
Test Vectors
Simulation Result Measurement Result
Select Alg/ModeControl Control
Microelectronics Design Center 9 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Reporting Performance: Area
How much silicon area is used by the circuit
Area is reported in Gate Equivalents (GE).
For the UMC65 technology and the standard cell library used
1 GE=1.44µm2
Includes overhead for clock trees, scan chains, reset circuitry.
Area in Gate Equivalents is not very accurate
The rectangular area in which the circuit could be manufacturedwill be more than the specified area. There is additional:
Overhead for power
Overhead for routability
Overhead for signal integrity
These depend on the circuit and its operating conditions .
Microelectronics Design Center 10 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Reporting Performance: Time, Speed, Throughput
Finding the correct unit
Clock period [ns]All EDA tools constrain the speed by specifying the clockperiod. The main constraint for speed in a digital circuit.
Throughput [Gbit/s]
When comparing different architectures clock period is notrepresentative.Throughput tells us how much data is processed per unit time.In this work, long message hashing performance is used.
Time per data item [ns/bit]
Throughput is related to the clock period by 1Clock Period .For AT (Area-Time) plots, one axis must be time.The [cycles/byte] commonly used for software performanceis a similar time per data item unit.
Microelectronics Design Center 11 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
The AT plot
0
A
2A
3A
4A
5A
0 T 2T 3T 4T 5T 6T 7T 8T
Inc
rea
sin
g A
rea
Increasing Critical Path /
Decreasing Operating Frequency
Mor
e Ef
ficie
nt Im
plem
enta
tion
Overconstrained
for Speed =>
Too large
Overconstrained
for Area =>
Too slow
Efficient
Implementations
Microelectronics Design Center 12 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Synthesis Results
0
20
40
60
80
100
120
140
160
180
Are
a [
kG
ate
eq
]
gmu
10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s
Throughput [Gbit/s]
Gbit/s
Time per bit [ns/bit]
0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit 0 ns/bit
1000 kbit/s/gate500 kbit/s/gate
200 kbit/s/gate
100 kbit/s/gate
50 kbit/s/gate
25 kbit/s/gate
Faster
Smaller
More
Efficient
BLAKE
Grostl
JH
Keccak
Skein
SHA-2
Synthesis Run Results
with Different
Timing Constraints
Grostl
Skein
JH
SHA-2
Keccak
Microelectronics Design Center 13 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
The Story of Wireload Models
Wireload models reflect the routing overhead of the circuit
For modern technologies parasitic effect from routing are amajor contributor to overall delay.
For synthesis, wireload models approximate this contribution
The parasitic effects of a net are modeled to be proportional tothe number of connections on a net, and the size of the circuitA look-up table is consulted to obtain the values
All standard cell libraries have a default model.
Each circuit is different, will require a different wireload.
Once there is a placement and routing solution, the specificwireload for the circuit can be extracted.
Subsequent synthesis runs will be more accurate.
Microelectronics Design Center 14 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Synthesis Results with Extracted Wireload
0
20
40
60
80
100
120
140
160
180
Are
a [
kG
ate
eq
]
gmu
10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s
Throughput [Gbit/s]
Gbit/s
Time per bit [ns/bit]
0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit 0 ns/bit
1000 kbit/s/gate500 kbit/s/gate
100 kbit/s/gate
50 kbit/s/gate
25 kbit/s/gate
Faster
Smaller
More
Efficient
BLAKE
Grostl
JH
Keccak
Skein
SHA-2
200 kbit/s/gate
Synthesis Run
with Extracted Wireload
Results for Different
Timing Constraints
Ch
an
ge
in
Pe
rfo
rma
nce
Microelectronics Design Center 15 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Obtaining Postlayout Results
Using Multi-Mode Multi-Corner constraints
All blocks are synthetized separately.
During the backend, all circuits are placed and routed at once
For each circuit, a separate mode is defined (17 in total)
At any time, only one core is activeConstraints specified individually for each coreSoC Encounter is able to optimize for all modes simultaneously
Due to parasitic effects, constraints are relaxed for P&R.
Post layout results are slower, constraints relaxedAll circuits are optimized at the same timeThere is one set of constraintsBackend affects each circuit differentlyUsed several runs to find an acceptable solution
Microelectronics Design Center 16 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Postlayout Results
0
20
40
60
80
100
120
140
160
180
Are
a [
kG
ate
eq
]
gmu
10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s
Throughput [Gbit/s]
Gbit/s
Time per bit [ns/bit]
0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit 0 ns/bit
1000 kbit/s/gate500 kbit/s/gate
100 kbit/s/gate
50 kbit/s/gate
25 kbit/s/gate
Faster
Smaller
More
Efficient
BLAKE
Grostl
JH
Keccak
Skein
SHA-2
200 kbit/s/gate
Initial
SynthesisSynthesis
with
Extracted
Wireload
Final
Postlayout
Result
Microelectronics Design Center 17 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Postlayout Results: GMU and ETH Zurich
0
20
40
60
80
100
120
140
160
180
Are
a [
kG
ate
eq
]
gmu
10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s
Throughput [Gbit/s]
Gbit/s
Time per bit [ns/bit]
0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit 0 ns/bit
1000 kbit/s/gate500 kbit/s/gate
100 kbit/s/gate
50 kbit/s/gate
25 kbit/s/gate
Faster
Smaller
More
Efficient
BLAKE
Grostl
JH
Keccak
Skein
SHA-2
2.4
88
Gb
it/s
200 kbit/s/gate
ethz
Microelectronics Design Center 18 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Reporting Performance: Power, Energy
We are interested in Energy, not Power
Energy/bit = PowerThroughput [pJ/bit]Energy per data item is a good indicator of implementationefficiency.
Measurement Results
The dummy mode is used to determine the overhead of theI/O circuitry (and static power consumption).
The dynamic power is measured for different clock frequencies.
The I/O (dummy) power is subtracted from the measurement.
Power is determined as a function of the clock rate. [mW/MHz]
Energy per bit is derived from this number.
All numbers for nominal VDD = 1.2V.
Microelectronics Design Center 19 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Normalized Energy/bit, Measurement vs Estimation
0
1
2
3
4
5
6
7
8
SkeinKeccakJHGroestlBLAKESHA-2
En
erg
y/B
it n
orm
aliz
ed
to
GM
U S
HA
-2
Algorithms
Normalized Energy/Bit for ALL SHA-3 Candidates
GMU
ETHZ
Measurement ResultsAverage of 5 ASICs
VDD=1.2VNumbers in pJ/bit
3.9
8 5.0
5
10
.16
20
.67 2
3.1
5
27
.38
11
.20
9.8
5
6.2
8
4.9
8
16
.02
28
.42
Microelectronics Design Center 20 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Throughput/Area, Measurement vs Estimation
0
0.5
1
1.5
2
2.5
3
SkeinKeccakJHGroestlBLAKESHA-2
Th
rou
gh
pu
t/A
rea
no
rma
lize
d t
o G
MU
SH
A-2
Algorithms
Normalized Throughput/Area of ALL SHA-3 Candidates
GMU
ETHZ
21
5
17
4
16
7
85
86
41
15
4
13
9
39
4
68
5
12
1
46
Measurement ResultsAverage of 5 ASICs
VDD=1.2VNumbers in kbits/GE
Microelectronics Design Center 21 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Concluding Remarks (I)
SHA-2
Very efficient in hardware
By far the smallest
Algorithm has been around longer,perhaps reason for more optimized implementations
BLAKE
Compact, easy to implement
Allows good scalability
Not the fastest
Microelectronics Design Center 22 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Concluding Remarks (II)
Grøstl
Best scalability (Speed/Area tradeoff)
Low throughput per area
Cumbersome for hardware
JH
Consistently ranks in the middle
So far, unable to find good scaling options
All modes use identical hardware
Microelectronics Design Center 23 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Concluding Remarks (III)
Keccak
Hands down fastest algorithm
Large block size, and small latency key to speed
Not very good Area/Speed trade-off
Skein
Low throughput per area
Interesting hardware trade-offs due to adder
Longer combinational delay per clock cycle,perhaps reason for better match between expectation andmeasurement.
Microelectronics Design Center 24 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Lessons Learned
Synthesis results can be far from actual performanceDifferences of more than 50% possible.
Measurement on ASIC is the proof of implementationActual design will also suffer from practical constraints
Industrial EDA tools are not for best performanceIndustrial tools are based on fulfilling constraints. Academicresearch tries to misuse these constraints to find the bestperformance. It is not a very good approach, lacks system.
Different implementations should be comparedThe complex EDA chain makes it very difficult to determineperformance with absolute certainty. Seeminglyinconsequential differences can have profound effects.
Microelectronics Design Center 25 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Thank you...
Microelectronics Design Center 26 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Additional Material
All sources and scripts:http://www.iis.ee.ethz.ch/~sha3
Microelectronics Design Center 27 / 31 Department of Information Technologyand Electrical Engineering Zurich
http://www.iis.ee.ethz.ch/~sha3
-
Post Layout Results: Speed, Typical Case
Alg. Block Size Impl. Area (FFs) Max. Clk Tput TpA[bits] [kGE] [MHz] [Gbit/s] [kbit/s·GE]
SHA-2 512ETHZ 24.30 (29%) 516.00 3.943 162.255
GMU 25.14 (35%) 870.32 6.855 272.691
BLAKE 512ETHZ 39.96 (26%) 344.12 3.091 77.347
GMU 43.02 (34%) 436.30 7.703 179.039
Grøstl 512ETHZ 69.39 (17%) 460.83 2.913 41.977
GMU 160.28 (9%) 757.58 18.470 115.239
JH 512ETHZ 46.79 (27%) 558.97 6.814 145.626
GMU 54.35 (31%) 947.87 11.286 207.655
Keccak 1088ETHZ 46.31 (25%) 786.16 35.639 769.550
GMU 80.65 (19%) 920.81 41.743 517.587
Skein 512ETHZ 71.87 (19%) 564.33 3.141 43.697
GMU 71.90 (22%) 312.11 8.411 116.977
Microelectronics Design Center 28 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Measurement Results: Speed, Average of 5 ASICs
Alg. Block Size Impl. Area (FFs) Max. Clk Tput TpA[bits] [kGE] [MHz] [Gbit/s] [kbit/s·GE]
SHA-2 512ETHZ 24.30 (29%) 552.79 4.224 173.826
GMU 25.14 (35%) 685.40 5.399 214.751
BLAKE 512ETHZ 39.96 (26%) 377.93 3.395 84.947
GMU 43.02 (34%) 405.84 7.165 166.541
Grøstl 512ETHZ 69.39 (17%) 445.63 2.817 40.593
GMU 160.28 (9%) 563.70 13.743 85.747
JH 512ETHZ 46.79 (27%) 532.48 6.491 138.725
GMU 54.35 (31%) 704.72 8.391 154.387
Keccak 1088ETHZ 46.31 (25%) 700.28 31.746 685.482
GMU 80.65 (19%) 701.75 31.813 394.456
Skein 512ETHZ 71.87 (19%) 588.24 3.274 45.548
GMU 71.90 (22%) 323.21 8.710 121.036
Microelectronics Design Center 29 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Post Layout Results: Power @2.488 Gb/s, Typical
Algorithm Block Size Imp. Latency Clk Freq. Power Energy/bit[bits] [cycles] [MHz] [mW] [pJ/bit]
SHA-2 512ETHZ 67 324 11.86 4.76
GMU 65 316 9.16 3.68
BLAKE 512ETHZ 57 276 34.80 13.99
GMU 29 140 16.47 6.62
Grøstl 512ETHZ 81 392 50.50 20.30
GMU 21 102 46.01 18.49
JH 512ETHZ 42 204 16.54 6.67
GMU 43 209 17.80 7.15
Keccak 1088ETHZ 24 54 8.16 3.28
GMU 24 54 9.98 4.01
Skein 512ETHZ 92 446 50.00 20.10
GMU 19 92 26.19 10.53
Microelectronics Design Center 30 / 31 Department of Information Technologyand Electrical Engineering Zurich
-
Measurement Results: Power @2.488 Gb/s - 1.2V
Algorithm Block Size Imp. Latency Clk Freq. Power Energy/bit[bits] [cycles] [MHz] [mW] [pJ/bit]
SHA-2 512ETHZ 67 324 12.57 5.05
GMU 65 316 9.90 3.98
BLAKE 512ETHZ 57 276 51.42 20.67
GMU 29 140 25.27 10.16
Grøstl 512ETHZ 81 392 68.12 27.38
GMU 21 102 57.59 23.15
JH 512ETHZ 42 204 24.51 9.85
GMU 43 209 27.89 11.20
Keccak 1088ETHZ 24 54 12.38 4.98
GMU 24 54 15.62 6.28
Skein 512ETHZ 92 446 70.71 28.42
GMU 19 92 39.86 16.02
Microelectronics Design Center 31 / 31 Department of Information Technologyand Electrical Engineering Zurich
IntroductionMotivationBackground
DesignI/O StructureShabzigerGoalsDesign FlowVerification Flow
PerformanceAreaTimeAT plotsSynthesis ResultsSynthesis Results with Extracted WireloadPostlayout ResultsPower/Energy
Normalized Measurement ResultsMeasurement Results: Energy per bitMeasurement Results: Throughput per Area
Concluding RemarksAdditional MaterialPost Layout Results: SpeedMeasurement Results: SpeedPost Layout Results: Power Measurement Results: Power