Lessons Learned from Designing a 65nm ASIC for Third Round SHA-3 …sha3/kgf_slides.pdf · 2012. 3....

31
Lessons Learned from Designing a 65 nm ASIC for Third Round SHA-3 Candidates Frank K. G¨ urkaynak, Kris Gaj, Beat Muheim, Ekawat Homsirikamol, Christoph Keller, Marcin Rogawski, Hubert Kaeslin, Jens-Peter Kaps ETH Zurich - George Mason University 22-23 March 2012

Transcript of Lessons Learned from Designing a 65nm ASIC for Third Round SHA-3 …sha3/kgf_slides.pdf · 2012. 3....

  • Lessons Learned fromDesigning a 65 nm ASIC

    for Third Round SHA-3 Candidates

    Frank K. Gürkaynak, Kris Gaj, Beat Muheim,Ekawat Homsirikamol, Christoph Keller, Marcin Rogawski,

    Hubert Kaeslin, Jens-Peter Kaps

    ETH Zurich - George Mason University

    22-23 March 2012

  • Motivation

    Present comparative ASIC performanceresults on all SHA-3 third round candidates

    Assumptions

    We make no claims about the cryptographic security

    Authors’ recommendations for SHA-2-256 equivalent securityhave been followed.

    Microelectronics Design Center 2 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Background

    Timeline

    earlier GMU releases ATHENa, a database for FPGA resultsETH publishes a study on 2nd round candidates

    May 2011 Quo Vadis 2011 Wokshop in WarsawStart of collaboration

    Jun 2011 Start of project

    Aug 2011 Common interface, all cores (ETH Zurich-GMU)compatible

    Oct 2011 Tape-out

    Dec 2011 Production problem with I/O transistors

    Feb 2012 Measured 5 ASICs from first batch

    Microelectronics Design Center 3 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Two Groups, Two Different Approaches

    Development was mostly independent.Groups did not compete for performance goals.

    All ASIC development by ETH Zurich

    George Mason University

    Academic approach

    Optimized for maximum:Throughput per Area

    Taken VHDL codes fromextensive architectureevaluations for FPGAs

    ETH Zurich

    Quasi industrial approach

    Specific throughput target:2.488 Gbit/s

    Selected smallest design forthe throughput

    Deliberately tried to increasearchitectural diversity

    Microelectronics Design Center 4 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • One ASIC, Many Cores

    A common I/O interface for all cores

    LFSR based input assembles random input message

    FinalBlock signal tells that current message block is last

    Last message block is padded (fixed padding length)

    All inputs applied parallel, 1088 bits for Keccak, 512 for others

    Multiplexer selects 16-bits out of 256 output bits

    Microelectronics Design Center 5 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • SHABZIGER: Our ASIC with all SHA-3 Candidates

    TechologyUMCLL65nm

    Supply1.2V VDD

    Metallization8-Metal

    Package56pin QFN56

    Total Size1.825mm x1.825mm

    Area Unit1 GE=1.44µm2

    Microelectronics Design Center 6 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Main Problem

    EDA tools are designed for industry requirements

    Circuit has to function to specification even in worstconditions. Constraints are defined to ensure this.

    Tools are not designed to see how much better(faster/smaller) a specific circuit can be made.

    In general, Academia is interested in limits

    Performance numbers are needed to show how we compareagainst others.

    It is not easy to get fair numbers from Industrial tools.

    Constraints are mis-used to squeeze out more performance.

    Microelectronics Design Center 7 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • The Design Flow

    RTL Description

    (VHDL)

    Synthesis

    (Synopsys DC)

    Place and Route

    (Cadence EDI)

    Synthesis

    (Synopsys DC)Wireload Model

    Constraints

    Place and Route

    (Cadence EDI)

    Architecture

    (GMU)

    Specifications

    Architecture

    (ETH Zurich)

    ASIC

    (UMC65nm)

    Accuracy of ResultsHigh Low

    Microelectronics Design Center 8 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • The Verification Flow

    LFSR

    Padding Unit

    RTL/Netlist

    Formatter

    NIST KAT

    Check Results Generate TV

    HP83000

    Mentor Modelsim

    manufactured ASIC

    Random Input Stimuli

    Simulated Response

    Expected Response

    Test Vectors

    Simulation Result Measurement Result

    Select Alg/ModeControl Control

    Microelectronics Design Center 9 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Reporting Performance: Area

    How much silicon area is used by the circuit

    Area is reported in Gate Equivalents (GE).

    For the UMC65 technology and the standard cell library used

    1 GE=1.44µm2

    Includes overhead for clock trees, scan chains, reset circuitry.

    Area in Gate Equivalents is not very accurate

    The rectangular area in which the circuit could be manufacturedwill be more than the specified area. There is additional:

    Overhead for power

    Overhead for routability

    Overhead for signal integrity

    These depend on the circuit and its operating conditions .

    Microelectronics Design Center 10 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Reporting Performance: Time, Speed, Throughput

    Finding the correct unit

    Clock period [ns]All EDA tools constrain the speed by specifying the clockperiod. The main constraint for speed in a digital circuit.

    Throughput [Gbit/s]

    When comparing different architectures clock period is notrepresentative.Throughput tells us how much data is processed per unit time.In this work, long message hashing performance is used.

    Time per data item [ns/bit]

    Throughput is related to the clock period by 1Clock Period .For AT (Area-Time) plots, one axis must be time.The [cycles/byte] commonly used for software performanceis a similar time per data item unit.

    Microelectronics Design Center 11 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • The AT plot

    0

    A

    2A

    3A

    4A

    5A

    0 T 2T 3T 4T 5T 6T 7T 8T

    Inc

    rea

    sin

    g A

    rea

    Increasing Critical Path /

    Decreasing Operating Frequency

    Mor

    e Ef

    ficie

    nt Im

    plem

    enta

    tion

    Overconstrained

    for Speed =>

    Too large

    Overconstrained

    for Area =>

    Too slow

    Efficient

    Implementations

    Microelectronics Design Center 12 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Synthesis Results

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    Are

    a [

    kG

    ate

    eq

    ]

    gmu

    10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s

    Throughput [Gbit/s]

    Gbit/s

    Time per bit [ns/bit]

    0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit 0 ns/bit

    1000 kbit/s/gate500 kbit/s/gate

    200 kbit/s/gate

    100 kbit/s/gate

    50 kbit/s/gate

    25 kbit/s/gate

    Faster

    Smaller

    More

    Efficient

    BLAKE

    Grostl

    JH

    Keccak

    Skein

    SHA-2

    Synthesis Run Results

    with Different

    Timing Constraints

    Grostl

    Skein

    JH

    SHA-2

    Keccak

    Microelectronics Design Center 13 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • The Story of Wireload Models

    Wireload models reflect the routing overhead of the circuit

    For modern technologies parasitic effect from routing are amajor contributor to overall delay.

    For synthesis, wireload models approximate this contribution

    The parasitic effects of a net are modeled to be proportional tothe number of connections on a net, and the size of the circuitA look-up table is consulted to obtain the values

    All standard cell libraries have a default model.

    Each circuit is different, will require a different wireload.

    Once there is a placement and routing solution, the specificwireload for the circuit can be extracted.

    Subsequent synthesis runs will be more accurate.

    Microelectronics Design Center 14 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Synthesis Results with Extracted Wireload

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    Are

    a [

    kG

    ate

    eq

    ]

    gmu

    10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s

    Throughput [Gbit/s]

    Gbit/s

    Time per bit [ns/bit]

    0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit 0 ns/bit

    1000 kbit/s/gate500 kbit/s/gate

    100 kbit/s/gate

    50 kbit/s/gate

    25 kbit/s/gate

    Faster

    Smaller

    More

    Efficient

    BLAKE

    Grostl

    JH

    Keccak

    Skein

    SHA-2

    200 kbit/s/gate

    Synthesis Run

    with Extracted Wireload

    Results for Different

    Timing Constraints

    Ch

    an

    ge

    in

    Pe

    rfo

    rma

    nce

    Microelectronics Design Center 15 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Obtaining Postlayout Results

    Using Multi-Mode Multi-Corner constraints

    All blocks are synthetized separately.

    During the backend, all circuits are placed and routed at once

    For each circuit, a separate mode is defined (17 in total)

    At any time, only one core is activeConstraints specified individually for each coreSoC Encounter is able to optimize for all modes simultaneously

    Due to parasitic effects, constraints are relaxed for P&R.

    Post layout results are slower, constraints relaxedAll circuits are optimized at the same timeThere is one set of constraintsBackend affects each circuit differentlyUsed several runs to find an acceptable solution

    Microelectronics Design Center 16 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Postlayout Results

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    Are

    a [

    kG

    ate

    eq

    ]

    gmu

    10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s

    Throughput [Gbit/s]

    Gbit/s

    Time per bit [ns/bit]

    0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit 0 ns/bit

    1000 kbit/s/gate500 kbit/s/gate

    100 kbit/s/gate

    50 kbit/s/gate

    25 kbit/s/gate

    Faster

    Smaller

    More

    Efficient

    BLAKE

    Grostl

    JH

    Keccak

    Skein

    SHA-2

    200 kbit/s/gate

    Initial

    SynthesisSynthesis

    with

    Extracted

    Wireload

    Final

    Postlayout

    Result

    Microelectronics Design Center 17 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Postlayout Results: GMU and ETH Zurich

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    Are

    a [

    kG

    ate

    eq

    ]

    gmu

    10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s

    Throughput [Gbit/s]

    Gbit/s

    Time per bit [ns/bit]

    0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit 0 ns/bit

    1000 kbit/s/gate500 kbit/s/gate

    100 kbit/s/gate

    50 kbit/s/gate

    25 kbit/s/gate

    Faster

    Smaller

    More

    Efficient

    BLAKE

    Grostl

    JH

    Keccak

    Skein

    SHA-2

    2.4

    88

    Gb

    it/s

    200 kbit/s/gate

    ethz

    Microelectronics Design Center 18 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Reporting Performance: Power, Energy

    We are interested in Energy, not Power

    Energy/bit = PowerThroughput [pJ/bit]Energy per data item is a good indicator of implementationefficiency.

    Measurement Results

    The dummy mode is used to determine the overhead of theI/O circuitry (and static power consumption).

    The dynamic power is measured for different clock frequencies.

    The I/O (dummy) power is subtracted from the measurement.

    Power is determined as a function of the clock rate. [mW/MHz]

    Energy per bit is derived from this number.

    All numbers for nominal VDD = 1.2V.

    Microelectronics Design Center 19 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Normalized Energy/bit, Measurement vs Estimation

    0

    1

    2

    3

    4

    5

    6

    7

    8

    SkeinKeccakJHGroestlBLAKESHA-2

    En

    erg

    y/B

    it n

    orm

    aliz

    ed

    to

    GM

    U S

    HA

    -2

    Algorithms

    Normalized Energy/Bit for ALL SHA-3 Candidates

    GMU

    ETHZ

    Measurement ResultsAverage of 5 ASICs

    VDD=1.2VNumbers in pJ/bit

    3.9

    8 5.0

    5

    10

    .16

    20

    .67 2

    3.1

    5

    27

    .38

    11

    .20

    9.8

    5

    6.2

    8

    4.9

    8

    16

    .02

    28

    .42

    Microelectronics Design Center 20 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Throughput/Area, Measurement vs Estimation

    0

    0.5

    1

    1.5

    2

    2.5

    3

    SkeinKeccakJHGroestlBLAKESHA-2

    Th

    rou

    gh

    pu

    t/A

    rea

    no

    rma

    lize

    d t

    o G

    MU

    SH

    A-2

    Algorithms

    Normalized Throughput/Area of ALL SHA-3 Candidates

    GMU

    ETHZ

    21

    5

    17

    4

    16

    7

    85

    86

    41

    15

    4

    13

    9

    39

    4

    68

    5

    12

    1

    46

    Measurement ResultsAverage of 5 ASICs

    VDD=1.2VNumbers in kbits/GE

    Microelectronics Design Center 21 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Concluding Remarks (I)

    SHA-2

    Very efficient in hardware

    By far the smallest

    Algorithm has been around longer,perhaps reason for more optimized implementations

    BLAKE

    Compact, easy to implement

    Allows good scalability

    Not the fastest

    Microelectronics Design Center 22 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Concluding Remarks (II)

    Grøstl

    Best scalability (Speed/Area tradeoff)

    Low throughput per area

    Cumbersome for hardware

    JH

    Consistently ranks in the middle

    So far, unable to find good scaling options

    All modes use identical hardware

    Microelectronics Design Center 23 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Concluding Remarks (III)

    Keccak

    Hands down fastest algorithm

    Large block size, and small latency key to speed

    Not very good Area/Speed trade-off

    Skein

    Low throughput per area

    Interesting hardware trade-offs due to adder

    Longer combinational delay per clock cycle,perhaps reason for better match between expectation andmeasurement.

    Microelectronics Design Center 24 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Lessons Learned

    Synthesis results can be far from actual performanceDifferences of more than 50% possible.

    Measurement on ASIC is the proof of implementationActual design will also suffer from practical constraints

    Industrial EDA tools are not for best performanceIndustrial tools are based on fulfilling constraints. Academicresearch tries to misuse these constraints to find the bestperformance. It is not a very good approach, lacks system.

    Different implementations should be comparedThe complex EDA chain makes it very difficult to determineperformance with absolute certainty. Seeminglyinconsequential differences can have profound effects.

    Microelectronics Design Center 25 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Thank you...

    Microelectronics Design Center 26 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Additional Material

    All sources and scripts:http://www.iis.ee.ethz.ch/~sha3

    Microelectronics Design Center 27 / 31 Department of Information Technologyand Electrical Engineering Zurich

    http://www.iis.ee.ethz.ch/~sha3

  • Post Layout Results: Speed, Typical Case

    Alg. Block Size Impl. Area (FFs) Max. Clk Tput TpA[bits] [kGE] [MHz] [Gbit/s] [kbit/s·GE]

    SHA-2 512ETHZ 24.30 (29%) 516.00 3.943 162.255

    GMU 25.14 (35%) 870.32 6.855 272.691

    BLAKE 512ETHZ 39.96 (26%) 344.12 3.091 77.347

    GMU 43.02 (34%) 436.30 7.703 179.039

    Grøstl 512ETHZ 69.39 (17%) 460.83 2.913 41.977

    GMU 160.28 (9%) 757.58 18.470 115.239

    JH 512ETHZ 46.79 (27%) 558.97 6.814 145.626

    GMU 54.35 (31%) 947.87 11.286 207.655

    Keccak 1088ETHZ 46.31 (25%) 786.16 35.639 769.550

    GMU 80.65 (19%) 920.81 41.743 517.587

    Skein 512ETHZ 71.87 (19%) 564.33 3.141 43.697

    GMU 71.90 (22%) 312.11 8.411 116.977

    Microelectronics Design Center 28 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Measurement Results: Speed, Average of 5 ASICs

    Alg. Block Size Impl. Area (FFs) Max. Clk Tput TpA[bits] [kGE] [MHz] [Gbit/s] [kbit/s·GE]

    SHA-2 512ETHZ 24.30 (29%) 552.79 4.224 173.826

    GMU 25.14 (35%) 685.40 5.399 214.751

    BLAKE 512ETHZ 39.96 (26%) 377.93 3.395 84.947

    GMU 43.02 (34%) 405.84 7.165 166.541

    Grøstl 512ETHZ 69.39 (17%) 445.63 2.817 40.593

    GMU 160.28 (9%) 563.70 13.743 85.747

    JH 512ETHZ 46.79 (27%) 532.48 6.491 138.725

    GMU 54.35 (31%) 704.72 8.391 154.387

    Keccak 1088ETHZ 46.31 (25%) 700.28 31.746 685.482

    GMU 80.65 (19%) 701.75 31.813 394.456

    Skein 512ETHZ 71.87 (19%) 588.24 3.274 45.548

    GMU 71.90 (22%) 323.21 8.710 121.036

    Microelectronics Design Center 29 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Post Layout Results: Power @2.488 Gb/s, Typical

    Algorithm Block Size Imp. Latency Clk Freq. Power Energy/bit[bits] [cycles] [MHz] [mW] [pJ/bit]

    SHA-2 512ETHZ 67 324 11.86 4.76

    GMU 65 316 9.16 3.68

    BLAKE 512ETHZ 57 276 34.80 13.99

    GMU 29 140 16.47 6.62

    Grøstl 512ETHZ 81 392 50.50 20.30

    GMU 21 102 46.01 18.49

    JH 512ETHZ 42 204 16.54 6.67

    GMU 43 209 17.80 7.15

    Keccak 1088ETHZ 24 54 8.16 3.28

    GMU 24 54 9.98 4.01

    Skein 512ETHZ 92 446 50.00 20.10

    GMU 19 92 26.19 10.53

    Microelectronics Design Center 30 / 31 Department of Information Technologyand Electrical Engineering Zurich

  • Measurement Results: Power @2.488 Gb/s - 1.2V

    Algorithm Block Size Imp. Latency Clk Freq. Power Energy/bit[bits] [cycles] [MHz] [mW] [pJ/bit]

    SHA-2 512ETHZ 67 324 12.57 5.05

    GMU 65 316 9.90 3.98

    BLAKE 512ETHZ 57 276 51.42 20.67

    GMU 29 140 25.27 10.16

    Grøstl 512ETHZ 81 392 68.12 27.38

    GMU 21 102 57.59 23.15

    JH 512ETHZ 42 204 24.51 9.85

    GMU 43 209 27.89 11.20

    Keccak 1088ETHZ 24 54 12.38 4.98

    GMU 24 54 15.62 6.28

    Skein 512ETHZ 92 446 70.71 28.42

    GMU 19 92 39.86 16.02

    Microelectronics Design Center 31 / 31 Department of Information Technologyand Electrical Engineering Zurich

    IntroductionMotivationBackground

    DesignI/O StructureShabzigerGoalsDesign FlowVerification Flow

    PerformanceAreaTimeAT plotsSynthesis ResultsSynthesis Results with Extracted WireloadPostlayout ResultsPower/Energy

    Normalized Measurement ResultsMeasurement Results: Energy per bitMeasurement Results: Throughput per Area

    Concluding RemarksAdditional MaterialPost Layout Results: SpeedMeasurement Results: SpeedPost Layout Results: Power Measurement Results: Power