Symmetric Cryptography in Hardware · 2013-06-17 · Tim Güneysu Hardware Security Group Horst...

Post on 25-Mar-2020

2 views 0 download

Transcript of Symmetric Cryptography in Hardware · 2013-06-17 · Tim Güneysu Hardware Security Group Horst...

Tim Güneysu

Hardware Security Group

Horst Görtz Institute for IT-Security 02/07/2013

Symmetric Cryptography in Hardware

Summer School on Design and Security of Cryptographic Functions, Algorithms and Devices

16.06.2013 Arbeitsgruppe Sichere Hardware, Ruhr-Universität Bochum

Agenda

• Introduction

• Objectives and Design Principles

• Identifying Cryptographic Building Blocks

• Case Study: Advanced Encryption Standard

• Lessons Learned

2

Why Cryptography in Hardware?

• Throughput/Performance

• Energy-Efficiency

• Cost

• Physical Security

3

Cryptographic Hardware Examples

• SmartCards (e.g., PayTV)

• RFID/NFC Tags (identif./auth.)

• Accelerators (AES NI, HDD encryption)

• Hardware Security and Trusted Platform Modules (HSM/TPM)

4

Hardware Implementation

• Application Specific Integrated Circuits (ASIC)

– Static circuit with fixed routes and gates

– Dedicated application-specific layout

– Expensive development but cheap per unit

• Field Programmable Gate Arrays (FPGA)

– Fabric of programmable logic and routing

– Different fabric sizes for applications

– Simple development but higher costs per unit

5

Agenda

• Introduction

• Objectives and Design Principles

• Identifying Cryptographic Building Blocks

• Case Study: Advanced Encryption Standard

• Lessons Learned

6

Dimensions in Hardware Design

Gajski-Diagram [1983] What about

Security?

Objectives for Hardware Implementation

• Hardware circuits are require design goals:

• AREA, DELAY, POWER, THROUGHPUT, ENERGY

8

Hardware Design for Minimum Area

• Targets: RFIDs, crypto cores running once-a-while

• Strategies:

– Serialize algorithm

– Minimize storage elements

– Reuse resources

Register

t

Logic blocks

t

Area 9

General Design Strategies for Power

• Targets: RFIDs, passively powered devices

• Strategies:

– Serialize algorithm

– Reduce data path

Active logic

t

t

Input

Input

t

t

48 bit 16 bit

10

General Design Strategies for Throughput

• Targets: Accelerators, bulk data applications

• Strategies: – Parallelize subfunctions/

precomputation tables

– Minimize critical path/ maximize frequency

– Unroll iterated structures

– Use pipelining

Active logic

t

t

tmax tmax

3x

Data Data 11

TAB

LE

General Design Strategies for Energy

• Target: Battery-powered devices, mobile sensors

• Strategies:

– Minimize clock cycles

– Unroll iterated structures

3x

4 cycles 2 cycles

12

Impact of Hardware Objectives on Block Cipher Implementations

• Example: Low-power ASIC 65nm CMOS

• Metrics: area, delay, power, energy, throughput

• Unrolling parameter Nr rounds per cycle

13

Impact on Area

• Naive expectation: 2x rounds 2x area

• In fact: state registers determine circuit growth

14

Impact on Circuit Delay

• Naive expectation: 2x rounds 2x delay

• In fact: critical path is not always in the round function

15

Impact on Energy

• Observation: energy is rather stable

16

Agenda

• Introduction

• Objectives and Design Principles

• Identifying Cryptographic Building Blocks

• Case Study: Advanced Encryption Standard

• Lessons Learned

17

Data Encryption Standard (DES)

DES is known as hardware-optimal

block cipher.

DES is still used today with triple encryption

TDES(x) = enck1(deck2(enck3(x))) 18

Data Encryption Standard – Round Function

f-Function Round Function

19

Inside the Round Function: Data Expansion

In HW: WIRES 20

• Eight substitution tables.

• 6 bits of input, 4 bits of output.

• Non-linear and resistant to differential

cryptanalysis.

Inside the Round Function: S-Box

In HW: 6-to-4 bit ROM

In HW: XOR-Gate

Inside the Round Function: Permutation

– Bitwise permutation to introduce diffusion.

– Output bits of one S-Box effect several S-Boxes in next round

In HW: WIRES 22

• Split key into 28-bit halves C0 and D0.

• In rounds i = 1, 2, 9 ,16, the two halves are each rotated left by one bit.

• In all other rounds where the two halves are each rotated left by two bits.

• In each round i permuted choice PC-2 selects a permuted subset of 48 bits of Ci and Di as round key ki, i.e. each ki is a permutation of k!

The Key Schedule of DES

In HW: WIRES

Summary: DES Implementation in Hardware

• Implementation of a DES round consists of

– Wires for all permutations, expansion and selection

– 2-input XOR gate per bit for key addition

– 6-to-4 (i.e. 64x4=256 bit) ROM for one S-box

– 2-input XOR gate per bit to mix left and right half

6x 2-XOR 6-to-4 SBOX

4x 2-XOR 4x Flipflop 6

6

6 4

4

4 Ri,(4)

Keyi Li,(4)

Ri+1(4)

4

Round function

for 6 input bits

Summary: Hardware-Friendly Cryptographic Building Blocks

• If you design a hardware-friendly cipher, choose:

– Static permutations wires

– Static rotations and shifts wires

– S-Box Read-Only-Memory (ROM)

– Static key wires tied to GND or VCC

– Dynamic key store (1-bit) flip-flop/SRAM (≥6 trans.)

– Key Addition (1-bit) XOR Gate (≥6 transistors)

– Boolean operations (1-bit ) (N)AND/(N)OR gate (≥4 transistors)

25

Agenda

• Introduction

• Objectives and Design Principles

• Identifying Cryptographic Building Blocks

• Case Study: Advanced Encryption Standard

• Lessons Learned

26

Recall: Advanced Encryption Standard

• AES was designed to be efficient in

hardware and software

• All AES rounds consists of several

layers, processing all 128 input bits

• The layers in each round are:

• Byte Substitution

• Diffusion Layer

• ShiftRows

• MixColumns

• Key Addition

27

Input

151173

141062

13951

12840

AAAA

AAAA

AAAA

AAAA

ShiftRow

MixColumn

KeyAddition

ByteSub

AES State Representation

Output

151173

141062

13951

12840

EEEE

EEEE

EEEE

EEEE

• AES is a byte-oriented

cipher (8-bit)

• AES round operates

on a state matrix

of 16 bytes A0-A15

• Output state E0-E15

is input to next round

28

Detailed Round Structure of AES

• Round function operates on 16 state bytes A0-A15

• Note: In the last round, the

MixColumn transformation is omitted

29

Byte Substitution Layer (S-Box)

• The S-Box is commonly realized as a lookup table

• The Byte Substitution layer consists of S-Boxes with the following properties:

• 16 identical, bijective S-Boxes

• the only nonlinear elements of AES, i.e., ByteSub(Ai) + ByteSub(Aj) ≠ ByteSub(Ai + Aj)

• Construction of S-Box in GF(28):

S[x] = AffineMap(x-1) 30

Diffusion Layer (D-Box)

• Mixes state bytes to influence as many other state bytes as possible

• Consists of two sublayers:

– ShiftRows Sublayer: Permutation of the data on a byte level

– MixColumn Sublayer: Matrix operation which combines blocks of four bytes (based on MDS code)

31

ShiftRows Sublayer

Rows of the state matrix are shifted cyclically:

Input matrix

Output matrix

B0 B4 B8 B12

B1 B5 B9 B13

B2 B6 B10 B14

B3 B7 B11 B15

B0 B4 B8 B12

B5 B9 B13 B1

B10 B14 B2 B6

B15 B3 B7 B11

no shift

← one position left shift

← two positions left shift

← three positions left shift

32

MixColumn Sublayer

• Linear transformation which mixes groups of 4 state bytes (32 bit)

• Each 4-byte column is considered a vector and multiplied by a fixed 4x4 matrix, e.g.,

where 01, 02 and 03 are given in hexadecimal notation

• All arithmetic is done in the Galois field GF(28)

15

10

5

0

3

2

1

0

02010103

03020101

01030201

01010302

B

B

B

B

C

C

C

C

33

Key Addition Layer

• Input

– 16-byte state matrix C

– 16-byte subkey ki

• Output: C ki (bit-wise XOR)

• A number of subkeys are generated in the key schedule (depending on number of rounds)

• Subkeys are added at the beginning and end of the cipher operations

34

3-D Depiction of AES Round

ShiftRow

MixColumn

KeyAddition

ByteSub

35

Key Schedule of AES-128 (44 Subkeys W[i])

RC[1] = x0 = (00000001)2

RC[2] = x1 = (00000010)2

RC[3] = x2 = (00000100)2

...

RC[10] = x9 = (00110110)2

Round constants

36

Efficient Implementation of AES in Hardware

Goal: Implement AES on FPGAs optimized for THROUGHPUT (1) and AREA (2)

Complexity of AES operations in hardware

1. Key addition (bit-wise XOR)

Straightforward using XOR Gates in HW or instruction in SW

2. ByteSub (S-Box)

Realized as memory table with 28 = 256x8-bit entries (2KB)

3. ShiftRows

Mere re-ordering of bytes (static permutation)

Remaining operation: MixColumn?

37

Recall: MixColumn = vector-matrix multiplication on bytes

3

2

1

0

02010103

03020101

01030201

01010302

3

2

1

0

b

b

b

b

x

c

c

c

c

Q: How to efficiently realize the constant multiplication

02 bi and 03 bi ?

Efficient Implementation of AES: MixColumns

38

Remark: 02 = x and 03 = (x+1) in GF(28); arithmetic in

GF(28) uses irreducible polynomial m(x) = x8+x4+x3+x+1

For c0 = 02 b0 = x · b0

= x · (b07 x7+b06 x

6+…+b01 x+b00 )

= b07 x8+b06 x

7+…+b01 x2+b00 x

= b07 x8+b06 x

7+…+b01 x2+b00 x + (b07 m(x))

For c0 = 03 b0 = (x+1) · b0 = x · b0 + b0

Efficient Implementation of AES: MixColumns

= 0 in GF(28)

Shift b0 to right by one bit and add m(x) if MSB=1

Compute 02 b0 and add b0 (XOR)

39

ShiftRow

ByteSub

MixColumn

KeyAddition

Optimizing AES for Throughput (1): T-Tables

Precompute tables

including all three layers

151173

141062

13951

12840

AAAA

AAAA

AAAA

AAAA

Input

Output

151173

141062

13951

12840

EEEE

EEEE

EEEE

EEEE

40

j

j

j

j

k

k

k

k

AS

AS

AS

AS

x

E

E

E

E

,3

,2

,1

,0

15

10

5

0

3

2

1

0

)(

)(

)(

)(

02010103

03020101

01030201

01010302

Optimizing AES for Throughput (1): T-Tables

¼ round in matrix notation

41

j

j

j

j

k

k

k

k

ASASASAS

E

E

E

E

,3

,2

,1

,0

151050

3

2

1

0

02

03

01

01

)(

01

02

03

01

)(

01

01

02

03

)(

03

01

01

02

)(

Decomposition of matrix multiplication:

Equation for a quarter round (32 bits; first column as example)

Optimizing AES for Throughput (1): T-Tables

j

j

j

j

k

k

k

k

AS

AS

AS

AS

x

E

E

E

E

,3

,2

,1

,0

15

10

5

0

3

2

1

0

)(

)(

)(

)(

02010103

03020101

01030201

01010302

42

j

j

j

j

k

k

k

k

ASASASAS

E

E

E

E

,3

,2

,1

,0

151050

3

2

1

0

02

03

01

01

)(

01

02

03

01

)(

01

01

02

03

)(

03

01

01

02

)(

03][

][

][

02][

][0

aS

aS

aS

aS

aT

][

][

02][

03][

][1

aS

aS

aS

aS

aT

][

02][

03][

][

][2

aS

aS

aS

aS

aT

02][

03][

][

][

][3

aS

aS

aS

aS

aT

j

j

j

j

k

k

k

k

ATATATAT

E

E

E

E

,3

,2

,1

,0

1531025100

3

2

1

0

)()()()(

New equation for quarter round:

each T-Box:

256 x 32 bit

Optimizing AES for Throughput (1): T-Tables

43

Optimizing AES for Throughput (1): T-Tables

¼ round: 4 TLU + key array + 4 XOR (32 bit wide)

1 round: 4 x 4 = 16 TLU

AES: 160 TLU / 1 block encryption

Memory: 4 T-Boxes, 1kB each (8192 bits)

Full encryption function for ¼ round (32 bits) using T-Tables

j

j

j

j

k

k

k

k

ATATATAT

E

E

E

E

,3

,2

,1

,0

1531025100

3

2

1

0

)()()()(

44

Two T-Tables are stored in one 18 kBit dual-ported BRAM

Output of BRAMs are XORed with each other and key

Fixed permutation (wires!) used to select correct input byte

Four instances with 4 XOR/2 BRAMs to compute columns 0-3 of AES state

B R

A M

8 8 8 32

T 0 T 1 T 2 T 3

IN 0

k i

8

128

8 8 8 8

128 ...

Column 3

T 0 T 1 T 2 T 3 k 0

k i + 3

k 3

32

Column 0

IN 3

S 3 S 0

32 32

π π π π π π π π

Optimizing AES for Throughput (1): T-Tables

45

Optimizing AES for Area (2): Tower-Field Arithmetic for S-Boxes in GF((24)2)

S-1=(shx+sl)=shΔx + (sl+sh) Δ

with Δ = (sh2λ+ shsl + sl

2)-1 = (sh2λ+ sl(sh+ sl))

-1

Transformation

to GF((24)2)

Reverse Transformation

back to GF(28)

T T-1 SBOX (8-to-8 bit=2KB SBOX is huge in area)

obtained

from EEA (see AES

book By Rijmen/Daemon)

46

Area Efficient Multipliers in GF(24) on FPGAs with 4-input LUTs

Multiplier

in GF(24)

Multiplier

in GF(2) Multiplication

with constant in GF(2) 47

Multiplier

in GF(24)

Multiplizierer

in GF(2) Multiplikation

mit Phi in GF(2)

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

3 levels

of logic

XOR

XOR

MUL

Area Efficient Multipliers in GF(24) on FPGAs with 4-input LUTs

48

Mapping GF(24) Arithmetic on 4-input LUTs

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

Square unit in GF(24) Multiplication with lambda

in GF(24)

Single level of LUT-4 logic 49

Implementing the S-Box in GF((24)2)

Combining tower field transformation with affine map

into a single matrix operation

T

T-1

50

Optimizing AES for Area (2): Tower-Field Arithmetic for S-Boxes in GF((24)2)

Round

function

T

A-1T

T-1

T-1A

Register Register Register Register Register Register Register

Final

Round

T

A-1T

T-1

T-1A

51

AES Results on Reconfigurable Hardware

• AES Performance on a large range of hardware (FPGA) devices

Basi

c R

ound

U

nro

lled

T-T

able

52

53

Agenda

• Introduction

• Objectives and Design Principles

• Identifying Cryptographic Building Blocks

• Case Study: Advanced Encryption Standard

• Lessons Learned

Lessons Learned

• Objectives for Hardware Implementation: Area, Power, Throughput, Energy and Security

• Objectives can be partially combined, often with non-linear scaling effects

• Optimally ciphers use hardware-friendly building blocks such as static permutations and rotations

• (Large) S-Boxes are usually the most complex component for implementation in hardware

54