Download - Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Implementing NTRU on a GPU

Jens HermansFrederik Vercauteren, Bart Preneel

COSIC, K.U.Leuven

30 July 2009

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU



Results

1 MotivationWhy NTRU?Why GPU?

2 NTRU Crash CoursePolynomialsOperations and Parameter Choices

3 ArchitectureHardwareProgramming model

4 ImplementationOptimization for architectureGeneral structure

5 Results




Results

Why NTRU?Why GPU?

Motivation

Speeding up NTRU Encryption on GPU:

Why NTRU?

Why GPUs?




Results

Why NTRU?Why GPU?

Why NTRU?

NTRU Signatures (a.k.a. ’NotTrue’)l

NTRU Encryption [1]

Under development: IEEE 1363.1 [2]

Security parameters increase




Results

Why NTRU?Why GPU?

Why NTRU?

NTRU Encryption:

Central operation⇒ Convolution

(Not lattice based!)

’Post-quantum’ security (?)

⇒ looking good for parallel implementation




Results

Why NTRU?Why GPU?

Why GPU?




Results

Why NTRU?Why GPU?

Old-style GPU programming

What GPUs are supposed to do:

3D operations

2D operations (textures / shading)

Abuse 2D operations (e.g. custom shader):=⇒ RSA implementation, 2007 1

Complicated...

1Moss, Page, SmartJens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU



Results

Why NTRU?Why GPU?

CUDA framework

Warning: sales talk

Your own personal supercomputer for < e500.

Nvidia CUDA Framework [3]:

Run ‘general’ programs on GPU

More complex operations, data types, branching...

Recent GPU required

Theory: 1TFlop (practice: 200 GFlop)




Results

Why NTRU?Why GPU?

CUDA Usage

Usages:

Linear algebra (e.g. CUBLAS)

Simulations (physics, chemistry, engineering...)

Image/video processing

...

Cryptography!




Results

Why NTRU?Why GPU?

Crypto on GPU

Current applications:

Ciphers:

RSA 2, ECC 3, AES 4

Cryptanalysis:

Factoring 5

Brute force

Focus: high throughput, not latency

2Moss, Page, Smart / Szerwinski, Guneysu / Fleissner3Szerwinski, Guneysu4Manavski / Harrison, Waldron5Bernstein, Chen, Cheng, Lange, Yang




Results

PolynomialsOperations and Parameter Choices





5 Results




Results


Polynomials

P(N) = Z[X ]/(XN − 1) and Pq(N) = Zq[X ]/(XN − 1)

f ∈ P(N):

f =N−1∑i=0

fiXi

Multiplication a = b ? c in P(N), cyclic convolution:

ak = (b ? c)k =∑

i+j≡k

bi · cj




Results


Encryption

Encryption:e = h ? r + m mod q (1)

with [2]:

N = 1171

m ∈ Pp(N) (p = 3)

h, e ∈ Pq(N) (q = 211)

Option 1: r ∈ Pp(N) #{ri = 1} = #{ri = −1} = dr = 106

Option 2: r = r1 ? r2 + r3 with r1, r2, r3 ∈ Pp(N)




Results


Decryption

Decryption:

a ≡ f ? e mod q (2)

m = a ? f−1p mod p (3)

with:

f = 1 + p ? F ⇒ f−1p = 1

(Same options for F as for r)

... decryption failures!?




Results

HardwareProgramming model





5 Results




Results


Processor

Nvidia GTX280:

240 cores, scalar processors

30 multiprocessors (8 cores each)

1.3 GHz

1GB Global Memory

32 & 64-bit integers, FP




Results


Programming model

(Source: CUDA programming guide)




Results


Memory types

(Source: CUDA programming guide)




Results

Optimization for architectureGeneral structure

Points of attention

Memory access ⇔ computation

Coalesced memory access, bank conflicts

Loop structure

Caching

Efficient mod p computation (decryption)




Results


Convolution

Ordinary polynomials:

for i = 0 to N − 1 doIf (ri = +1)⇒ tk = tk + hk−i mod N

If (ri = −1)⇒ tk = tk − hk−i mod N

end for

Product-form polynomials (r = r1 ? r2 + r3) [4]:

for i = 0 to dr dotk = tk + hk−r+

i mod N

tk = tk − hk−r−i mod N

end for




Results


Layout

1 block = 1 encryption

Upload rb,hb,mb

Bit packing

1 thread = 4× “ei”




Results


Memory access

Thread k

Thread k+1

...

...

Block b

...

r b

hb

...

......

eb

...

Figure: Memory access.




Results


Product-form polynomials

Product-form encryption

e = h ? r + m mod qwith r = r1 ? r2 + r3

Algorithm:

1 tmp← r2 ? h

2 tmp2← r1 ? tmp

3 e← tmp2 + r3 + m




Results





5 Results




Results

Results

Platform (N, q, p) Enc/s Dec/sC Intel Core2 @ 3.00GHz (1171, 2048, 3) 95 95CUDA GTX280 (1 op) 571 546CUDA GTX280 (20000 ops) 24 ·103 24 ·103

C Intel Core2 @ 3.00GHz (1171, 2048, 3) 3.22 ·103 -CUDA GTX280 (1 op) Product form 6.25 ·103 -CUDA GTX280 (20000 ops) 218 ·103 -

Table: Comparison of NTRU implementations




Results

Throughput

100

101

102

103

104

105

0

0.5

1

1.5

2

2.5x 10

4

Number of parallel operations

op

era

tio

ns /

s

Figure: Operations per second for encryption with ordinary polynomials.




Results

Comparison

Platform (N, q, p) Enc/s Dec/sFPGA 6 (251, 128, X + 2) 193 ·103 -Palm 6 Product form 21 11Palm 6 30 16ARM C 6 307 148

FPGA 7 (167, 128, 3) 18 8.4

C 8 (787, 587, ?) 7.66 ·103 4.61 ·103

C (1171, 2048, 3) 95 95CUDA (1 op) 571 546CUDA (20000 ops) 24 ·103 24 ·103

C (1171, 2048, 3) 3.22 ·103 -CUDA (1 op) Product form 6.25 ·103 -CUDA (20000 ops) 218 ·103 -


6Bailey, Coffin, Elbirt, Silverman, Woodbury7Atıcı, Batina, Fan, Verbauwhede, Yalcın8EBATS




Results

Comparison: other algorithms

1 2 3 40

0.5

1

1.5

2

2.5x 10

5th

roughput

NTRU PF

RSA 2048 ECC NIST−224

NTRU

Figure: Throughput (enc/s)




Results


Different security levels:

NTRU (1171, 2048, 3): 256-bitNTRU (167, 128, 3): � 80 bitRSA 2048 bit: 112-bitECC NIST-224: 112-bit

Different amount of data:

NTRU (1171, 2048, 3): 1756 bitNTRU (167, 128, 3): 250 bitRSA: 1024/2048 bit




Results

Main results

NTRU:

Very fast implementation

Fast compared to other ciphers

Total throughput: 218000 enc/s or 47.8 MByte/s

Well suited for GPU




Results

Final remark: GPUs

+ -Computing power Memory access/transfer

Price Power consumptionThroughput Latency

Reprogrammable




Results

Questions ?




Results

Key references

J. Hoffstein, J. Pipher, and J.H. Silverman.

NTRU: A Ring-Based Public Key Cryptosystem.Lecture Notes in Computer Science, pages 267–288, 1998.

W. Whyte, N. Howgrave-Graham, J. Hoffstein, J. PIpher, J.H. Silverman, and P. Hirschhorn.

IEEE P1363.1 Draft 10: Draft Standard for Public Key Cryptographic Techniques Based on Hard Problems over Lattices.

Nvidia.

Compute Unified Device Architecture Programming Guide, 2007.

J. Hoffstein and J.H. Silverman.

Random small Hamming weight products with applications to cryptography.Discrete Applied Mathematics, 130(1):37–49, 2003.

Other references: see paper




Results

Comparison

Platform (N, q, p) Enc/s Dec/s

FPGA Xilinx Virtex 1000EFG860 @ 50 MHz (251, 128, X + 2) 193 ·103 -Palm Motorola Dragonball @ 20 MHz (C) Product form 21 11Palm Motorola Dragonball @ 20 MHz (Assembly) 30 16ARM C ARM7TDMI @ 37 MHz 307 148

FPGA Xilinx Virtex 1000EFG860 @ 500kHz (167, 128, 3) 18 8.4

C Intel Core2 Duo @ 3GHz (787, 587, ?) 7669 4613

C Intel Core2 Extreme @ 3.00GHz (1171, 2048, 3) 95 95CUDA GTX280 (1 op) 571 546

CUDA GTX280 (20000 ops) 24 ·103 24 ·103

C Intel Core2 Extreme @ 3.00GHz (1171, 2048, 3) 3.22 ·103 -

CUDA GTX280 (1 op) Product form 6.25 ·103 -

CUDA GTX280 (20000 ops) 218 ·103 -





Results


Platform (N, q, p) Enc/s Dec/sCUDA GTX280 (1 op) 571 546CUDA GTX280 (20000 ops) 24 ·103 24 ·103

CUDA GTX280 (1 op) Product form 6.25 ·103 6.25 ·103

CUDA GTX280 (20000 ops) 218 ·103 218 ·103

RSA comparisonCUDA 9 Nvidia 8800GTS 1024 bit 813

C++ 10 Core2 @ 1.83GHz 2048 bit (6.66 ·103) 168

ECC comparisonC 11 Core2 @ 1.83 GHz ECC NIST-224 1.86 ·103

9Szerwinski, Guneysu10Crypto++11EBATS