MotivationNTRU Crash Course
ArchitectureImplementation
Results
Implementing NTRU on a GPU
Jens HermansFrederik Vercauteren, Bart Preneel
COSIC, K.U.Leuven
30 July 2009
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
1 MotivationWhy NTRU?Why GPU?
2 NTRU Crash CoursePolynomialsOperations and Parameter Choices
3 ArchitectureHardwareProgramming model
4 ImplementationOptimization for architectureGeneral structure
5 Results
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Why NTRU?Why GPU?
Motivation
Speeding up NTRU Encryption on GPU:
Why NTRU?
Why GPUs?
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Why NTRU?Why GPU?
Why NTRU?
NTRU Signatures (a.k.a. ’NotTrue’)l
NTRU Encryption [1]
Under development: IEEE 1363.1 [2]
Security parameters increase
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Why NTRU?Why GPU?
Why NTRU?
NTRU Encryption:
Central operation⇒ Convolution
(Not lattice based!)
’Post-quantum’ security (?)
⇒ looking good for parallel implementation
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Why NTRU?Why GPU?
Why GPU?
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Why NTRU?Why GPU?
Old-style GPU programming
What GPUs are supposed to do:
3D operations
2D operations (textures / shading)
Abuse 2D operations (e.g. custom shader):=⇒ RSA implementation, 2007 1
Complicated...
1Moss, Page, SmartJens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Why NTRU?Why GPU?
CUDA framework
Warning: sales talk
Your own personal supercomputer for < e500.
Nvidia CUDA Framework [3]:
Run ‘general’ programs on GPU
More complex operations, data types, branching...
Recent GPU required
Theory: 1TFlop (practice: 200 GFlop)
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Why NTRU?Why GPU?
CUDA Usage
Usages:
Linear algebra (e.g. CUBLAS)
Simulations (physics, chemistry, engineering...)
Image/video processing
...
Cryptography!
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Why NTRU?Why GPU?
Crypto on GPU
Current applications:
Ciphers:
RSA 2, ECC 3, AES 4
Cryptanalysis:
Factoring 5
Brute force
Focus: high throughput, not latency
2Moss, Page, Smart / Szerwinski, Guneysu / Fleissner3Szerwinski, Guneysu4Manavski / Harrison, Waldron5Bernstein, Chen, Cheng, Lange, Yang
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
PolynomialsOperations and Parameter Choices
1 MotivationWhy NTRU?Why GPU?
2 NTRU Crash CoursePolynomialsOperations and Parameter Choices
3 ArchitectureHardwareProgramming model
4 ImplementationOptimization for architectureGeneral structure
5 Results
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
PolynomialsOperations and Parameter Choices
Polynomials
P(N) = Z[X ]/(XN − 1) and Pq(N) = Zq[X ]/(XN − 1)
f ∈ P(N):
f =N−1∑i=0
fiXi
Multiplication a = b ? c in P(N), cyclic convolution:
ak = (b ? c)k =∑
i+j≡k
bi · cj
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
PolynomialsOperations and Parameter Choices
Encryption
Encryption:e = h ? r + m mod q (1)
with [2]:
N = 1171
m ∈ Pp(N) (p = 3)
h, e ∈ Pq(N) (q = 211)
Option 1: r ∈ Pp(N) #{ri = 1} = #{ri = −1} = dr = 106
Option 2: r = r1 ? r2 + r3 with r1, r2, r3 ∈ Pp(N)
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
PolynomialsOperations and Parameter Choices
Decryption
Decryption:
a ≡ f ? e mod q (2)
m = a ? f−1p mod p (3)
with:
f = 1 + p ? F ⇒ f−1p = 1
(Same options for F as for r)
... decryption failures!?
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
HardwareProgramming model
1 MotivationWhy NTRU?Why GPU?
2 NTRU Crash CoursePolynomialsOperations and Parameter Choices
3 ArchitectureHardwareProgramming model
4 ImplementationOptimization for architectureGeneral structure
5 Results
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
HardwareProgramming model
Processor
Nvidia GTX280:
240 cores, scalar processors
30 multiprocessors (8 cores each)
1.3 GHz
1GB Global Memory
32 & 64-bit integers, FP
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
HardwareProgramming model
Programming model
(Source: CUDA programming guide)
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
HardwareProgramming model
Memory types
(Source: CUDA programming guide)
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Optimization for architectureGeneral structure
Points of attention
Memory access ⇔ computation
Coalesced memory access, bank conflicts
Loop structure
Caching
Efficient mod p computation (decryption)
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Optimization for architectureGeneral structure
Convolution
Ordinary polynomials:
for i = 0 to N − 1 doIf (ri = +1)⇒ tk = tk + hk−i mod N
If (ri = −1)⇒ tk = tk − hk−i mod N
end for
Product-form polynomials (r = r1 ? r2 + r3) [4]:
for i = 0 to dr dotk = tk + hk−r+
i mod N
tk = tk − hk−r−i mod N
end for
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Optimization for architectureGeneral structure
Layout
1 block = 1 encryption
Upload rb,hb,mb
Bit packing
1 thread = 4× “ei”
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Optimization for architectureGeneral structure
Memory access
Thread k
Thread k+1
...
...
Block b
...
r b
hb
...
......
eb
...
Figure: Memory access.
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Optimization for architectureGeneral structure
Product-form polynomials
Product-form encryption
e = h ? r + m mod qwith r = r1 ? r2 + r3
Algorithm:
1 tmp← r2 ? h
2 tmp2← r1 ? tmp
3 e← tmp2 + r3 + m
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
1 MotivationWhy NTRU?Why GPU?
2 NTRU Crash CoursePolynomialsOperations and Parameter Choices
3 ArchitectureHardwareProgramming model
4 ImplementationOptimization for architectureGeneral structure
5 Results
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Results
Platform (N, q, p) Enc/s Dec/sC Intel Core2 @ 3.00GHz (1171, 2048, 3) 95 95CUDA GTX280 (1 op) 571 546CUDA GTX280 (20000 ops) 24 ·103 24 ·103
C Intel Core2 @ 3.00GHz (1171, 2048, 3) 3.22 ·103 -CUDA GTX280 (1 op) Product form 6.25 ·103 -CUDA GTX280 (20000 ops) 218 ·103 -
Table: Comparison of NTRU implementations
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Throughput
100
101
102
103
104
105
0
0.5
1
1.5
2
2.5x 10
4
Number of parallel operations
op
era
tio
ns /
s
Figure: Operations per second for encryption with ordinary polynomials.
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Comparison
Platform (N, q, p) Enc/s Dec/sFPGA 6 (251, 128, X + 2) 193 ·103 -Palm 6 Product form 21 11Palm 6 30 16ARM C 6 307 148
FPGA 7 (167, 128, 3) 18 8.4
C 8 (787, 587, ?) 7.66 ·103 4.61 ·103
C (1171, 2048, 3) 95 95CUDA (1 op) 571 546CUDA (20000 ops) 24 ·103 24 ·103
C (1171, 2048, 3) 3.22 ·103 -CUDA (1 op) Product form 6.25 ·103 -CUDA (20000 ops) 218 ·103 -
Table: Comparison of NTRU implementations
6Bailey, Coffin, Elbirt, Silverman, Woodbury7Atıcı, Batina, Fan, Verbauwhede, Yalcın8EBATS
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Comparison: other algorithms
1 2 3 40
0.5
1
1.5
2
2.5x 10
5th
roughput
NTRU PF
RSA 2048 ECC NIST−224
NTRU
Figure: Throughput (enc/s)
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Comparison: other algorithms
Different security levels:
NTRU (1171, 2048, 3): 256-bitNTRU (167, 128, 3): � 80 bitRSA 2048 bit: 112-bitECC NIST-224: 112-bit
Different amount of data:
NTRU (1171, 2048, 3): 1756 bitNTRU (167, 128, 3): 250 bitRSA: 1024/2048 bit
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Main results
NTRU:
Very fast implementation
Fast compared to other ciphers
Total throughput: 218000 enc/s or 47.8 MByte/s
Well suited for GPU
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Final remark: GPUs
+ -Computing power Memory access/transfer
Price Power consumptionThroughput Latency
Reprogrammable
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Questions ?
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Key references
J. Hoffstein, J. Pipher, and J.H. Silverman.
NTRU: A Ring-Based Public Key Cryptosystem.Lecture Notes in Computer Science, pages 267–288, 1998.
W. Whyte, N. Howgrave-Graham, J. Hoffstein, J. PIpher, J.H. Silverman, and P. Hirschhorn.
IEEE P1363.1 Draft 10: Draft Standard for Public Key Cryptographic Techniques Based on Hard Problems over Lattices.
Nvidia.
Compute Unified Device Architecture Programming Guide, 2007.
J. Hoffstein and J.H. Silverman.
Random small Hamming weight products with applications to cryptography.Discrete Applied Mathematics, 130(1):37–49, 2003.
Other references: see paper
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Comparison
Platform (N, q, p) Enc/s Dec/s
FPGA Xilinx Virtex 1000EFG860 @ 50 MHz (251, 128, X + 2) 193 ·103 -Palm Motorola Dragonball @ 20 MHz (C) Product form 21 11Palm Motorola Dragonball @ 20 MHz (Assembly) 30 16ARM C ARM7TDMI @ 37 MHz 307 148
FPGA Xilinx Virtex 1000EFG860 @ 500kHz (167, 128, 3) 18 8.4
C Intel Core2 Duo @ 3GHz (787, 587, ?) 7669 4613
C Intel Core2 Extreme @ 3.00GHz (1171, 2048, 3) 95 95CUDA GTX280 (1 op) 571 546
CUDA GTX280 (20000 ops) 24 ·103 24 ·103
C Intel Core2 Extreme @ 3.00GHz (1171, 2048, 3) 3.22 ·103 -
CUDA GTX280 (1 op) Product form 6.25 ·103 -
CUDA GTX280 (20000 ops) 218 ·103 -
Table: Comparison of NTRU implementations
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
MotivationNTRU Crash Course
ArchitectureImplementation
Results
Comparison: other algorithms
Platform (N, q, p) Enc/s Dec/sCUDA GTX280 (1 op) 571 546CUDA GTX280 (20000 ops) 24 ·103 24 ·103
CUDA GTX280 (1 op) Product form 6.25 ·103 6.25 ·103
CUDA GTX280 (20000 ops) 218 ·103 218 ·103
RSA comparisonCUDA 9 Nvidia 8800GTS 1024 bit 813
C++ 10 Core2 @ 1.83GHz 2048 bit (6.66 ·103) 168
ECC comparisonC 11 Core2 @ 1.83 GHz ECC NIST-224 1.86 ·103
9Szerwinski, Guneysu10Crypto++11EBATS
Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU
Top Related