extreme enumeration on GPU and in cloud ches...Extreme Enumeration on GPU and in Cl dClouds...
Transcript of extreme enumeration on GPU and in cloud ches...Extreme Enumeration on GPU and in Cl dClouds...
Extreme Enumeration G d i Cl don GPU and in Clouds
Po‐Chun Kuo1, Michael Schneider2, Özgür Dagdelen3, Jan Reichelt3,Johannes Buchmann2,3, Chen‐Mou Cheng1, and Bo‐Yin Yang4
1 National Taiwan University, Taipei, Taiwan2 Technische Universität Darmstadt, Germany
3 Center for Advanced Security Research Darmstadt (CASED), Germany4 Academia Sinica, Taipei, Taiwan
CHESNara JapanNara, Japan2011.9.30
OutlineOutline
• Introduction–Problem Description–Problem Description–Known algorithm–Application
• Algorithm• Algorithm• ImplementationImplementation• Results
2
Problem Description
nn
OFind it!
3
Problem PropertyProblem Property
• NP‐hard• Worst/average case equivalence property/ g q p p y
– at 1997, Ajtai shown the equivalence between worst‐case and average‐case lattice problemworst case and average case lattice problem [Ajtai’97]
• Many problems could be reduced to SVP• Many problems could be reduced to SVP– Knapsack, Subset Sum, factoring, approximate GCD bl tGCD problem…etc
• The first Fully Homomorphic Encryption Scheme is based on lattice problem is proposed at 2009 [Gentry’09] 4
Known AlgorithmKnown Algorithm• Approximation Algorithm
– Lattice Basis Reduction Algorithm– LLL, ratio ~ 1.02n in average case [Nguyen and Stehlé, 2006], g [ g y , ]– BKZ, ratio ~1.01n in average case
• with a parameter “block size”p– the approximate ratio are all exponential to input dimension
• Exact AlgorithmExact Algorithm– Lattice Enumeration
• Super exponential time and polynomial space• Super‐exponential time and polynomial space• But seems exponential time in practical
Si– Sieve• both exponential time and space 5
ApplicationApplication• Factoring polynomials over the integers or the rational• Factoring polynomials over the integers or the rational
numbers– E g given x2‐1 return x‐1 x+1E.g. given x 1, return x 1, x+1
• Finding the minimal polynomial of an algebraic number given to a good enough approximation. g g pp– E.g, given 1.618033, return x2‐x+1=0
• Factoring number with known some bitFactoring number with known some bit• Approximate GCD problem• Integer ProgrammingInteger Programming• Knapsack problem• Subset Sum problemSubset Sum problem
6
Our ContributionsOur Contributions• Paralleli e and implement GNR’10 t i l tti• Parallelize and implement GNR’10 extreme pruning lattice
reduction on GPU and in Cloud– Highly parallelize, about 90% parallel benefit
O GTX 480 i b 12 i f h i7– One GTX 480 is about 12 times faster than one i7 core.– Extend the implementation to multiple GPUs and run on Amazon’s EC2
cloud services.d h i id i d fl ibl b di• Extend the pruning idea to pre‐computing and flexible bounding
function– Get the same output quality and speed up more than 10x in pre‐
ticomputing• New security level measure
– Tradition: Dollar‐day (e.g. buy 10 machines and run 5 days)– New: Dollar (e.g. rent computation power from Amazon)
• Estimate the cost of solving ASVP instances of SVP Challenge in higher dimensionsg
7
AlgorithmAlgorithm
• Lattice Enumeration• Lattice Enumeration using Extreme Pruning
8
AlgorithmAlgorithm• Lattice Enumeration
– Exhaustive search all the possible solutions
Gram‐SchmidtGram‐Schmidt Process
9
• Build the search tree according to the “coefficient vector”• Guess bound = ||b1|| or Gauss prediction• Run DFS to find the shortest vector
……
…
10
bound boundnew
Guess boundcn
new
Guess bound Early abort
cn‐1
Renew bound Gauss predictionp
Extreme pruning[GNR’10]
……
pruning[GNR 10] Parallel
c2
c111
AlgorithmAlgorithm
• Lattice Enumeration• Lattice Enumeration using Extreme Pruning
12
Idea of Extreme PruningIdea of Extreme Pruning• Proposed by Nicolas Gama, Phong Q. Nguyen and Oded Regev• Goal: Maximize the reward per operation• Only search the space where is the most possible to find out
solution• If failed, randomized the basis and search again.
13
Lattice Enumeration using Extreme Pruning
• Guess very tight bound for each level
boundn cnfor each level
• If failed, do a randomize to the basis then do
boundn‐1 cn‐1to the basis, then do Enumeration again.
• E.g.E.g.time for search each tree be 1000 times faster, and
…… ……
probability of finding out the vector be 10%,=> gain ~100x faster=> gain ~100x faster
bound2 c2
bound1 c114
How to decide bound for each level ?How to decide bound for each level ?
B di t (R R R ) [0 1]n• Bounding vector (R1, R2, …, Rn) [0,1]n, with R1 ≦ R2 ≦ … ≦ Rn
• Bounding function for level i is Ri.Guess Boundg i
• Linear bounding functionR = i / n– Ri = i / n
– Theoretical analysis ‐> the success probability 1/n [GNR’10]– In Practice ‐> the success probability is much higher– But the search space is still too large for high dimension…
• Polynomial: fitting the numerically optimized in [GNR’10]• Polynomial: fitting the numerically optimized in [GNR 10]– No theoretical guarantee on the probability– But it performs well in practice– success probability is about 10% by experiment
15
Bounding functionBounding function1
0 7
0.8
0.9
Linear
0.5
0.6
0.7
0.3
0.4Polynomial
0.1
0.2
00 5 10 15 20 25 30 35 40 45
16
ll lParallel• Parallel for Enumeration
– Parallel for one search tree• Parallel for Extreme Enumeration
– Parallel formany search treesParallel for many search trees
17
Parallel for EnumerationParallel for Enumeration
Parallel DFSupper tree + DFSlower tree
DFSDFS
pre‐computing
DFSupper tree one computer pre computing pre‐computing
DFSlower tree Parallel part
…Parallel part
Output the shortest vector …DFS
Parallel computing
collecting the sub‐tree results.
18
Parallel for Extreme EnumerationInput basis
…Randomized Basis with Seedi
Pre‐ComputationLLL, BKZ
Pre‐ComputationLLL, BKZ
Pre‐ComputationLLL, BKZ
…
Enumeration Enumeration Enumeration…
Output
19
ImplementationImplementation
h l d d i• We have Cloud and GPU version implementations.
• Cloud version: C++ and Hadoop Streaming• GPU version : CUDAGPU version : CUDA
• Memory Used:– For dimension n, needs 8n2+40n+164 bytes– n=100, needs 84 KB– n=800, needs 5.1 MB,
20
Implementation overviewpInput basis
R d i d B i ith S d…
Pre‐ComputationLLL, BKZ
Pre‐ComputationLLL, BKZ
Randomized Basis with Seedi
…
CPU Cloud Computing
GPU ENUMGPU ENUMGPU ENUM
Cl dCloudCloud ENUM
Cloud ENUM
GPU ENUMGPU ENUMGPU ENUM Cloud ENUM
Cloud ENUM
ENUM
CPU Cloud ComputingGPU Cloud Computing
Output
CPU Cloud ComputingGPU Cloud Computing
– One GTX 480 is about 12 times faster than one i7 core. 21
R ltResults• We launch our CPU Cloud in IIS, Taiwan
– Machine 0‐1 Intel Xeon E5430 • 2.66 (GHz) * 2(cpus) * 4(cores) * 1(thread)
– Machine 2‐3 Intel Xeon E5520 2 27 (GH ) * 2( ) * 4( ) * 2(th d )• 2.27 (GHz) * 2(cpus) * 4(cores) * 2(threads)
– Machine 4‐8 Intel Xeon E5620 • 2 40 (GHz) * 2(cpus) * 4(cores) * 2(threads)• 2.40 (GHz) 2(cpus) 4(cores) 2(threads)
– Total: 9 d 72 h i l 128 i t l• 9 nodes, 72 physical cores, 128 virtual cores,
• 306 GHz
22
Performance compare to single‐corep g
100000
Cloud Enum Performance with pre‐Computing BKZ40P15
10000
100000
1000
Time
56
80x
100
Cost
43x56x
1
10
Single‐coreGiantCloud (120X Frequency)
S d ti i i
182 83 84 85 86 87 88 89 90 91
Dimension
• Speed‐up ratio is increase23
Prune BKZ in Pre ComputingPrune BKZ in Pre‐Computing
0.9
18.00E‐01
9.00E‐01BKZ 30 Pruning 15
0.6
0.7
0.8
obab
ility
LLL 5.00E‐01
6.00E‐01
7.00E‐01
(sec)
BKZ 30
0 2
0.3
0.4
0.5
Success P
ro BKZ 10 Pruning 15BKZ 10BKZ 20 Pruning 15BKZ 20BKZ 30 Pruning 15
2.00E‐01
3.00E‐01
4.00E‐01
time (
0
0.1
0.2
0 10 20 30 40 50Dimension
BKZ 30 Pruning 15BKZ 30BKZ 40 Pruning 15
0.00E+00
1.00E‐01
30 35 40Dimension
• Almost the same quality as none‐pruning version
Dimension
• But Cost time is much less – in dimension 80 ‐120, speed‐up 10x faster 24
Total running timeTotal running time100000
10000
3540
1000
Time (secon
d)
BKZ 30 P i 15BKZ bl k i30
100
BKZ 30 Pruning 15
BKZ 35 Pruning 15
BKZ block size
1080 85 90 95 100 105 110
Di i
BKZ 40 Pruning 15
• Seems exponential increaseT d ff th ti (b i lit ) d
Dimension
• Tradeoff on the pre‐computing (basis quality) and extreme enumeration
25
BKZ / (BKZ+ ENUM)BKZ / (BKZ+ ENUM)1
0 7
0.8
0.9
0.5
0.6
0.7
Ratio
0.3
0.4
R
BKZ 30 Pruning 15
BKZ 35 Pruning 15
0
0.1
0.2
BKZ 40 Pruning 15
• To combine with “total running time” when percentage BKZ of
80 85 90 95 100 105 110
Dimension
• To combine with total running time , when percentage BKZ of total time is about 40‐60%, it gets the min total running time
26
Estimate Cost Time
10000000
100000000
1E+09
10000
100000
1000000
100
1000
10000
ost T
ime (day)
0.1
1
10
100 105 110 115 120 125 130
Co
0.001
0.01BKZ 35 Pruning 15
BKZ 40 Pruning 15
BKZ 55 Pruning 15
• This shows the optimal pre‐computing is BKZ d
0.0001Dimension
BKZ 55 Pruning 15
40 pruning 15 in dimension 104 to130.27
http://www.latticechallenge.org/svp‐challenge/28
Our RecordOur RecordN 1 di i 120 d 0• No. 1, dimension 120, seed 0– 2300 US dollars paid to Amazon
• Rent 64 machines about 14 4 hours• Rent 64 machines about 14.4 hours– Machine type
• Cluster GPU Quadruple Extra Large InstanceCluster GPU Quadruple Extra Large Instance • 2 x Intel Xeon X5570 CPUs• 22 GB of memory• 2 x NVIDIA Tesla “Fermi” M2050 GPUs• 1690 GB of instance storage
t 2 5 US d ll h• costs 2.5 US dollars per hour• http://aws.amazon.com/ec2/instance‐types/
29
Concluding RemarksConcluding Remarks
i i l lid i f ’• Empirical validation of GNR’10 extreme pruning
• Parallelize and implement GNR’10 extreme pruning on GPU and in Cloudp g
• Extend the pruning idea to pre‐computing and flexible bounding functionand flexible bounding function
• New security level measure• Estimate the cost of solving ASVP instances of SVP Challenge in higher dimensions
30
• Thank you
31
•Q & A ?•Q & A ?
32