extreme enumeration on GPU and in cloud ches...Extreme Enumeration on GPU and in Cl dClouds...

Extreme Enumeration G d i Cl don GPU and in Clouds

Po‐Chun Kuo1, Michael Schneider2, Özgür Dagdelen3, Jan Reichelt3,Johannes Buchmann2,3, Chen‐Mou Cheng1, and Bo‐Yin Yang4

1 National Taiwan University, Taipei, Taiwan2 Technische Universität Darmstadt, Germany

3 Center for Advanced Security Research Darmstadt (CASED), Germany4 Academia Sinica, Taipei, Taiwan

CHESNara JapanNara, Japan2011.9.30

OutlineOutline

• Introduction–Problem Description–Problem Description–Known algorithm–Application

• Algorithm• Algorithm• ImplementationImplementation• Results

2

Problem Description

nn

OFind it!

3

Problem PropertyProblem Property

• NP‐hard• Worst/average case equivalence property/ g q p p y

– at 1997, Ajtai shown the equivalence between worst‐case and average‐case lattice problemworst case and average case lattice problem [Ajtai’97]

• Many problems could be reduced to SVP• Many problems could be reduced to SVP– Knapsack, Subset Sum, factoring, approximate GCD bl tGCD problem…etc

• The first Fully Homomorphic Encryption Scheme is based on lattice problem is proposed at 2009 [Gentry’09] 4

Known AlgorithmKnown Algorithm• Approximation Algorithm

– Lattice Basis Reduction Algorithm– LLL, ratio ~ 1.02n in average case [Nguyen and Stehlé, 2006], g [ g y , ]– BKZ, ratio ~1.01n in average case

• with a parameter “block size”p– the approximate ratio are all exponential to input dimension

• Exact AlgorithmExact Algorithm– Lattice Enumeration

• Super exponential time and polynomial space• Super‐exponential time and polynomial space• But seems exponential time in practical

Si– Sieve• both exponential time and space 5

ApplicationApplication• Factoring polynomials over the integers or the rational• Factoring polynomials over the integers or the rational

numbers– E g given x2‐1 return x‐1 x+1E.g. given x 1, return x 1, x+1

• Finding the minimal polynomial of an algebraic number given to a good enough approximation. g g pp– E.g, given 1.618033, return x2‐x+1=0

• Factoring number with known some bitFactoring number with known some bit• Approximate GCD problem• Integer ProgrammingInteger Programming• Knapsack problem• Subset Sum problemSubset Sum problem

6

Our ContributionsOur Contributions• Paralleli e and implement GNR’10 t i l tti• Parallelize and implement GNR’10 extreme pruning lattice

reduction on GPU and in Cloud– Highly parallelize, about 90% parallel benefit

O GTX 480 i b 12 i f h i7– One GTX 480 is about 12 times faster than one i7 core.– Extend the implementation to multiple GPUs and run on Amazon’s EC2

cloud services.d h i id i d fl ibl b di• Extend the pruning idea to pre‐computing and flexible bounding

function– Get the same output quality and speed up more than 10x in pre‐

ticomputing• New security level measure

– Tradition: Dollar‐day (e.g. buy 10 machines and run 5 days)– New: Dollar (e.g. rent computation power from Amazon)

• Estimate the cost of solving ASVP instances of SVP Challenge in higher dimensionsg

7

AlgorithmAlgorithm

• Lattice Enumeration• Lattice Enumeration using Extreme Pruning

8

AlgorithmAlgorithm• Lattice Enumeration

– Exhaustive search all the possible solutions

Gram‐SchmidtGram‐Schmidt Process

9

• Build the search tree according to the “coefficient vector”• Guess bound = ||b1|| or Gauss prediction• Run DFS to find the shortest vector

……

…

10

bound boundnew

Guess boundcn

new

Guess bound Early abort

cn‐1

Renew bound Gauss predictionp

Extreme pruning[GNR’10]

……

pruning[GNR 10] Parallel

c2

c111

AlgorithmAlgorithm

• Lattice Enumeration• Lattice Enumeration using Extreme Pruning

12

Idea of Extreme PruningIdea of Extreme Pruning• Proposed by Nicolas Gama, Phong Q. Nguyen and Oded Regev• Goal: Maximize the reward per operation• Only search the space where is the most possible to find out

solution• If failed, randomized the basis and search again.

13

Lattice Enumeration using Extreme Pruning

• Guess very tight bound for each level

boundn cnfor each level

• If failed, do a randomize to the basis then do

boundn‐1 cn‐1to the basis, then do Enumeration again.

• E.g.E.g.time for search each tree be 1000 times faster, and

…… ……

probability of finding out the vector be 10%,=> gain ~100x faster=> gain ~100x faster

bound2 c2

bound1 c114

How to decide bound for each level ?How to decide bound for each level ?

B di t (R R R ) [0 1]n• Bounding vector (R1, R2, …, Rn) [0,1]n, with R1 ≦ R2 ≦ … ≦ Rn

• Bounding function for level i is Ri．Guess Boundg i

• Linear bounding functionR = i / n– Ri = i / n

– Theoretical analysis ‐> the success probability 1/n [GNR’10]– In Practice ‐> the success probability is much higher– But the search space is still too large for high dimension…

• Polynomial: fitting the numerically optimized in [GNR’10]• Polynomial: fitting the numerically optimized in [GNR 10]– No theoretical guarantee on the probability– But it performs well in practice– success probability is about 10% by experiment

15

Bounding functionBounding function1

0 7

0.8

0.9

Linear

0.5

0.6

0.7

0.3

0.4Polynomial

0.1

0.2

00 5 10 15 20 25 30 35 40 45

16

ll lParallel• Parallel for Enumeration

– Parallel for one search tree• Parallel for Extreme Enumeration

– Parallel formany search treesParallel for many search trees

17

Parallel for EnumerationParallel for Enumeration

Parallel DFSupper tree + DFSlower tree

DFSDFS

pre‐computing

DFSupper tree one computer pre computing pre‐computing

DFSlower tree Parallel part

…Parallel part

Output the shortest vector …DFS

Parallel computing

collecting the sub‐tree results.

18

Parallel for Extreme EnumerationInput basis

…Randomized Basis with Seedi

Pre‐ComputationLLL, BKZ



…

Enumeration Enumeration Enumeration…

Output

19

ImplementationImplementation

h l d d i• We have Cloud and GPU version implementations.

• Cloud version: C++ and Hadoop Streaming• GPU version : CUDAGPU version : CUDA

• Memory Used:– For dimension n, needs 8n2+40n+164 bytes– n=100, needs 84 KB– n=800, needs 5.1 MB,

20

Implementation overviewpInput basis

R d i d B i ith S d…



Randomized Basis with Seedi

…

CPU Cloud Computing

GPU ENUMGPU ENUMGPU ENUM

Cl dCloudCloud ENUM

Cloud ENUM

GPU ENUMGPU ENUMGPU ENUM Cloud ENUM

Cloud ENUM

ENUM

CPU Cloud ComputingGPU Cloud Computing

Output

CPU Cloud ComputingGPU Cloud Computing

– One GTX 480 is about 12 times faster than one i7 core. 21

R ltResults• We launch our CPU Cloud in IIS, Taiwan

– Machine 0‐1 Intel Xeon E5430 • 2.66 (GHz) * 2(cpus) * 4(cores) * 1(thread)

– Machine 2‐3 Intel Xeon E5520 2 27 (GH ) * 2( ) * 4( ) * 2(th d )• 2.27 (GHz) * 2(cpus) * 4(cores) * 2(threads)

– Machine 4‐8 Intel Xeon E5620 • 2 40 (GHz) * 2(cpus) * 4(cores) * 2(threads)• 2.40 (GHz) 2(cpus) 4(cores) 2(threads)

– Total: 9 d 72 h i l 128 i t l• 9 nodes, 72 physical cores, 128 virtual cores,

• 306 GHz

22

Performance compare to single‐corep g

100000

Cloud Enum Performance with pre‐Computing BKZ40P15

10000

100000

1000

Time

56

80x

100

Cost

43x56x

1

10

Single‐coreGiantCloud (120X Frequency)

S d ti i i

182 83 84 85 86 87 88 89 90 91

Dimension

• Speed‐up ratio is increase23

Prune BKZ in Pre ComputingPrune BKZ in Pre‐Computing

0.9

18.00E‐01

9.00E‐01BKZ 30 Pruning 15

0.6

0.7

0.8

obab

ility

LLL 5.00E‐01

6.00E‐01

7.00E‐01

(sec)

BKZ 30

0 2

0.3

0.4

0.5

Success P

ro BKZ 10 Pruning 15BKZ 10BKZ 20 Pruning 15BKZ 20BKZ 30 Pruning 15

2.00E‐01

3.00E‐01

4.00E‐01

time (

0

0.1

0.2

0 10 20 30 40 50Dimension

BKZ 30 Pruning 15BKZ 30BKZ 40 Pruning 15

0.00E+00

1.00E‐01

30 35 40Dimension

• Almost the same quality as none‐pruning version

Dimension

• But Cost time is much less – in dimension 80 ‐120, speed‐up 10x faster 24

Total running timeTotal running time100000

10000

3540

1000

Time (secon

d)

BKZ 30 P i 15BKZ bl k i30

100

BKZ 30 Pruning 15

BKZ 35 Pruning 15

BKZ block size

1080 85 90 95 100 105 110

Di i

BKZ 40 Pruning 15

• Seems exponential increaseT d ff th ti (b i lit ) d

Dimension

• Tradeoff on the pre‐computing (basis quality) and extreme enumeration

25

BKZ / (BKZ+ ENUM)BKZ / (BKZ+ ENUM)1

0 7

0.8

0.9

0.5

0.6

0.7

Ratio

0.3

0.4

R

BKZ 30 Pruning 15

BKZ 35 Pruning 15

0

0.1

0.2

BKZ 40 Pruning 15

• To combine with “total running time” when percentage BKZ of

80 85 90 95 100 105 110

Dimension

• To combine with total running time , when percentage BKZ of total time is about 40‐60%, it gets the min total running time

26

Estimate Cost Time

10000000

100000000

1E+09

10000

100000

1000000

100

1000

10000

ost T

ime (day)

0.1

1

10

100 105 110 115 120 125 130

Co

0.001

0.01BKZ 35 Pruning 15

BKZ 40 Pruning 15

BKZ 55 Pruning 15

• This shows the optimal pre‐computing is BKZ d

0.0001Dimension

BKZ 55 Pruning 15

40 pruning 15 in dimension 104 to130.27

http://www.latticechallenge.org/svp‐challenge/28

Our RecordOur RecordN 1 di i 120 d 0• No. 1, dimension 120, seed 0– 2300 US dollars paid to Amazon

• Rent 64 machines about 14 4 hours• Rent 64 machines about 14.4 hours– Machine type

• Cluster GPU Quadruple Extra Large InstanceCluster GPU Quadruple Extra Large Instance • 2 x Intel Xeon X5570 CPUs• 22 GB of memory• 2 x NVIDIA Tesla “Fermi” M2050 GPUs• 1690 GB of instance storage

t 2 5 US d ll h• costs 2.5 US dollars per hour• http://aws.amazon.com/ec2/instance‐types/

29

Concluding RemarksConcluding Remarks

i i l lid i f ’• Empirical validation of GNR’10 extreme pruning

• Parallelize and implement GNR’10 extreme pruning on GPU and in Cloudp g

• Extend the pruning idea to pre‐computing and flexible bounding functionand flexible bounding function

• New security level measure• Estimate the cost of solving ASVP instances of SVP Challenge in higher dimensions

30

• Thank you

31

•Q & A ?•Q & A ?

32

extreme enumeration on GPU and in cloud ches...Extreme Enumeration on GPU and in Cl dClouds...

Documents

Transcript of extreme enumeration on GPU and in cloud ches...Extreme Enumeration on GPU and in Cl dClouds...