Diego de Freitas Aranhaecc2011.loria.fr/slides/aranha.pdfAranha et al. 2011 ML 6504M + 2736R FE...

Software implementation of pairings

Diego de Freitas Aranha

September 21, 2011

Department of Computer ScienceUniversity of Brasılia

Joint work withK. Karabina, P. Longa, C. Gebotys, J. Lopez, D. Hankerson,

A. Menezes, E. Knapp, F. Rodrıguez-Henrıquez,L. Fuentes-Castaneda, J.-L. Beuchat, J. Detrey, N. Estibals.

Diego F. Aranha Software implementation of pairings

Introduction

Pairing-Based Cryptography enables many elegant solutions tocryptographic problems:

Identity-based encryption

Short signatures

Non-interactive authenticated key agreement

Pairing computation is the most expensive operation in PBC.

Important: Make it faster!


Objective

Explore new ways to accelerate serial and parallel implementationsof cryptographic pairings:

Maximize throughput

Minimize latency

Applications: servers, real-time services.

Contributions

Lazy reduction in extension fields

Elimination of penalty for negative parameterizations

Compressed cyclotomic squarings

Parallelization of Miller’s Algorithm

Delayed squarings and new formulations

Notes on high security levels and current state-of-the-art


Bilinear pairings

Let G1 = 〈P〉 and G2 = 〈Q〉 be additive groups and GT be amultiplicative group such that |G1| = |G2| = |GT | = prime n.

An efficiently-computable map e : G1 ×G2 → GT is anadmissible bilinear map if the following properties are satisfied:

1 Bilinearity: given (V , W ) ∈ G1 ×G2 and (a, b) ∈ Z∗q:

e(aV , bW ) = e(V ,W )ab = e(abV ,W ) = e(V , abW ).

2 Non-degeneracy: e(P,Q) 6= 1GT, where 1GT

is the identity ofthe group GT .


Bilinear pairings


Bilinear pairings

If G1 = G2, the pairing is symmetric.


Barreto-Naehrig curves

Let u be an integer such that p and n below are prime:

p = 36u4 + 36u3 + 24u2 + 6u + 1

n = 36u4 + 36u3 + 18u2 + 6u + 1

Then E : y2 = x3 + b, b ∈ Fp is a curve of order n andembedding degree k = 12.

Example: u = −(262 + 255 + 1), b = 2 (implementation-friendly).


Pairing computation

The pairing er (P,Q) is defined by the evaluation of fr ,P at adivisor related to Q.

[Miller 1986] constructed fr ,P in stages combining Millerfunctions evaluated at divisors.


Pairing computation

Let lU,V be the line equation through points U,V ∈ E (Fqk ) andvU the shorthand for lU,−U .

For any integers a and b, we have:

1 fa+b,P(D) = fa,P(D) · fb,P(D) ·laP,bP(D)

v(a+b)P(D);

2 f2a,P(D) = fa,P(D)2 · laP,aP(D)v2aP(D) ;

3 fa+1,P(D) = fa,P(D) · l(a)P,P(D)v(a+1)P(D)

.

[Barreto et al. 2002] showed how to evaluate fr ,P at Q using thefinal exponentiation in the Tate pairing.


Pairing computation

Algorithm 1 Miller’s Algorithm.

Input: r =∑log2 r

i=0 ri2i ,P,Q.

Output: er (P,Q).

1: T ← P2: f ← 13: for i = blog2(r)c − 1 downto 0 do4: f ← f 2 · lT ,T (Q)5: T ← 2T6: if ri = 1 then7: f ← f · lT ,P(Q)8: T ← T + P9: end if

10: end for11: return f (q

k−1)/n


Asymmetric pairing

aopt : G2 ×G1 → GT

(Q,P) → (fr ,Q(P) · lrQ,πp(Q)(P) · lrQ+πp(Q),−π2p(Q)(P))

p12−1n

with r = 6u + 2,G1 = E (Fp),G2 = E ′(Fp2)[n].

The towering is:

Fp2 = Fp[i ]/(i2 − β), where β = −1.

Fp4 = Fp2 [s]/(s2 − ξ), where ξ = 1 + i .

Fp6 = Fp2 [v ]/(v3 − ξ), where ξ = 1 + i .

Fp12 = Fp4 [t]/(t3 − s) or Fp6 [w ]/(w2 − v).


Generalized lazy reduction

Intuitively, it is a trade-off between addition and modular reduction:

(a · b) mod p + (c · d) mod p = (a · b + c · d) mod p

Observation: Pairings use non-sparse primes for Fp!

Previous state-of-the-art (3M + 2R in Fp2):

a · b = (a0b0 + a1b1β) + [(a0 + a1)(b0 + b1)− a0b0 − a1b1] i ,

For k = 2i3j , total of (3i · 6j)M + (2 · 3i−1 · 6j)R.


Generalized lazy reduction

Idea: Suppose Fp2 is a higher extension and apply recursively!

Any component c of an element in Fpk is ultimately computed asc =

∑±aibj mod p, requiring a single reduction.

New state-of-the-art: total of (3i · 6j)M + kR.

Remark 1: Montgomery bounds should be maintained forintermediate results. Choose |p| acoordingly.

Remark 2: Same idea applies to arithmetic in E ′(Fp2).

Example: Multiplication in Fp12 goes from 54M + 36R to54M + 12R. In total, 40% of reductions are saved.


Removing the inversion penalty

Consider (p12 − 1)/n = (p6 − 1)(p2 + 1)(p4 − p2 + 1)/n.

The hard part is (p4 − p2 + 1)/n which requires 3 |u|-th powers.

If u < 0, from pairing definition:

aopt(Q,P) =[f|r |,Q(P)−1 · h

] p12−1n .

By distributing the power (p12 − 1)/n, we can compute instead:

aopt(Q,P) =[f|r |,Q(P)p

6 · h] p12−1

n.


Revised pairing computation

Algorithm 2 Miller’s Algorithm for general r , even k .

Input: r =∑log2 r

i=0 ri2i ,P,Q.

Output: er (P,Q).

1: T ← P2: f ← 13: for i = blog2(r)c − 1 downto 0 do4: f ← f 2 · lT ,T (Q)5: T ← 2T6: if ri = 1 then7: f ← f · lT ,P(Q)8: T ← T + P9: end if

10: end for11: if u < 0 then T ← −T , f ← f q

k/2

12: return f (qk−1)/n



Consider Fp12 = Fp4 [t]/(t3 − s).

Let g =∑2

i=0 (g2i + g2i+1s)t i ∈ Gφ6(Fp2) and

g2 =∑2

i=0 (h2i + h2i+1s)t i with gi , hi ∈ Fp2 .

Given C (g) = [g2, g3, g4, g5], it is efficient to computeC (g2) = [h2, h3, h4, h5] .

Important: Decompression map D requires one inversion in Fp2 .



Recall that |u| = 262 + 255 + 1.

Idea: g |u| can now be computed in three steps:

1 Compute C(g2i ) for 1 ≤ i ≤ 62 and store C(g255) and C(g262)

2 Compute D(C(g255)) = g255 and D(C(g262)) = g262

3 Compute g |u| = g262 · g255 · g

Remark: Montgomery’s simultaneous inversion allowssimultaneous decompression.

Example: Computing a |u|-th power is now 30% faster.


Implementation results

Table: Operation counts for different implementations of the Optimal Atepairing at the 128-bit security level.

Work Phase Operations in Fp

Beuchat et al. 2010ML 6992M + 5040RFE 4647M + 4244R

ML+FE 11639M + 9284R

Aranha et al. 2011ML 6504M + 2736RFE 3648M + 1926R

ML+FE 10152M + 4662R

[Pereira et al. 2011] has a slightly faster operation count, butwhich produces a slower implementation in the target platform.



Table: Timings in cycles for the asymmetric setting on 64-bit processors.

Beuchat et al. 2010Operation Phenom II Core i7 Opteron Core 2 Duo

Mult in Fp2 440 435 443 590Squaring in Fp2 353 342 355 479Miller Loop 1,338,000 1,330,000 1,360,000 1,781,000Final Exp. 1,020,000 1,000,000 1,040,000 1,370,000Pairing 2,358,000 2,330,000 2,400,000 3,151,000

Aranha et al. 2011Operation Phenom II Core i5 Opteron Core 2 Duo

Mult in Fp2 368 412 390 560Squaring in Fp2 288 328 295 451Miller Loop 898,000 978,000 988,000 1,275,000Final Exp. 664,000 710,000 722,000 919,000Pairing 1,562,000 1,688,000 1,710,000 2,194,000

Improvement 34% 28% 29% 30%

Important: Latency of around 0.5 milisec in a 3GHz Phenom II X4.Diego F. Aranha Software implementation of pairings

Parallelization

Property of Miller functions

fa·b,P(D) = f b,P(D)a · f a,bP(D)

We can write r = 2w r1 + r0 and compute fr ,P(D):

fr ,P(D) = f2w r1+r0,P(D)

= f r1,P(D)2w · f 2w ,r1P(D) · f r0,P(D) ·

l(2w r1)P,r0P(D)

vrP(D).

If r has low Hamming weight, w can be chosen so that r0 is small.

For many processors, we can:

Apply the formula recursively

Write r as r = 2wi ri + · · ·+ 2w2r2 + 2w1r1 + r0

If P is fixed (private key), riP can also be precomputed.


Load balancing

Problem: We must determine an optimal partition wi .

Let c1(1) be the cost of a serial loop and cπ(i) be the cost of aparallel loop for processor 1 ≤ i ≤ π.

We can count the operations executed by each processor and solvethe system cπ(1) = cπ(i) to obtain wi . The speedup is:

s(π) = c1(1)+expcπ(1)+par+exp ,

where par is the cost of parallelization and exp is the cost of thefinal exponentiation.


Symmetric pairing

A pairing-friendly supersingular binary elliptic curve is the setof solutions (x , y) ∈ F2m × F2m satisfying the equation

y2 + y = x3 + x + b,

where b ∈ {0, 1}, and a point at infinity ∞.


Symmetric pairing

Choosing T = 2m − N and a prime n dividing N,[Barreto et al. 2004] defined the reduced ηT pairing:

ηT : E (F2m)[n]× E (F2m)[n]→ F∗24m

ηT (P,Q) = fT ′,P′(ψ(Q))24m−1

N ,

where T ′ = ±T and P ′ = ±P.

The function f is a Miller function and ψ is the distortion mapψ(x , y) = (x2 + s, y + sx + t).



For the asymmetric setting, estimated speedup of only 10%.

For the symmetric setting:

0

2

4

6

8

10

12

14

10 20 30 40 50 60

Speedup

Number of processors

Beuchat et al. 2009Aranha et al. 2010



Figure: Timings in the symmetric setting taken on an Intel Core 2 45nm.

0

5

10

15

20

25

30

Late

ncy

(m

illio

ns

of

cycl

es)

1 2 4 8Number of threads

Beuchat et al. 2009

23.03

13.14

9.08 8.93

Aranha et al. 2010

17.40

9.34

5.083.02



New parallelization:

No significant storage costs and almost-linear scalability

Latency improvement of 28%, 44% and 66% in 2, 4, 8processors

Limitations in the asymmetric setting:

Serial final exponentiation

Expensive point doublings

Expensive extension field squarings


Delayed squaring

Idea: Delay the squarings until we reach the cyclotomic subgroup!

Recall the parallelization (M = qk−1r ):

fr ,P(D)M =(f r1,P(D)M

)2w · f 2w ,r1P(D)M ·

(fr0,P(D) ·

l(2w r1)P,r0P(D)

vrP(D)

)M .

Remark: Delayed squarings increase speedup to 18-20%.


Parallel pairing derivations

Hess’ instantiation (α-Weil)

α(P,Q) =

f2u+1,P(Q)

f2u+1,Q(P)

(fu,(6u+2)P(Q)f u6u+2,P(Q)

fu,(6u+2)Q(P)f u6u+2,Q(P)

)p2(p6−1)(p2+1)

Critical path:

((f uu,(6u+2)Q(P)

)p2)(p6−1)(p2+1)


Parallel pairing derivations

New instantiation (β-Weil)

β(P,Q) =

((fp,h,P(Q)

fp,h,Q(P)

)p fp,h,pP(Q)

fp,h,Q(pP)

)(p6−1)(p2+1)

Critical path: pP, (fp,h,Q(pP))(p6−1)(p2+1)

Optimization:

pP = 2u(p2 − 2)P + p2P − P = 2u(φ(P)− 2P) + φ(P)− P.



0

0.5

1

1.5

2

1 2 3 4 5 6 7 8

Sp

eed

up

Number of processors

Optimal ateOptimal ate with delayed squaring

α-Weil pairingβ-Weil pairing

Best results until now:

Optimal ate pairing reaches speedup of 1.45 with 4 processors

β-Weil pairing reaches speedup of 1.86 with 8 processors


Curve choice at higher security levels

Important: Pairing security is defined by the hardness of the DLPin G1,G2,GT .

Barreto-Naehrig curves are optimal at the 128-bit level

Security usually scaled by increasing embedding degree

Kachisa-Scott-Schaefer curves with k = 18 have been pointedas the best family known for the 192-bit level

What about other families?


Curve choice at higher security levels

Table: Operation counts for the Optimal Ate pairing at the 192-bitsecurity level. M is the cost of multiplying two 512-bit integers in a64-bit machine.

Family Phase Operations in Fp

BLS (k = 24, |p| = 478)ML 14990MFE 25785M

ML+FE 40775M

BN (k = 12, |p| = 638)ML 26084MFE 11284M

ML+FE 37368M

KSS (k = 18, |p| = 512)ML 13817MFE 23022M

ML+FE 36839M

BW (k = 12, |p| = 638)ML 16823MFE 12647M

ML+FE 29470M


State-of-the-art

Table: Timings in 103 cycles on an Intel Core i7 Sandy Bridge 32nm atthe 128-bit security level using the fastest multipliers available.

Number of threadsAsymmetric pairing 1 2 4 8Optimal ate 1562 1287 1137 1107

Improved optimal ate – 1260 1080 1056

α-Weil – – 1272 936

β-Weil – – 1104 840

Symmetric pairing 1 2 4 8Genus-1 ηT 6455 3370 1794 1034

Genus-2 Optimal η – general 8265 – – –

Genus-2 Optimal η – degenerate 2358 – – –


Conclusions and future

New techniques for implementing pairings:

Speed records for pairing computation in software (hardware)

Dependency on architectural features

Scalable parallelization

New pairing derivations

Emphasis on implementation of protocols:

Pairing type and optimizations differ greatly

Higher security levels should be more interesting


RELIC cryptographic library:http://code.google.com/p/relic-toolkit/

Thank you for your attention!Any questions?


http://code.google.com/p/relic-toolkit/

References

D. F. Aranha, J. Lopez, D. Hankerson. High-speed parallelsoftware implementation of ηT pairing. CT-RSA 2010,89–105.

D. F. Aranha, J.-L. Beuchat, J. Detrey, N. Estibals. OptimalEta Pairing on Supersingular Genus-2 Binary HyperellipticCurves. Cryptology ePrint Archive, Report 2010/559.

D. F. Aranha, K. Karabina, P. Longa, C. Gebotys, J. Lopez.Faster Explicit Formulas for Computing Pairings over OrdinaryCurves. EUROCRYPT 2011, 48–68.

D. F. Aranha, E. Knapp, A. Menezes,F. Rodrıguez-Henrıquez. Parallelizing the Weil and TatePairings. IMA-CC 2011, To appear.


Diego de Freitas Aranhaecc2011.loria.fr/slides/aranha.pdfAranha et al. 2011 ML 6504M + 2736R FE...

Documents

Transcript of Diego de Freitas Aranhaecc2011.loria.fr/slides/aranha.pdfAranha et al. 2011 ML 6504M + 2736R FE...