[PPT]PowerPoint Presentation - Indian Institute of...

Debdeep Mukhopadhyay Chester Rebeiro

Department of Computer Science and EngineeringIndian Institute of Technology Kharagpur

1

Accelerations of Scalar MultiplicationAdvanced Techniques

23-27 May 2011 Anurag Labs, DRDO

Non-Adjacency Form (NAF)

NAF(29)=(1,0,0,-1,0,1), since 29=32-4+1Binary(29)=(1,1,1,0,1), since 29=16+8+4+1

Pros:NAF does not have any consecutive ones (hence called non-adjacent).Average density of non-zero terms in NAF is 1/3.It reduces the number of point additions in ECC scalar multiplication.

Cons:Maximum length of NAF can be one more than the binary.

23-27 May 2011 2Anurag Labs, DRDO

Algorithm for NAF generation

k=29.k0=2-(29%4)=1, k=29-1=28, k=14k1=0 (Note that it can never be 1). k=7k2=2-(7%4)=-1, k=4k3=0, k=2k4=0, k=1k5=2-(1%4)=1, k=0 (algorithm terminates)


Why Non-adjacent?

When k is odd, it can be either 4p+1 or 4p+3 (p is an integer). Case 1: k=4p+1

◦ ki=1, k=2p (even) => next NAF bit is 0 Case 2: k=4p+3

◦ ki=-1, k=2p+2 (even) => next NAF bit is 0


Scalar Multiplication with NAF

Expected Run time = m/3 A + m DNormal Run time = m/2 A + mD

Note that here number of doubling is unchanged. Later we see a method to remove doubling all together.


Width w-NAF

k=29, w=3NAF digits = (1,0,0,0,0,-3)29=(1,0,0,0,0,-3)=1.32-3

Pros: Density of non-zero terms =1/(w+1)

Cons: Pre-computation required, this means storage in hardware Length is unaltered as normal NAF


Algorithm for width w-NAF generation

u≡k (mod 2w) => -2w-1≤k≤2w-1

k=29, w=3◦ k0=-3, k=16◦ k1=0, k=8◦ k2=0, k=4

◦ k3=0, k=2◦ k4=0, k=1◦ k5=1, k=0 (algorithm terminates)


Scalar Multiplication with width w-NAF

Pre-computation: 1D + (2w-2-1)AExpected Run time = m/(w+1) A + m DNormal Run time = m/2 A + mD

Hence designing an architecture would incur the initial pre-computation phase.


Koblitz Curves

9

• Koblitz curves are a special class of elliptic curves and are defined on

where elliptic curve parameter • Koblitz curves are computationally efficient compared to random curves, as

Frobenius map can be utilized to accelerate scalar multiplication.

The previous methods did not reduce the number of doubling operations.Koblitz invented a set of curves which does not require any doubling. he curves were named after him.


Choice of the curve

Choice of the curve depends on a, which can be either 0 or 1. As we have seen the Elliptic Curve is a group of points.

◦ Group should be chosen that ECDLP is difficult.◦ The number of elements in the elliptic group is called the order of the group. ◦ For security, the order of the group should be very nearly prime (it has a factor

of a prime number and a small integer) as otherwise there can be subgroups which are called as divisors of the group,

which makes the curve cryptographically weak.◦ The field elements belong to GF(2m)

The subgroups belong to GF(2d), where d | m. If m is prime, d=1. Thus the only subgroups are E0(GF(2)) and E1(GF(2)). It can be easily checked that:

E0(GF(2)) = (O, (0,1)) E1(GF(2))= (O, (0,1), (1,0), (1,1))


An Interesting Property

• The curve satisfies : (x4,y4)+2(x,y)=µ(x2,y2), where µ=(-1)1-a

• Define, Frobenius Map as:

• Frobenius map follows the relation where • For a point P on the Koblitz curve, we can use the property of Frobenius map to

compute 2P.


τ-adic NAF

The scalar k can be represented as a polynomial, where τ is the inderminate.◦ this sum is analogous to the binary expansion.◦ the scalar is said to belong to the ring Z[τ].

◦ It can be proved that the τ-adic NAF representation is unique.


Divisibility by τ

In order to generate this NAF, we divide the element by τ, like we divided by 2 in the binary NAF.

As it is a NAF, the remainder is generated such that the next NAF digit is zero.


Algorithm for τ-adic NAF generation

k=29.The τ-adic NAF is (-1,0,1,0,1,0,-1,0,1)=> 29=1- τ2+ τ4+ τ6- τ8

Þ29P=P- τ2(P)+ τ4(P)+ τ6(P)- τ8(P)Þ29P=(x,y)-(x4,y4)+(x16,y16)+(x64,y64)-(x256,y256)Thus, the scalar multiplication avoids any doubling operation, instead it performs easy squaring operation.It may be noted that the length is almost twice of the binary expansion, hence a reduction is necessary.


Reduction of the scalar

τm(P)=P [from Fermat’s Little Theorem] (τm-1)(P)=O

◦ Hence, if γ≡k (mod τm-1)=> γ(P)=k(P)


Reduction of Scalar

16

• Solinos presented efficient reduction algorithm for reduction of a scalar. The algorithm involves integer multiplication. Thus, costly for hardware implementations. • An alternative approach known as Lazy Reduction was proposed by Brumley and Jarvinen which uses the observation that division by is cheap.

• The algorithm uses multiplication and division by 2 and integer additions. Implementation is simple and area requirement is small.

• The algorithm takes m clock cycles. So, Lazy.


Scalar Multiplication with τ-adic NAF

Expected Run time = m/3 A Normal Run time = m/2 A + mD


Summary

18

• Basic steps of scalar multiplication on Koblitz curves Reduction of the scalar. NAF generation from reduced scalar. Point addition for nonzero NAF digits.

Point addition is performed in Lopez-Dahab projective co-ordinate system.

Point squaring for every NAF digit. Final field inversion to transform scalar multiplication result into affine

co-ordinate system from projective co-ordinate system.

• Our Koblitz curve scalar multiplier uses simple NAF method for scalar multiplication.


Top Level Architecture


Reduction of Scalar

20

• Solinos presented efficient reduction algorithm for reduction of a scalar. The algorithm involves integer multiplication. Thus, costly for hardware implementations.

• An alternative approach known as Lazy Reduction was proposed by Brumley and Jarvinen which uses the observation that division by is cheap.

• The algorithm uses multiplication and division by 2 and integer additions. Implementation is simple and area requirement is small.

• The algorithm takes m clock cycles. So, Lazy.


Architecture for Reduction of Scalar

21

• Arrangement of adders and shift circuits is used to perform reduction of scalar. Here u is the LSB of c0. There are registers to store intermediate values. Control unit generates control signals for Multiplexers and write enable signal for storage registers. Initially storage register for c0

contains the value of scalar.


T-NAF Generation from Reduced Scalar

22

Can be found by observing last two bits of c0 and c1.

r0=b0+c0 r1=b1+c1

Reduced Scalar

T-NAF digits are generated after performing reduction of the scalar. As, the algorithm does integer additions and subtractions,

adders of the reduction circuit can be used to generate T-NAF digits.


Architecture for Reduction & T-NAF Generation

23

• The left portion of the circuit is used to generate digits. The NAF generation and reduction hardware shares the adders and registers. During reduction, control signal M is set to 0. After the reduction is over, NAF generation starts and M is changed to 1.


Choice of Scalar Multiplication Algorithm

• The Left-to-Right algorithm first computes the entire NAF of the reduced scalar and then starts processing the NAF from MSB.

• So, it waits for the entire NAF generation and this takes nearly m clock cycles in GF(2m).

• Additionally, at every iteration, Q is squared. So, when a point addition is in progress, we cannot perform in parallel. But, squaring is cheap in hardwares and the algorithm does not uses this advantage of parallel

processing.

24

• There are two scalar multiplication algorithms in literature:• Process the scalar starting from MSB (Left-to-Right).• Process the scalar starting from LSB (Right-to-Left).


Fast Scalar Multiplication Algorithm

• The Right-to-Left algorithm for scalar multiplication is shown below

• The scalar multiplication does not wait for entire NAF of the scalar. As soon as the LSB, i.e the first NAF digit is generated, the scalar multiplication starts.

• Additionally, point addition updates only Q and point squaring is independent of Q.

• So, we can use the fact that point squaring is cheap in hardware and can perform in parallel with . So, we select this Right-to-Left algorithm for scalar multiplication.


Point Addition Unit

26

• The point addition unit does point addition in Lopez-Dahzb co-ordinate system and takes 8 clock cycles. Initially these three registers are initialized with base point (Px, Py, 1). After every point addition, result Q Q+P is stored in register (RA1, RB1, RC1). In the figure, P = (RA2, RB2 ). In every clock cycle field multiplication is performed and the Multiplier is of Hybrid Karatsuba type. Control signals are used to control the multiplexers and write eneble signals for storage registers.


Point Addition Unit


Point Squaring Unit

28

• During scalar multiplication, point squaring is performed in every clock cycle. The base point is updated P(x, y) P(x2, y2). Point squarings are performed using dedicated squarer circuits as squarers are cheap.

• If we see the scalar multiplication algorithm, then

it can be seen that point squaring is independent of point additions.• A nonzero NAF digit is followed by several Zero digits (NAF

property). So, during point addition, we can continue point squaring in parallel until another nonzero NAF digit appears.


Point Squaring Unit

29

The NAF digits are generated from LSB side. Let us consider a portion of the entire NAF be <. . . . . .1 0 0 0 0 0 1 . . . . .>.

For the first 1, a point addition is required nad this point addition takes 8 clock cycles. If we check the algorithm, then it can be seen that for a nonzer NAF digit u, when a

point addition takes place and uses the present value of P.

If we consider only sequential processing, then it can be seen that after performing point addition for 1, the algorithm will perform 6 point squarings for the sequence <0 0 0 0 0 1>. This will require 6 clock cycles.

As P is independent of Q, we can perform the 6 point squarings in parallel with point additions (which takes 8 clock cycles). Thus saving 6 clock cycles.

When the next nonzero appears in <. . 1 0 0 0 0 0 1 . . > , then we must stop this parallel processing of zeros, as the last updated value of P for <. . 1 0 0 0 0 0 1 . . > will be required during the next point addition.


Architecture for Point Squaring Unit

• The point P(x, y) is in affine co-ordinate and two dedicated squarers are used for squaring x and y co-ordinates.

• Initially the registers are assigned with the base point. When the scalar multiplication starts, point squaring is performed for every digit and the registers are updated.

• A write enable signal en is used to protect content of registers from unnecessary squarings specially for the case (another Nonzer) mentioned in previous slide.


Architecture for Inversion

31

• Scalar multiplication when done in Lopez-Dahab co-ordinate system, requires a final inversion after processing the entire scalar.

• For ECC, Itoh-Tsujii inversion is efficient.• In a field GF(2m), the inverse of an element a is . • Using quad operation we can compute the inverse. Here is an example for

the field GF(2233).

• This requires multiplications and repeated quad operations. We can implement this using a multiplier and quad circuits. 23-27 May 2011 Anurag Labs, DRDO

Architecture for Inversion

32

• This is the basic block diagram for the inversion unit. The multiplier is actually a part of the point addition unit. This multiplier is shared between point addition unit and inversion unit.

• It can be seen from the previous slide that there are repeated quad operations. For example in step 7, computation of . If we use a single quad circuit, then the exponentiation will take 14 clock cycles. To reduce number of clock cycles, we use a cascade of several quad circuits. This cascade of quad circuits is called Quadblock.


Architecture for Quadblock …

33

• Here is an example for a Quadblock which contains 11 cascaded quad circuits. So, for an element a, we can raise it to a maximum of .

• A multiplexer is used to get intermediate results, for example .• To raise an element to a power which is more than the number of cascaded quad

circuits, repeated application of the quad block is required. So, the number of clock cycles depend on the number of quad circuits. For example, to perform , we can do it in two clock cycles.

• Number of clock cycles reduce if we increase number of quad circuits. But delay increases. So, there must be a balance in the design between delay and number of quad circuits.


Experimental Performance

34

• Experimentation was performed on Xilinx Virtex V FPGA for GF(2283).• Scalar multiplier on random curve in the field GF(2283) has an area of

around 40,000 LUTs, frequency 37 MHz and computation time of 63 micro seconds.

• Koblitz curve scalar multiplier (in first stage of implementation) which uses in GF(2283), has an area of 41,300 LUTs, frequency 31 MHz and average computation time of 35 micro seconds.

• It can be seen, that a Koblitz curve crypto processor takes almost half computation time compared to random curve crypto processor.


Further Acceleration

35

• We have found a novel technique to reduce number of point additions during scalar multiplication using representation of a scalar.

• For any scalar, we have found that length of is close to half of the length of .

• However, there is an overhead of small amount of pre-computations and an increased area.

• In Virtex IV FPGA, scalar multiplication using for the field GF(2283) saves 35% computation time compared to method.


Thank You


[PPT]PowerPoint Presentation - Indian Institute of...

Documents

Transcript of [PPT]PowerPoint Presentation - Indian Institute of...