Blum Blum Shub on the GPU - DiVA portal831071/FULLTEXT01.pdf · Conclusions. The conclusion made...

Master ThesisComputer ScienceThesis no: MCS-2012-03January 2012

Blum Blum Shub on the GPUA performance comparison between a CPU bound

and a GPU bound Blum Blum Shub generator

Mikael Olsson

Niklas Gullberg

School of Computing

Blekinge Institute of Technology

SE-371 79 Karlskrona

Sweden

This thesis is submitted to the School of Computing at Blekinge Institute of Technology

in partial fulfillment of the requirements for the degree of Master of Science in Computer

Science. The thesis is equivalent to 20 weeks of full time studies.

Contact Information:Authors:Mikael Olsson 19870522-4619E-mail: [email protected]

Niklas Gullberg 19870917-4158E-mail: [email protected]

University advisor:Lecturer Andrew MossSchool of Computing

School of ComputingBlekinge Institute of Technology Internet : www.bth.se/comSE-371 79 Karlskrona Phone : +46 455 38 50 00Sweden Fax : +46 455 38 50 57

Abstract

Context. The cryptographically secure pseudo-random numbergenerator Blum Blum Shub (BBS) is a simple algorithm with astrong security proof, however it requires very large numbers tobe secure, which makes it computationally heavy. The Graph-ics Processing Unit (GPU) is a common vector processor origi-nally dedicated to computer-game graphics, but has since beenadapted to perform general-purpose computing. The GPU hasa large potential for fast general-purpose parallel computing butdue to its architecture it is difficult to adapt certain algorithmsto utilise the full computational power of the GPU.Objectives. The objective of this thesis was to investigate if animplementation of the BBS pseudo-random number generatoralgorithm on the GPU would be faster than a CPU implemen-tation.Methods. In this thesis, we modelled the performance of amulti-precision number system with different data types; to de-cide which data type should be used for a multi-precision numbersystem implementation on the GPU. The multi-precision num-ber system design was based on a positional number system. Be-cause multi-precision numbers were used, conventional methodsfor arithmetic were not efficient or practical. Therefore, additionwas performed by using Lazy Addition that allows larger carryvalues in order to limit the amount of carry propagation requiredto perform addition. Carry propagation was done by using atechnique derived from a Kogge-Stone carry look-ahead adder.Single-precision multiplication was done using Dekker splits andmulti-precision modular multiplication used Montgomery multi-plication.Results. Our results showed that using the floating-point datatype would yield greater performance for a multi-precision num-ber system on the GPU compared to using the integer data type.The performance results from our GPU bound BBS implementa-tion was about 4 times slower than a CPU version implementedwith the GNU Multiple Precision Arithmetic Library (GMP).Conclusions. The conclusion made from this thesis, is that ourGPU bound BBS implementation, is not a suitable alternativeor replacement for the CPU bound implementation.

Keywords: Blum Blum Shub, Multi-precision, OpenCL, GPGPU

i

List of Figures

2.1 Layout of components for IEEE 754 single-precision floating-point 92.2 Processor architectures . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 OpenCL memory hierarchy . . . . . . . . . . . . . . . . . . . . . . 283.2 Results of Floating point and Integer With Group Size 64 . . . . . 323.3 Results of Floating point and Integer With Group Size 128 . . . . 323.4 Results of Floating point and Integer With Group Size 256 . . . . 33

4.1 Example of carry propagation with Binary Kogge-Stone . . . . . . 394.2 Example of carry propagation with Kogge-Stone . . . . . . . . . . 404.3 Nodes used in Kogge-Stone carry look-ahead adder . . . . . . . . 424.4 Parallel Complex Add . . . . . . . . . . . . . . . . . . . . . . . . 424.5 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 Results for Addition tests . . . . . . . . . . . . . . . . . . . . . . 565.2 Single Digit multiplication results . . . . . . . . . . . . . . . . . . 58

6.1 Time taken to execute BBS . . . . . . . . . . . . . . . . . . . . . 616.2 CPU-ticks per random bit . . . . . . . . . . . . . . . . . . . . . . 62

ii

List of Tables

2.1 Flynn’s Taxonomy of Parallel Architectures . . . . . . . . . . . . 18

3.1 Floating point and integer test parameters . . . . . . . . . . . . . 323.2 CPU-ticks per instruction set . . . . . . . . . . . . . . . . . . . . 333.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Truth table for Negation (NOT) . . . . . . . . . . . . . . . . . . . 384.2 Truth table for Conjunction (AND), Disjunction (OR) and Exclu-

sive Disjunction (XOR) . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Addition test parameters . . . . . . . . . . . . . . . . . . . . . . . 565.2 Multiplication test parameters . . . . . . . . . . . . . . . . . . . . 585.3 Single Digit multiplication results . . . . . . . . . . . . . . . . . . 59

6.1 Parameters for the GPU and CPU comparison of BBS . . . . . . 606.2 Latency of the BBS . . . . . . . . . . . . . . . . . . . . . . . . . . 61

iii

List of Algorithms

2.1 Addition with ripple carry . . . . . . . . . . . . . . . . . . . . . . 132.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1 Kernel addition of two arrays . . . . . . . . . . . . . . . . . . . . 263.2 Kernel base for benchmarking . . . . . . . . . . . . . . . . . . . . 303.3 Instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Full kernel description . . . . . . . . . . . . . . . . . . . . . . . . 314.1 Parallel Simple Addition . . . . . . . . . . . . . . . . . . . . . . . 374.2 Initial Binary Kogge-Stone values . . . . . . . . . . . . . . . . . . 394.3 Binary Kogge-Stone propagate and generate carry . . . . . . . . . 394.4 Binary Kogge-Stone result . . . . . . . . . . . . . . . . . . . . . . 394.5 First Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.6 Propagation Node . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.7 Final Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.8 Parallel version of Kogge-Stone . . . . . . . . . . . . . . . . . . . 444.9 Full precision multiplication . . . . . . . . . . . . . . . . . . . . . 454.10 Dekker Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.11 Dekker Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 464.12 Transforming output from Dekker multiplication to PNS . . . . . 474.13 Dekker using Absolute . . . . . . . . . . . . . . . . . . . . . . . . 484.14 Dekker using Volatile . . . . . . . . . . . . . . . . . . . . . . . . . 484.15 Montgomery Reduction . . . . . . . . . . . . . . . . . . . . . . . . 494.16 Montgomery Multiplication . . . . . . . . . . . . . . . . . . . . . 504.17 Parallel Montgomery Multiplication . . . . . . . . . . . . . . . . . 524.18 Our Blum Blum Shub algorithm . . . . . . . . . . . . . . . . . . . 545.1 Optimised Naive Splitting Multiplication . . . . . . . . . . . . . . 575.2 Multiplication test . . . . . . . . . . . . . . . . . . . . . . . . . . 586.1 GMP BBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

iv

Contents

Abstract i

1 Introduction 11.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 62.1 Number handling in computers . . . . . . . . . . . . . . . . . . . 7

2.1.1 Single-precision types . . . . . . . . . . . . . . . . . . . . . 72.1.2 Multi-precision types . . . . . . . . . . . . . . . . . . . . . 10

2.2 Basic Arithmetic Algorithms . . . . . . . . . . . . . . . . . . . . . 112.2.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.4 Modulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Lazy Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Blum Blum Shub generator . . . . . . . . . . . . . . . . . . . . . 162.5 GPUs and the SIMD Computational Model . . . . . . . . . . . . 172.6 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Exploring the GPU 233.1 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.1 Kernel and Work-item . . . . . . . . . . . . . . . . . . . . 263.1.2 Work-groups and Synchronisation . . . . . . . . . . . . . . 273.1.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Floating-point or Integer . . . . . . . . . . . . . . . . . . . . . . . 283.2.1 Kernel description . . . . . . . . . . . . . . . . . . . . . . 303.2.2 Performance analysis . . . . . . . . . . . . . . . . . . . . . 32

4 Multi-Precision arithmetic 354.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.1 Simple Addition . . . . . . . . . . . . . . . . . . . . . . . . 364.1.2 Complex Addition . . . . . . . . . . . . . . . . . . . . . . 374.1.3 Binary Kogge-Stone . . . . . . . . . . . . . . . . . . . . . 37

v

4.1.4 Numerical Kogge-Stone . . . . . . . . . . . . . . . . . . . . 394.1.5 Parallel algorithm . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3 Montgomery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.4 Putting it together . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Performance results 555.1 Test equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Comparative results 60

7 Conclusion 637.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.2 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . 64

References 65

vi

Chapter 1

Introduction

Deterministic systems always produce the same output when the same inputconditions are provided. A great deal of effort is required to ensure that computersbehave deterministically. A deterministic system by definition cannot be random,and randomness in such a system is a sign that something is wrong. Therefore,their design must be verified before they are constructed and the result of eachcalculation has to be checked for errors during execution, to ensure that they aredeterministic. Even though computers are deterministic systems themselves, theyuse randomness for a wide variety of purposes e.g. choosing a random set of datato test, password generation, and generating encryption keys for cryptography.Because computers are deterministic systems, they are unable to generate truerandom numbers directly. However, true random numbers can be sampled fromphysical systems such as thermal noise, radioactive decay or radio noise.

Random numbers from a physical system are generated by sampling the en-tropy of the system. Entropy is a measure of unpredictability. A simple exampleto illustrate entropy is a coin flip. Under the assumption it is fair, it providesone bit of entropy each flip. By contrast a double-sided coin provides zero bitsof entropy because it is entirely predictable. The entropy of a physical systemvaries on the physical phenomenon it is based on. Ideally, a physical systemshould be based on quantum phenomena that, in theory, are completely random.Non-quantum based physical systems can also be used for number generation,however their entropy might be low and data must be collected before a bit ofentropy can be generated.

A consequence of having to accumulate entropy is the speed at which physicalsystems generate numbers might not be sufficient for a high performance appli-cation. A way to increase physical systems efficiency is to take their output anduse it as input, called a seed, for a pseudo random number generator (PRNG).A PRNG is an algorithm that maps the seed onto a sequence of numbers thatstatistically appears to be random. An important feature for PRNGs is repeata-bility. In situations when one wants to test something that relies on randomnumbers, such as in simulations, PRNGs will produce a consistent sequence ofrandom numbers during the tests when the same seed is used.

To prove that the numbers generated by the PRNG have the appearance of

1

Chapter 1. Introduction 2

randomness, they have to pass statistical tests. One such test can be to testuniformity of distribution, meaning that on average each output is equally likely.Another test of randomness checks the lack of correlation between numbers; givenany number in the sequence from the PRNG, it does not affect the statistical prob-ability of the other numbers. Cryptographically Secure PRNG (CSPRNG) areused for cryptographic purposes, because of this, they have stronger requirementsfor statistical properties as well as additional security requirements. With a givenset of numbers from the generator, it should be hard to deduce the parametersor the internal state of the generator, such as the seed used by the generator.Additionally, more emphasis is put on the lack of correlation between the num-bers. The basis of a computational proof of security is that, when given a set ofnumbers, one wants to minimize the probability of correctly guessing previous orsubsequent numbers.

There exist several types of CSPRNG based on different ways to generaterandom numbers. Some examples are cryptographic hash functions, encryptionalgorithms, and number theoretical algorithms.

Hash functions can be used to create a CSPRNG. Hash functions work bytaking an arbitrary sized input and returning a fixed size bit string calleda hash value or fingerprint. The hash value is unique for an input, andwill always generate the same output with the same input. How the hashvalue is calculated depends on what hash function algorithm that is used.To create a CSPRNG, a cryptographic hash function has to be used. Thereare several different cryptographic hash functions, a commonly used one isthe Secure Hash Algorithm 256 (SHA-256). SHA-256 takes in an arbitrarysized input and outputs a 256-bit hash value. A desired property of thecryptographic hash functions is that it should be infeasible to calculate theoriginal value from the hash value; i.e. a hash function should act as a one-way function. The CSPRNG can be built by taking a randomly chosenvalue (a seed) and an incrementing counter starting at a secret value. SHA-256 can then be used to create a hash-value of the sum of the seed and thecurrent counter value. The bit string can later be used to create randomnumbers. To keep the random numbers secure the seed value and the initialcounter value must be kept secret.

Encryption algorithms are another type of algorithm that can be used to builda CSPRNG. To use encryption algorithms for creating a CSPRNG, an en-cryption key and an incrementing counter is required. The encryption keyalso called a seed and the initial counter value is randomly chosen. The en-cryption key is used with the encryption algorithm to encrypt the countervalues. The encrypted counter values can then be used to create randomnumbers. To keep the CSPRNG secure the encryption key and the initialcounter value has to be kept secret.


Number theoretical algorithms is a third type of CSPRNG uses number the-oretical algorithms. These algorithms are based on mathematical theoriesand problems that are thought to be hard to calculate when there is limitedknowledge about their initial states. If the problem of predicting the nextnumber in the sequence can be reduced to the underlying number theoreticalproblem, then a resulting proof of security follows.

Between hash functions, encryption algorithms, and number theoretical algo-rithms: hash functions have the best throughput but the worst security, whilenumber theoretical algorithms have the best security but the worst performance.This thesis investigates the possibility to increase the throughput of the numbertheoretical CSPRNG algorithm Blum Blum Shub, by using the graphics process-ing unit (GPU), which is a commodity vector processor.

At first graphics computations were performed on the CPU. However, whenthe graphics computations became more complex and demanding, the CPU couldnot keep up. Therefore, the graphics computation was moved to a dedicatedGraphic Processing Unit (GPU). The GPU was built with an array of pipelinesthat could perform graphic calculations in parallel. At this time, the graphicspipeline was mostly fixed functionality and focused on a particular type of graph-ics rendering intended for computer games. With an increasing demand for morecomplex and realistic graphics, advances were made in the GPUs processing speedand programmable shaders were introduced into the graphics pipeline. Graphicsshaders are a set of instructions that can manipulate different types of data; atthe time, there were only two types of shaders, pixel and vertex shaders. With theintroduction of graphics shaders, it was possible with extensive knowledge aboutthe graphics pipeline to perform General-Purpose computation on the GraphicsProcessing Units (GPGPU).

In 2007, the area of GPGPU took a leap forward with a major change thatadded additional types of shaders, and unified the shaders into a single packagethat could perform all functions. Nvidia at the same time released their GPGPUframework called Compute Unified Device Architecture (CUDA) [7]. CUDA en-abled programmers to create small GPU programs, called kernels, without therequirement of understanding the graphics pipeline. However, the main drawbackof CUDA is that it is Nvidia hardware specific. Two years later, the KhronosGroup announced Open Compute Language (OpenCL) [17], a platform indepen-dent framework for high performance computing. It has many similarities withCUDA, such as the use of kernels, but unlike CUDA, it is not hardware specific.

The main contribution of this thesis is the design, implementation and analysisof a novel approach for a BBS random generator on the GPU using OpenCL. Aperformance comparison is made against an equivalent CPU implementation. Thehighly parallel nature of the GPU shows promise for increasing the performanceof the BBS. Our performance evaluation against the CPU version showed thatour GPU version was slower, meaning our approach to a GPU based BBS is not


a suitable replacement or an alternative.A second contribution is a method for measuring execution time of small

OpenCL based GPU programs. The OpenCL framework provides access to ahigh-precision timer. However, what precision this timer has and how muchoverhead it adds when used, depends on the compute unit’s type, model andmanufacturer e.g. Nvidia’s GPU timer only has half a microsecond precision [9],while the Advanced Micro Devices (AMD) GPU timer has nanosecond precisionbut adds an overhead of several microseconds [21]. Therefore, to measure therun-time of the GPU programs with a consistent high-precision timer, the high-precision timer on the CPU had to be used. However, using the CPU timer addsa large overhead to the measurements as well, but because it has much higherprecision than the GPU-timers, measuring the execution time of several runs ofthe same GPU-program and by applying statistical methods the overhead can beeliminated. This method was vital for analysing our implementation, because tocreate the fastest possible BBS version, we had to measure each component tomake sure the fastest implementation was used.

The remainder of this thesis is structured in the following manner. Chap-ter 2, focuses on describing how the BBS algorithm and the GPU architectureworks, as well as giving a short description of how computers handles numbersand a description of work related to this thesis. In Chapter 3, we explore theproperties of the GPU. The algorithms we use for creating our BBS implemen-tation is described in Chapter 4. Chapter 5 contains performance measurementsof the different components, which are required to implement BBS on a GPU.We compare our BBS implementation on the GPU with a CPU implementationin Chapter 6, and we make our conclusions about our GPU implementation inChapter 7.

1.1 Research Questions

• Is it possible to achieve a performance increase with running BBSon the GPU ?

In this thesis, an implementation of the BBS generator on the GPU isproven possible and is explained in detail. However, our implementationof the BBS generator did not provide any performance advantages againsta reference CPU implementation. Therefore, our implementation is not aviable replacement for a CPU implementation.

• How well does the GPU implementation scale compared to theCPU implementation when the work size is increased?

The GPU implementation scaled well and was continuously four timesslower than the CPU implementation when 128 or more concurrent cal-culations were used. When less than 128 concurrent calculations were used


the GPU could not mask the memory access delay, with other calculationsand was therefore even slower.

• What is an appropriate representation of large numbers to allowparallel processing?

Both the residue number system and the radix based multi-precision numbersystem allow parallel processing, and related work has used both success-fully. However, due to the complexity of implementing a residue numbersystem, our implementation uses a radix based multi-precision number sys-tem.

• Assuming that the GPU is faster on large amounts, at whichwork size does performance of the GPU outweigh the latency oftransferring the work to the GPU?

Because our GPU implementation is slower than the CPU implementation,this research question was not investigated.

Chapter 2

Background

The standard way to represent numbers is called a Positional Number System(PNS). In a PNS, each position is a coefficient of a power of a predefined baseor radix. The radix also identifies the possible values that can be used in eachposition, conventionally a radix r allows the values 0 through r−1. For example,the common decimal numeral system has a radix of ten that allows each positionto have a value between 0 and 9. This system was based upon the human hands,which together have ten digits (fingers). Therefore, each position in the decimalnumeral system is called a digit.

123410 = (1× 103) + (2× 102) + (3× 101) + (4× 100) (2.1)

Equation 2.1 shows a number with a radix of ten. The digit in each position ismultiplied by a power of the radix and the value represented is the sum of thepartial products.

Computers use a radix of two called binary. In the binary system, the valuein each position is called a bit. Each bit can have the values 0 or 1. This systemworks particularly well for computers as 0 and 1 can easily be encoded in digitalcircuits.

A PNS can use any radix, depending on what is most convenient for the task.Equation 2.2 is a general definition of the value that is represented by a PNSusing an arbitrary radix r and has a length of n positions. In a PNS with anarbitrary radix r, each positional value si is called a limb and can have the value0 to r − 1.

S =n−1∑i=0

siri (2.2)

When using arithmetic operations with a PNS, one or more positions in thesystem can end up with values that are larger than the radix can represent. Inthis situation, the excess of each positions value also called a carry has to betransferred to the next position. A common method to transfer the carry alsocalled carry propagation is the ripple carry method. Ripple carry goes over eachposition systematically and transfers any carry values required. The carry valuec is calculated with a division c = v

r, where v is the positional value and r the

6

Chapter 2. Background 7

radix used. The new positional value x is calculated with modulus x = v mod r.An example of carry propagation with ripple carry is shown in Equation 2.3.

123410 + 567810 = (6× 103) + (8× 102) + (10× 101) + (12× 100)

= (6× 103) + (8× 102) + (11× 101) + (2× 100)

= (6× 103) + (9× 102) + (1× 101) + (2× 100)

(2.3)

2.1 Number handling in computers

There exist multiple ways of handling and representing numbers in computers.A computer natively handles and performs arithmetic operations using single-precision data types. However, there are some applications, for example cryp-tography, that requires higher precision than the single-precision data types canprovide.

2.1.1 Single-precision types

Non-negative integers, also called unsigned integers, are stored in a straightfor-ward manner in binary form, e.g. 1110 = 10112. The maximum integer value acomputer can handle is determined by the size of the computers registers. For acomputer with a register size of w bits, the register can represent non-negativeintegers from 0 to 2w − 1, i.e. an 8-bit processor can represent 0 to 255. A col-lection of bits in a computer is called a word; a word’s size is mandated by thecomputer’s register size. A word is the largest basic unit a processor can use toperform operations.

Representation of negative number can be done in several ways. Some com-monly explored methods are offset binary, signed magnitude, Ones’ Complement,and Two’s Complement. The different methods affect the value range and howthe calculations are performed.

Offset binary shifts the representation of zero into a higher register value; i.e.the register value zero instead represents a negative number. Calculationwith offset binary is very different from unsigned integers, e.g the additionoperation has to act as a subtraction in case the second term is actuallynegative, which means checking the value beforehand.

Signed magnitude is the practice of using a notation to signify a negative num-ber. Usually this is accomplished by using the top bit as the sign. A sideeffect is that it is possible to represent −0 which for most purposes is thesame number as 0. As with offset binary, addition has to take special regardto negative terms.

Ones’ Complement inverts all the bits to represent a negative number. Thetop bit is reserved and used as a sign bit. Ones’ complement can re-use


the arithmetic operations for unsigned integers with slight modification,because any carry from the top bit has to be added to the bottom. Likewith signed magnitude, Ones’ Complement also has the drawback of beingable to represent −0.

Two’s Complement is similar to Ones’ Complement, to get the negative repre-sentation all bits are inverted, and then the value one is added to the result.This method of representation allows addition using the same instructionsas unsigned integers, without regard for signage. In Two’s Complement,zero only has a single representation. However, the number range is notuniform around zero as it goes from −2w−1 to 2w−1 − 1.

Two’s Complement is the most common method for representation of negativeintegers. Offset binary and Signed magnitude is also common but for other ap-plications.

However, when rational numbers are required, the integer representation meth-ods cannot be used, because they do not account for the radix point. There aretwo common methods to represent rational numbers, fixed-point and floating-point, named for how they handle the radix point. The fixed-point method useinteger registers, however the registers are partitioned into two parts. The firstpart holds the positive powered integer bits and the second part holds the neg-ative powered bits called fractional bits. An example of a fixed-point numberwith two integer bits and four fractional bits is shown in Equation 2.4. Arith-metic with fixed-point numbers is similar to integer arithmetic. Addition andsubtraction can be performed as normal integers operations, without any regardfor the radix point, while multiplication and division require some modificationto be able to handle the fixed-point numbers. Multiplication can multiply thetwo fixed-point numbers, as they were integers without any regard for the radixpoint. However, the product of the multiplication will have to be reduced toposition the radix point in the same position giving the same amount of precisione.g. when two fixed-point numbers with f fractional bits are multiplied it willresult in a number with 2f fractional bits as shown in Equation 2.5. Therefore,the multiplication product has to be reduced by either removing the additionalfractional bits or by rounding the number to fit f fractional bits, this reductioncan cause precision loss in the product. As the integer division operation cannothandle fractional numbers i.e. the integer division 1

5would result in 0, the fixed-

point number in the dividend is multiplied with its radix r to a power of f toensure that the result is an integer value, which the radix point can be appliedto as shown in Equation 2.6.

2.312510 = 10.01012 =

1× 21 + 0× 20 + 0× 2−1 + 1× 2−2 + 0× 2−3 + 1× 2−4(2.4)


12.0010 × 02.0010 = 1100.002 × 10.002 = 24.0010

1100002 × 10002 = 1100000002 = 11000.00002 = 24.000010

(2.5)

1.00

5.00⇒ 1.00× 102

5.00⇒ 100

5= 20⇒ 0.20 (2.6)

A trade-off has to be made when using fixed-point numbers, because a largevalue range will decrease the precision, while high precision will decrease thevalue range. Using a technique similar to the scientific notation of numberscalled floating-point circumvents this trade-off. Floating-point numbers consistsof a sign bit s multiplied, with a radix r raised to the power of an exponente, then multiplied with a mantissa m, shown in Equation 2.7. In computersthe radix is two, the exponent is an integer that sets the position of the radixpoint. The mantissa is a fractional value between one and two; the number ofbits in the mantissa determines the precision available in the number, as shownin Equation 2.8. Because the exponent makes it possible to move the radix point,floating-point numbers have greater range of possible values. In the commonlyused IEEE 754 floating-point standard, the placement of the radix point is alwaysafter the first non-zero digit. This is called normalising the floating-point number.Because the only non-zero value of a binary system is one, the top bit in anormalised number will always be one. This makes the top bit redundant andonly the fractional bits in the mantissa has to be stored, allowing an additionalbit of precision. The IEEE 754 standard defines how the components of thefloating-point number are stored. Figure 2.1 shows the common layout used bythe different sizes of floating-point types. The single-precision IEEE floating-pointtype uses 24-bits for the mantissa, where one bit is freed by the normalisation.

(−1)s × re ×m (2.7)

m = 1 +m0

21+m1

22+m2

23+ . . . (2.8)

11 10 9 8 716 15 14 13 12 6 5 4 3 23 2 1 0 22S 7 6 5 4 21 20 19 18 17 1 0

Exponent Mantissa

8-bit 23-bit

Sign

1-bit

Figure 2.1: Layout of components for IEEE 754 single-precision floating-point

However, the advantages of floating-point numbers come at a cost. For ex-ample when adding two floating-point numbers, they are required to have the


same exponent. The increase in the exponent value while representing the samenumber creates a loss of precision, as well as a loss in performance for the extracalculations necessary to increase the exponent, see Equation 2.9. While fixed-point numbers can be calculated using integer instructions, floating-point numbercalculations cannot use the same instructions, because of their complex nature.Therefore, the floating-point arithmetic is offloaded to dedicated floating-pointprocessing units (FPU).

1.000 ≤ mantissa ≤ 9.999, 0 ≤ exponent ≤ 10

A = 103 × 1.451

B = 100 × 3.56

A+B = 103 × 1.451 + 100 × 3.56 = 103 × (1.451 + 0.004) = 103 × 1.455

(2.9)

2.1.2 Multi-precision types

When a calculation requires an integer value that is larger than the native registerof the processor, but the precision available in a floating-point number is notsufficient, a solution is to use multi-precision numbers. A multi-precision numberis created by splitting a large number into several word-sized pieces. The processorcan only do calculations on the individual words, so arithmetic on multi-precisionnumbers has to be handled manually. The relationship between the individualpieces dictates how the arithmetic operations are carried out. There are multipleways of constructing a multi-precision number system. A common way is to usea positional number system as a base, where word-sized pieces form the limbsof a large number. Arithmetic operations can then be based on the commonlytaught techniques, how they work is described in greater detail in Section 2.2.This approach of multi-precision number system will henceforth be referred to asa radix based multi-precision number system.

For a processor with b-bit large registers, the maximum radix possible is 2b

because the registers can only hold the values between 0 and 2b−1 e.g. a processorwith 32-bit registers can only handle values between 0 and 232−1. However, usinga radix that is of the same size as of the register is not advisable for some computerarchitectures, because it can cause the registers to overflow. Overflow means thatsome bits of the number have been lost because they did not fit inside the registere.g. in a 4-bit processor the largest number a register can hold is 24 − 1 = 15,if the addition 15 + 15 = 30 is performed the registers cannot fully representthe result because it requires 5-bits. The way the architectures handles overflowvaries, some architectures set the register value to 2b−1, while the most commonway is to wrap around. When using unsigned integers, wrap around ignores thetop bit of the sum and instead gives the remainder b bits as the result, this isshown with 4-bit register in Equation 2.10. Some computer architectures haveautomatic overflow detection and will either handle the overflow if it is possible


or give a notice that it has occurred. However, for those architectures that lackoverflow detection, it can be manually detected by using a smaller radix r and bychecking if the sum is larger than r−1. If the radix used during manual detectionis small enough so the overflow fits in the register, the overflow can be handledwithout any result inaccuracies as no part of the result is altered or lost.

1111

+0001

/10000(2.10)

An alternative approach to a multi-precision number system is to use a ResidueNumber System (RNS). In a RNS a number is represented with a basis, which is aset of co-prime numbers that are less than 2w in size. A RNS’s representation of Xwith a basis A is denoted as < X >A, where < X >A= {|x|a0, |x|a1, . . . , |x|a(n−1)}and A = {a0, a1, . . . , an−1}. The components of < X >A are |x|ai = x mod ai.The maximum number that a RNS with a basis A can handle is calculated witha product of all the moduli, as shown in Equation 2.11.

|A| =n−1∏i=0

ai (2.11)

The result from a calculation in basis A is implicitly modulo |A|. Convertinga RNS representation back to normal representation is done with the Chineseremainder theorem, as shown in Equation 2.12.

X =n−1∑i=0

(ai ×

|x|aiai

mod ai

)mod |A| , ai =

|A|ai

(2.12)

An advantage of using RNS is that addition, subtraction and multiplication canbe performed individually for each moduli, as demonstrated with a multiplicationin Equation 2.13.

< X >A × < Y >A= {|x|a0 × |y|a0 mod a0, . . . , |x|a(n−1) × |y|a(n−1) mod an−1}(2.13)

However, a drawback with using a RNS is that some operations are complex, e.g.division, which makes the multi-precision number system hard to implement andwork with [19].

2.2 Basic Arithmetic Algorithms

Arithmetic on a radix based multi-precision number system use methods similarto the commonly taught methods for decimal arithmetic. A radix based multi-precision number system splits a large number into several limbs in accordance


to a PNS with an arbitrary radix r. The radix is chosen such that the limbs canbe used with the native instructions in the processor. Therefore, the arithmeticoperations of a radix based multi-precision number system can be based uponthe arithmetic operations of a PNS. For a parallel processor that can performcalculations in parallel, most methods are however inefficient because they aresequential. This section explores the cost of basic arithmetic operations, andhow their expenses are related to a parallel processor. The cost of performingthe basic arithmetic operations is an estimation of the amount of resources theoperations require to solve problems. The estimation is called Computationalcomplexity and is expressed as a function O with an argument that signifies theresources required to solve a given problem i.e. an operation that requires n stepsto calculate its output has a computational complexity of O(n). The time it takesto solve these problems is called the operation’s Time complexity and is expressedas a function T with an argument that signifies the amount of time required tosolve the problem. For ease of presentation, all examples in this section are inradix ten.

2.2.1 Addition

The traditional addition method where carry propagation is performed at thesame time as the addition cannot be parallelised, because it uses the ripple carrymethod to carry propagate. The ripple carry method traverses over each digit ina number systematically, therefore it will always have to perform n steps wheren is the number of digits in the number, giving the method a computationalcomplexity of O(n) as shown in algorithm 2.1. The ripple carry method cannot beparallelised because the systematic movement over the digits in a number createsdependencies between the digits, hence a digit cannot be fully carry propagatedbefore it has received the carry value from the digit before it. Because the methodcannot be parallelised the method cannot complete a problem faster then T (n).

When adding smaller numbers e.g. 9999 and 1 the ripple carry method willrequire 4 steps to fully carry propagate. This is an acceptable number of stepsfor completing an addition. However, when the ripple carry method is used forcarry propagation on a addition between two multi-precision numbers, which canhave several hundred of limbs, the method requires too many steps to fully carrypropagate the addition. Therefore, the traditional addition method with theripple carry is too inefficient to be used for multi-precision numbers.


Algorithm 2.1 Addition with ripple carry

Input: A : {a0, a1, . . . , an−1}, B : {b0, b1, . . . , bn−1}Output: S : {s0, s1, . . . , sn−1}c = 0for i = 0→ n− 1 dosi = (ai + bi + c) mod 10c = (ai + bi + c) / 10

end for

2.2.2 Multiplication

The traditional way of performing multiplication is by taking each digit from onenumber and multiply it with all the digits in a second number. The results fromthe multiplications are then added together to calculate the final result as shownin Equation 2.14.

1 2 3 4 5× 1 2 3 4 0

0 0 0 0 0 = (12345 × 0)4 9 3 8 0 = (12345 × 40)

3 7 0 3 5 = (12345 × 300)2 4 6 9 0 = (12345 × 2000)

+ 1 2 3 4 5 = (12345 × 10000)1 5 3 3 3 7 3 0 0

(2.14)

When multiplication between two n-digit long numbers is performed n2 steps arerequired to perform the multiplication phase. This in turn results in n number ofresults that should be combined with addition to calculate the final result. Eachaddition has a computational complexity of A(n), which means that A(n) × nsteps are required to calculate the final result. Therefore, the resulting number ofsteps required to perform a full multiplication is n2 +A(n) ∗n. However, becauseall the arithmetic operations required to perform a multiplication on a computercan be done in two loops as shown in Algorithm 2.2, the resulting computationalcomplexity is O(n2).

Algorithm 2.2 Multiplication

Input: X : {x0, x1, . . . , xn−1}, Y : {y0, y1, . . . , yn−1}Output: S : {s0, s1, . . . , s2n−1}

for i = 0→ n− 1 dofor j = 0→ n− 1 dos((i+j)+1) = s((i+j)+1) + ((si+j + (xi × yj) / 10)s(i+j) = (si+j + (xi × yj)) mod 10

end forend for


The largest result that can be produced by multiplying two n-digit long numbersis 2n as shown in Equation 2.14. This is usually not a problem for most appli-cations using single-precision numbers because the values used are small enoughto fit even after the multiplication. Multi-precision numbers on the other handrepresents large numbers, which causes the limbs to overflow when the traditionalmultiplication method is used. This in turn causes the multi-precision number tolose precision. Some architectures provide methods that provide the full preci-sion result that is suitable for a radix based multi-precision number system, e.g.the Intel x86 architecture stores the upper w bits of the result in a second regis-ter [5], however this functionality is not available in most popular programminglanguages. The IEEE floating-point standard describes no multiplication methodthat gives a full precision result, only the normalised result.

There are a number of ways to parallelise the traditional multiplication method.The multiplication of the limbs can be fully parallelised over n2 processors. Thecarry propagation for the n rows can be performed over n processors. The finalstage where the rows are added together is a typical reduction problem. Oneprocessor can add two sets of numbers, which means that n sets of numbers canbe added together over dlog2(n)e steps. In the first step n sets are added togetherby n/2 processors to produce n/2 sets of numbers. In the second step, n/4 pro-cessors add the new sets of numbers together, and so on. This is only one methodfor parallelising multi-precision multiplication. Multi-precision multiplication isflexible in the order of execution, and as such can be adapted to fit the parallelexecution model of the architecture in use.

2.2.3 Division

Division cannot be parallelised because when a division is performed the dividendand the divisor cannot be split up to allow parallel processing. An exampleof this is shown in Equation 2.15, the first two digits can be divided cleanly,however division on the third and fourth digits produce a remainder, that is usedwith the fifth digit to complete the division. Because of these dependencies, thetraditional division method cannot be parallelised. The computational complexityof the traditional division method is O(n2), where n is the number of digits in thedividend [22]. In situations when multiple divisions should be performed with thesame divisor, a single division can be used to calculate the divisor’s reciprocal.The reciprocal can then be used with multiplication to perform division efficiently,because multiplication can be parallelised. As the traditional division method hasa high computational complexity and is not parallelisable, it is not suitable for usein a multi-precision number system. However, by pre-calculating the reciprocaland then using multiplication to perform division, multi-precision division can beperformed efficiently.


1028

12)1234512000

345240

10596

9

(2.15)

2.2.4 Modulus

Modulus is used to calculate the remainder of a division. The remainder is calcu-lated with Equation 2.16 where n is the dividend, d the divisor, q the quotient ofthe integer division n

dand r the remainder. Because modulus requires a division

to calculate q, it inherits division’s limited degree of parallelisation. When manymodulus calculations with the same divisor should be performed, the divisor re-ciprocal can be pre-calculated and used as explained in Section 2.2.3.

n− q × d = r (2.16)

2.3 Lazy Addition

When performing addition with a positional number system, it is common toperform the carry propagation while doing the addition. As explained in sec-tion 2.2.1, the carry propagation is commonly performed with the ripple carrymethod that cannot be parallelised. Lazy addition is an addition method thatperforms addition without performing any carry propagation, instead it uses spacein the register to store the excess. The extra space is allocated by using a smallerradix, which limits the space for the limb value but allows carry values greaterthan one. Because no carry propagation is performed there are no dependenciesbetween the numbers, thus the Lazy addition method can be performed in paral-lel. However, because the radix used only can represent numbers between 0 andr − 1 the result from a Lazy addition has to be carry propagated, before beingused with any other PNS operations. The advantage of lazy addition becomesmost apparent when adding several numbers e.g. 1234 + 567 + 789 + 1234 asshown in Equation 2.17 (In the equation, digits larger than the decimal system


are shown in brackets).

1234 + 567 = (1)× 103 + (2 + 5)× 102 + (3 + 6)× 101 + (4 + 7)× 100

= (1)× 103 + (7)× 102 + (9)× 101 + (11)× 100

179[11] + 789 = (1)× 103 + (7 + 7)× 102 + (9 + 8)× 101 + (11 + 9)× 100

= (1)× 103 + (14)× 102 + (17)× 101 + (20)× 100

1[14][17][20] + 1234 = (1 + 1)× 103 + (14 + 2)× 102 + (17 + 3)× 101 + (20 + 4)× 100

= (2)× 103 + (16)× 102 + (20)× 101 + (24)× 100

= 2[16][20][24]

(2.17)

Lazy addition can perform several additions, before any carry propagation isrequired, because the addition of two b-bit large numbers can at most result in asum that is b+1 bit large, e.q. addition of two 28-bit numbers would only producea carry in the 28th-bit position. How many lazy additions that can be performedsafely without any risk for overflow, can be calculated by subtracting the size ofthe processor’s register with the size of the radix used. This means that when amulti-precision number system is used that uses a radix of 228, it is possible toperform 232−28 = 16 additions in each limb with 28-bit numbers before the limbsin the multi-precision number has to perform any carry propagation.

2.4 Blum Blum Shub generator

The Blum Blum Shub pseudo random number generator is a number theoreticalCSPRNG with a very simple algorithm, shown in Equation 2.18.

xn+1 = x2n (modM) (2.18)

Where M is the product of two primes, p and q, that have the special propertiesshown in Equation 2.19 and Equation 2.20, where p1 and p2 are prime numbers.

p ≡ 3mod 4 (2.19)

p = 2p1 + 1 , p1 = 2p2 + 1 (2.20)

The initial value x0 is a quadratic residue of M , meaning there is an integery such that y2 ≡ x0modM . x0 can be chosen by taking a random seed s that isless than M

2but greater than 0 and satisfies gcd(s,M) = 1, then x0 = s2.

The BBS generator has a periodicity which means that after some number ofiterations it will generate x0 and the sequence will repeat. The periodicity can beincreased by selecting primes such that gcd(p−1, q−1) is as low as possible. It ispossible to calculate the periodicity using the function λ(λ(M)), but for that tobe possible Equation 2.20 must be satisfied for p and q. λ(n) is the Carmichael


function am ≡ 1 mod n. The Carmichael function returns the smallest integerm for any a coprime to n.

An interesting property of the generator is the ability to jump to any step iin the sequence using Equation 2.21.

xn = x2imodλ(M)0 (modM) (2.21)

From each iteration of the generator, Blum et al. [2] showed that at least one bitcan be extracted from each iteration that is regarded as secure, they left an openquestion for how many secure bits could be extracted from each iteration. Alexiet al. and Vazirani et al. showed that log2(log2(M)) secure bits may be extractedfrom each iteration [1, 33].

The security proof of BBS is based on the quadratic residuosity problem [2].Several authors have further enhanced the security proof of BBS. Fischlin et al.gave a strong proof for the difficulty of factoring a large M value in their work [14].Siderenko et al. showed in their work how to select an appropriate modulus size toachieve a desired level of security. Cusick found an imbalance in the distributionof output values [11]. He found that with a period of λ(λ(M)) output bits, thereis on average an imbalance no larger than

√λ(λ(M)), which is less than the

average imbalance of a true random number generator with the same period.Because the security of the BBS PRNG relies on M being large, and with the

suggestion from Fischlin et al. that the M value should be at least 1000 to 5000bits large [14], it is required to use a multi-precision number system. A drawbackof requiring an M of such magnitude and that only log2(log2(M)) bits may beextracted from each iteration, is that BBS becomes computationally expensive.Shi et al. [29] did a performance comparison of their own PRNG and severalothers, including three versions of BBS which were considerably slower than thefastest PRNG tested.

The large modulus is a problem with conventional arithmetic, because it isnot known how to make division into a parallel algorithm. Fortunately, the BBSalgorithm is a modular multiplication which is an operation that has been studiedby Montgomery [25]. Montgomery discovered a way to perform the modularoperation using multiplication and addition only, by transforming the numbersinto a different modular domain.

2.5 GPUs and the SIMD Computational Model

According to Flynns taxonomy, computer architectures can be organized into fourdifferent groups [16] illustrated in Table 2.1.


Single Instruction Multiple InstructionSingle Data SISD MISD

Multiple Data SIMD MIMD

Table 2.1: Flynn’s Taxonomy of Parallel Architectures

Processing Unit 1Data 1

Instructions 1

(a) Single Instruction Single Data

Data

Processing Unit 1

Processing Unit 2

Processing Unit 3

Processing Unit 4Instruction 4

Instruction 3

Instruction 2

Instruction 1

(b) Multiple Instruction Single Data


Instructions 2


Instructions 3


Instructions 1


Instructions 4

(c) Multiple Instruction Multiple Data

Instructions

Processing Unit 1

Processing Unit 2

Processing Unit 3

Processing Unit 4

Data 1

Data 2

Data 3

Data 4

(d) Single Instruction Multiple Data

Figure 2.2: Processor architectures

Single Instruction Single Data (SISD) (Figure 2.2(a)) is the simplest of all thearchitectures. It has a single instruction and data flow, which means that itdoes not have any parallelism. This makes it easy to build algorithms for thearchitecture. However, because of the single execution flow some algorithms canbe slow.

Multiple Instruction Single Data (MISD) (Figure 2.2(b)) is a very uncommonarchitecture, which has a very limited range of applications. Conventional algo-rithms can typically be divided among different tasks or partitioning the dataamong the processors. MISD does not work with either of these parallelizationtechniques.

The Multiple Instruction Multiple Data (MIMD) (Figure 2.2(c)) architectureis very common in modern CPUs. The MIMD processing unit is essentially built


upon a series of SISD processors. Therefore, each processor can run their owninstruction path without disrupting each other. This make building parallelizedalgorithms on the MIMD is easy compared to other parallel architectures.

Single Instruction Multiple Data (SIMD) (Figure 2.2(d)) is an architecturewhere many processors are running the same instruction but on different data.There are great advantages with SIMD for algorithms that can be parallelizedover data, i.e. when the same calculations are executed over multiple sets of datawith minimal dependencies between the data sets. When a conditional segment isreached that causes divergence, not all threads will execute the same conditionalsegment. Therefore, each thread has to be executed sequentially, because thereis only a single flow of instructions, which causes great performance degradation.

GPUs are stream processors which works in a very similar fashion as SIMD,and for most purposes could be regarded as such.

2.6 Linear regression

Linear regression is a statistical tool, which is used for finding correlation indata. The method analyses the data and maps it to a first order linear function(Equation 2.22).

y = α + βx (2.22)

There are many different methods and techniques for doing linear regression.One of the most common ones is the least-square regression, normally calledsimple linear regression. While the other methods and techniques can handlemultiple dimensions in their predictions, simple linear regression only works overtwo dimensions i.e. one independent variable (x) and one dependent variable (y).Simple regression has a relatively easy equation for calculating its predictions.The data variance, covariance and sums are required (Equation 2.23) to calculatethe simple linear regression coefficients and error values.

Sx =n∑i=0

xi Sy =n∑i=0

yi

Sxx =n∑i=0

x2i Sxy =n∑i=0

xiyi Syy =n∑i=0

y2i

(2.23)

With the variance, covariance and sums of the data, it is possible to calculate thecoefficients β (Equation 2.24) and α (Equation 2.25).

β =nSxy − SxSynSxx − S2

x

(2.24)


α =1

nSy − β

1

nSx (2.25)

With the coefficients, the error values can be calculated, where s2β (Equation 2.27)and s2α (Equation 2.28) is the error values for the coefficients and s2ε (Equa-tion 2.26) is the estimated error for the linear regression.

s2ε =1

n(n− 2)(nSyy − S2

y − β2(nSxx − S2x)) (2.26)

s2β =ns2ε

nSxx − S2x

(2.27)

s2α = s2β1

nSxx (2.28)

By inserting the calculated coefficients α and β into Equation 2.22, it is possibleto use the equation to predict values beyond the known data range.

2.7 Related work

Many areas of cryptography have already been introduced to GPGPU. A similaralgorithm to BBS is the public-key cryptographic algorithm Rivest Samir Adle-man (RSA), created by Rivest et al [28]. Public-key cryptography allows someoneto create a pair of keys, one public and one private key. The public key is usedto encrypt a message, and then only the private key can be used to decrypt it.In the RSA algorithm, encryption and decryption are done using modular expo-nentiation. The modulus is N which is the product of two large primes p, q. Theencryption key e is a chosen value 1 < e < ϕ(N) and gcd(e, ϕ(N)) = 1. ϕ is theEuler’s totient function and is calculated with ϕ(N) = (p−1)(q−1). The decryp-tion key d is d = e−1 (mod (p− 1)(q − 1)). The encryption algorithm is c = me

(mod N), m is the message transformed into an integer such that 0 < m < n,and c is the encrypted message. The decryption algorithm is m = cd (mod N).A fast way to calculate these modular exponentiations is to use Montgomery ex-ponentiation, an extension of Montgomery multiplication. RSA has already beensuccessfully ported to run on a GPU by several authors, using different numberhandling systems. To measure and compare their implementations most authorsmeasure throughput as exponentiations per second and latency in milliseconds toperform a single exponentiation.


The earliest works that investigate Montgomery exponentiation using a radixbased system on a GPU was made by Fleissner [15]. Fleissners implementationwas limited because he chose to use numbers only 192 bits in size. However, hisresults showed a considerable speed-up in comparison to a CPU-implementation.Because Fleissners work predated CUDA, his implementation was done usingOpenGL shaders.

The first published work into calculating modular exponentiation over arbi-trary sized fields on the GPU using a residue number system was made by Mosset al. [26]. The use of a residue system allowed Moss et al. to divide the workof an Montgomery exponentiation across several threads. Their implementationwas done using OpenGL shaders, and like Fleissner, predated CUDA. Their re-sults show a threefold increase in performance against a CPU library for RSAwhen performing many parallel Montgomery exponentiations. However, whenthey used few parallel Montgomery exponentiations, the performance droppedbecause of the high overhead.

Szerwinski et al. [31] was the first to perform modular exponentiations onthe GPU using the GPGPU framework CUDA. They tested two approaches forperforming modular exponentiations, mainly focusing on RNS and testing oneapproach of a radix based solution. The radix based solution they tested is asimple approach to parallelisation where each thread computes one exponentia-tion each. To create their implementations, they used the GPGPU frameworkCUDA that introduces the concept of warps. Warps in CUDA are groups of 32threads which always work in lock-step (see SIMD section 2.5), which means thatthe cost of divergence can be avoided if the warp follows the same execution path.Therefore, to avoid thread divergence they had to use the same exponent for eachthread in a warp but could still use different base terms. They investigated sev-eral approaches to RNS based calculations, and then compared the best solutionagainst their own radix based solution and to other published results. Their re-sults showed that the radix based solution had the highest throughput but veryhigh latency. Compared with their RNS solution, the radix based solution had48 times higher latency on 1024-bit exponentiations and 64 times higher on 2048bit.

Harrison et al. [19] investigated two approaches to calculating modular expo-nentiation on the GPU, one using RNS in a manner similar to Moss et al. and oneusing a radix based solution similar to Fleissner. Two versions of the radix basedsolution was used, one serial approach where each thread deals with a single expo-nentiation each, the second implementation was a parallel version which dividedeach exponentiation over several threads. Their parallel approach is split intotwo parts, in the first part the partial multiplications are performed which can bedone without any thread communication. The second part combines and reducesthe results, which requires thread communication causing some considerable over-head. The results showed that the serial version required 1024 of simultaneousexponentiations before it could reach its full throughput, which was considerably


faster than the CPU. The parallel version was faster for a smaller amount of ex-ponentiations compared to the serial radix based version, but peak throughputwas lower than the CPU. The RNS approach was faster than the CPU and radixbased solutions when running with 32 to 512 parallel exponentiations, but hadlower throughput than serial radix based with more parallel exponentiations.

The earliest implementation of BBS for the GPU was done by Olano[27]. Heused the PRNG to generate noise for computer graphics in order to make it lookit more realistic. The implementation he made did not need any cryptographicstrength and could therefore use smaller M . He found satisfactory lack of corre-lation between the numbers when using M = 61.

Tzeng et al. [32] did similar work to Olano, where they used cryptographichash functions seeded with different PRNG to generate picture noise. Among thePRNGs used there was a BBS algorithm with an M less than

√232, because they

decided to use single-precision. They found that using BBS this way providedthem with great speed but with unsatisfactory random noise.

Other types of PRNGs have been successfully implemented on the GPU. Thefast but non-cryptographic PRNG Mersenne Twister has been implemented byseveral authors [13, 32] on the GPU, with considerable performance increase.Demchhik [13] implemented and compared several types of PRNGs on AMD/ATIgraphics cards. His experiments showed an increase in the performance of allPRNGs he tried compared again two types of CPUs.

Chapter 3

Exploring the GPU

At first all calculations required to display graphics on the computer displaywas performed sequentially on the CPU. However, with increasing demand formore complex and realistic graphics in computer games, the CPU did not haveenough computational power to provide a frame rate fast enough for a real timeexperience. Therefore, the Graphics Processing Unit (GPU) was created, whichis a dedicated piece of hardware to offload the graphics calculations from theCPU. Because most graphics calculations require fractional numbers, the GPUwas built as a floating-point processor. As the GPU was only intended to performgraphics calculations of a particular type, it was not designed in the same wayas the CPU. The CPU is required to perform calculations with as low latency aspossible to handle its tasks, while the GPU only needs to update the computerdisplay that has a low update frequency, often less than 100 times per second.This meant that the design of the GPU could focus more on maximizing thethroughput rather than minimizing the latency, because of this it was designedas a parallel processor, which can perform several instructions at the same time.As the GPU has to perform the same type of calculations at the same time buton different data, it was designed as a SIMD architecture.

At first, the GPU had a fixed functionality, however as the GPU’s compu-tational power increased, it was made increasingly programmable to be able toperform tasks that were more complex. In 2000, programmable vertex and pixelshaders were introduced into the GPUs, making it easier for developers to manip-ulate the GPUs calculations. At this stage, developers with extensive knowledgeabout the GPUs hardware could use the shaders to perform General PurposeComputation on the GPU (GPGPU). The vertex and pixel shaders were at firstprocessed on distinct processors on the GPU with different capabilities, with eachnew GPU generation these capabilities were brought closer, until eventually theywere unified into a single processor type. With the unification of the shaderprocessors, GPGPU frameworks were developed that made it easier to performGPGPU without an extensive knowledge of the GPU hardware.

Profiling is an analysis method that is used to get an understanding of howprocessors and algorithms behave under different circumstances. Profiling canalso be used for creating performance models, which can be used to compare

23

Chapter 3. Exploring the GPU 24

algorithms and predict their performance.With the increasing interest in GPGPU, Nvidia has been trying to improve

the processing speed of instructions using the integer data type on their GPUs.In the Nvidia OpenCL programming guide [10], there is a throughput comparisonbetween the integer and floating-point data types when using different instruc-tions on Nvidia GPUs. The comparison shows that there is equal throughputfor most of the instructions. However, the throughput figures does not match upwith the throughput comparison in the Nvidia CUDA programming guide [8], andthere is no description of the experiment that Nvidia used to acquire the mea-surements. Additionally in a synthetic graphics-benchmarking program, thereare claims that Nvidia’s driver would detect when a specific shader was beingused, and would switch to a shader optimised by Nvidia [3]. This would increasethe performance by degrading the image quality, giving their GPUs an unfairadvantage over other GPU manufacturers. Because of the throughput result in-consistences in the documentation, and the claims about the Nvidia driver, wecannot rely on the throughput data in the documentation. Therefore, we madean experiment to create a performance model for the Nvidia Geforce GTX 580graphics card that we were using. To be able to conclude which of the data typesthat was most suitable for a multi-precision number system, which is requiredfor a secure GPU bound BBS implementation. To our knowledge, there existedno working models for predicting the performance impact of using different datatypes on the GPU, when the Open Computing Language (OpenCL) GPGPUframework is used.

In previous work Wong et al. [35] created a performance model for the NvidiaGT200 GPU chip, by using the Compute Unified Device Architecture (CUDA)GPGPU framework. The performance model included measurements for differentinstructions, memory access, and the behaviour of the GPU. Their measurementsshowed that integer addition and subtraction had equal performance to theirfloating-point counterpart. They also explain that their GPU had two forms ofinteger multiplication, one that could handle 24-bit numbers and another thatcould multiply 32-bit numbers. They also explain that the 24-bit multiplicationwas mapped to a single instructions, while the 32-bit multiplication was per-formed with four instructions. This showed that the 24-bit multiplication hadthe same performance as the integer addition and subtraction, while the 32-bitmultiplication had one fourth of the performance. The measurements showedthat the floating-point multiplication was faster than the 24-bit integer multipli-cation. The poor performance for the 32-bit multiplication was due to limitationsin the earlier CUDA capable GPUs. This problem was resolved later in GPUssupporting CUDA version 2.0 and above.

Wong et al. performed their performance measurements with the help of theCUDA clock function. The CUDA clock function returns current elapsed GPUclock cycles. Unfortunately, this function is unavailable when using OpenCL. Toour knowledge, there are no standard tools to benchmark GPU kernel execution


times. Therefore, we had to come up with our own method to measure perfor-mance of GPU programs. For OpenCL on nVidia GPUs there is a GPU-boundtimer that we can use, however it has a precision of half a microsecond, which isnot good enough for our purposes.

Therefore, we were required to find an alternative way to measure the per-formance of our GPU programs. After reading papers and investigating differentmethods, we decided to use the Time Stamp Counter (TSC), which is a 64-bitregister that counts the number of CPU ticks i.e. clock cycles, elapsed since thelast reset. TSC is present on all x86 processors since the Pentium [4]. The TSCvalue can be retrieve by using the RDTSC instruction, although the TSC seemslike a straightforward way to measure the start and end time of a program, therehave been some aspects of the development of the modern CPU which have af-fected the value of the TSC register. Multi-core processors introduced a problemwhere the cores did not have identical values in their TSC registers. Power-savingmanagement affected the rate of the ticks by stepping the CPUs frequency. Out-of-order execution could affect when the values was retrieved by allowing otherinstructions to run before it, adding extra clock cycles to the TSC value. Intelfixed the power-saving management problem by making the TSC counter tickwith a constant rate, which meant that the counter would not be affected by thechanges in the CPU frequency [6]. To avoid the multi-core processor problem, theRDTSC instruction has to be serialised to keep it on a single core. This can bedone by adding the volatile property to our inline assembler code used for retriev-ing the TSC value. The out-of-order execution problem can be solved by addingthe CPUID instruction before running RDTSC, which will force all the precedinginstructions to finish before allowing the program execution to continue [4]. Afteradding the fixes for these problems we could use the procedure in Listing 3.1 toretrieve accurate TSC values. This was used for measuring the start and endtime of our programs running on the GPU.

u i n t 6 4 t getCycleCount ( ){

u i n t 3 2 t lo , h i ;a sm v o l a t i l e (\

” x o r l %%eax , %%eax ;\n\cpuid ;\n\rd t s c ” : ”=a” ( l o ) , ”=d” ( h i ) : : ”ebx” , ” ecx ” ) ;return ( u i n t 6 4 t ) l o + ( ( u i n t 6 4 t ) h i << 32ULL) ;

}

Listing 3.1: Procedure to retrieve a accurate TSC value


3.1 OpenCL

Open Computing Language (OpenCL) is a programming framework, which en-ables developers to write programs that can run on a wide range of computationalunits i.e. CPUs, GPUs and other processors. However, the focus for this sectionwill be on the GPU. OpenCL builds upon a concept where a host device is con-nected to one or more OpenCL devices. The host device can be any computerrunning a common operating system. The host is responsible for setting up andmanaging the programs, which is executed on the OpenCL devices. The OpenCLdevices do not have to be of the same type i.e. it can be a mix of CPUs, GPUsand other processors. The computational units architectures have to be takeninto account when writing the OpenCL program, because different architectureshandles instructions and conditions differently e.g. how SIMD handles conditionalstates (Section 2.5).

3.1.1 Kernel and Work-item

OpenCL kernels contain the program which will be executed on one or moreOpenCL devices. When a kernel is queued up for execution, an index-space isdefined and a copy of the kernel is executed for each item in the index-space.These copies are called work-items. The work-items are independent tasks withthe same program but with different data. Each work-item in the index-space hasa unique global id, which identifies them. The global id is accessible from insidethe kernels and can be used in several different ways e.g. traversing an array orother objects. An example kernel for this could be to add all the elements in twoarrays, if the array is n long, one could start n work-items and the task would befully parallelised (Algorithm 3.1).

Algorithm 3.1 Kernel addition of two arrays

Input: X : {x0, x1, x2, . . . , xn−2, xn−1, xn}, Y : {y0, y1, y2, . . . , yn−2, yn−1, yn}Output: Rid = get global idRid = Xid + Yid

In the algorithm 3.1 a one dimensional index-space is used. However, theindex-space used can have up to three dimensions. The one-dimensional index-space can be illustrated as a line. The global id is the coordinate for the work-itemon that line. When additional dimensions are added to the index-space we addanother coordinate to each work-item per dimension, just like the x, y, and zcoordinates in a coordinate system. Therefore, when a two-dimensional index-space is used the index space is no longer a line, instead it is a grid with x and ycoordinates. Which means that e.g. the fifth work-item on the second row wouldhave a global id that is (4, 1) (the coordinates start counting from zero). When


working with an image where each pixel should be changed, a two-dimensionalindex-space has to be used, to be able to match a work-item for each pixel.

3.1.2 Work-groups and Synchronisation

The work-items in the index-space are divided into one or more work-groups. Thesize of the work-groups can either be defined by the programmer or automati-cally by OpenCL. Each work-group has a unique work-group id, which is used toidentify the work-groups. All the work-items inside a work-group have a local id,which is unique inside that work-group. The work-group id and the local id areretrievable inside the kernel during execution and can be used in the same man-ner as the global id. Work-groups can be either one, two, or three-dimensional,depending on the needs of the program and the data e.g. an image would use atwo-dimensional index-space and therefore a two-dimensional work-group has tobe used as well, to correctly divide the work-items into work-groups.

Synchronisation, in the context of parallel computation, is the practice ofcommunication between threads and controlling the execution flow. Most paral-lel algorithms require that the threads can communicate data from their internalstate to other threads, how this is performed on the device depends on the de-vice’s architecture. Execution control is an important aspect of synchronisation,because a thread might need the data from another thread’s calculation to con-tinue. In this case, the thread needs to be halted until the calculation on theother thread is done. How the execution flow is controlled depends on the devicearchitecture.

All work-items in the same work-group are always executed on the same de-vice. This allows work-items inside the same work-group to use shared memory toexchange data. OpenCL only allows synchronisation to occur between work-itemsinside the same work-group, because synchronising work-items between differentdevices would be very slow and impractical. A Barrier is a synchronisation in-struction, which marks a point in the kernel, which all the work-items inside awork-group has to reach before the program is allowed to continue. Because syn-chronisation between work-groups is not allowed, a barrier in one work-group willnot affect the execution of other work-groups.

3.1.3 Memory

OpenCL devices have four different memory spaces that can be used by thedevelopers; these are private, local, constant, and global memory. The speedof the memories are in the order they are listed where private is the fastest, thenthe local and so forth. Each work-item has its own private memory, which onlythat work-item can see and access. The private memory is the smallest but thefastest memory available on the GPU. Local memory is shared between all work-items inside a work-group. This memory is slower than the private memory but


faster than the global memory. The local memory is also larger than the privatememory. The constant memory is a memory space that is accessible to all thework-items on an OpenCL device. However, this memory is read only and canonly be allocated by the host. This memory space as its name implies, is good forholding constants. Private, local, and constant memory is cached memory, whichmeans that after a read from a memory location any consecutive reads from thesame location in the memory will be very fast, as long as the memory unchanged.The global memory is the largest but the slowest memory. It is accessible for allwork-items in all work-groups on the same OpenCL device. The global memoryis a streaming memory, which means that it achieves its best performance whenthere is contiguous memory access. Making small scarce accesses to the globalmemory can have a severe negative impact on the performance of the executingkernels.

After a kernel’s execution has completed, the private and local memory spaceallocated for that kernel is unallocated and any data in those memory spaces islost. Therefore, any data that should be saved has to be copied to the globalmemory. The host can then read the data or another kernel can use it. TheOpenCL memory hierarchy is illustrated in Figure 3.1.

Global/Constant Memory

Host Memory

Local Memory

Work-item

Work-item

Private Memory

Private Memory

Workgroup

Local Memory

Work-item

Work-item

Private Memory

Private Memory

Workgroup

Figure 3.1: OpenCL memory hierarchy [20]

3.2 Floating-point or Integer

To help us make an informed decision about which data type to use for ourmulti-precision number system implementation, we conducted an experiment that


tested the performance impact the data types had on a set of instructions. This setof instructions modelled a basic arithmetic function in a multi-precision numbersystem. As the set of instructions was small, conventional methods of measuringexecution time would not be sufficient. Instead, we constructed a profiling methodusing the CPU’s TSC and statistical analysis to get the execution time of smallsets of instructions.

There are several challenges to overcome to get the time it takes to perform asmall set of instructions on the GPU. Because kernels are queued before they areexecuted, it is uncertain how long it will take before the GPU starts executingthe kernel. There is a similar problem when the kernel is done, because thetime it takes for the host program to get the CPU time varies depending onif there are other programs or services running that have higher priority. Thisis partly due to delays from the operating system, and partly because of otherGPGPU programs or graphics software running on the GPU. The delays can bemanaged and reduced by turning off background services and graphical software.On average, these delays should be similar. Therefore, by measuring the runtimefor the program several times, and then averaging the measurements, this effectshould be minimised.

The problem with measuring a small set of instructions is that the overheadassociated with running a GPU program is larger than the time it takes to runthe instruction set. This was solved by running a varying amount of copies of theinstruction set inside the kernel, causing the execution time to increase linearlywith the amount of sets. With the linearly increasing time, it is possible to analysethe result despite the large overhead.

A side effect from using the same instruction set several times in a row isthat the compiler might optimise the kernel so that only one instruction is per-formed. To make sure that the compiler did not remove any instructions in theoptimisation, dependencies had to be added between the sets.

The final challenge was to decide the number of work-items to use. Testsshowed that when using less work-items than the GPU could handle simultane-ously, the execution time was unaffected by the number of work-items. The testsalso showed that the added scheduler overhead is most likely relatively constant,because when increasing the number of work-items beyond what the GPU couldhandle the execution times increased linearly.

The execution time for a single kernel Tkernel (Equation 3.1) is dependent onhow many repetitions i of an instruction set is performed, how long time a singleinstruction set b takes, the scheduler cost k and, the overhead from loading thevariables f . The total execution time Ttotal (Equation 3.2) for the GPU programis dependent on how many work-items w that are used, the kernel execution timeTkernel, and the overhead σ from starting the GPU program. This means that themeasurement Ttotal varies based on two independent variables i and w.

Tkernel(i) = b× i + f + k (3.1)


Ttotal(w, i) = (w)× Tkernel(i) + σ (3.2)

By using increasing amounts of work-items and repetitions of the instruction set,with the high-precision timer on the CPU, we could measure Ttotal accuratelyenough to perform statistical analysis. The statistical analysis was performedby using linear regression in two steps. In the first step, linear regression wasperformed for every i, which gave i-number of first order functions. That was laterused for calculating Tkernel, by using linear regression on the function coefficients.

3.2.1 Kernel description

The benchmarked kernel can be separated into two parts, a base kernel, andthe repeated test kernel. The base kernel, Algorithm 3.2, takes two parameterswhich are the values used in the benchmarking of the kernel. The values arecopied from the global memory into private variables, so as not to incur anyadditional cost from accessing the global memory during the execution of theremainder of the kernel. During execution the variables a, s1, s2 are modifiedsuch that the compiler cannot predict their outcome. The final value in a is thenstored to global memory so that the compiler recognizes that there is work to bedone. Without the transfer of a, the compiler correctly identifies that nothing inthe kernel will be returned and removes all the instructions in the kernel.

Algorithm 3.2 Kernel base for benchmarking

Input: seed1, seed2Output: rs1 = seed1s2 = seed2a = 0. . .i-number of repeating instruction sets. . .r = a

To compare the performance impact of integer and float-point instructions, thekernels have to use equivalent instructions in order to give a fair result. Becausewe used this result to decide which data type to use for our multi-precision num-ber system implementation, the instruction set had to be representative of theinstructions used for a multi-precision number system. The use of our multi-precision number system is to calculate the sequences of a BBS generator, whichfor the most part is a sequence of large multiplications. A frequently used instruc-tion set on a multi-precision multiplication is the inner partial product. The innerpartial product multiplies two limbs, adds any carry over, and adds its result to


the relevant position. Because it is used frequently, we chose to use that instruc-tion set to model the performance of our multi-precision number system. Theinstruction set includes one multiplication and two additions. It is a simplifiedmodel of the inner partial multiplication, because in practice the limb multipli-cation would have to be full precision. However, a simplified model allows for aclearer comparison of the floating-point and integer instructions.

When creating a multi-precision number system, a point of interest is thecomputational performance of limbs when trying to maximise the available pre-cision. Performance of a multi-precision number system is usually dependent onthe number of limbs; therefore increasing the number of bits per limb can be away to increase precision without performance-loss. Choosing the floating-pointdata type means only 24-bits are usable per limb, while using integers gives 32-bits of precision per limb. To create a multi-precision number of equal precision,for every third integer limb, four floating-point limbs are required.

To make sure that no iterations are lost due to optimisation, the iterationshave data dependency between them in such a way, that the compiler is unable tofind a method of calculation that would cause a constant execution speed, regard-less of the number of iterations. The test instruction set is shown in Algorithm 3.3and the full kernel description in Algorithm 3.4.

Algorithm 3.3 Instruction set

s1 = s1 + s2a = a+ (s1× s2)

Algorithm 3.4 Full kernel description

Input: seed1, seed2Output: rs1 = seed1s2 = seed2a = 0s1 = s1 + s2a = a+ (s1× s2). . .s1 = s1 + s2a = a+ (s1× s2). . .s1 = s1 + s2a = a+ (s1× s2)r = a


3.2.2 Performance analysis

After analysing the data from a performance test, it was apparent that a sin-gle test run would not generate enough data to achieve an acceptable variancethreshold in the linear regression. Therefore, an ad-hoc solution was used whereeach test is repeated 50 times to generate enough data to achieve an acceptablevariance threshold. All the test data was saved and used unmodified with simplelinear regression. The amount of work-items and the kernel length used duringthe tests are outlined in Table 3.1. The tests were performed with work-groupsizes 64, 128, and 256, to check if group-size had any large performance impact.The execution times are in CPU-ticks that are counted at the rate of 3.2 GHz.

Iterations (i)256 512 768 1024 1280 1536

2048 2560 3072 3584 4096

Work-Items (t)2048 4096 8192 16384 32768 49152

65536 98304 131072 196608 262144

Table 3.1: Floating point and integer test parameters

0 50000

100000 150000

200000 250000

300000

0 1000

2000 3000

4000

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

CPU ticks

Floating-point, Work-Group size 64

8000000.006000000.004000000.002000000.00

Work-ItemsIterations

CPU ticks

(a) Float 64

0 50000

100000 150000

200000 250000

300000

0 1000

2000 3000

4000

0 2e+06 4e+06 6e+06 8e+06 1e+07

1.2e+07 1.4e+07 1.6e+07 1.8e+07

CPU ticks

Integer, Work-Group size 64

15000000.0010000000.00

5000000.00


CPU ticks

(b) Integer 64

Figure 3.2: Results of Floating point and Integer With Group Size 64

0 50000

100000 150000

200000 250000

300000

0 1000

2000 3000

4000

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

CPU ticks


8000000.006000000.004000000.002000000.00


CPU ticks

(a) Float 128

0 50000

100000 150000

200000 250000

300000

0 1000

2000 3000

4000

0 2e+06 4e+06 6e+06 8e+06 1e+07

1.2e+07 1.4e+07 1.6e+07 1.8e+07

CPU ticks


15000000.0010000000.00

5000000.00


CPU ticks

(b) Integer 128



0 50000

100000 150000

200000 250000

300000

0 1000

2000 3000

4000

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

CPU ticks


8000000.006000000.004000000.002000000.00


CPU ticks

(a) Float 256

0 50000

100000 150000

200000 250000

300000

0 1000

2000 3000

4000

0 2e+06 4e+06 6e+06 8e+06 1e+07

1.2e+07 1.4e+07 1.6e+07 1.8e+07

CPU ticks


15000000.0010000000.00

5000000.00


CPU ticks

(b) Integer 256


Figures 3.2 to 3.4 show the average execution time for the GPU programs withthe different parameters. The result of applying two linear regressions accordingto our model is shown in Table 3.2, which translates to execution time in CPU-ticks per instruction set. A drawback of our model is that the analytical methodassumes that the execution of instruction sets is performed sequentially. Thatmeans the indicated time to execute a single instruction set is divided by thenumber of parallel processing units. A different way of presenting the result isto show throughput. Throughput is the number of instruction sets that can beperformed per CPU-tick. Throughput is calculated by dividing one by the CPU-ticks per instruction set, to get instruction sets calculated per CPU-tick. Thethroughput is shown in Table 3.3.

Group Size Floating Point Integer Int/Float64 0.0078125 0.0157575 2.01696128 0.00798549 0.0160631 2.011535256 0.00815848 0.0159807 1.958784

Table 3.2: CPU-ticks per instruction set

Group Size Floating Point Integer Float/Int64 128 63.4618 2.01696128 125.2271 62.2544 2.011535256 122,5719 62.5754 1.958784

Table 3.3: Throughput, instruction sets calculated per CPU-tick

The work to perform arithmetic operations on multi-precision numbers usuallydepend on the number of limbs, either scaling linearly O(n) or quadraticallyO(n2). Because floating-point requires more limbs to provide the same precision,


it also means more work. The work for floating-point in a linear algorithm is 4xwhile integer has 3x, on a quadratically scaling algorithm the work for floating-point is (4x)2 = 16x2 while it is (3x)2 = 9x2 for integers. However, the resultsfrom our performance tests show that calculating the inner partial product withfloating-point has twice the throughput of integers. Therefore, the time to com-plete a linearly scaling algorithm is 4x

2= 2x and 16x2

2= 8x2 for a quadratically

scaling algorithm. Both of these are lower than the equivalent times for integers,meaning for algorithms that scale linearly or quadratically, floating-point willprovide better performance. A flaw with this analysis is that it assumes that theentire precision of the data types is used for the multi-precision number system.In a practical implementation, some bits might be reserved to make it possibleto detect overflow or to use Lazy Addition. However, because this experimentwas meant to be a guideline for the early design decisions for our multi-precisionnumber system implementation, we deemed this a minor flaw.

An alternate analysis is to consider the processing speed relative to the numberof bits processed. The integer data type provides 32 bits while the floating-pointdata type provides 24 bits. Hence, for the floating-point to have an advantage, ithas to have a throughput at least 32

24≈ 1.33 greater than the integer in order to

process more bits in a given time period. Our results show that, the floating-pointhas twice the throughput when calculating inner partial products compared to theinteger. Thus according to this analysis using the floating-point data type shouldprovide better performance for a multi-precision number system on the GPU.Because this analysis only focuses on the bit throughput, it does not capture thecomplexity of multi-precision algorithms. However, it is not dependent on thenumber of bits used in limbs and therefore does not suffer from the same flaw asthe first analysis.

By using the result from both the analysis methods, it is possible make a strongconclusion that the floating-point data type provides the best performance whencalculating inner partial products on the GPU. Therefore, we implemented ourmulti-precision number system using the floating-point data type.

Our results are limited to the performance of the integer and floating-pointdata types of a specific instruction set on the GPU. However, our profiling methodcan be used to calculate the performance of the individual equivalent instructionsi.e. Addition, Multiplication, Division, etc., between the two data types as well.Using multiple kernels with different amount of instructions makes it possibleto analyse the kernels’ runtime, by regarding them as a system of equations,which can be used to calculate the individual instructions performance. E.g.a kernel that performs two additions and one multiplication has the equation2A + M = Y1, and a kernel with one addition and three multiplications has theequation A + 3M = Y2, where Y1 and Y2 is the performance result of respectivekernel.

Chapter 4

Multi-Precision arithmetic

There are multiple methodologies to create multi-precision number systems. Twocommon methods are the radix based multi-precision number system and theresidual number system.

The radix based approach is simple to understand because it builds upon thecommonly known positional number system, one down side with using the radixbased approach however is that there are dependencies between the limbs whenperforming arithmetic operations which means that carry propagation still needto be handled between the limbs. An RNS approach on the other hand doesnot have any dependencies between the limbs during most arithmetic operations,however RNS is more complex to implement than the radix based approach andsome of the arithmetic operations require base changes to be able to calculateits result efficiently. Therefore, we decided to use a radix based multi-precisionnumber system.

In a radix based multi-precision number system, a large number A is repre-sented as an array A = (a0, a1, · · · , an−1), where a0 is the least significant limband an−1 the most significant limb. Our experiment in Section 3.2.2, showedthe floating-point data type was faster then the integer data type, when usedon the GPU. Therefore, our multi-precision number system is implemented usingfloating-points. This means that the radix can be at most 224 as explained inSection 2.1.2. However, when performing arithmetic operations there is no sys-tem in place to warn or notify when precision is lost due to a number that is toolarge for a limb. This means that carry detection has to be performed manually.A consequence of using floating-point is that, in contrast to using integers, thereare no bitwise instructions. Bitwise instructions allow easy carry extraction andmasking because individual bits can be processed. Instead, equivalent operationshave to be mathematical or take advantage of the floating-point properties, todetect and extract the carry.

To extract the carry from a limb with radix r, a division by r is used, and amodulus by r can be used to remove the carry. Both division and modulus are ex-pensive, even as native operations [35], instead these operations can be optimisedby using multiplication with reciprocals. The value 1

rcan be pre-calculated once,

and then re-used when needed. With the pre-calculated reciprocal, the carry

35

Chapter 4. Multi-Precision arithmetic 36

c and the remainder of ak can conveniently be calculated in two steps. Firstthe carry can be calculated with the following equation c = bak × r−1c. Theremainder can then be calculated by using the carry ak = ak − r × c.

There are several different methods of creating a parallelised implementationof a radix based multi-precision number system for the GPU. A method is to leteach work-item hold a single limb of the multi-precision number. In this method,the maximum precision available is directly set by the work-group size, because aradix based multi-precision number system require communication between work-items and work-groups cannot communicate with each other. Another method isto hold several limbs per work-item, which will increase the delay for each work-item to finish, and decrease the amount of communication required. A thirdmethod is to hold all the limbs on a single work-item, meaning each work-itemperforms an independent calculation. This method has the most delay becauseeverything is done in the same work-item, and is the same thing as runningan algorithm sequentially. However, no communication is required and manyindependent calculations can be performed in parallel. For our multi-precisionnumber system implementation for the GPU, the first method was chosen becauseit builds upon a simple idea, and to our knowledge, no one had tested this formof parallelisation before.

4.1 Addition

Our multi-precision addition implementation is divided into two parts. The firstpart is called Simple Addition and is built upon the lazy addition method, wherecarry propagation is ignored. This is used since performing carry propagationevery addition is performance degrading and made unnecessary as shown in Sec-tion 2.3. The second part is the Complex Addition, which takes in a multi-precision number and resolves any carry propagations that are required.

4.1.1 Simple Addition

Simple Addition takes in two n-limb long multi-precision numbers and adds allthe limbs in their respective positions together. The time it takes to completea sequential Simple Addition is T (n) and with a complexity of O(n). However,since each limb in the multi-precision number can be computed independently,the Simple Addition can be parallelised so that the time is reduced to T (n

p),

where p is the number of processors available. The parallelised version of SimpleAddition is shown in Algorithm 4.1.


Algorithm 4.1 Parallel Simple Addition

Input: A{a0, . . . , an−1}, B{b0, . . . , bn−1}Output: Aid = get local idaid = aid + bid

4.1.2 Complex Addition

Complex Addition takes in a single multi-precision number, and performs anynecessary carry propagations required for the radix used. As multi-precisionnumbers are used to represent very large numbers, the traditional ripple-carrymethod is too inefficient because it has a complexity of O(n) and cannot beparallelised as shown in Section 2.2.1.

This problem also affects electrical engineers when they design processors withlarge registers e.g. 64-bit registers, because bit-level carry propagation becomesvery slow. They solve this problem by instead using carry look-ahead adders.There are several different carry look-ahead adders. However, we chose to usethe Kogge-Stone carry look-ahead adder, because it is fully parallel and it carrypropagates in a tree-based fashion, allowing it to complete the propagation forall limbs in dlog2(n)e steps [24, 18].

4.1.3 Binary Kogge-Stone

To be able to understand how the binary version of Kogge-Stone works, an un-derstanding of Boolean logic is required. Computers and other digital circuits arebased upon Boolean logic, and it is used to perform just about every operation inthem. Boolean logic is used to determine if a Boolean statement is true or false,which is represented as 1 and 0 digitally. Boolean operations are used to formBoolean statements and to calculate their outcome. Two values are comparedwith Boolean operations to decide if the Boolean statement if true or false. InBoolean logic, there are three main operations.

Conjunction (∧) also called AND, this operator will deem the statement trueif and only if both values are true, as shown in Table 4.2.

Disjunction (∨) is the second operations and is called OR, a statement is truewhen using disjunction if one of the values are true, as shown in Table 4.2.

Negation (¬) is the last main operation and is called NOT, this operation invertsa value i.e. a true value becomes false and vice versa, as shown in Table 4.1.


X ¬XT FF T

Table 4.1: Truth table for Negation (NOT)

Exclusive Disjunction (⊕) also called XOR is a commonly used operation,which is derived by combining the main operations in the following fashionX ⊕ Y = (X ∧ ¬Y ) ∨ (¬X ∧ Y ). A statement using XOR is only true ifboth values are different as shown in Table 4.2.

X Y X ∧ Y X ∨ Y X ⊕ YT T T T FF T F T TT F F T TF F F F F

Table 4.2: Truth table for Conjunction (AND), Disjunction (OR) and Exclusive Dis-junction (XOR)

Kogge-Stone uses Boolean logic to perform the calculations necessary for theaddition and the carry propagation of two binary numbers. Figure 4.1 showsan illustration of how the bits flow through the Kogge-Stone algorithm when110102 = 2610 and 10010102 = 7410 are added, where each column or positionis a bit of the number and each row is a step in the algorithms progression. Inthe initial step of Kogge-Stone two values are calculated for each position, calledpropagate (P) and generate (G) in the manner shown in Algorithm 4.2. Thepropagate value signifies that it will propagate a carry if it receives one. Thegenerate value signifies that the addition in the position will generate a carry.After the initial step each position passes along its P and G values to the position2s positions ahead of itself, where s is the number of steps performed. Booleanlogic is used to calculate if a carry will be generated or propagated with theAlgorithm shown in 4.3, if both the generate and the propagate value is 0 for aposition, then that position will not propagate any more carries. This process willthen repeat for dlog2(n)e steps where n is the number of positions in the largestnumber. When the necessary amount of steps has been performed, the result foreach position is calculated by inputting the initial propagate value for a positionand the previous position’s final generate value into the Boolean XOR operation,as shown in Algorithm 4.4.


MSB LSBCarry direction

0 XOR 0 = 01 XOR 0 = 00 XOR 1 = 10 XOR 0 = 01 XOR 1 = 00 XOR 1 = 11 XOR 0 = 1

Result = 0Result = 0Result = 1Result = 0Result = 0Result = 1Result = 1

0 1

p

0

0

0

P = 1 XOR 0 = 1G = 1 AND 0 = 0

0 0

0

0

0

0

1 0

p

g

g

g

1 1

g

g

g

g

0 0

0

0

0

0

1 1

g

g

g

g

0 0

0

0

0

0

P = 0 XOR 0 = 0G = 0 AND 0 = 0

P = 1 XOR 0 = 1G = 1 AND 0 = 0

P = 1 XOR 1 = 0G = 1 AND 1 = 1

P = 0 XOR 0 = 0G = 0 AND 0 = 0

P = 1 XOR 1 = 0G = 1 AND 1 = 1

P = 0 XOR 0 = 0G = 0 AND 0 = 0

Figure 4.1: Example of carry propagation with Binary Kogge-Stone

Algorithm 4.2 Initial Binary Kogge-Stone values

P = Ai ⊕Bi

G = Ai ∧Bi

Algorithm 4.3 Binary Kogge-Stone propagate and generate carry

Pi = Pi ∧ P2s

Gi = (Pi ∧G2s) ∨Gi

Algorithm 4.4 Binary Kogge-Stone result

R = Pi ⊕Gi−1

4.1.4 Numerical Kogge-Stone

Because Kogge-Stone is a binary adder, some modifications had to be made tomake it work for numbers with a higher radix. The first modification that wasdone, changed the Numerical Kogge-Stone so it only takes a single number as


input instead of two. Because, Lazy addition is used to perform addition, Kogge-stone is then used to perform any necessary carry propagations. Using lazyaddition creates a problem for Kogge-Stone however, as it can create situationswhere a number has a carry larger than one. Initially Kogge-Stone has no sup-port to handle carries that are larger than one. So the initial step of Kogge-Stonewas modified to perform an initial single-step carry propagation before calculat-ing the propagate value and the generate value. This will result in an increaseof steps required to complete the Kogge-Stone addition from dlog2(n)e steps to(dlog2(n)e+ 1).

As Boolean logic only works for integer data types and our implementa-tion uses floating-points, the numerical Kogge-Stone has to be implemented withfloating-point instructions. The propagate value and the generate value are there-fore calculated with normal arithmetic and comparative operations i.e. equal, lessthan, greater than, etc. The comparative operations return a 1 if they are trueand a 0 if they are false. The modified initial step of Kogge-Stone is illustrated inFigure 4.3(a), takes a single limb of the multiple-precision number as input, cal-culates the initial carry and passes it along the next position before the propagateand generate values are calculated as shown in Algorithm 4.5.

MSB LSBCarry direction

(5 + 0 + 0) % 10(9 + 8 + 0) % 10(4 + 2 + 1) % 10(8 + 3 + 0) % 10(5 + 4 + 1) % 10(5 + 4 + 1) % 10(1 + 5 + 1) % 10

Result = 5Result = 7Result = 7Result = 1Result = 0Result = 0Result = 7

01

0

0

0

0

p = (1+5) == BASE-1 = 0g = (1+5) > BASE-1 = 0

55

p

p

g

g

p = (5+4) == BASE-1 = 1g = (5+4) > BASE-1 = 0

45

p

g

g

g

p = (4+5) == BASE-1 = 1g = (4+5) > BASE-1 = 0

48

g

g

g

g

p = (8+4) == BASE-1 = 0g = (8+4) > BASE-1 = 1

34

0

0

0

0

p = (4+2) == BASE-1 = 0g = (4+2) > BASE-1 = 0

29

g

g

g

g

p = (9+8) == BASE-1 = 0g = (9+8) > BASE-1 = 1

85

0

0

0

0

p = (5+0) == BASE-1 = 0g = (5+0) > BASE-1 = 0

Figure 4.2: Example of carry propagation with Kogge-Stone

In the algorithms, each position’s value is calculated by taking the radix moduluswith the position’s initial value, and the excess that should be sent to the next


Algorithm 4.5 First Node

Ai = Xi mod radixCi+1 =

⌈X

radix

⌉Pi = (Ai + Ci) == radix− 1Gi = (Ai + Ci) > radix− 1

position is calculated by dividing the position’s initial value with the radix. Thepropagate value is calculated by using the comparative operation equal (==)between the sum of the position value and the received carry with radix − 1.This is the higher radix equivalent of what the binary version does when it usesXOR to calculate the propagate value, because if the value in the position isequal to radix−1 it will propagate a carry if it receives a carry from the previousposition. The generate value is calculated in a similar manner, the same sum ischecked to see if it is larger or equal to the radix− 1, which will signify that thisposition will generate a carry, just like the binary version does when using theAND operation.

As the propagate value and generate value from the initial step are still either0 or 1, the repeating propagation steps required only minor changes as the imple-mentation uses floats instead of integers. The Boolean operations are replaced bynormal arithmetic operations that fulfil the same purpose. Multiplication acts asan AND operation, because if one of the values are 0 the outcome will be 0. Inthe same manner OR is replaced with addition, which becomes 1 as long as oneof the values are 1. The new algorithm is shown in Algorithm 4.6.

Algorithm 4.6 Propagation Node

P = Po × PiG = (Gi × Po) +Go

The final result for each position is calculated by taking the position’s initialvalue (Ai + Ci) and adding it with the previous position’s generate value (Gi−1)as shown in Algorithm 4.7.

Algorithm 4.7 Final Node

S = Ai + Ci +Gi−1

The computational complexity of this Kogge-Stone implementation isO(nlog2(n)),which is worse than the ripple method when executed sequentially. However,when executing the Kogge-Stone implementation on a parallel platform, it hasthe time complexity of T (nlog2(n)

p), p ≤ n where p is the number of processors

available. Because the GPU is highly parallel and modern GPUs have hundredsof processors, it can satisfy the p = n condition so that a time complexity ofT (log2(n)) can be achieved.


C,A

P | G

X

CiCo

(a) First Node

Po, GoPi, Gi

P, G

(b) Propagation Node

C,A

G

S

(c) Final Node

Figure 4.3: Nodes used in Kogge-Stone carry look-ahead adder

4.1.5 Parallel algorithm

85293448455601

sgg spgs

sgg sggs

sgg sggs

sgg sggs

Ti

me

Work-items (p)

(t)

Figure 4.4: Parallel Complex Add

The parallel version of our Kogge-Stone implementation is based around the work-groups in OpenCL, mapping each work-unit in the work-group to a single limb


in the multi-precision number (fig. 4.4). Because all work-items are in the samework-group, they can use the local shared memory for carry propagation.

The parallel implementation that is shown in Algorithm 4.8. Takes in a singlelimb, performs any necessary carry propagations for that limb, and then returnsthe carry-propagated limb. In the initial step the excess of the limb is carry prop-agated as explained in the previous section 4.1.4. However, some optimisationsof the algorithm were made. The division and modulus with the radix are re-placed according to the optimisation detailed in the introduction to this chapter.Another optimisation is that the variable holding the position which the currentposition should carry propagate to (2s where s is the number of steps elapsed).Is not calculated with multiplication instead the binary shift operations is used,which moves the 1 in the binary value one position higher each step and thusincreases the power of the value with 2 each time.

The memory requirement for the parallel Kogge-Stone is 4n local memory and6n private variables. Both P and G are twice as big as n, this is because all work-items are going to check memory for input, even if that work-item is never goingto get any input, e.g. work-item id=0. It is an optimization so that no work-itemswill diverge. A second reason for P and G being twice as big is to prevent therisk of performance loss due to odd access patterns. Because the implementationis executed in parallel, barriers has to be used when values are propagated, toresolve any synchronisation problems where a work-item reads from a memoryposition before a work-item has finished writing to the position and vice versa.


Algorithm 4.8 Parallel version of Kogge-Stone

Input: A, id = local group id, P = {p−n, . . . , pn−1}, G = {g−n, . . . , gn−1}Output: A

1: C = bA× r−1c2: S = A− r × C3: gid = C4: Barrier(local)5: C = gid−16: Barrier(local)7: pid = (S + C) == r − 18: gid = (S + C) > r − 19: Barrier(local)

10: shift = 111: for k = 0 to dlog2(n)e do12: tp = pid × pid−shift13: tg = pid × gid−shift + gid14: Barrier(local)15: pid = tp16: gid = tg17: Barrier(local)18: shift = shift× 219: end for20: A = (S + C + gid−1)21: A = A− r × bA× r−1c

4.2 Multiplication

To perform multiplication on two n-limb multi-precision numbers, n2 single-limbfull-precision multiplications are required. This means that the entire 2b-bit prod-uct from multiplying two b-bit limbs, has to be kept. However, when using a radixbased multi-precision number system with b-bit large limbs, the result has to bedivided over two b-bit limbs in order to store it and not lose any precision. Thispresents a problem when performing floating-point multiplication of the limbs,because floating-point multiplication automatically normalises and rounds theresult.

We have found two methods for performing full-precision multiplication. Oneis based on the schoolbook method of multiplication; the other method was cre-ated by Dekker, T.J, which takes advantage of the properties of the IEEE 754floats.

When using the schoolbook method, it is known that a multiplication of twob-bit numbers gives a result which is 2b-bits large. This means that if only 18-


bits are available to store the result, then the factors of the multiplication haveto split into parts that can only be 9-bits in size. These parts can then bemultiplied together as illustrated in Figure 4.5(b). Because our multi-precisionnumber system use floating-point variables, division and modulus has to be usedto break up the factors, as shown in Algorithm 4.9.

Algorithm 4.9 Full precision multiplication

Input: A,B,√r

Output: C : c0, c1a0 = A mod

√r

a1 = A/√r

b0 = B mod√r

b1 = B/√r

t = a0× b1 + a1× b0c0 = a0× b0 + (t mod

√r)×

√r

c1 = a1× b1 + (t/√r) + c0/r

c0 = c0 mod r

AB

C0C1

(a) Single-limb multiplication

a0b0

a0*b0

a1b1

a1*b0a0*b1

a1*b1

C0C1

(b) Split single-limb multiplication

Figure 4.5: Multiplication

The other method was published by Dekker who found a way to get the fullprecision of a multiplication by calculating the floating-point error of the floating-point multiplication [12]. The Dekker multiplication takes two factors A,B andcalculates the result z + zz, where z is the answer from a normal floating-pointmultiplication and zz is the difference from the full precision arithmetic answer.


Algorithm 4.10 Dekker Splits [12]

Input: A, b/2 ≤ s ≤ b− 1Output: (ah, at)constant = 2s + 1p = A× constantq = A− pah = p+ qat = a− ah

It works by splitting the input factors using a splitting value constant to offsetthe factors according to Algorithm 4.10, where b is the number of bits available inthe mantissa and s is used to increase the input value such that the precision of s-bits are lost in the normalisation and rounding. When expanding the calculationof ah it is easier to examine why, see Equation 4.1. The addition of one A and themagnitude of A × s2 forces the addition to be rounded and lose precision. Theresult is that ah contains the upper b − s bits of A and at which contain rest ofA such that A = ah + at. The splitting algorithm makes it possible to performthe full precision Dekker multiplication as shown in Algorithm 4.11. The Dekkermultiplication produces z and zz, where z is the normal result of a floating-pointmultiplication and zz is the floating-point error.

p = A× (2s + 1)

q = A× (1− (2s + 1)) = A× (−2s)

ah = A× (2s + 1− 2s) = A

(4.1)

Algorithm 4.11 Dekker Multiplication [12]

Input: A,B ≥ 0Output: z, zzs = 12(ah, at) = Split(A, s)(bh, bt) = Split(B, s)z = A×Bzz = (ah × bh)− zzz = zz + ah × btzz = zz + at × bhzz = zz + at × bt

Before it is possible to use the result from the Dekker multiplication in a radixbased multi-precision number system, z and zz needs to be transformed into theexpected positional format for a radix based multi-precision number system, by


performing the Algorithm 4.12. The large value in z can be managed usingdivision and modulus, implemented using multiplication of the inverse of radixsimilar to complex addition. The value in zz can become negative which meansthe lower positional value might become negative. When this occurs, the upperpositional value has to be subtracted by one, and r added to the lower positionalvalue. This problem was solved by using a compare operation that returns 1 whenthe comparison is true, and 0 when false.

Algorithm 4.12 Transforming output from Dekker multiplication to PNS

Input: z, zzOutput: C : {c0, c1}, c0, c1 ≥ 0c1 = bz × r−1cc0 = (c1− (z × r)) + zzc1 = c1− (c0 < 0)c0 = c0 + (c0 < 0)× rreturn (c0, c1)

When creating the Dekker multiplication for the GPU we had a problemwith the compiler optimising away the calculation of head and tail variables.The reason for this was that the calculation of ah is actually ah = p + A − p.The compiler will recognise that the calculation is unnecessary and change it toah = A. There was also a similar bug caused by optimisation at zz = (ah×bh)−z.There is a standard way to stop the compiler from optimising certain parts, byclassifying variables as volatile. The purpose of classifying a variable as volatileis to stop compiler optimisations because that variable may be changed from anunknown source at an unknown time. We knew the specific places where compileroptimisations caused problems but volatile gave us only rough control of compileroptimisation, which is why we found another solution using the absolute valuefunction.

Because A ≥ 0 and p ≥ A, and from the calculation q = A − p it is certainthat q ≤ 0, the calculation ah = p+ q can be changed to ah = fabs(q)− fabs(p).fabs is the OpenCL function for calculating the absolute value of a floating-pointvariable. Using this modification the compiler will not remove any calculations.A similar change can be applied to the second calculation, changing it from zz =(ah×bh)−z to zz = fabs(ah×bh)−fabs(z). The algorithms 4.13 and 4.14 describetwo versions of Dekker multiplication allowing full precision multiplication for aradix based multi-precision number system.


Algorithm 4.13Dekker using Absolute

Input: A,B ≥ 0Output: c0, c1

1: constant = 212 + 12: p = A× constant3: q = A− p4: ah = fabs(q)− fabs(p)5: at = A− ah6: p = B × constant7: q = B − p8: bh = fabs(q)− fabs(p)9: bt = B − bh

10: z = A×B11: zz = fabs(ah × bh)− fabs(z)12: zz = zz + ah × bt13: zz = zz + at × bh14: zz = zz + at × bt15: c1 = bz × r−1c16: c0 = (c1− (z × r)) + zz17: c1 = c1− (c0 < 0)18: c0 = c0 + (c0 < 0)× r

Algorithm 4.14Dekker using Volatile

Input: A,B ≥ 0Output: c0, c1

1: p, q, z, zz are volatile2: constant = 212 + 13: p = A× constant4: q = A− p5: ah = p+ q6: at = A− ah7: p = B × constant8: q = B − p9: bh = p+ q

10: bt = B − bh11: z = A×B12: zz = (ah × bh)− z13: zz = zz + ah × bt14: zz = zz + at × bh15: zz = zz + at × bt16: c1 = bz × r−1c17: c0 = (c1− (z × r)) + zz18: c1 = c1− (c0 < 0)19: c0 = c0 + (c0 < 0)× r

4.3 Montgomery

To perform the iterations of the BBS algorithm, a multiplication and a modulusis required. Performing each of these operations would be expensive because themultiplication would give a product that is double the size of the factors, andas discussed in Section 2.2.4 modulus needs to be performed through a division,which is sequential. What was needed was a method to do modular multiplicationthat does not have these drawbacks. Fortunately there is a technique calledMontgomery multiplication which can do this, discovered by P. Montgomery [25].

Montgomery multiplication works by moving the factors into another modulardomain, which henceforth will be called Montgomery representation. With themulti-precision numbers in the Montgomery representation, it is possible for us todo the multiplication and then reduce it by discarding limbs instead of performinga multi-precision division.

Instead of doing calculations with the factors a, b, it is done with aR (mod M),bR (mod M), where M is a modulus that fulfils M > 1 and R is a number that


has to be chosen such that R > M and gcd(R,M) = 1. This means R can bechosen in such a way that allows easy division. When R is chosen as R = rk,where k is a number that is k ≥ n and n is the number of limbs in M, division byR can be performed by discarding the k lowest positional limbs in a radix basedmulti-precision number system with radix r.

Montgomery multiplication multiplies aR (mod M), bR (mod M) in such away that instead of producing (ab) × R2 (mod M), the result is still in Mont-gomery representation i.e. (ab) × R (mod M). To do this the multiplication in-cludes a Montgomery reduction that essentially does a division by R and performsthe modular operation at the same time.

The Montgomery reduction is shown in Algorithm 4.15. To calculate Mont-gomery reduction, the modular multiplicative inverse ofM i.e.M ′ = 1/M (mod R),is required. M ′ can be calculated using the extended Euclidian algorithm. The ex-tended Euclidian algorithm is used to find gcd(a, b) by solving ax+by = gcd(a, b),it will also find x and y. If it is known that gcd(a, b) = 1, then x is the multiplica-tive inverse of a (mod b). Because M does not change, M ′ can be pre-calculatedto reduce the performance impact.

Algorithm 4.15 Montgomery Reduction [30]

Input: y = (ab)×R2 (mod M)Output: z = ab×R (mod M)

1: u = (−y ×M ′) mod R2: z = (y + u×M)/R3: if z ≥M then4: z = z −M5: end if

Consider the method for calculating modulus, y mod M = y −M ×⌊yM

⌋. In

Montgomery reduction, the calculation of u is equivalent to the calculation of thefloored division in the modulus operation, this is because M ′ is the multiplicativeinverse of M modulus R. The calculation of z first performs the rest of themodulus calculation and then divides by R to return the result in the expectedMontgomery representation. As a final step, z has to be subtracted by M ifz ≥M [25].

Transformation to and from Montgomery representation is achieved by usingMontgomery multiplication and the values R’ and 1 respectively. To transforma number a into Montgomery representation aR (mod M), a is Montgomerymultiplied with R′.

R′ ≡ R2 (mod M) (4.2)

The Montgomery multiplication performs a division of R, leaving aR (mod M)


as shown in Equation. 4.3.

aR (mod M) = a×R2

R(mod M) = a ( mod M)×R2 (modM)︸︷︷︸

R′

×R−1 ( mod M)

(4.3)To transform aR (mod M) back into the normal number representation, a Mont-gomery multiplication is performed between aR (mod M) and 1, which removesthe R from the equation and returns a.

a = aR× 1×R−1 (mod M) (4.4)

There are many different ways of accomplishing Montgomery multiplication,Koc et al. studied several different methods and their impact on performance [23].They found that the fastest methods were Separated Operand Scanning (SOS)and Coarsely Integrated Operand Scanning (CIOS). The SOS method is the sim-plest one, which does the multiplication followed by a Montgomery reduction ofthe result. A considerable drawback of this method is the memory requirementwhich is 2n+ 2, because the entire result of the multiplication has to be stored.

The CIOS method instead interleaves the multiplication and reduction, allow-ing it to use only n + 3 worth of memory. It uses an outer loop to control thelimb of one factor, and two inner loops. The first inner loop multiplies the limbwith the other factor, the second loop then reduces the result.

Both these algorithms are sequential and Koc et al. does not offer any insightinto how these could be made parallel, and the descriptions by Koc et al. made itdifficult for us to realise a parallel solution. Nigel Smart describes an alternativeAlgorithm 4.16 in the book “Cryptography, An Introduction” [30]. It is essentiallya Finely Integrated Operand Scanning (FIOS) method, also described by Koc etal. in [23], but Smart’s description allowed us to more easily see a parallel solution.

Algorithm 4.16 Montgomery Multiplication [30]

Input: AR (mod M), BR (mod M)Output: Z = (AB)×R (mod M)

1: Z = 02: for i = 0 to n− 1 do3: u = ((z0 + ai × b0)×M ′) mod r4: Z = (Z + ai ×B + u×M)/r5: end for6: if Z ≥M then7: Z = Z −M8: end if

Similar to the CIOS method, the reduction is interleaved with the multipli-cation in the FIOS method. However, the FIOS method applies the reduction


as the limbs are multiplied, unlike the CIOS method that applies the reductionafter the multi-precision multiplication. Algorithm 4.16 only shows a single loop,but line 4 has an implicit loop for multi-precision multiplication. Because FIOSapplies the reduction at the same time as the limb multiplication, it only has asingle inner loop.

For the FIOS method to work, the reduction term u is calculated beforeperforming the multi-precision multiplication. Because the method interleavesreduction and multiplication, the reduction is performed one limb at a time,which means the term u is only one limb in size. Therefore M ′ is redefined asM ′ = −1/M (mod r) for the FIOS method. It also means division and modulusis performed by r. Modulus by r is simple to perform because only the lowestsignificant limb is kept, the rest are discarded. Division by r is simple becausethe lowest significant limb is discarded, practically done with zi ⇐ zi+1 for i =0, . . . , n− 1.

When realising a parallel solution of Algorithm 4.16, it is not practical toparallelise over the outer loop, because then all work-items will create a partialanswer for the limbs in Z, which would then need to be combined in some way.Instead, consider the calculation of Z, which has two multiplications, both ofwhich multiply a constant large number with a changing single-limb number.This means that each work-item can hold a single limb of B and M . This meansthat the upper half of each single-limb multiplication has to be transferred to thenext work-item. However, because the entire result is going to be divided by rit makes more sense for each work-item to load the bottom half from the nextwork-item. Before moving the bottom halves, they have to be added togetherwith Z followed by a Kogge-Stone carry propagation, so that no information islost in the division.

The calculation of u is performed by all work-items. If instead the calculationwould have been done by one work-item only, it would cause thread-divergence.

The final part of Algorithm 4.16 is the conditional subtraction, which is aconsiderable challenge to perform on the GPU because of the multi-precisioncomparison and because of how the GPU handles conditional segments (Sec-tion 2.5). However, Walter showed that as long as R and A are smaller than 2Mand A is converted back to normal representation, a subtraction is not neededunless Z = M [34]. Walter noted the only time this can happen is when A ≡ 0mod M , but according to BBS the initial A must be a quadratic residue of M ,which is always less than M . So the only way A ≡ 0 mod M can be true iswhen A = 0, which would be meaningless since this would cause all BBS outputto be 0, hence we do not need to perform the check and subtraction in the finalalgorithm step.

A limitation in our implementation had to be made. Because our Kogge-Stonealgorithm cannot catch any carry generated by the highest positional limb, thenumber of limbs usable by our Montgomery multiplication has to be limited towork-group size− 1. This allows the extra work-item of the work-group to catch


the carry when such a situation occurs.The final algorithm is shown in Algorithm 4.17. The algorithm Montgomery

multiplies A and B and returns the result in A. A is contained in local memoryto allow quick access when iterating through the limbs. The algorithm only takesa single limb of B and M since they do not change during calculation for anywork-item. However, each work-item receives b0 and M ′ because they are neededto calculate u. T is a temporary storage space in the local memory for shiftingthe lower multiplication results, and allows the transfer of z0. P and G are thelocal memory spaces required by the Kogge-Stone algorithm.

Algorithm 4.17 Parallel Montgomery Multiplication

Input: A, bloc id, b0,mloc id,M′, T, P,G

Output: A1: loc id = local work item id2: loc size = work group size3: z0 = zloc id = ploc id = gloc id = 04: Barrier(local)5: for i = 0 to loc size− 1 do6: tloc id = zloc id7: Barrier(local)8: z0 = t09: (u, extra) = dekker mul(ai, b0, u)

10: u = u+ z011: (u, extra) = dekker mul(u,M ′)12: (lu, hu) = dekker mul(ai, bloc id)13: (lz, hz) = dekker mul(u,mloc id)14: lu = lu+ lz + zloc id15: hu = hu+ hz16: u = blu ∗ r−1c17: tloc id = lu− u ∗ r18: Barrier(local)19: lu = tloc id+1 + u20: hu = hu+ lu21: zloc id = kogge stone(hu, loc id, P,G)22: end for23: Aloc id = zloc id

4.4 Putting it together

The GPU-bound BBS we have created performs a predefined amount of iterationsand extracts all secure bits from them. Values that can be pre-calculated forKogge-Stone and Montgomery multiplication are calculated on the CPU. The


values pre-calculated on the CPU are defined directly in the kernels by passingthe values to the OpenCL compiler to remove the first time access latency ofaccessing the constant memory, and to reduce the amount of private variablesused by the algorithm. The pre-calculated values for Montgomery multiplicationare calculated on the CPU because of limitations in the scope of this thesis.Making parallel algorithms to calculate the various values needed, in particularthe multiplicative inverse, is something we deemed a project worthy of its ownstudy. The expected output from the BBS generator is a bit stream which meansall outputs has to be concatenated onto each other, and then stored in an array.Assembling this array of bits is a sequential operation better suited for the CPU.

Our GPU-bound BBS algorithm is described in Algorithm 4.18. It takes thefollowing inputs, X = X0 (The initial BBS value), Y = R′, M , M ′. In addition,the algorithm requires three local memory spaces, one for storing X (locX), andtwo for the Kogge-Stone propagation (P and G). The memory space required forX is n large where n is the number of limbs, the requirements for Kogge-Stonehas already been covered. The algorithm begins by moving the input X into localmemory, and storing the work-item id limb of M and R′ in private variables.

An iteration of BBS is made using a sequence of three Montgomery multipli-cations. The first one multiplies Xi with R′ to get the Montgomery representationof Xi. In the second one Xi is squared to get Xi+1. Because it is still in Mont-gomery representation, Xi+1 is Montgomery multiplied with 1. To save memoryspace, the value 1 is generated using a compare operation, so only the first work-item of the work-group holds the value 1 while the rest holds the value 0. Theiteration ends by extracting the lowest positional bits. After the iterations aredone, the algorithm returns an array R containing the parity from each iteration.

When running the BBS algorithm using a radix of r = 218 and using 127limbs, the M should be (2(18×126) + 217) < M < 2(18×127). With an M of thelower bound the maximum number of bits that can be taken from each iterationis log2(log2(2

(18×126) + 217)) ≈ 11 bits per iteration. To extract 11 bits from eachiteration, a modulus by 211 is performed on the lowest positional limb. The bitsextracted from the BBS iterations are stored in an array located in the globalmemory, so the bits can be concatenated on the CPU.


Algorithm 4.18 Our Blum Blum Shub algorithm

Input: X, Y, M, M ′, P, G, locXOutput: R

1: global id = global work item id2: loc id = local work item id3: group id = work group id4: loc size = work group size5: locXloc id = Xglobal id

6: privRid = Yglobal id7: privR0 = Yloc size×group id8: privM = Mglobal id

9: for i = 0 to Iterations do10: locX = montgomery mul(locX, privRid, privR0, privM,M ′, P,G)11: Barrier(local)12: locX = montgomery mul(locX, locXloc id, locX0, privM,M ′, P,G)13: locX = montgomery mul(locX, (loc id < 1), 1, privM,M ′, P,G)14: Barrier(local)15: R(i+(Iterations×group id)) = locX0 (mod 211)16: end for

Chapter 5

Performance results

To be able to properly test the individual algorithms we had to create kernelsthat used the algorithms in such a way that was both representative for normaloperation but also allowed us to measure the actual cost of the algorithms. Alltests were run 50 times in order to generate enough data to achieve an acceptablevariance threshold in the linear regression. All the time measurements were usedunaltered with linear regression, but was averaged for presentation in the graphs.

5.1 Test equipment

All tests were performed on a computer with an Intel core i7 960 3.2 GHz proces-sor, 6GB DDR3 RAM running at 1066 MHz and a nVidia GTX580 with 1536MBmemory. The operating system was Ubuntu 10.10 x64 with kernel 2.6.35, thenVidia driver version used was x86 64-280.13.

5.2 Addition

When we tested addition we wanted to see if there was any benefit to using lazyaddition and to what degree. We also wanted to see how expensive our carrypropagation was.

The tests consisted of the GPU performing a fixed amount of lazy additions,with carry propagation at different intervals. With a radix of 218 and a mantissawith 24-bits it is safe to perform 2(24−18) = 64 additions without the risk of losingprecision. Therefore we tested with a carry propagation at the intervals shown inTable 5.1.

The point of using Kogge-Stone is that even with large numbers the carrypropagation will be quick, so to test this property we varied the number of limbsin the big numbers. In order to observe the scaling of Kogge-Stone, the numberof limbs has to be varied. Because we parallelised the multi-precision systemwith one limb per work-item, a way to observe the scaling is by using a largework-group size, with carry propagation over a varying amount of limbs. It wasalso of interest to see how the algorithm scaled when using different work-group

55

Chapter 5. Performance results 56

sizes because that is how the precision is set in our multi-precision number systemimplementation. However, a varying work-group size introduced other variablesthat affected the result which are outside of our control. These are the effects fromthe scheduler working with different work-group sizes, memory usage, and howwell the instructions can be managed to hide instruction latency. The performancewas measured over different amounts of work-items in order to allow the use oflinear regression over the work-items’ measured run time to remove the latencyintroduced at the start of the tests.

Intervals 1 8 16 32

Limbs used 64 128 256 512 1024

Work-Items2048 4096 8192 16384 3276865536 131072 262144

Table 5.1: Addition test parameters

1

10

100

0 200 400 600 800 1000

Work Size

Performance over 32 additions

1 lazy add8 lazy add


(a) 32 Additions

10

100

200 400 600 800 1000

Work Size




(b) 128 Additions

10

100

100 1000

Work Size




(c) 128 Additions on logscale

10

100

200 400 600 800 1000

Work Size

Performance over 128 additions, group size varying



(d) 128 Additions with variable work-group size, log scale

Figure 5.1: Results for Addition tests

The results shown are the coefficients from linear regression over the number ofwork-items. With a fixed work-group size of 1024, the Kogge-Stone carry prop-agation scales as a logarithmic function, as shown in Figure 5.1(a) to 5.1(c).When we introduced a variable work-group size, the result took on a very differ-ent shape as shown in Figure 5.1(d), indicating that the Kogge-Stone algorithm


benefits from large work-group sizes. This can possibly be because at higherwork-group sizes it can better hide latencies from memory access. Because thereis a maximum amount of work-groups per compute units, a work-group size of 64limits the amount of work-items the scheduler can switch between, which limitsthe schedulers ability to hide latencies, and likely causes the poor performance.

5.3 Multiplication

We found three different methods for performing full-precision single limb multi-plication. In order to decide which method to use, their performance was analysedusing the profiling method described in Section 3.2. The first two multiplicationalgorithms are based on multiplication with Dekker splits, they have already beendescribed in Algorithms 4.13 and 4.14. The third one is an optimised version ofthe naive algorithm described in Algorithm 4.9. The optimisation is that in-stead of using c = a mod b the algorithm uses c = a − b × ba × b−1c, where b−1

has been pre-calculated on the CPU. The optimised algorithm is described inAlgorithm 5.1.

Algorithm 5.1 Optimised Naive Splitting Multiplication

Input: A,BOutput: C : c0, c1

a1 =⌊A× 1√

r

⌋a0 = A− a1×

√r

b1 =⌊B × 1√

r

⌋b0 = B − b1×

√r

t = a0× b1 + a1× b0split1 =

⌊t× 1√

r

⌋split0 = t− split1×

√r

c0 = a0× b0 + split0×√r

carry = bc0× r−1cc1 = a1× b1 + split1 + carryc0 = c0− carry × r

Because the algorithms do not require any communication with other work-items,it is possible to model their performance using the profiling method from Sec-tion 3.2. As the multiplication algorithms are quite complex, they should notbe affected by the compiler’s optimisations in any destructive way. However, toadd an extra layer of protection, the result from the previous iteration is used asinput for the next iteration to create dependencies between them, as shown inAlgorithm 5.2.


Algorithm 5.2 Multiplication test

Input: seed1, seed2Output: upper, lowerlower = seed1upper = seed2(lower, upper) = Mul(lower, upper)(lower, upper) = Mul(lower, upper). . .

Different parameters had to be used for the multiplication benchmarks than theonce used in Section 3.2, because using the same number of iterations made thecompilation time unreasonably high. However, the number of work-items useddoes not affect compilation time so they remained the same. The parameters areshown in Table 5.2.

Iterations (n) 32 64 128 256 512 1024

Work-Items (t)2048 4096 8192 16384 32768 49152

65536 98304 131072 196608 262144

Work-group Size 64 128 256

Table 5.2: Multiplication test parameters

0 50000

100000 150000

200000 250000

300000

0 200

400 600

800 1000

0 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07

CPU ticks

Absolute Value, Work-Group size 128

50000000.0040000000.0030000000.0020000000.0010000000.00


CPU ticks

(a) Absolute value

0 50000

100000 150000

200000 250000

300000

0 200

400 600

800 1000

0 2e+07 4e+07 6e+07 8e+07 1e+08

1.2e+08 1.4e+08 1.6e+08 1.8e+08

CPU ticks

Volatile, Work-Group size 128

150000000.00100000000.00

50000000.00


CPU ticks

(b) Volatile

0 50000

100000 150000

200000 250000

300000

0 200

400 600

800 1000

0 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07

CPU ticks

Naive, Work-Group size 128

50000000.0040000000.0030000000.0020000000.0010000000.00


CPU ticks

(c) Naive

Figure 5.2: Single Digit multiplication results


Figure 5.2 shows a visual representation of the data, the results from the twolinear regressions applied to the data is shown in Table 5.3. Because the graphshad near identical appearance between different work-group sizes, only the graphsfor work-group size of 128 are shown.

Work-group Size 64 128 256Volatile 0.608373 0.604442 0.614700

Absolute Value 0.212693 0.210843 0.208930Naive Method 0.210488 0.211631 0.211313

Table 5.3: Single Digit multiplication results, results in CPU ticks

Results showed that Dekker multiplication using volatile variables was consid-erably slower while the naive and absolute value methods were almost equal.Work-group size only had a minor effect on performance. Because execution timecannot help decide between the absolute value and the naive method, anothermetric had to be used to decide between them, namely how many private vari-ables they needed. Because only a finite number of private variables are availableon the GPU, it is an advantage to use less private variables. The naive algo-rithm described in Algorithm 5.1 needs nine variables, the variable t and carrycan share a variable. The Dekker multiplication described in Algorithm 4.13only needs eight because c0, c1 can re-use p, q because they are no longer needed.Therefore, the Dekker splits multiplication using the absolute value method wasused, as our single limb multiplication algorithm.

Chapter 6

Comparative results

To decide if our GPU bound BBS implementation was faster than a CPU imple-mentation, it was compared against a CPU bound BBS algorithm implementedwith the GNU Multiple Precision Arithmetic Library (GMP). The metrics usedto compare the performance of the implementations were latency and CPU-ticksper bit.

The latency of the implementations is the time it takes to generate a singleoutput value, measured in CPU-ticks. CPU-ticks per bit is measured by lettingthe implementations generate increasing amounts of bits and then diving thenumber of CPU-ticks required to finish the iterations of the BBS sequence, withthe number of bits generated.

To be able to perform a fair comparison between the two implementationsthey have to do the same amount of work. The workload for our GPU boundBBS implementation takes two inputs, number of iterations and number of work-groups. Each work-group on the GPU performs their independent BBS sequencewith a number of iterations defined by the input. However, because the CPUbound implementation only takes a single input, its workload is the product ofthe number of iterations and work-groups.

All the tests were performed with the parameters outlined in the Table 6.1.The GPU tests were done with a work-group size of 128 work-items. The timemeasured start before the transfer of data to the GPU and ends after the data isread back from the GPU. All the tests were run 50 times to achieve an acceptablevariance, as explained previous chapters.

No. Work-groups 1 128 192 256 512 768 1024 1536 2048Iterations 1 64 128 256

Table 6.1: Parameters for the GPU and CPU comparison of BBS

The CPU bound GMP algorithm for BBS is described in Algorithm 6.1, the Mand A are provided and are the same ones that are used in the GPU bound BBS.

60

Chapter 6. Comparative results 61

Algorithm 6.1 GMP BBS

Input: (work-groups× iterations)Output: RA = BBS start seedM = Prime productfor i = 0 to (iterations× work-groups) doA = A× A mod MRi = A mod 211

end forreturn R

The results for the latency tests are shown in Table 6.2. BBS total executiontime is shown in Figure 6.1 and the CPU-ticks per bit is shown in Figure 6.2

GMP GPULatency 12714 2361168

Table 6.2: Latency of the BBS, in CPU-ticks

The latency for the GPU bound BBS is considerably larger than the CPU boundBBS, an expected result because of the added latency from the kernel queuingand transferring memory.

0 500

1000 1500

2000 2500

0 50

100 150

200 250

300

0 1e+009 2e+009 3e+009 4e+009 5e+009 6e+009

CPU ticks

GMP BBS

5000000000.004000000000.003000000000.002000000000.001000000000.00

Groups

Iterations

CPU ticks

(a) GMP

0 500

1000 1500

2000 2500

0 50

100 150

200 250

300

0

5e+009

1e+010

1.5e+010

2e+010

2.5e+010CPU ticks

GPU BBS

20000000000.0015000000000.0010000000000.005000000000.00

Groups

Iterations

CPU ticks

(b) GPU

Figure 6.1: Time taken to execute BBS, measured in CPU-ticks

Figure 6.2(b) shows CPU-ticks per bit for the GPU, the figure have been limitedin height to better show the relationship of increasing work-group size and numberof iterations. The data missing in the figure is the CPU-ticks per bit when a singlegroup is used, because it is too large and showing it would make the other valuesin the figure too small to differentiate.

The results showed that the GPU bound BBS is about 4 times slower thanthe CPU bound BBS when generating large amount of bits, while the GPU iseven slower for smaller amount of bits because of the added latency.

Chapter 6. Comparative results 62

0 500

1000 1500

2000 2500 0

50 100

150 200

250 300

950 975 1000 1025 1050 1075 1100 1125 1150

CPU ticks per Bit

GMP BBS Ticks per Bit

1150.00 1100.00 1050.00 1000.00

Groups

Iterations

CPU ticks per Bit

(a) GMP

0 500

1000 1500

2000 2500 0

50 100

150 200

250 300

3500 3700 3900 4100 4300 4500

CPU ticks per Bit

GPU BBS Ticks per Bit

Groups

Iterations

CPU ticks per Bit

(b) GPU

Figure 6.2: CPU-ticks per random bit

Chapter 7

Conclusion

We have performed tests to model the performance of a multi-precision num-ber system using different data types. To model the performance of the multi-precision number system, the performance of calculating inner partial productsof a multiplication algorithm with the different data types were compared. Tobe able to measure the small execution time of a single inner partial product,we devised a profiling method based on a CPU bound high-precision timer andstatistical analysis. The results showed that the floating-point data type had aslight advantage when calculating the inner partial product of a multiplication.This result guided our decision for which data type to use in our multi-precisionimplementation of BBS.

We created a radix based multi-precision number system, with arithmeticfunctions for addition, single-limb multiplication, and Montgomery multiplica-tion. The addition algorithm used a carry propagation algorithm adapted fromthe Kogge-Stone carry look-ahead adder. To our knowledge, this is the firstattempt at adapting the Kogge-Stone algorithm to work on the GPU using num-bers with a radix higher than two. The single-limb multiplication was basedon a technique created by Dekker, which allows full-precision multiplication offloating-point numbers. The Montgomery multiplication, which is a modularmultiplication, was based on the FIOS method.

The Montgomery multiplication was used to create a GPU bound BBS algo-rithm. After extensive tests of our GPU BBS implementation, it showed thatit was slower than a CPU implementation made with the GNU multi-precisionarithmetic library. We suspect this is likely due to the high degree of synchro-nization required in our GPU implementation.

7.1 Future Work

Our BBS implementation was constructed using the floating-point data type,because our performance comparison indicated that the floating-point data typehas a slight performance advantage over integers on the GPU, when calculatingthe inner partial product of a multiplication. However, it would still be interestingto see how an integer based GPGPU implementation of BBS would fare against

63

Chapter 7. Conclusion 64

our floating-point implementation, because there could be some optimisations oroperations that could make an integer implementation faster.

Trying a different variation of the Kogge-Stone could also yield better perfor-mance. There are many different versions of the Kogge-stone adder, we chooseto use the radix-2 version in which each position gets its value from two inputs, aradix-4 implementation which takes four inputs could lead to better performancebecause it can complete the carry propagation in log4 steps instead of log2.

Our parallel Montgomery multiplication implementation was inspired by NigelSmarts description in his book “Cryptography, An Introduction” [30], which es-sentially is the FIOS method of doing the Montgomery multiplication. It wouldbe interesting to see how other GPGPU implementations of the Montgomerymethods i.e. CIOS, FIPS and CIHS, would fare against our implementation.

It would also be interesting to see the performance of an implementation us-ing a different approach for a multi-precision number system. Our choice of aradix based multi-precision system has some drawbacks, because it requires aconsiderable amount of thread communication. Using a different approach forhandling the numbers might lead to better performance because of less communi-cation and algorithms that would be more appropriate for GPGPU e.g. a residuenumber system that allows independent calculation of limbs.

7.2 Acknowledgements

We would like to take this opportunity to thank the people who helped andsupported us during our work with this thesis. First and foremost, we want toexpress our deepest gratitude to our supervisor Andrew Moss at Blekinge Instituteof Technology, for sharing his expertise and experience in the research area, andfor his time and patience during the thesis. To everyone at Wireless IndependentProvider AB, we thank you for all your moral support and for providing thehardware that made this thesis possible. We would also like to thank our friendsand families for their support during the thesis.

References

[1] Werber Alexi, Benny Chor, Oded Goldreich, and Claus Schnorr. Rsa andrabin functions: Certain parts are as hard as the whole. SIAM Journal ofComputing, 17:194–209, 1988.

[2] Leonore Blum, Manuel Blum, and Michael Shub. A simple unpredictablepseudo-random number generator. SIAM Journal of Computing, 15(2):364–383, 1986.

[3] Futuremark corporation. Audit report: Alleged nvidia driver cheat-ing on 3dmark03. http://www.futuremark.com/pressroom/companypdfs/3dmark03_audit_report.pdf, May 2003. [Accessed January 30, 2012].

[4] Intel Corporation. Using the RDTSC Instruction for Performance Monitor-ing, 1997.

[5] Intel Corporation. Intel Architecture Software Developer’s Manual, 1999.

[6] Intel Corporation. System Programming Guide, Part 1 Volume 3A. In Intel64 and IA-32 Architectures Software Developers Manual. 2011.

[7] NVIDIA Corporation. What is cuda. URL: http://developer.nvidia.com/what-cuda. [Accessed January 30, 2012].

[8] NVIDIA Corporation. NVIDIA CUDA C Programming Guide, 4.0 edition,2011.

[9] NVIDIA Corporation. OpenCL Best Practices Guide, February 2011.

[10] NVIDIA Corporation. OpenCL Programming Guide, 4.0 edition, 2011.

[11] T.W. Cusick. Properties of the x2 mod N pseudorandom number generator.Information Theory, IEEE Transactions on, 41(4):1155–1159, 1995.

[12] T. J. Dekker. A floating-point technique for extending the available precision.Numerische Mathematik, 18:224–242, 1971.

[13] Vadim Demchik. Pseudo-random number generators for monte carlo simula-tions on ati graphics processing units. Computer Physics Communications,182(3):692 – 705, 2011.

65

References 66

[14] R. Fischlin and C. Schnorr. Stronger security proofs for rsa and rabin bits.In Walter Fumy, editor, Advances in Cryptology EUROCRYPT 97, volume1233 of Lecture Notes in Computer Science, pages 267–279. Springer Berlin/ Heidelberg, 1997.

[15] Sebastian Fleissner. Gpu-accelerated montgomery exponentiation. In YongShi, Geert van Albada, Jack Dongarra, and Peter Sloot, editors, Compu-tational Science ICCS 2007, volume 4487 of Lecture Notes in ComputerScience, pages 213–220. Springer Berlin / Heidelberg, July 2007.

[16] Michael J. Flynn. Some computer organizations and their effectiveness. Com-puters, IEEE Transactions on, C-21(9):948 –960, September 1972.

[17] The Khronos Group. The opencl specification. URL: http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf, November 2011. [Accessed Jan-uary 30, 2012].

[18] D. Harris and I. Sutherland. Logical effort of carry propagate adders. InSignals, Systems and Computers, 2003. Conference Record of the Thirty-Seventh Asilomar Conference on, volume 1, pages 873–878. IEEE, 2003.

[19] Owen Harrison and John Waldron. Efficient acceleration of asymmetric cryp-tography on graphics hardware. In Bart Preneel, editor, Progress in Cryp-tology AFRICACRYPT 2009, volume 5580 of Lecture Notes in ComputerScience, pages 350–367. Springer Berlin / Heidelberg, 2009.

[20] Advanced Micro Devices Inc. Introduction to OpenCL Programming, rev aedition, 2010.

[21] Advanced Micro Devices Inc. Programming Guide ATI Stream Computing,June 2010.

[22] Donald E. Knuth. The art of computer programming. - 2: Seminumericalalgorithms, chapter 4. Addison-Wesley, third edition, 1998.

[23] C. K. Koc, Tolga Acar, and B.S. Kaliski Jr. Analyzing and comparing mont-gomery multiplication algorithms. Micro, IEEE, 16(3):26–33, 1996.

[24] Peter M. Kogge and Harold S. Stone. A parallel algorithm for the efficientsolution of a general class of recurrence equations. Computers, IEEE Trans-actions on, C-22(8):786 –793, August 1973.

[25] Peter L. Montgomery. Modular multiplication without trail division. Math-ematics of Computation, 44(170):519–521, 1985.

References 67

[26] Andrew Moss, Daniel Page, and Nigel Smart. Toward acceleration of rsausing 3d graphics hardware. In Steven Galbraith, editor, Cryptography andCoding, volume 4887 of Lecture Notes in Computer Science, pages 364–383.Springer Berlin / Heidelberg, 2007.

[27] Marc Olano. Modified noise for evaluation on graphics hardware. In Proceed-ings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphicshardware, HWWS ’05, pages 105–110, New York, NY, USA, 2005. ACM.

[28] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digi-tal signatures and public-key cryptosystems. Commun. ACM, 21:120–126,February 1978.

[29] Hongsong Shi, Shaoquan Jiang, and Zhiguang Qin. More efficient ddh pseu-dorandom generators. Designs, Codes and Cryptography, 55:45–64, 2010.

[30] Nigel Smart. Cryptography, an introduction. Mcgraw-Hill College, 3rd edi-tion, 2003.

[31] Robert Szerwinski and Tim Gneysu. Exploiting the power of gpus for asym-metric cryptography. In Elisabeth Oswald and Pankaj Rohatgi, editors,Cryptographic Hardware and Embedded Systems CHES 2008, volume 5154of Lecture Notes in Computer Science, pages 79–99. Springer Berlin / Hei-delberg, 2008.

[32] Stanley Tzeng and Li-Yi Wei. Parallel white noise generation on a gpu viacryptographic hash. In Proceedings of the 2008 symposium on Interactive3D graphics and games, I3D ’08, pages 79–87, New York, NY, USA, 2008.ACM.

[33] Umesh Vazirani and Vijay Vazirani. Efficient and secure pseudo-randomnumber generation (extended abstract). In George Blakley and DavidChaum, editors, Advances in Cryptology, volume 196 of Lecture Notes inComputer Science, pages 193–202. Springer Berlin / Heidelberg, 1985.

[34] C.D. Walter. Montgomery exponentiation needs no final subtractions. Elec-tronics Letters, 35(21):1831 –1832, October 1999.

[35] Henry Wong, M.M. Papadopoulou, M. Sadooghi-Alvandi, and AndreasMoshovos. Demystifying gpu microarchitecture through microbenchmark-ing. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEEInternational Symposium on, pages 235–246. IEEE, March 2010.

Blum Blum Shub on the GPU - DiVA portal831071/FULLTEXT01.pdf · Conclusions. The conclusion made...

Documents

Transcript of Blum Blum Shub on the GPU - DiVA portal831071/FULLTEXT01.pdf · Conclusions. The conclusion made...