Gecco09 Coarse Grain on of Evolutionary Algorithms On

download Gecco09 Coarse Grain on of Evolutionary Algorithms On

of 9

Transcript of Gecco09 Coarse Grain on of Evolutionary Algorithms On

  • 8/14/2019 Gecco09 Coarse Grain on of Evolutionary Algorithms On

    1/9

    Coarse Grain Parallelization of Evolutionary Algorithms onGPGPU Cards with EASEA

    Ogier MaitreLouis Pasteur University,LSIIT, FDBT

    Illkirch, [email protected]

    Nicolas LachicheLouis Pasteur University,LSIIT, FDBT

    Illkirch, [email protected]

    strasbg.fr

    Laurent BaumesInsituto de Tecnologia QuimicaUPV-CSIC

    Valencia [email protected]

    Avelino CormaInsituto de Tecnologia Quimica

    UPV-CSICValencia Spain

    [email protected]

    Pierre ColletLouis Pasteur University,

    LSIIT, FDBTIllkirch, France

    [email protected]

    ABSTRACTThis paper presents a straightforward implementation of astandard evolutionary algorithm that evaluates its popula-tion in parallel on a GPGPU card.

    Tests done on a benchmark and a real world problem us-ing an old NVidia 8800GTX card and a newer but not topof the range GTX260 card show a roughly 30x (resp. 100x)speedup for the whole algorithm compared to the same al-gorithm running on a standard 3.6GHz PC. Knowing thatmuch faster hardware is already available, this opens newhorizons to evolutionary computation, as search spaces cannow be explored 2 or 3 orders of magnitude faster, dependingon the number of used GPGPU cards.

    Since these cards remains very difficult to program, the

    knowhow has been integrated into the old EASEA language,that can now output code for GPGPU out of a simple .ezsource file.

    Categories and Subject Descriptors

    G.1.6 [Mathematics of Computing]: Numerical Analy-sisOptimization

    General Terms

    Performance

    Keywords

    Parallelization, evolutionary computation, genetic algorithms,GPGPU, Graphic Processing Unit, EASEA

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.GECCO 09 Montreal, CANADACopyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.

    1. INTRODUCTIONEver since GPGPU (General Purpose Graphic Processing

    Unit) cards have appeared on the market a few years ago,researchers have been interested in using them for evolution-ary computation, due to the inherent parallelism of these al-gorithms. However, surprisingly enough, even though manypapers have been published on the challenging implemen-tation of Genetic Programming on these cards, after exten-sive search, only three (different) papers [10, 9, 14] could befound that addressed the implementation of standard evolu-tionary algorithms (with a fixed genome size and a commonevaluation function for all individuals), but with strange andoverly complex implementation decisions and results that donot encourage to follow their steps, looking at the involved

    work.Why has the very simple and basic idea of running the

    evolutionary algorithm on the host CPU, and evaluating allindividuals in parallel on the GPGPU card apparently notbeen tested is beyond understanding.

    The aim of this paper is to test this most basic and straight-forward form of parallelization of an evolutionary algorithmon a GPGPU card. The EASEA language [4] has been re-vived in order to help non expert GPGPU programmers ob-tain comparable results to those presented in this paper.

    The paper starts by examining the state of the art. Thencomes a description of what GPGPUs are and some of theirmore and less desirable characteristics will be pointed out,that is followed by a section that briefly recalls the history of

    the EASEA language and its functionalities. Finally, resultsare presented on a standard benchmark and a real-worldproblem, followed by a discussion on the presented work.

    2. STATE OF THE ARTFor some reason, it seems that most efforts to efficiently

    use GPGPUs in the domain of evolutionary computationhave been made in the field of Genetic Programming (GP),even though GP hardly satisfies the intrinsic programmingconstraints of GPGPUs.

    As will be briefly discussed later on, GPGPU cards arein fact very powerful massively parallel computers that have

  • 8/14/2019 Gecco09 Coarse Grain on of Evolutionary Algorithms On

    2/9

    (among others) one main drawback: all the elementary pro-cessors on the card are organised into larger multi-processorsthat must all execute the same program (SPMD model, forSingle Program Multiple Data). Inside each multi-processor,all elementary processors must execute the same instruc-tion at the same time but possibly on different data (SIMDmodel, for Single Instruction Multiple Data).

    The obvious parallel part in evolutionary algorithms is the

    evaluation of the population, but in the case of GP, the codeto be executed for the evaluation of all the individuals is po-tentially different, because all individuals are different fromeach other and the evaluation of a GP individual is obtainedby running the individual on some data (potentially identi-cal among the population). So to sum up, where GP wantsto execute different programs (individuals) on identical data(learning set), GPGPUs are designed to do exactly the op-posite, i.e. execute identical programs on different data. Soa lot of imagination is required from researchers to get thesecards to work against their design, but GP being extremelyCPU-greedy, it is very understandable that these courageousattempts should be made.

    On the opposite, standard evolutionary algorithms need torun an identical evaluation function on different individuals(that can be considered as different data), meaning that thisis exactly what GPGPUs have been designed to deal with.However, for some strange reason, very few researchers havegone this obvious way, and when they did, they made strangechoices, with over-complicated implementations [14, 9, 10].

    The most basic idea that comes to mind when one wantsto parallelize an evolutionary algorithm is to run the evolu-tion engine in a sequential way on some kind of master CPU(potentially the host computer CPU), and when a new gen-eration of children has been created, get all children to eval-uate in parallel on a massively parallel computer. This mayhowever sound like a bad idea because at each generation, itis necessary to transfer the whole population to the parallelcomputer and get the results back. Maybe this feared trans-

    fer/overhead time is what stopped everyone from trying thissimplistic idea, but no paper could be found that tried thissimple route.

    Even though this seemed such a trivial thing to do, it isexactly what has been tested out in this paper, based onthe principle that one should always explore the obvious tomake sure it is really not good enough, before spending alot of time and energy to optimise things out.

    Probably in order to adopt a more refined technique, [14]implement a refined fine-grained algorithm with a 2D toroidalpopulation structure stored as a set of 2D textures, with thecomplete algorithm running on the GPGPU (which posesa serious problem since these cards do not have a randomnumber generator, so before going on the GPGPU, they cre-

    ate a matrix of random numbers that is stored in GPGPUmemory for future reference). Anyway, a 10 speedup isobtained, but on a gigantic populations of 5122 individuals.[9] find that standard genetic algorithms are ill-suited to runon GPGPUs because of such operators as crossovers (thatwould slow down execution when executed on the GPGPU)and therefore choose to implement a crossover-less Evolu-tionary Programming algorithm [8] here again entirely onthe GPGPU card. The obtained speedup of their parallelEP ranges from 1.25 to 5.02 when the population size islarge enough. [10] implement a Fine Grained Parallel Ge-netic Algorithm once again on the GPGPU, to avoid mas-

    sive data transfer. For a strange reason, they implement abinary GA even though GPGPUs have no bit-operators, sogo into a lot of trouble to implement a single point crossover.

    So probably for fear of being too slow or non-optimal,these teams of researchers seem to have skipped the mostbasic implementation that will be explored in this paper.

    In order to guarantee replicability, the EASEA languagehas been revived for the occasion, that will allow both repli-

    cability and ease of use for non GPGPU expert programmerswho would like to try their algorithms on GPGPU with min-imal effort, for the price of a GPGPU card.

    3. PRESENTATION OF EASEABack in 1998, an INRIA Cooperative Research Action

    called EVOLAB started between four French research labo-ratories with the aim to come up with a software platformthat would help out non expert programmers to try out evo-lutionary algorithms to optimise their applied problems. A

    first prototype of the EASEA [4] (EAsy Specification forEvolutionary Algorithms, pronounce [i:zi:]) language wasdemonstrated at the EA99 conference in Dunkerque, andnew versions regularly came out on its web-based Source-forge software repository until 20031.

    In the meantime, the EASEA language (along with GUIDE[5], its dedicated Graphic User Interface) was used for manyresearch projects and at least 4 PhDs, as well as for teach-ing evolutionary computation in several universities. Before2000, the output of the EASEA compiler was a C++ sourcefile that was using either the GALib or the EO libraries fortheir implementation of the evolutionary operators.

    By 2000, the DREAM [1] (Distributed Resource Evolu-tionary Algorithm Machine) European Project was starting,and it needed a programming language. EASEA became theprogramming language of the project, and it was modified soas to output java sourcecode for the DREAM, while feedingon the same input files, for this was another nice feature ofthe EASEA language: the same .ez file could be compiledto create code for GALib, EO [11] or then the DREAM, de-pending on a single option flag. This functionality allowedreplicability across platforms and computers, as a published.ez program could be recompiled and executed with mini-mal effort on a Mac, a PC running under Windows or Linux,or any other machine where one of the target GALib / EO/ DREAM libraries was installed.

    Development of the EASEA language stopped with theDREAM project in 2003, but its ability to be able to gener-ate a fully functional and compilable source code from a sim-

    ple C-like description of the evolutionary operators (namelyinitialiser, evaluator, crossover operator and mutator) thatwere needed for a particular problem made it look like awonderful solution to allow non-expert programmers to useGPGPUs. Due to their origins and history, these cards needa long time to understand, and are quite difficult to program.

    So the idea behind the revival of EASEA was that throughusing a possibly old .ezfile, compiling it using the flag -cudaon the command line would have the compiler produce codethat would directly run on the GPGPU card.

    1http://sourceforge.net/projects/easea/

  • 8/14/2019 Gecco09 Coarse Grain on of Evolutionary Algorithms On

    3/9

    Brief overview of EASEA

    The idea behind EASEA was to allow virtually any basicprogrammer to try out an evolutionary algorithm by justtyping the code that was specific to the problem to be solved.The code for the implementation of the GPGPU algorithmthat tries to minimise the Weierstrass test function pre-sented below therefore does not contain much more thanthe following lines2

    \User classes :GenomeClass {

    float x[N];}

    As one can infer from this class declaration section, thegenome of an individual consists of an array of N floats3, N(and ITER below) being in fact a macro-definition (used tostore the dimension of the problem).

    \GenomeClass::initialiser :for(int i=0; i

  • 8/14/2019 Gecco09 Coarse Grain on of Evolutionary Algorithms On

    4/9

    whether elitism should be implemented or not, and whether the fitness function should be maximised or

    minimised.

    The .ez file containing these sections gets compiled bytyping:$ easea weierstrass.ez

    on the command line. Depending on the appended option,the easea compiler will output code for :

    cuda : any NVIDIA GPGPU card. When this optionis used, the evaluation function will be sent on theGPGPU and run in parallel on the population to beevaluated. The rest of the algorithm that managesthe population (selections, crossovers, mutations, re-ductions, . . . ) will stay on the host CPU and executelinarly. Speedup therefore only depends on the popu-lation size, the size of the genome and evaluation time.

    no option : the host CPU. This is new for the EASEAlanguage that, in its previous versions, always neededa companion library to provide the evolutionary oper-ators. Compiling a .ez file without an option will nowimplement a full algorithm (that will run on the CPUonly) without the need for a crutch.

    4. PRESENTATION OF GPGPU CARDSBack in 1965, Moore predicted that the evolution of tech-

    nology would allow to double the number of transistors persquare millimeter of silicium every year, and later on everytwo years [12]. At first, most people mistook this law for thedoubling of computer power every two years. This was nota big mistake because even though being able to put twiceas many transistors on the same surface did not imply thatcomputers would run twice as fast, it happened so that clockfrequency (and more generally single core computer chips)more or less did. But what was true in the past may not

    hold forever, and one must observe that it is now severalyears that the clock frequency of personal computers seemsto be stuck below 4Ghz (independently of the verificationof Moores law). Therefore, even if clock speed is not ev-erything, and certainly not the best performance indicator,the speed of single core CPUs does not increase as much asbefore, even though (thanks to Moores law) the number oftransistors that chips can hold still increases at the samerate. At first, this extra space was used to increase cachesize, until it became possible to host several processing unitson the same chip, eventually leading to multi-core CPUs.

    Even though a dual-core chip does not make the computerrun twice as fast, manufacturers can claim that their poweris multiplied by two, which is only correct if an applicationis run that is parallel enough so that it can use both coresat the same time. In computer graphics, many algorithmssuch as vertex or fragment shaders are inherently parallel, sographics cards have been able to fully benefit from Mooreslaw, by massively multiplying the cores on the card to cal-culate in parallel the colour of many fragments. In 2005,some people realised that such cards, containing as many as128 cores and claiming a computer power of several hundredgigaflops for less than a thousand dollars could be used forother calculations than vertex shading.

    In November 2006, NVidia launched CUDA (ComputeUnified Device Architecture), that provides a Software De-velopment Kit and Application Programming Interface that

    allowed a programmer to use a variation of the C languageto code algorithms for GeForce 8 series GPUs. Then NVidiacame out with TESLA calculators, which are graphics cardswithout a graphic output, totally dedicated to scientific pro-cessing, that boast 4.3 TeraFlops for around $8.000.

    Such enormous advertised computing power is very attrac-tive for users of inherently parallel CPU-greedy evolutionaryalgorithms, but unfortunately, even though the CUDA SDK

    helps a lot, GPGPUs are not easy to program: one needsto fully understand the complex memory structure, as wellas how executable code can be launched on the cores, howthe cores are grouped into Single Instruction Multiple Dataclusters, a.s.o.

    Architecture of GPGPU cards

    Until a couple of years ago, graphics cards were made of sev-eral programmable specialised units such as vertex shadersand pixel shaders. Depending on the application, there wereeither too many vertex shaders with reference to the num-ber of pixel shaders, or the opposite. One way to get a goodbalance between pixel and vertex shaders was actually tocreate more general units, able to process either pixels or

    vertices.By doing so, modern graphics cards are now made of many

    multi-purpose small processing units that are also able tocompute complex and large mathematical functions. Forinstance, the $500 GEForce GTX 295 graphics card fromNVIDIA embeds 480 (2402) streaming processors that cannow be used for other purposes than 3D rendering. However,the primary use of these cards still being computer graphics,their architecture is still designed towards this purpose.

    The evolution of graphics cards being driven by the gam-ing industry, new models appear frequently, so NVIDIA andAMD (the two GPGPU manufacturers) have been workingon hardware abstraction layers allowing to re-use developedsoftware on their range of cards.

    The work presented here was done using CUDA (an SDKdeveloped by NVIDIA) and tested on the now discontinued8800GTX card, that contains 128 streaming processors.

    Due to their design, these processors are not meant towork independently. In fact, on the 8800GTX, they aregrouped into 16multi-processorsthat each contain 8 stream-ing processors. The 16 multi-processors work in SPMDmode (Single Program Multiple Data, meaning that theymust share the same code, but can use individual programcounters that may point to different instructions) and thethe 8 streaming processors they contain work in SIMD mode(they must all execute the same instruction at the same mo-ment).

    Each multi-processor contains 16KB that are shared amongits streaming processors, and all streaming processors of allmulti-processors have access to 64KB of constant mem-ory through a cache. Then, the graphic card hosts a globalmemory (768MB on the 8800GTX) that can be directly ac-cessed by all streaming processors in read/write, albeit with-out cache. This memory can also be accessed by the CPU ofthe host machine, that uses it to both transfer instructionsand data to the multi-processors and to retrieve the resultsof the calculations.

    The lack of cache for the global memory is compensatedfor by the fact that several thread groups (called warps)can be scheduled simultaneously on a single multi-processor.Whenever a memory access is done by one group, it is swapped

  • 8/14/2019 Gecco09 Coarse Grain on of Evolutionary Algorithms On

    5/9

    0

    500

    1000

    1500

    2000

    2500

    3000

    0 500 1000 1500 2000 2500 3000 3500 4000 4500

    Time(s)

    Population size

    Weierstrass Iteration 120 on CPUWeierstrass Iteration 120 on GPU

    Figure 1: CPU (top) and CPU+GPGPU (bottom)evaluation times for increasing population sizes.

    for another one so that the multi-processor is not stalled onthe memory access (a global memory access uses several hun-dreds of cycles, where a shared memory access only takes acouple, and access to a register only takes one).

    Moreover, other constraints apply on the code that is ex-ecutable by a GPGPU, the most visible being that no func-tion calls are being allowed. If functions are found in thecode, they are inlined automatically. The CPU and GPGPUmemory address spaces being different, a piece of code run-ning in the GPGPU does not have access to the CPU mem-ory, meaning that any global variables declared in the mainprogram will not be accessible from the function that is ex-ecuting in the GPGPU.

    All in all, advertised performance can only be achieved invery particular cases that make full use of many tricks. Inreal life, the advertised 1.8 TFlop theoretical shader pro-cessing rate of NVIDIAs new $500 GTX295 is probablyvery difficult to obtain, but this is to be compared to the ad-vertised 51.20 GFlops of the fastest 2008 quad-core $2,500Intel QX9775 3.2GHz PC processor (that are probably asdifficult to obtain in real life).

    Anyway programming for GPGPU imposes some smallconstraints, that are not so overwhelming that nothing canbe done. Following them gives access to the extraordinaryparallel power of these cards.

    5. TESTING THE GPGPU CARDTwo problems have been tested on the GPGPU: a bench-

    mark function that contains interesting tuneable parametersthat allow to observe the behaviour of the GPGPU card, anda much more complex real world problem to make sure thatthe GPGPU processors were not only able to run fitness

    functions that would occupy a maximum of two lines. Infact, the 400 lines of the evaluation code of the chemicalproblem was coded by a chemist, who has not the least ideahow a GPGPU card can be programmed.

    5.1 Tests on the Weierstrass benchmarkWeierstrass-Mandelbrot test functions, defined as:

    Wb,h(x) =P

    i=1bih

    sin(bix) with b > 1 and 0 < h < 1

    are very nice to use as a benchmarks on cpu usage in evolu-tionary computation because they provide two parametersthat can be adjusted independently.

    0

    20

    40

    60

    80

    100

    0 1000 2000 3000 4000 5000 6000

    Time(s)

    Population size

    Weierstrass Iteration 120Weierstrass Iteration 70Weierstrass Iteration 10

    Figure 2: Influence of the computational cost of thefitness function on 8800GTX for increasing popula-tion sizes. Evaluation time is virtually constant upto 2,048 individuals and linear afterwards.

    Theory defines Weierstrass functions as an infinite sumof sines, which is not practical to compute in a finite time,

    even on a GPGPU. Therefore, programmers only do a finitenumber of iterations (ITER in the evaluation function) tocompute an approximation of the function, that are directlyproportional to the CPU time that the evaluation will take.

    Then, another parameter that can be adjusted is the di-mension of the problem: a 1,000 dimensions Weierstrassproblem takes 1,000 continuous parameters, meaning thatits genome is an array of 1,000 float values, while a 10 di-mension problem only takes 10 floats. The 10 dimensionproblem will evidently take much less time to evaluate thanthe 1,000 dimension problem, but as shown in the previousparagraph, evaluation time also depends on the number ofiterations, so by tuning these two parameters, one can testmany configurations combining genome size and evaluation

    time.Fig. 1 shows the time taken by the evolutionary algorithmto compute 10 generations on both a 3.6GHz Pentium com-puter and an 8800GTX GPGPU card, for a dimension of1000, a number of iterations of 120 and number of evalu-ations per generation growing from 16 to 4,096 individuals(number of offspring = 100% of POPSIZE). This representsthe total time (including what is serially done on the CPU:population management, crossovers, mutations, selections,. . . ) on both architectures (CPU and CPU+GPU).

    For 4,096 evaluations (10 generations), the CPU needed2,100 seconds where the 8800GTX only needed 63, resultingin a speedup of 33.3 when using this old NVidia card.

    Since it is really difficult to see how the GPGPU curveevolves on fig. 1, fig. 2 only shows the GPGPU curve ondifferent iteration values. On this second figure, one cansee that the old 8800GTX card steadily takes in more in-dividuals to evaluate in parallel without much difference inevaluation time until the threshold number of 2048 individ-uals is reached after which it gets saturated. Beyond thisvalue, increase in evaluation time increases linearly with thenumber of individuals, which is normal since the parallelcard already works on a full load. It is interesting to seethat on 10 iterations, the curve before and after 2048 hasnearly the same slope, meaning that for 10 iterations, 99%of the time comes from the main algorithm on the CPU.

    Since using a GPGPU card creates a necessary overhead

  • 8/14/2019 Gecco09 Coarse Grain on of Evolutionary Algorithms On

    6/9

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0 200 400 600 800 1000 1200

    Time(s)

    Population size

    Weierstrass Cpu Iteration 10Weierstrass 40B data Iteration 10Weierstrass 2KB data Iteration 10

    Weierstrass 4KB data Iteration 10

    Figure 3: Influence of individual size (40 B 4 KB)on a very short evaluation function (10 iterations):overhead is overcome for a population between 400and 650 individuals.

    (transfer of the individuals on the card, . . . ), it is interest-ing to determine when it becomes advantageous to add a8800GTX compared to a normal 3.6GHz Pentium com-puter. Fig 3 shows that on a small problem (10 dimen-sions Weierstrass / 10 iterations that computes in virtuallyno time), this threshold is met between 400 individuals or600 individuals, depending whether the genome size uses 40bytes or 4 kilobytes (which is quite a big genome).

    The steady line (that represents the CPU) shows thatthis corresponds to a CPU evaluation time slightly shorterthan 0.035 milliseconds (remember there are 10 generations),which is very short, even on a 3.6GHz computer.

    The 3 GPGPU curves show that indeed, the size of thegenomes has an impact when individuals are passed to theGPGPU card for evaluations. On this figure, evaluation isdone on a 10 dimensions Weierstrass function correspondingto a 40 bytes size (the 8800GTX card only accepts floats).

    The additional genome data is not used on the 2 kilobytesand 4 kilobytes genomes, in order to isolate the time takento transfer large size genomes to the GPGPU for all thepopulation.

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    0 1000 2000 3000 4000 5000 6000 7000 8000 9000

    Time(s)

    Population size

    Weierstrass Iteration 10Weierstrass Iteration 70

    Weierstrass Iteration 120

    Figure 4: Curves of fig. 2 but with a GTX260 card.

    The mathematics laboratory in our University was kindenough to lend us the GTX260 NVidia GPGPU card theyrecently bought, and fig. 4 shows the same curves as thoseof fig. 2, but with the newer card.

    The good news is that CUDA did its job: the very same

    Figure 5: Finding an atomic model matching a givendiffraction pattern using traditional methodology.

    source code compiled for the new card. One can see thatwith this card, the total time is only 20 seconds for a pop-ulation of 5000 individuals where our older 8800GTX tookslightly above 60, and where the 3.6GHz Pentium took 2100seconds. So where the 8800GTX vs CPU speedup was 33.3,the GTX260 vs CPU speedup is about 105, which is quitenice for a card that only costs around $250.00.

    5.2 Application to a real-world problemTrying out new algorithms on benchmarks is always ideal,

    since they allow for observing exactly what is of interest.However, the drawback is that these toy-problems are oftenso simple that it is delicate to infer anything from the factthat everything worked on the toy problem.

    This is why on top of all the tests made on the Weierstrassbenchmark function, it was nevertheless found to be impor-tant to try out the GPGPU cards on a real evolutionaryproblem, with all its possible quirks and peculiarities.

    5.3 Description of the problemIn materials science, knowledge of the structure at an

    atomistic/molecular level is required for any advanced un-derstanding of its performance, due to the intrinsic link be-tween the structure of the material and its useful properties.It is therefore essential that methods to study structures bedeveloped.

    Rietveld refinement techniques [13] can be used to ex-tract structural details from an X-Ray powder Diffraction(XRD) pattern [3, 2, 6], provided an approximate struc-ture is known. However, if a structural model is not avail-able, its determination from powder diffraction data is anon-trivial problem. The structural information containedin the diffracted intensities is obscured by systematic or ac-cidental overlap of reflections in the p owder pattern.

  • 8/14/2019 Gecco09 Coarse Grain on of Evolutionary Algorithms On

    7/9

    As a consequence, the application of structure determina-tion techniques which are very successful for single crystaldata (primarily direct methods) is, in general, limited tosimple structures (cf. fig. 5). Here, we focus on inherentlycomplex structures of a special type of crystalline materialswhose periodic structure is a 4-connected 3 dimensional netsuch as alumino-silicates, silico-alumino-phosphates (SAPO),alumino-phosphates (AlPO), etc... These materials are mi-

    croporous materials, whose structure allows to sort moleculesbased on a size exclusion process due to the presence of chan-nels (fig. 6-c) and cages (fig. 6-b), as shown in fig. 6 for theframework called LTA [7]. The picture only shows a finitepart of the structure as crystals are 3D periodic, i.e. the unitcell in white, (fig. 6-left), is repeated by simple translationin all the dimensions, (fig. 6-f) containing 27, e.g. 3*3*3,unit cells.

    The determination of such kinds of structures is still verymuch dominated by model building.

    Figure 6: LTA crystal framework: a) the white cubeis the unit cell, b) cages are represented in green, c)

    3D channels are in blue, d) split of the structure LTAin building units, e) piece of LTA structure, and f)crystal structure with 27 unit cells.

    The fitness function used for evaluation is based on theconnectivity of atoms. As such materials are characterizedby networks of corner-sharing TO4, with T a given tetra-coordinated element such as Si, Al, P but also Ge, a po-tential structure must fulfill this structural constraint. Thegenetic algorithm is employed in order to find correct loca-tions, e.g. from a connectivity point of view, of the T atoms.As the distance T-T for bonded atoms, lies in a fixed range[dmin-dmax], the connectivity of each new configurations ofT atoms can be evaluated. The fitness function corresponds

    to the number of defects in the structure, and Fitness=f1+f2is defined as follows:

    1. All T atoms should be linked to 4 and only 4 neigh-bouring Ts, so:

    f1=Abs(4-Number of Neighbours), and

    2. no T should be too close, e.g. T-T

  • 8/14/2019 Gecco09 Coarse Grain on of Evolutionary Algorithms On

    8/9

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

    Time(s)

    Population size

    Real problem on GPU

    Figure 8: GTX260 total time: constant up to 2048individuals and linear when the card is loaded.

    Results obtained on this real problem will be published ina materials journal.

    6. DISCUSSIONUntil now, the Giga- and Tera-flops advertised by GPGPU

    cards did not translate into anything usable in the evolu-tionary algorithm domain, even though EAs are supposedto be inherently parallel. In this paper, for the first time,figures on the paper translate in a direct reduction of totalevaluation time by 2 orders of magnitude, and this is appar-ently directly linked to the power of these cards, that keeps

    increasing at a tremendous rate thanks to the millions ofdollars injected in hardware improvement by the huge gam-ing industry that keeps wanting for more. Fortunately, whatgamers want is compatible with our algorithms, so we areon the first row by pure luck.

    At first, we had planned to show the difference in eval-uation time only, showing that GPGPU computation wasefficient, but it turned out that the overhead induced bythe EA was so negligible (except on Weierstrass 10 for 10iterations) that we always plotted total time, therefore inte-grating the EA algorithm overhead. So there should be nounexpected surprises behind the presented curves.

    Then, integration in the EASEA language will allow any-one who wishes to try these cards out to do so at minimal

    expense (just the price of the card). A couple of guidelinesshould be followed, of course (such as using flat genomesrather than pointers towards matrices, fitness functions withless than 11,000 lines, floats rather than doubles4, . . . ), butthe reward is such that these rather small constraints looksecondary in most cases.

    Up to now, only two (rather different) problems have beentried out (a toy problem and a real problem) so it would bepreposterous to generalise from these two only results, butwe have the strong feeling that the observed behaviour onthese two problems will generalize well (we tried on smalland large evaluation times, small and large genome size,small and large source codes, toy and real problems).

    Right now, all NVidia GPGPU cards above the 8800GTXseem to be supported without any modification thanks tothe CUDA layer that takes care of the hardware differences.A student will start trying to do the same on PlayStation3 Cell processors that have a quite different architecture,but if everything works out well, a new version of EASEAwill come out shortly with a -cell option that will, out ofthe same intial .ez file, allow to parallelize an evolutionaryalgorithm on Cell processors (no plans have been made totry ATI/AMD GPGPU cards for the moment), and comparetheir performance with GPGPUs.

    Right now, the implementation scheme is extremely basic,but seems to be working extremely well: run the sequentialevolutionary algorithm on the PC, and only dispatch thepopulation on the GPGPU for parallel evaluation. Moresubtle implementations may, and will certainly work much

    better, but before refining the process, it was necessary totry out the basic thing.

    On Weierstrass (ideal case) a 33 speedup was obtainedwith a 518 MFlops 8800GTX, (but of which 345 GFlops onlywere available since some specialised processing units werenot used in this work). A 105 speedup was obtained witha 715 GFlops GTX260 that costs only $250, so speedup in-

    4Although available on all recent GPGPU cards, using dou-ble precision variables will apparently considerably slowdown the calculations on current GPGPU cards, but thishas not been tested yet, as all this work has been done onan old 8800GTX card that can only manipulate floats.

  • 8/14/2019 Gecco09 Coarse Grain on of Evolutionary Algorithms On

    9/9

    creased more than GFlops rate, probably thanks to a largernumber of registers and an improved architecture. Whichspeedup can be obtained with a 1,788 GFlops card is a mys-tery that will be solved as soon as we receive one (the chem-istry lab who provided the real world problem immediatelyordered 2, that we should receive shortly).

    Anyway, using these cards will probably need us to changeour algorithms. A bicycle with a 40mph top speed does not

    drive the same way as a 4,000mph rocket, and this is thekind of change we are facing. At these speeds, most of ouralgorithms will most probably reach premature convergenceafter the first 10mn, so we will most probably need to usemuch larger populations, but this is exactly what these cardsare good at, so a 2 orders of magnitude increase in searchspeed will allow us to explore much wider search spaces whileat the same time exploiting nooks and crannies to a muchlarger extent. This will probably allow us to tackle newproblems that were beyond the reach of standard PC CPUs.

    7. REFERENCES[1] M. G. Arenas, P. Collet, A. E. Eiben, M. Jelasity, J. J.

    Merelo, M. Preuss, and M. Schoenauer. A framework fordistributed evolutionary algorithms. In Proceedings of

    PPSN 2002, pages 665675. Springer, 2002.[2] L. A. Baumes, M. Moliner, and A. Corma. Design of a

    full-profile matching solution for high-throughput analysisof multi-phases samples through powder x-ray diffraction.Chemistry - A European Journal, In Press.

    [3] L. A. Baumes, M. Moliner, N. Nicoloyannis, and A. Corma.A reliable methodology for high throughput identificationof a mixture of crystallographic phases from powder x-raydiffraction data. CrystEngComm, 10:13211324, 2008.

    [4] P. Collet, E. Lutton, M. Schoenauer, and J. Louchet. Takeit easea. In In Parallel Problem Solving from Nature VI,pages 891901. Springer, LNCS, 2000.

    [5] P. Collet and M. Schoenauer. GUIDE: Unifyingevolutionary engines through a graphical user interface. InP. Liardet et al., eds, EA03, volume 2936 of LNCS, pages203215, Marseilles, 2003. Springer.

    [6] A. Corma, M. Moliner, J. M. Serra, P. Serna, M. J.Diaz-Cabanas, and L. A. Baumes. A newmapping/exploration approach for ht synthesis of zeolites.Chemistry of Materials, pages 32873296, 2006.

    [7] A. Corma, F. Rey, J. Rius, M. Sabater, and S. Valencia.Supramolecular self-assembled molecules as organicdirecting agent for synthesis of zeolites. Nature,431:287290, 2004.

    [8] D. B. Fogel. Evolving artificial intelligence. Technicalreport, 1992.

    [9] K.-L. Fok, T.-T. Wong, and M.-L. Wong. Evolutionarycomputing on consumer graphics hardware. IntelligentSystems, IEEE, 22(2):6978, March-April 2007.

    [10] J.-M. Li, X.-J. Wang, R.-S. He, and Z.-X. Chi. An efficientfine-grained parallel genetic algorithm based ongpu-accelerated. In Network and Parallel Computing

    Workshops, 2007. NPC Workshops. IFIP InternationalConference on, pages 855862, 2007.

    [11] E. Lutton, P. Collet, and J. Louchet. Easea comparisons ontest functions: Galib versus eo. In P. Collet et al., eds,Artificial Evolution01, pages 217228. Springer, 2001.

    [12] G. Moore. Cramming more components onto integratedcircuits. Electronics Magazine, 38(8), April 19 1965.

    [13] R. A. Young. The Rietveld Method. OUP and InternationalUnion of Crystallography, 1993.

    [14] Q. Yu, C. Chen, and Z. Pan. Parallel genetic algorithms onprogrammable graphics hardware. In Advances in NaturalComputationICNC 2005, Proceedings, Part III, volume3612 of LNCS, pages 10511059, Changsha, August 27-292005. Springer.