Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

download Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

of 12

Transcript of Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    1/12

    Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    Ogier Maitre 1 , Nicolas Lachiche 1 , Philippe Clauss 1 , Laurent Baumes 2 , AvelinoCorma 2 , Pierre Collet 1

    {maitre,lachiche,clauss,collet }@lsiit.u-strasbg.fr{baumesl,acorma }@itq.upv.es

    1 LSIIT University of Strasbourg, France2 Insituto de Tecnologia Quimica UPV-CSIC, Valencia Spain

    Abstract. A parallel solution to the implementation of evolutionaryalgorithms is proposed, where the most costly part of the whole evolu-tionary algorithm computations (the population evaluation), is deportedto a GPGPU card. Experiments are presented for two benchmark ex-amples on two models of GPGPU cards: rst a toy problem is usedto illustrate some noticable behaviour characteristics before a real prob-lem is tested out. Results show a speed-up of up to 100 times comparedto an execution on a standard micro-processor. To our knowledge, thissolution is the rst showing such an efficiency with GPGPU cards. Fi-nally, the EASEA language and its compiler are also extended to allowusers to easily specify and generate efficient parallel implementations of evolutionay algorithms using GPGPU cards.

    1 Introduction

    Between nowadays available manycore architectures, GPGPU (General PurposeGraphic Processing Units) cards are offering one of the most attractive cost/per-formance ratio. However programming such machines is a difficult task. Thispaper focuses on a specic kind of resource-consuming application: evolution-ary algorithms. It is well-known that such algorithms offer efficient solutions tomany optimization problems, but they usually require a great number of evalu-ations, making processing power a limit on standard micro-processors. However,their algorithmic structure clearly exhibits resource-costly computation partsthat can be naturally parallelized. But, GPGPU programming constraints in-duce to consider dedicated operations for efficient parallel execution, one of themain performance-relevant constraint being the time needed to transfer datafrom the host memory to the GPGPU memory.

    This paper starts by presenting evolutionary algorithms, and studying themto determine where parallelization could take place. Then, GPGPU cards arepresented in section 3, and a proposition on how evolutionary algorithms couldbe parallelized on such cards is made and described in section 4. Experiments aremade on two benchmarks and two NVidia cards in section 5 and some relatedworks are described in section 7. Finally, results and future developments arediscussed in the conclusion.

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    2/12

    2 Presentation of Evolutionary Algorithms

    In [5], Darwin suggests that species evolve through two main principles: variationin the creation of new children (that are not exactly like their parents) andsurvival of the ttest, as many more individuals of each species are born thancan possibly survive.

    Evolutionary Algorithms (EAs) [9] get their inspiration from this paradigmto suggest a way to solve the following interesting question. Given :

    1. a difficult problem for which no computable way of nding a good solutionis known and where a solution is represented as a set of parameters,

    2. a limited record of previous trials that have all been evaluated.

    How can one use the accumulated knowledge to choose a new set of parame-

    ters to try out (and therefore do better than a random search) ? EAs rely on ar-ticial Darwinism to do just that: create new potential solutions from variationson good individuals, and keeping a constant population size through selection of the best solutions. The Darwinian inspiration for this paradigm leads to borrowsome specic vocabulary from biology: given an initial set of evaluated potentialsolutions (called a population of individuals ), parents are selected among the bestto create children thanks to genetic operators (that Darwin called variationoperators), such as crossover and mutation. Children (new potential solutions)are then evaluated and from the pool of parents and children, a replacement operator selects those that will make it to the new generation before the loop isstarted again.

    Fig. 1. Generic evolutionary loop.

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    3/12

    2.1 Parallelization of a generic evolutionary algorithm

    The algorithm presented on gure 1 contains several steps that may or may notbe independent. To start with, population initialisation is inherently parallel,because all individuals are created independently (usually with random values).

    Then, all newly created individuals need to be evaluated. But since theyare all evaluated independently using a tness function, evaluation of the pop-ulation can be done in parallel. It is interesting to note that in evolutionaryalgorithms, evaluation of individuals is usually the most CPU-consuming stepof the algorithm, due to the high complexity of the tness function.

    Once a parent population has been obtained (by evaluating all the individu-als of the initial population), one needs to create a new population of children. Inorder to create a child, it is necessary to select some parents on which variationoperators (crossover, mutation) will be applied. In evolutionary algorithms, se-lection of parents is also parallelizable because one parent can be selected severaltimes, meaning that independent selectors can select whoever they wish withoutany restrictions.

    Creation of a child out of the selected parents is also a totally independentstep: a crossover operator needs to be called on the parents, followed by a mu-tation operator on the created child.

    So up to now, all steps of the evolutionary loop are inherently parallel butfor the last one: replacement. In order to preserve diversity in the successivegenerations, the ( N + 1)-th generation is created by selecting some of the bestindividuals of the parents+children populations of generation N . However, if an individual is allowed to appear several times in the new generation, it couldrapidly become preeminent in the population, therefore inducing a loss of diver-sity that would reduce the exploratory power of the algorithm.

    Therefore, evolutionary algorithms impose that all individuals of the newgeneration are different. This is a real restriction on parallelism, since it meansthat the selection of N survivors cannot be made independently, otherwise asame individual could be selected several times by several independent selectors.

    Finally, one could wonder whether several generations could evolve in parallel.The fact that generation ( N + 1) is based on generation N invalidates this idea.

    3 GPGPU architecture

    GPGPU and classic CPU designs are very different. GPGPUs come from thegaming industry and are designed to do 3D rendering. They inherit specic

    features from this usage. For example, they feature several hundreds executionunits grouped into SIMD bundles that have access to a small amount of sharedmemory (16KB on the NVidia 8800GTX that was used for this paper), a largememory space (several hundred megabytes), a special access mode for texturememory and a hardware scheduling mechanism.

    The 8800GTX GPGPU card features 128 stream processors (compared to4 general purpose processors on the Intel Quad Core) even though both chips

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    4/12

    share a similar number of transistors (681 million for the 8800GTX vs 582 millionfor the Intel Quad Core). This can be done thanks to a simplied architecture

    that has some serious drawbacks. For instance, all stream processors are notindependent. They are grouped into SIMD bundles (16 SPMD bundles of 8SIMD units on the 8800GTX, which saves 7 fetch and dispatch units). Then,space-consuming cache memory is simply not available on GPGPUs, meaningthat all memory accesses (that can be done in only a few cycles on a CPU if thedata is already in the cache) cost several hundred cycles.

    Fortunately, some workarounds are provided. For instance, the hardwarescheduling mechanism allows to run a bundle of threads called a warp at thesame time, swapping between the warps as soon as a thread of the current warpis stalled on a memory access, so memory latency can be overcome with warpscheduling. But there is a limit to what can be done: it is important to haveenough parallel tasks to be scheduled while waiting for the memory. A threads

    state is not saved into memory. It stays on the execution unit (like in hyper-threading mecanism), so the number of registers used by a task directly impactsthe number of tasks that can be scheduled on a bundle of stream processors.Then, there is a limit in the number of schedulable warps (24 warps, i.e. 768threads on the 8800GTX).

    All these quirks make it very difficult for standard programs to exploit thereal power of these graphic cards.

    4 Parallel implementation on a GPGPU card

    As has been shown in section 2.1, it is possible to parallelize most of the evolu-tionary loop. However, whether it is worth to run everything in parallel on theGPGPU card or not is another matter: in [8,7, 11], the authors implementedcomplete algorithms on GPGPU cards, but clearly show that doing so is verydifficult, for quite few performance gains.

    Rather than going this way, the choice made for this paper was to keep every-thing simple, and start with experimenting the obvious idea of only parallelizingchildren evaluation, based on the three following considerations.

    1. Implementing the complete evolutionary engine on the GPGPU card is verycomplex, so it seems preferable to start with parallelizing only one part of the algorithm.

    2. Usually, in evolutionary algorithms, execution of the evolutionary engine(selection of parents, creation of children, replacement step) is extremely

    fast compared to the evaluation of the population.3. Then, if the evolutionary engine is kept on the host CPU, one needs totransfer the genome of the individuals only once on the GPGPU for eachgeneration. If the selection and variation operators (crossover, mutation) hadbeen implemented on the GPGPU, it would have been necessary to get thepopulation back on the host CPU at every generation for the replacementstep.

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    5/12

    Evaluation of the population on the GPGPU is a massively parallel processthat suits well an SPMD/SIMD computing model, because standard evolution-

    ary algorithms use the same evaluation function to evaluate all individuals3

    .Individual evaluations are grouped into structures called Blocks, that imple-

    ment a group of threads which can be executed on a same bundle. Dispatchingindividuals evaluation across this structure is very important in order to max-imize the load on the whole GPGPU. Indeed, as seen in section 3 a bundlehas limited scheduling capacity, depending on the hardware scheduling deviceor register limitations. The GPGPU computing unit must have enough registersto execute all the tasks of a block at the same time, representing the schedulinglimit. The 8800GTX card has a scheduling capacity of 768 threads, and 8192registers. So one must make sure that the number of individuals in a block isnot greater than the scheduling capacity, and that there are enough individualson a bundle in order to maximize this capacity. In this paper, the implemented

    algorithm spreads the population inton

    kblocks, where

    nis the number of bundles on the GPGPU and k is the integer ceiling of popSizen schedLimit . This simple

    algorithm yields good results in tested cases. However, a strategy to automati-cally adapt blocks denitions to computation complexity either by using a staticor dynamic approach needs to be investigated in some future work.

    When a population of children is ready to be evaluated, it is copied onto theGPGPU memory. All the individuals are evaluated with the same evaluationfunction, and results (tnesses) are sent back to the host CPU, that containsthe original individuals and manages the populations.

    On a standard host CPU EA implementation, an individual is made of agenome plus other information, such as its tness, and statistics information(whether is has recently been evaluated or not, . . . ). So the transfer of n genomesonly onto the GPGPU would result in n transfers and individuals scatteredeverywhere in GPGPU memory. Such a number of memory transfers would havebeen inacceptable. So, it was chosen to ensure spatial locality of all genomes tobe contiguous into host CPU memory, and send the whole children population inone single transfer. Some experiments showed that on a particular case, with alarge number of children, the transfer time went from 80 seconds with scattereddata down to 180 s whith a buffer of contiguous data.

    Our implementation uses only global memory since in the general case, theevaluation function does not generate signicant data reuse that would justify theuse of the small 16KB shared memory or the texture cache. Indeed with sharedmemory, the time saved in data accesses is generally wasted by data transfersbetween global memory and shared memory. Notice that shared memory is notaccessible from the host CPU part of the algorithm. Hence one has to rst copy

    data into global memory, and in a second step into shared memory.The chosen implementation strategy exhibits the main overhead risk as beingthe time spent to transfer the population onto the GPGPU memory. Hence the(computation time)/(transfer time) ratio needs to be large enough to effectively

    3 This is not the case in Genetic Programming, where on the opposite, all individualsare different functions that are tested on a common data learning set.

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    6/12

    take advantage of the GPGPU card. Experiments on data transfer rate showthat on the 8800GTX, a 500 MB/s bandwidth was reached, which is much

    lower than the advertised 4GB/s maximum bandwidth. However, experimentspresented in the following show that this rate is quite acceptable even for verysimple evaluation functions.

    5 Experiments

    Two implementations have been tested: a toy problem that contains interestingtuneable parameters allowing to observe the behaviour of the GPGPU card,and a much more complex real world problem to make sure that the GPGPUprocessors are also able to run more complex tness functions. In fact, the 400code lines of the real world evaluation function were programmed by a chemist,who has not the least idea on how to use a GPGPU card.

    5.1 The Weierstrass benchmark program

    Weierstrass-Mandelbrot test functions, dened as:

    W b,h (x) =

    i =1 b ih sin (bi x) with b > 1 and 0 < h < 1

    are very interesting to use as a test case of CPU usage in evolutionary compu-tation since they provide two parameters that can be adjusted independently.

    0

    500

    1000

    1500

    2000

    2500

    3000

    0 500 1000 1500 2000 2500 3000 3500 4000 4500

    T i m e

    ( s )

    Population size

    Weierstrass Iteration 120 on CPUWeierstrass Iteration 120 on GPU

    0

    20

    40

    60

    80

    100

    0 1000 2000 3000 4000 5000 6000

    T i m e

    ( s )

    Population size

    Weierstrass Iteration 120Weierstrass Iteration 70Weierstrass Iteration 10

    Fig. 2. Left: Host CPU (top) and CPU+8800GTX (bottom) time for 10 generationsof the Weierstrass problem on an increasing population size. Right: CPU+8800GTXcurve only, for increasing numbers of iterations and increasing population sizes.

    Theory denes Weierstrass functions as an innite sum of sines. Program-mers perform a nite number of iterations to compute an approximation of thefunction. The number of iterations is closely related to the host CPU time spentin the evaluation function.

    Another parameter that can also be adjusted is the dimension of the prob-lem: a 1,000 dimension Weierstrass problem takes 1,000 continuous parameters,

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    7/12

    meaning that its genome is an array of 1,000 oat values, while a 10 dimen-sion problem only takes 10 oats. The 10 dimension problem will evidently take

    much less time to evaluate than the 1,000 dimension problem. But since eval-uation time also depends on the number of iterations, tuning both parametersprovides many congurations combining genome size and evaluation time.

    Figure 2 left shows the time taken by the evolutionary algorithm to compute10 generations on both a 3.6GHz Pentium computer and an 8800GTX GPGPUcard, for 1000 dimensions, 120 iterations and a number of evaluations per gen-eration growing from 16 to 4,096 individuals (number of children = 100% of thepopulation). This represents the total time (including what is serially done onthe host CPU (population management, crossovers, mutations, selections, . . . )on both architectures (host CPU only and host CPU+GPU).

    For 4,096 evaluations ( 10 generations), the host CPU spends 2,100 secondswhile the host CPU and 8800GTX only spends 63 seconds, resulting in a speedup

    of 33.3.Figure 2 right shows the same GPGPU curve, for different iteration counts.

    On this second gure, one can see that the 8800GTX card steadily takes inmore individuals to evaluate in parallel without much difference in evaluationtime until the threshold number of 2048 individuals is reached, after which itgets saturated. Beyond this value, evaluation time increases linearly with thenumber of individuals, which is normal since the parallel card is already workingon a full load. It is interesting to see that with 10 iterations, the curve beforeand after 2048 has nearly the same slope, meaning that for 10 iterations, thetime spent in the evaluation function is negligible, so the curve mainly showsthe overhead time.

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0 200 400 600 800 1000 1200

    T i m e

    ( s )

    Population size

    Weierstrass Cpu Iteration 10Weierstrass 40B data Iteration 10Weierstrass 2KB data Iteration 10

    Weierstrass 4KB data Iteration 10

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    0 1000 2000 3000 4000 5000 6000 7000 8000 9000

    T i m e

    ( s )

    Population size

    Weierstrass Iteration 10Weierstrass Iteration 70

    Weierstrass Iteration 120

    Fig. 3. Left: determination of genome size overhead on a very short evaluation. Right:Same curves as gure 2 right, but for the GTX260 card.

    Since using a GPGPU card induces a necessary overhead, it is interestingto determine when it is advantageous to use an 8800GTX card. Figure 3 leftshows that on a small problem with 10 dimensions Weierstrass and 10 iterationsrunning in virtually no time, this threshold is met between 400 individuals or

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    8/12

    600 individuals, depending whether the genome size uses 40 bytes or 4 kilobytes,which is quite a big genome.

    The steady line (representing the host CPU) shows an evaluation time slightlyshorter than 0.035 milliseconds, which is very short, even on a 3.6GHz computer.

    The 3 GPGPU curves show that indeed, the size of the genomes has an impactwhen individuals are passed to the GPGPU card for evaluations. On this gure,evaluation is done on a 10 dimension Weierstrass function corresponding to a 40bytes size (the 8800GTX card only accepts oats). The additional genome datais not used on the 2 kilobytes and 4 kilobytes genomes, in order to isolate thetime taken to transfer large size genomes to the GPGPU for all the population.

    On gure 3 right, the same measures are shown with a recently acquiredGTX260 NVidia card. One can see that with this card, total time is only 20seconds for a population of 5,000 individuals while the 8800GTX card takes 60seconds and the 3.6GHz Pentium takes 2,100 seconds. So where the 8800GTXvs host CPU speedup was 33.3, the GTX260 vs host CPU speedup is about 105,which is quite impressive for a card that only costs around $250.00.

    5.2 Application to a real-world problem

    In materials science, knowledge of the structure at an atomistic/molecular level isrequired for any advanced understanding of its performance, due to the intrinsiclink between the structure of the material and its useful properties. It is thereforeessential that methods to study structures are developed.

    Rietveld renement techniques [10] can be used to extract structural detailsfrom an X-Ray powder Diffraction (XRD) pattern [2, 1, 4], provided an approx-imate structure is known. However, if a structural model is not available, itsdetermination from powder diffraction data is a non-trivial problem. The struc-tural information contained in the diffracted intensities is obscured by systematicor accidental overlap of reections in the powder pattern.

    As a consequence, the application of structure determination techniqueswhich are very successful for single crystal data (primarily direct methods) is, ingeneral, limited to simple structures. Here, we focus on inherently complex struc-tures of a special type of crystalline materials whose periodic structure is a 4-connected 3 dimensional net such as alumino-silicates, silico-alumino-phosphates(SAPO), alumino-phosphates (AlPO), etc...

    The genetic algorithm is employed in order to nd correct locations, e.g.from a connectivity point of view, of T atoms. As the distance T T for bondedatoms lies in a xed range [dmin-dmax], the connectivity of each new cong-urations of T atoms can be evaluated. The tness function corresponds to the

    number of defects in the structure, and Fitness= f 1 + f 2 is dened as follows:1. All T atoms should be linked to 4 and only 4 neighbouring T s , so:

    f 1 = Abs (4 Number of Neighbours )2. no T should be too close, e.g. T T < dmin , so:

    f 2 = Number of Too Close T Atoms

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    9/12

    Speedup on this chemical problem: As mentioned earlier, the source codecame from our chemist co-author, who is not a programming expert (but is

    nevertheless capable of creating some very complex code) and knows nothingabout GPGPU architecture and its use.

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0 2 00 0 4 00 0 6 00 0 8 00 0 1 00 00 1 2 00 0 1 40 00 1 6 00 0 1 80 00 2 0 00 0

    T i m e

    ( s )

    Population size

    Real problem on CPUReal problem on GPU

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

    T i m e

    ( s )

    Population size

    Real problem on GPU

    Fig. 4. Left: evaluation times for increasing population sizes on host CPU (top) andhost CPU + GTX260 (bottom). Right: CPU + GTX260 total time.

    First, while the 3.60GHz CPU was evaluating 20,000 individuals in 23 secondsonly, which seemed really fast considering the very complex evaluation function,the GPGPU version took around 80 seconds which was disappointing.

    When looking at the genome of the individuals, it appeared that it was codedin a strange structure, i.e. an array of 4 pointers towards 4 other arrays of 3 oats.This structure seemed much too complex to access, so it was suggested to attenit into a unique array of 12 oats which was easy to do, but unfortunately, the

    whole evaluation function was made of pointers to some parts of the previousgenome structure. After some hard work, all got back to a pointer-less at code,and evaluation time for the 20,000 individuals instantaneously dropped from 80second down to 0.13 seconds. One conclusion to draw out of this experience isthat, as expected, GPGPUs do not seem very talented in allocating, copying andde-allocating memory.

    Back to the host CPU, the new function now took 7.66s to evaluate 20,000individuals, meaning that all in all, the speedup offered by the GPGPU cardis nearly 60 on the new GTX260 (gure 4 left). Figure 4 right shows only theGTX260 curve.

    6 EASEA : Evolutionary Algorithm specicationlanguage

    EASEA 4 [3] is software platform that was originally designed to help non ex-pert programmers to try out evolutionary algorithms to optimise their applied

    4 EASEA (pron. [i:zi:]) stands for EAsy Specication for Evolutionary Algorithms.

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    10/12

    1 \ User classes :2 GenomeClass {3 float x[ SIZE ];4 }

    1 \ GenomeClass :: mutator :2 for ( int i =0; i

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    11/12

    Adding the -cuda option to this compiler is very important, since it notonly allows replication of the presented work, but also gives non GPGPU expert

    programmers the possibility to run their own code on these powerful parallelcards.

    7 Related Work

    Even though many papers have been written on the implementation of GeneticProgramming algorithms on GPGPU cards, only three papers were found on theimplementation of standard evolutionary algorithms on these cards.

    In [11], Yu et al. implement a rened ne-grained algorithm with a 2Dtoroidal population structure stored as a set of 2D textures, which imposes re-strictions on mating individuals (that must be neighbours). Other constraintsarise, such as the need to store a matrix of random numbers in GPGPU memoryfor future reference, since there is no random number generator on the card.Anyway, a 10 time speedup is obtained, but on a huge population of 512 512individuals.

    In [7], Fok et al. nd that standard genetic algorithms are ill-suited to runon GPGPUs because of such operators as crossovers that would slow downexecution when executed on the GPGPU and therefore choose to implementa crossover-less Evolutionary Programming algorithm [6] here again entirely onthe GPGPU card. The obtained speedup of their parallel EP ranges from 1.25to 5.02 when the population size is large enough.

    In [8], Li et al. implement a Fine Grained Parallel Genetic Algorithm onceagain on the GPGPU, to avoid massive data transfer. For a strange reason,they implement a binary genetic algorithm even though GPGPUs have no bit-

    wise operators, so go into a lot of trouble to implement simple genetic operators.To our knowledge no paper proposed the simple approach of only parallelizingthe evaluation of the population on the GPGPU card.

    8 Conclusion and future developments

    Results show that deporting the children population onto the GPGPU for a par-allel evaluation yields quite signicant speedups of up to 100 on a $250 GTX260card, in spite of the overhead induced by the population transfer.

    Being faster by around 2 orders of magnitude is a real breakthrough in evo-lutionary computation, as it will allow applied scientists to nd new results intheir domains. Then, researchers on articial evolution will need to modify theiralgorithms to adapt them to such speeds, that will probably lead to prematureconvergence, for instance. Then, unlike many other works that are difficult (if not impossible) to replicate, knowhow on the parallelization of evolutionary al-gorithms has been integrated into the EASEA language. Researchers who would

    this has not been tested yet, as all this work has been done on an 8800GTX cardthat can only manipulate oats.

  • 8/14/2019 Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

    12/12

    like to try out these cards can simply specify their algorithm using EASEA andthe compiler will parallelize the evaluation.

    Anyway, many improvements can still be expected. Load balancing couldprobably be improved, in order to maximize bundles throughput. Using texturecache memory may be interesting on evaluation functions that repeatedly ac-cess genome data. Automatic use of shared memory could also yield some goodresults, particulary on local variables in the evaluation function.

    Finally, an attempt to implement evolutionary algorithms on Sony/Toshiba/IBMCell multicore chips is currently being made. Its integration into the EASEA lan-guage could allow to compare performance of GPGPU and Cell architecture onidentical programs.

    References

    1. L. A. Baumes, M. Moliner, and A. Corma. Design of a full-prole matching so-lution for high-throughput analysis of multi-phases samples through powder x-raydiffraction. Chemistry - A European Journal , In Press.

    2. L. A. Baumes, M. Moliner, N. Nicoloyannis, and A. Corma. A reliable methodologyfor high throughput identication of a mixture of crystallographic phases frompowder x-ray diffraction data. CrystEngComm , 10:13211324, 2008.

    3. P. Collet, E. Lutton, M. Schoenauer, and J. Louchet. Take it easea. In In Parallel Problem Solving from Nature VI , pages 891901. Springer, LNCS, 2000.

    4. A. Corma, M. Moliner, J. M. Serra, P. Serna, M. J. Diaz-Cabanas, and L. A.Baumes. A new mapping/exploration approach for ht synthesis of zeolites. Chem-istry of Materials , pages 32873296, 2006.

    5. C. Darwin. On the Origin of Species by Means of Natural Selection or the Preser-vation of Favoured Races in the Struggle for Life . John Murray, London, 1859.

    6. D. B. Fogel. Evolving articial intelligence. Technical report, 1992.7. K.-L. Fok, T.-T. Wong, and M.-L. Wong. Evolutionary computing on consumergraphics hardware. Intelligent Systems, IEEE , 22(2):6978, March-April 2007.

    8. J.-M. Li, X.-J. Wang, R.-S. He, and Z.-X. Chi. An efficient ne-grained parallelgenetic algorithm based on gpu-accelerated. In Network and Parallel Computing Workshops, 2007. NPC Workshops. IFIP International Conference on , pages 855862, 2007.

    9. K. De Jong. Evolutionary Computation: a Unied Approach . MIT Press, 2005.10. R. A. Young. The Rietveld Method . OUP and International Union of Crystallog-

    raphy, 1993.11. Q. Yu, C. Chen, and Z. Pan. Parallel genetic algorithms on programmable graphics

    hardware. In Advances in Natural ComputationICNC 2005, Proceedings, Part III ,volume 3612 of LNCS , pages 10511059, Changsha, August 27-29 2005. Springer.