jpdc-num.pdf

28
Parallelization Strategies for Ant Colony Optimisation on GPUs Jos´ e M. Cecilia, Jos´ e M. Garc´ ıa Computer Architecture Department University of Murcia 30100 Murcia, Spain Andy Nisbet, Martyn Amos Novel Computation Group Division of Computing and IS Manchester Metropolitan University Manchester M1 5GD, UK Manuel Ujald´ on Computer Architecture Department University of M´alaga 29071 M´alaga, Spain Abstract Ant Colony Optimisation (ACO) is an effective population-based meta- heuristic for the solution of a wide variety of problems. As a population- based algorithm, its computation is intrinsically massively parallel, and it is therefore theoretically well-suited for implementation on Graphics Process- ing Units (GPUs). The ACO algorithm comprises two main stages: Tour construction and Pheromone update. The former has been previously im- plemented on the GPU, using a task-based parallelism approach. However, up until now, the latter has always been implemented on the CPU. In this paper, we discuss several parallelisation strategies for both stages of the ACO algorithm on the GPU. We propose an alternative data-based parallelism scheme and selection procedure, namely I-Roulette, for the Tour construc- tion, which fit better on the GPU architecture. We also describe novel GPU programming strategies for the Pheromone update stage. Our results show * Corresponding author Preprint submitted to Journal of Parallel and Distributed Computing March 31, 2011

Transcript of jpdc-num.pdf

  • Parallelization Strategies for Ant Colony Optimisation

    on GPUs

    Jose M. Cecilia, Jose M. Garca

    Computer Architecture DepartmentUniversity of Murcia30100 Murcia, Spain

    Andy Nisbet, Martyn Amos

    Novel Computation GroupDivision of Computing and IS

    Manchester Metropolitan UniversityManchester M1 5GD, UK

    Manuel Ujaldon

    Computer Architecture DepartmentUniversity of Malaga29071 Malaga, Spain

    Abstract

    Ant Colony Optimisation (ACO) is an effective population-based meta-heuristic for the solution of a wide variety of problems. As a population-based algorithm, its computation is intrinsically massively parallel, and it istherefore theoretically well-suited for implementation on Graphics Process-ing Units (GPUs). The ACO algorithm comprises two main stages: Tourconstruction and Pheromone update. The former has been previously im-plemented on the GPU, using a task-based parallelism approach. However,up until now, the latter has always been implemented on the CPU. In thispaper, we discuss several parallelisation strategies for both stages of the ACOalgorithm on the GPU. We propose an alternative data-based parallelismscheme and selection procedure, namely I-Roulette, for the Tour construc-tion, which fit better on the GPU architecture. We also describe novel GPUprogramming strategies for the Pheromone update stage. Our results show

    Corresponding author

    Preprint submitted to Journal of Parallel and Distributed Computing March 31, 2011

  • a total speed-up exceeding 21x for the Tour construction stage, and 20x forPheromone update, compared to its sequential counterpart version, and sug-gest that ACO is a potentially fruitful area for future research in the GPUdomain.

    Keywords: Metaheuristics, GPU programming, Ant Colony Optimization,TSP, Performance analysis

    1. Introduction

    Ant Colony Optimisation (ACO) [? ] is a population-based searchmethod inspired by the behaviour of real ants. It may be applied to a widerange of hard problems [? ? ], many of which are graph-theoretic in nature.It was first applied to the Travelling Salesman Problem (TSP) [? ] by Dorigoand colleagues, in 1991 [? ? ].

    In essence, simulated ants construct solutions to the TSP in the form oftours. The artificial ants are simple agents which construct tours in a parallel,probabilistic fashion. They are guided in this task by simulated pheromonetrails and heuristic information. Pheromone trails are a fundamental compo-nent of the algorithm, since they facilitate indirect communication betweenagents via their environment, a process known as stigmergy [? ]. A detaileddiscussion of ant colony optimization and stigmergy is beyond the scope ofthis paper, but the reader is directed to [? ] for a comprehensive overview.

    ACO algorithms are population-based, in that a collection of agents col-laborates to find an optimal (or even satisfactory) solution. Such approachesare naturally suited to parallel processing, but their success strongly dependson both the nature of the particular problem and the underlying hardwareavailable. Several parallelisation strategies have been proposed for the ACOalgorithm, on both shared and distributed memory architectures [? ? ? ].

    The Graphics Processing Unit (GPU) is a major current theme of interestin the field of high performance computing, as they follow in the footsteps ofearlier throughput-oriented processor designs but have achieved far broaderuse in commodity machines, mainly motivated by the needs of real-timecomputer graphics, making possible for GPUs to become mass-market de-vices [? ]. For instance, since late 2006, NVIDIA has shipped almost 220million CUDA-capable GPUsseveral orders of magnitude more than histor-ical massively parallel architectures like CM-2, MasPar and Goodyear MPP(Massively Parallel Processor) machines [? ].

    2

  • For applications with abundant parallelism, GPUs deliver higher peakcomputational throughput than latency-oriented CPUs and thus offeringtremendous potential performance on massively parallel problems [? ], eventhough scarifying the serial performance of a single task may be required [?]. Therefore, some massively parallel workloads, may be redefining for be-ing adapted to throughput-oriented architectures, instead of latency-orientedcounterparts.

    Of particular interest to us are attempts to parallelise the ACO algo-rithm on GPUs. Up to now, these approaches focus on accelerating the tourconstruction step performed by each ant by taking a task-based parallelismapproach with pheromone deposition calculated on the CPU.

    In this paper, we present the first fully developed ACO algorithm for theTravelling Salesman Problem (TSP) on GPUs, so that both main phases areparallelised. This is the main technical contribution of the paper. We clearlyidentify two main algorithmic stages: Tour construction and Pheromoneupdate. A data-parallelism approach (which is better-suited to the GPUparallelism model than task-based parallelism) is described to enhance tourconstruction performance. Additionally, we describe various GPU designpatterns for the parallelisation of the pheromone update, which has not beenpreviously described in the literature.

    The main contributions of this work are the following:

    1. To the best of our knowledge, this is the first time that a data-parallelismapproach for the tour construction stage is introduced on GPUs. Wedesign so using two different types of artificial ants: Queen ants (as-sociated with CUDA thread-blocks), and worker ants (associated withCUDA threads).

    2. We introduce I-Roulette (Independent Roulette) as an alternative se-lection method to the classic Roulette Wheel which is more adequatefor enhancing the GPU simulation.

    3. We also discuss the implementation of the pheromone update stageon GPUs, either using atomic operations or other alternative GPUcomputing techniques to avoid them.

    4. We offer an in-depth analysis of both stages of the ACO algorithm fordifferent instances of the TSP problem. We tune different GPU pa-rameters, reaching up to 21x speed-up factor for the tour constructionstage and 20x speed-up factor for the pheromone update stage.

    3

  • 5. We validate the quality of the solution obtained by our GPU algo-rithms, comparing it to the quality of the solution obtained by thesequential code given in [? ].

    A preliminary and partial version of this work was presented in [? ].Here, we significantly extend that work with a more formal description ofthe algorithm design on GPUs, extensive analysis of our proposals on Fermisarchitecture from NVIDIA for both algorithmic stages, and a solution qualityvalidation of our proposals. Moreover, we extend the evaluation process,adding several benchmarks from TSPLIB Library.

    The paper is organised as follows. We briefly introduce Ant Colony Op-timisation for the TSP and Compute Unified Device Architecture (CUDA)from NVIDIA in Section ??. In Section ?? we present GPU designs forboth main stages of the ACO algorithm. Experimental methodology is in-troduced in Section ?? before we show the performance evaluation of ouralgorithm in Section ??. Finally, parallelization strategies for the ACO al-gorithm previously presented in the literature are described in Section ??,before concluding with a brief discussion and consideration of future work.

    2. Background

    2.1. Ant Colony Optimisation for the Traveling Salesman Problem

    The Traveling Salesman Problem (TSP) [? ] involves finding the shortest(or cheapest) round-trip route that visits each of a number of cities ex-actly once. The symmetric TSP on n cities may be represented as a completeweighted graph, G, with n nodes, with each weighted edge, ei,j, representingthe inter-city distance di,j = dj,i between cities i and j. The TSP is a well-known NP-hard optimisation problem, and is used as a standard benchmarkfor many heuristic algorithms [? ].

    The TSP was the first problem solved by Ant Colony Optimisation (ACO)[? ? ]. This method uses a number of simulated ants (or agents), whichperform distributed search on a graph. Each ant moves through on thegraph until it completes a tour, and then offers this tour as its suggestedsolution. In order to do this, each ant may drop pheromone on the edgescontained in its proposed solution. The amount of pheromone dropped, ifany, is determined by the quality of the ants solution relative to those ob-tained by the other ants. The ants probabilistically choose the next city tovisit, based on heuristic information obtained from inter-city distances and

    4

  • the net pheromone trail. Although such heuristic information drives the antstowards an optimal solution, a process of evaporation is also applied inorder to prevent the process stalling in a local minimum.

    The Ant System (AS) is an early variant of ACO, first proposed by Dorigo[? ]. The AS algorithm is divided into two main stages: Tour constructionand Pheromone update. Tour construction is based on m ants building toursin parallel. Initially, ants are randomly placed. At each construction step,each ant applies a probabilistic action choice rule, called the random propor-tional rule, in order to decide which city to visit next. The probability forant k, placed at city i, of visiting city j is given by the equation ??

    pki,j =[i,j]

    [i,j]

    lNki [i,l] [i,l]

    , if j Nki , (1)

    where i,j = 1/di,j is a heuristic value that is available a priori, and are two parameters which determine the relative influences of the pheromonetrail and the heuristic information respectively, and Nki is the feasible neigh-bourhood of ant k when at city i. This latter set represents the set of citiesthat ant k has not yet visited; the probability of choosing a city outsideNki is zero (this prevents an ant returning to a city, which is not allowed inthe TSP). By this probabilistic rule, the probability of choosing a particularedge (i, j) increases with the value of the associated pheromone trail i,j andof the heuristic information value i,j. The numerator of the equation ??is pretty much the same for every ant in a single run, thus, computationtimes can be saved by storing this information in additional matrix, calledchoice info matrix as showed in [? ]. The random propotional rule ends witha selection procedure, which is done analogously to the roulette wheel selec-tion procedure of evolutionary computation (for more detail see [? ], [? ]).Each value choice info[current city][j] of a city j that ant k has not visitedyet determines a slice on a circular roulette wheel, the size of the slice beingproportional to the weight of the associated choice. Next, the wheel is spunand the city to which the marker points is chosen as the next city for antk. Furthermore, each ant k maintains a memory, Mk, called the tabu list,which contains the cities already visited, in the order they were visited. Thismemory is used to define the feasible neighbourhood, and also allows an antto both to compute the length of the tour T k it generated, and to retrace thepath to deposit pheromone.

    After all ants have constructed their tours, the pheromone trails are up-

    5

  • dated. This is achieved by first lowering the pheromone value on all edgesby a constant factor, and then adding pheromone on edges that ants havecrossed in their tours. Pheromone evaporation is implemented by

    i,j (1 )i,j, (i, j) L, (2)where 0 < 1 is the pheromone evaporation rate. After evaporation,

    all ants deposit pheromone on their visited edges:

    i,j i,j +mk=1

    ki,j, (i, j) L, (3)

    where ij is the amount of pheromone ant k deposits. This is definedas follows:

    ki,j =

    {1/Ck if e(i, j)k belongs to T k

    0 otherwise(4)

    where Ck, the length of the tour T k built by the k-th ant, is computed asthe sum of the lengths of the edges belonging to T k . According to equation??, the better an ants tour, the more pheromone the edges belonging to thistour receive. In general, edges that are used by many ants (and which arepart of short tours), receive more pheromone, and are therefore more likelyto be chosen by ants in future iterations of the algorithm.

    2.2. CUDA Programming model

    All NVIDIA GPU platforms from the G80 architecture can be programmedusing the Compute Unified Device Architecture (CUDA) programming modelwhich makes the GPU to operate as a highly parallel computing device. EachGPU device is a scalable processor array consisting of a set of SIMT (Sin-gle Instruction Multiple Threads) Streaming Multiprocessors (SM), each ofthem containing several stream processors (SPs). Different memory spacesare available in each GPU on the system. The global memory (also calleddevice or video memory) is the only space accessible by all multiprocessors.It is the largest and the slowest memory space and it is private to each GPUon the system. Moreover, each multiprocessor has its own private memoryspace called shared memory. The shared memory is smaller and also loweraccess latency than global memory. Finally, there are other addressing spacesfor specific purpose such as texture and constant memory [? ].

    6

  • The CUDA programming model is based on a hierarchy of abstractionlayers The thread is the basic execution unit that is mapped to a singleSP. A thread-block is a batch of threads which can cooperate together asthey are assigned to the same multiprocessor, and therefore they share allthe resources included in this multiprocessor, such as register file and sharedmemory. A grid is composed of several thread-blocks which are equally dis-tributed and scheduled among all multiprocessors. Besides, there is not anyparticular order in the way of thread-blocks are executed, therefore they areexecuted in Multiple Instruction Multiple Data (MIMD) fashion. Finally,threads included in a thread-block are divided into batches of 32 threadscalled warps. The warp is the scheduled unit, so the threads of the samethread-block are scheduled in a given multiprocessor warp by warp. The 32threads in a warp execute the same instruction over multiple data (SIMD).The programmer declares the number of thread-blocks, the number of threadsper thread-block and their distribution to arrange parallelism given the pro-gram constraints (i.e., data and control dependencies).

    3. Code Design and Tuning Techniques

    Algorithm 1 The sequential AS version for the TSP problem:

    1: InitializeData()2: while Convergence() do3: TourConstruction()4: PheromoneUpdate()5: end while

    In this Section, we present several different GPU designs for the Ant Sys-tem (AS), as applied to the TSP. The Algorithm ?? shows a single-programmultiple data (SPMD) style pseudocode for the AS. Firstly, all AS struc-tures for the TSP problem, such as the distance matrix, the number of cities,etc. are initialized, before performing both main stages of the AS algorithm,i.e. Tour construction and Pheromone update. These stages are computeduntil the convergence criteria is reached. For the tour construction, we be-gin by analysing CPU baselines and traditional task-based implementationson the GPU, which motivates our approach of instead increasing the data-parallelism. For pheromone update, we describe several GPU techniques thatare potentially useful in increasing application bandwidth.

    7

  • 3.1. Previous tour construction proposals

    3.1.1. CPU baseline

    The tour construction stage is mainly divided into two main stages: Ini-tialization and ASDecisionRule. In the former all data structures, such asthe tabu list (visited), and the initial random city, are initialized by each ant.The Algorithm ?? shows the latter stage which is divided into two sub-stagesin turn. First, each ant calculates the heuristic information to visit city jfrom city i according to equation ?? (lines 1-11). As previously explained,it is computationally expensive to repeatedly calculate those values for eachcomputational step of each ant, k, and they can be avoided by using an ad-ditional data structure, namely choice info, in which those heuristic valuesare stored through adjacency matrix, and are therefore calculated only onceper each call [? ]. Notice that each entry of this structure can be calculatedindependently from each other (see Equation ??). Second, the probabilis-tic choice for choosing the next city by each ant is performed by using theroulette wheel selection [? ? ] as showed in algorithm ??, lines 12-18.

    3.1.2. Task-based approach on GPUs

    Figure 1: Task-based parallelism on the tour construction kernel.

    The traditional task-based parallelism approach to tour construction isbased on the observation that ants run in parallel looking for the best tourthey can find. Therefore, any inherent parallelism exists at the level of in-dividual ants. To implement this idea of parallelism on CUDA, each antis identified as a CUDA thread, and threads are equally distributed amongCUDA thread blocks. Each CUDA thread deals with the task assigned to

    8

  • Algorithm 2 ASDecisionRule for the Tour construct stage. m is the numberof ants, and n is the number of cities of the TSP instance

    1: sum prob 0.0;2: current city ant[k].tour[step 1];3: for j = 1 to n do4: if ant[k].visited[j] then5: selection prob[j] 0.0;6: else7: current probability choice info[current city][j];8: selection prob[j] current probability;9: sum probs sum probs+ current probability;

    10: end if11: end for{Roulette Wheel Selection Process}

    12: r random(1..sum probs);13: j 1;14: p selection prob[j];15: while p < r do16: j j + 1;17: p p+ selection prob[j];18: end while19: ant[k].tour[step] j;20: ant[k].visited[j] true;

    each ant; i.e, maintenance of an ants memory (tabu list, list of all visitedcities, and so on) and movement (see the core of this computation in Algo-rithm ??). Figure ?? briefly summarizes the process sequentially developedby each ant.

    To improve the kernels bandwidth some structures previously presentedare placed in on-chip shared memory. Among all of them, visited andselection prob list are good candidates to be placed on shared memory asthey are accessed many times along the computation in an irregular accesspattern. However, the shared memory is a scarce resource in CUDA (seeTable ??), and thus the size of these structures becomes limited. Moreover,in the CUDA programming model, shared memory is allocated at CUDAthread block level.

    9

  • 3.2. Our tour construction approach based on data-parallelism

    The task-based parallelism just described presents several issues for GPUimplementation. Firstly, this approach requires a relatively low number ofthreads on the GPU, since the recommended number of ants for solving theTSP problem is taken as the same as the number of cities [? ]. In addition,this version presents an unpredictable memory access pattern due to theexecution is guided by a stochastic process. Finally, the checking of the listof cities visited presents many warp divergences (different threads in a warptake different paths) leading to serialisation [? ].

    3.2.1. The choice info matrix calculation on the GPU

    To increase the application parallelism on CUDA, the choice info compu-tation is performed apart from the tour construction kernel, being includedin a different kernel which is executed right before the tour construction ateach stage of ACO algorithm. We set a CUDA thread per each entry ofthe choice info structure, and they are equally grouped into CUDA threadblocks. However, the performance for this kernel may be drastically affectedby the use of costly math function like powf() (see Equation ??). However,there are analogous CUDA functions which are map directly to the hardwarelevel like powf(), although they can provide somehow lower accuracy [? ].

    3.2.2. Data-based parallelism proposal

    Figure ?? shows an alternative design, which increases the data-parallelismin the tour construction kernel, and also avoids warp divergences. In this de-sign, a thread-block is associated with each ant (i.e. Queen ants), and eachthread in a thread-block represents a city (or cities) the ant may visit (i.e.Worker ants). All workers ants fully cooperate to obtain a solution, increas-ing the data-parallelism by a factor of 1:w, being w the number of workerants per queen ant.

    A thread loads the heuristic value associated with its associated city(ies),and checks whether the city has been visited or not. To avoid conditionalstatements (and, thus, warp divergences), the tabu list is represented inshared memory as one integer value per each city. A citys value is 0 if it isvisited, and 1 otherwise. Finally, these values are multiplied and stored in ashared memory array, which is then prepared for the selection process throughroulette wheel. Notice that, the shared memory requirements are drasticallyreduced compared to the previous version. Now, tabu list and probabilistic

    10

  • Figure 2: Data-based parallelism on the tour construction kernel.

    list are stored one per each thread-block (i.e. Queen ant) instead of one perthread (i.e. Worker ant).

    The number of threads per thread-block on CUDA is a hardware limitingfactor. Thus, the cities should be distributed among threads to allow fora flexible implementation. A tiling technique is proposed to deal with thisissue. Cities are divided into blocks (i.e. tiles). For each tile, a city is selectedstochastically, from the set of unvisited cities on that tile. When this processhas completed, we have a set of partial best cities. Finally, the city withthe best absolute heuristic value is selected from this partial best set.

    The tabu list information can be placed in the register file (since it rep-resents information private to each thread). However, the tabu list cannotbe represented by a single integer register per thread in the tiling version,because, in that case, a thread represents more than one city. The 32-bitregisters may be used on a bitwise basis for managing the list. The first cityrepresented by each thread; i.e. on the first tile, is managed by bit 0 on theregister that represents the tabu list, the second city is managed by bit 1,and so on.

    3.2.3. I-Roulette: An alternative selection method

    The roulette wheel is a fully sequential stochastic selection process, and itis hard to be parallelized. To implement this on CUDA, a particular thread is

    11

  • Figure 3: An alternative method for increasing the parallelism on the selection process.

    identified to proceed sequentially with the selection, doing this exactly n 1times, being n the number of cities. Moreover, the kernel needs to generatecostly pseudorandom numbers on the GPU. We use the NVIDIAs CURANDlibrary [? ].

    Figure ?? shows an alternative method for removing the sequential partsof the previous kernel design. We call this method I-Roulette (IndependentRoulette). I-Roulette consists of generating a random number per each cityin the interval [0, 1] to feed into the stochastic simulation. Thus, three valuesare multiplied and stored in the shared memory array per city; i.e. theheuristic value associated with its associated city, a value showing whetherthe city has been visited or not, and the random number associated with itscity.

    3.3. Pheromone update stage

    The last stage in the ACO algorithm is the pheromone update whichcomprise two main tasks: pheromone evaporation and pheromone depositas previously explained in Section ??. The pheromone evaporation is quitestraightforward to implement on CUDA, as a single thread can indepen-dently calculate the Equation ?? to each entry of the pheromone matrix,thus lowering the pheromone value on all edges by a constant factor.

    Then, ants deposit different quantities of pheromone on the edges thatthey have crossed in their tours. The quantity of pheromone deposited byeach ant depends on the quality of the tour found by that ant (see Equations?? and ??). Figure ?? shows the design of the pheromone kernel; this presentsn CUDA threads per CUDA thread block (being n number of cities), one lessthan the tour length (i.e. n + 1). Each ant generates its own private tourin parallel, and they will feasibly visit the same edge as another ant. Thisfact forces us to use atomic instructions for accessing the pheromone matrix,which diminishes the application performance.

    Therefore, a key objective is to avoid using atomic operations. An alter-native approach to do so is shown in Figure ??, where we use a scatter togather transformations [? ].

    The configuration launch routine for the pheromone update kernel nowsets as many threads as there are cells in the pheromone matrix (c = n2) andequally distributes these threads among thread blocks. Thus, each thread

    12

  • Figure 4: Pheromone deposit with atomic instructions.

    represents the coordinates of a single entry of the pheromone matrix, and itis in charge of checking whether the cell represented by it has been visited byany ants; i.e. each thread accesses device memory to check that information.This means that each thread performs 2n2 memory loads/thread for a totalof l = 2 n4 (n2 threads) accesses to device memory.

    Notice that, there is a tradeoff between the pressure in device memoryfor avoiding a design based on atomic operations and the number of atomicoperations itself doing so (relation loads:atomic from now). For the Scatter toGather based design, the relation loads:atomic is l:c. Therefore, this approachallows us to perform the computation without using atomic operations, butat the cost of drastically increasing the number of accesses to device memory.

    A tiling technique is proposed for increasing the application bandwidth.Now, all threads cooperate to load data from global memory to shared mem-ory, but they still access edges in the ants tour. Each thread accesses globalmemory 2n2/, being the tile size. The rest of the accesses are performedon shared memory. Therefore, the total number of global memory accessesis = 2n4/. The relation loads/atomics is lower : c, but maintains the

    13

  • Figure 5: Scatter to Gather transformation for the pheromone deposit.

    orders of magnitude.We note that an ants tour length may be bigger than the maximum

    number of threads that each thread block can support (see Table ??). Ouralgorithm prevents this situation by setting our empirically demonstratedoptimum thread block layout, and dividing the tour into tiles of this length.This raises up another issue; this is when n+ 1 is not divisible by the . Wesolve this by applying padding in the ants tour array to avoid warp divergence(see Figure ??).

    Unnecessary loads to device memory can be avoided by taking advantageof the problems nature. We focus on the symmetric version of the TSP, sothe number of threads can be reduced in half, thus halving the number of de-vice memory accesses. This so-called Reduction version actually reduces theoverall number of accesses to either shared or device memory by having halfthe number of threads compared to the previous version. This is combinedalso with tiling, as previously explained. The number of accesses per threadremains the same, giving a total of device memory access of = n4/.

    14

  • 4. Experimental methodology

    4.1. Hardware features

    Table 1: CUDA and hardware features for Tesla C2050.

    GPU element Feature Tesla C2050

    Streaming Cores per SM 32processors Number of SMs 14(GPU Total SPs 448cores) Clock frequency 1 147 MHz

    Maximum Per multiprocessor 1 536number of Per block 1 024threads Per warp 32

    SRAM 32-bit registers 32 Kmemory Shared memory 16/48 KBavailable per L1 cache 48/16 KBmultiprocessor (Shared + L1) 64 KB

    Size 3 GBGlobal Speed 2x1500 MHz(video) Width 384 bitsmemory Bandwidth 144 GB/s

    Technology GDDR5

    The hardware evaluation platforms are: (1) a dual-socket 2.40 GHz quad-core Intel Xeon E5620 WESTMERE processor, and (2) a NVIDIA TeslaC2050 based on Fermi architecture released in November 2010 (see mainfeatures on Table ??) [? ]. In addition to a larger number of CUDA pro-cessors in each Streaming Multiprocessor (SM), including a fourfold increaseover prior SM designs, Fermi improves the GPU capabilities with additionalfeatures like enhanced single-precision floating-point accuracy-implementingIEEE 754-2008 floating-point standard- and double-precision floating-pointperformance, new general-purpose L1 and L2 caches, faster context switch-ing, a unified 64-bit virtual address space, a brand new instruction set, Er-ror Correction Code (ECC) memory support to enhance data integrity inhigh performance computing. The same on-chip memory is devoted to L1cache and shared memory, being configurable at compilation time through(cudaFuncSetCacheConfig) CUDA instruction.

    15

  • Finally, we use gcc 4.3.4 with the -O3 flag to compile our CPU imple-mentations, and CUDA compilation tools, release 3.2, for our GPU imple-mentations.

    4.2. Our benchmark

    We test our designs using a set of benchmark instances from the well-known TSPLIB library [? ] [? ]. All benchmark instances are defined ona complete graph, and all distances are defined to be integer numbers. TheTable ?? shows a list of all targeted benchmark instances with informationon the number of cities, the type of distance and the length of optimal tours.

    ACO parameters such as the number of ants m, , , , and so on areset according with the values recommended in [? ]. The most importantparameters for the scope of this study is the number of ants, which is setm = n (being n the number of cities), = 1, = 2, = 0.5.

    Table 2: Description of the benchmark instances from TSPLIB library. (EUC 2D: 2-Dimensional euclidean distances).

    Name Cities Type Best tour lengthd198 198 EUC 2D 15780a280 280 EUC 2D 2579lin318 318 EUC 2D 42029pcb442 442 EUC 2D 50778rat783 783 EUC 2D 8806pr1002 1002 EUC 2D 259045pcb1173 1173 EUC 2D 56892pcb1291 1291 EUC 2D 50801pr2392 2392 EUC 2D 378032

    5. Performance Evaluation

    In this Section we deeply analyse both main stages of the ACO algorithm:Tour construction and Pheromone update. We compare our implementationswith the sequential code, written in ANSI C, provided by Stuzle in [? ]. Theperformance tables are recorded for a single iteration, and averaged over 100iterations. In this work we focus on the computational characteristics of the

    16

  • Ant System and how it can be efficiently implemented on the GPU. However,to guarantee the correctness of our algorithms, a quality comparison betweensequential, GPU codes, normalized to the optimum solution, for a variety ofbenchmarks is provided by running all algorithms a fixed number of iterations(1000-iterations), and averaged over 5 independent runs. The experimentsare based on single-precision.

    5.1. Evaluation of tour construction stage

    Now, we evaluate the tour construction stage on Tesla C2050 under differ-ent aspects: The performance impact of using costly arithmetic instruction inthe choice info kernel, comparison versus a CPU counterpart, improvementdegree attained through data-based approach, speed-up factor obtained whenusing different on-chip GPU memories available on FERMI architecture, andperformance benefits of changing the selection process. We now address eachof these issues separately.

    5.1.1. Choice info kernel evaluation

    Table 3: Execution times (milliseconds) on the GPU system for the choice info kernel,using both CUDA instructions: powf and powf. We vary the TSPLIB benchmark instanceto increase the number of cities.

    TSPLIB CUDA CUDABenchmarks powf() powf()

    d198 0.038 0.013 (2.88x)a280 0.065 0.020 (3.18x)

    lin318 0.082 0.024 (3.33x)pcb442 0.147 0.042 (3.47x)rat783 0.441 0.117 (3.77x)pr1002 0.719 0.188 (3.82x)

    pcb1173 0.981 0.254 (3.86x)d1291 1.185 0.307 (3.86x)pr2392 4.042 1.039 (3.88x)

    We first evaluate the choice info kernel, before assessing the impact ofincluding various modifications to the tour construction. The table ?? showsthe performance of the choice info kernel. It is drastically affected by using

    17

  • costly math function like powf(). However, there are analogous CUDA func-tions which are map directly to the hardware level like powf(), althoughthey can provide somehow lower accuracy [? ]. After ptx inspection, thiskernel presents 4 memory access to global memory. The empirical streamingbandwidth obtained by this kernel is up to 90 GB/s, using powf() and upto 23 GB/s, using powf() on the Tesla C2050. These performance numberstranslate into up to 67 GFLOPS and 17 GFLOPs respectively, counting bothinstructions; powf() and powf(), as a single floating point operation. We usea 256 thread-block, one per each entry of the choice info data structure toreach the best performance. These particular values minimize non-coalescedmemory accesses and yield high occupancy values.

    5.1.2. Tuning our data-based approach

    For the data-based approach, the number of workers ants (threads) perqueen ant (threads-blocks) is a degree of freedom which is analyzed in Table??. The 128 thread-block configuration reaches our best performance inall benchmark instances. This particular value is well suited for developinghigh-throughput application on GPUs. Notice that, some of configurationsare not allow (n.a) because either the number of worker ants is bigger thanthe number of cities, or the number of cities divided by the number of workerants is bigger than 32, as it is the maximum number of cities that each workerant is capable to manage. The tabu list per each queen ant is divided amongtheir worker ants, and place it on a bit basis in a single register.

    Table 4: Execution times (milliseconds) on Tesla C2050. We vary the number of workerants and benchmark instances (n.a. means not available due to register constraints).

    Number of d198 a280 lin318 pcb442 rat783 pr1002 pr2392Worker ants

    16 10.39 30.06 38.59 101.09 n.a n.a n.a32 6.85 18.68 23.89 62.77 357.24 749.04 n.a64 5.06 12.78 15.51 41.17 235.86 474.79 6083.96128 4.32 11.94 15.18 38.73 207.08 391.59 5092.27256 n.a 15.94 20.89 41.85 245.33 412.20 5680.81512 n.a n.a n.a n.a 296.90 498.55 6680.391024 n.a n.a n.a n.a n.a n.a 10037.1

    18

  • 5.1.3. I-Roulette Versus Roulette Wheel

    Table 5: Execution times (milliseconds) on both selection procedures: Roulette wheel, andour approach Independent Roulette (I-Roulette).

    TSPLIB Roulette Wheel I-RouletteBenchmarks

    d198 7.76 4.32 (1.79x)a280 20.33 11.94 (1.70x)

    lin318 26.73 15.18 (1.76x)pcb442 69.86 38.73 (1.80x)rat783 376.76 207.08 (1.81x)pr1002 747.85 391.56 (1.90x)

    pcb1173 1227.52 644.05 (1.90x)d1291 1606.87 840.02 (1.91x)pr2392 12017.2 5092.27 (2.36x)

    Table ?? shows the improvement attained by increasing the parallelismon the GPU through our selection method, I-Roulette. This method reachesup to 2,36x of speed-up factor compared to the classic roulette wheel eventhough it generates many costly random numbers. However, the roulettewheel compromises the GPU parallelism, thus, there is a tradeoff betweenimproving total throughput at the expense of increased latency on individualtasks, being favored the former option by a wide margin.

    5.1.4. GPU Versus CPU

    Table ?? presents execution times on a high-end CPU and Tesla C2050GPUs already introduced for the set of simulations included within ourbenchmarks. It is worth mentioning that for a fair comparison we haveselected hardware platforms with a similar cost (investment ranges between1.500 and 2.000 euros for each single processor). We see that GPUs obtainsbetter performance than its CPU counterpart, reaching up to 21x speed-upfactor.

    Table ?? also shows us the benefit of having many parallel light-weightthreads through a data-parallelism approach instead of having heavy-weightthreads due to task-based approach on GPUs.

    For task-based parallelism versions, we use a 16 CUDA threads with16 ants running in parallel per thread-block to reach the best performance

    19

  • Table 6: Execution times (milliseconds) on different hardware platforms (CPU vs GPU)and enabling data-based approach on the GPU.

    TSPLIB CPU XeonGPU Task-based GPU Data-based

    Benchmarks E5620 Ex. timeSpeed-up

    Ex. timeSpeed-up Speed-up

    Vs CPU Vs CPU Vs Task

    d198 43.01 29.37 1.46x 4.32 9.95x 6.79xa280 151.99 70.52 2.15x 11.94 12.72x 5.9x

    lin318 223.78 153.33 1.45x 15.18 14.69x 10.1xpcb442 618.07 301.66 2.04x 38.73 15.95x 7.78xrat783 3539.01 1375.38 2.57x 207.08 17.08x 6.64xpr1002 7965.17 2437.34 3.26x 392.56 20.33x 6.2x

    pcb1173 12839.26 3392.74 3.78x 652.05 19.69x 5.2xd1291 17450.53 4102.3 4.25x 869.96 20.05x 4.71xpr2392 110573.22 29792.03 3.71x 5092.27 21.71x 5.85x

    for the targeted benchmarks. This particular value produces a very poorGPU resource usage per SM, and it is not well suited for developing high-throughput applications on the GPU computing arena. The heavy-weightthreads presented by this design need resources to execute their independenttasks, thus avoiding large serialization phases. In CUDA, this is obtained bydistributing those threads among SMs, which is possible by increasing thenumber of thread-block in the execution.

    The task-based approach is only rewarded with a maximum of 4.25xspeed-up factor instead of 21.71x speed-up factor reached by data-based par-allelism.

    5.2. Evaluation of pheromone update kernel

    This section evaluates the pheromone update kernel on Tesla C2050 underdifferent aspects: The performance impact of using different GPU algorithmicstrategies, and comparison versus a CPU counterpart. We now address eachof these issues separately.

    5.2.1. Evaluation of different GPU algorithmic strategies

    In this case, the baseline version is our best-performing kernel version,which uses atomic instructions and shared memory. From there, we showthe slow-downs incurred by each technique. As previously explained, this

    20

  • Table 7: Execution times (in milliseconds) for various pheromone update implementations(Tesla C2050).

    Code version TSPLIB benchmark instance (problem size)Tesla C2050 d198 a280 lin318 pcb442 rat783 pr1002 pcb1173 d1291 pr23921. Atomic Ins.

    0.18 0.41 0.49 0.54 2.42 3.52 4.68 5.85 18.57+ Tiling2. Atomic

    0.26 0.45 0.60 0.9 2.49 4.45 5.33 6.01 19.04Ins.3. Ins.& Thread

    25.47 93.93 144.63 516.6 4669.58 12256.4 22651.3 33682 390301Reduction4. Tiled Scatter

    66.29 211.81 368.9 1321.3 12331.2 32343.6 58740.7 86445.2 1018150to Gather5. Scatter to

    66.37 260.82 424.1 1534.2 14649.9 39299.1 73384.8 107926 1313744.4GatherSlow-Downs

    368x 636x 865x 2841x 6053x 11164.x 15680x 18448x 70745xAttained (5 Vs 1)

    kernel presents a tradeoff between the number of accesses to global memoryfor avoiding costly atomic operations and the number of those atomic opera-tions (called loads:atomic). The scatter to gather computation (5) patternpresents the major difference between both parameters. This imbalance isreflected in the performance degradation showed by the bottom-row on Ta-ble ??. The slow-down increases exponentially with the benchmark size, asexpected.

    The tiling technique (4) improves the application bandwidth with thescatter to gather approach. The Reduction technique (3) actually reducesthe overall number of accesses to either shared or device memory by havinghalf the number of threads of versions 4 or 5. This also uses tiling to alleviatethe pressure on device memory. Even though the number of loads per threadremains the same, the overall number of loads in the application is reduced.

    Finally, the atomic instruction version (version (2)) is also slightly im-proved through tiling. However, Tesla C2050 provides an improved memoryhierarchy which actually alliviates the pressure on device memory. As previ-ously mentioned, the on-chip memory is statically configurable, setting dif-ferent partitions size for L1 and shared memory. It is noteworthy to mentionthat for version (2) it performs slightly better by having a bigger L1 partioninstead of having too much unused shared memory (around 0.5% improve-ment on average). For the version (1), the on-chip configuration depends

    21

  • on the benchmark size, and the shared memory used by the targeted bench-mark. Better performance, on the same order than previous, are obtainedby having more L1, as long as the shared memory requirements are lessthan 16KB, otherwise the nvcc compiler automatically devotes more roomto shared memory in order to execute the application.

    5.2.2. GPU Vs CPU

    Table 8: Execution times (milliseconds) on different hardware platforms (CPU vs GPU).We vary the TSPLIB benchmark instance to increase the number of cities.

    TSPLIB CPU Xeon GPU NVIDIABenchmarks Intel E5620 Tesla C2050

    d198 1.02 0.18 (5.67x)a280 2.6 0.41 (6.34x)

    lin318 4.1 0.49 (8.37x)pcb442 6.6 0.54 (12.22x)rat783 33.08 2.42 (13.67x)pr1002 54.89 3.52 (15.59x)

    pcb1173 77.59 4.68 (16.58x)d1291 102.43 5.85 (17.51x)pr2392 371.05 18.57 (19.98x)

    22

  • Table ?? shows the speed-up factor for the best version of the pheromoneupdate kernel compared to the sequential code. The pattern of computationfor this kernel is based on data-parallelism, showing a linear speed-up alongwith the problem size, reaching up to 20x speed-up factor for Tesla C2050.

    5.3. Quality of the Solution

    Figure 6: The quality of the solution averaged of 1000 iterations. The 95% of confidenceinterval is shown on top of the bars.

    Finally, the Figure ?? shows a quality comparison of the solutions ob-tained by the main targeted algorithms in this work. They are normalizedwith respect to the optimum solution for each benchmark previously reported(see Table ??). The solution quality showed is the result of running all al-gorithms a fixed number of iterations (1000-iterations), and averaged over 5independent runs. Moreover, a 95% confidence interval is also provided. Itis worth noticing that the tours quality obtained for GPU codes is similarthan the verified sequential code, being even better in some cases.

    6. Related Work

    6.1. Parallel implementations

    Stuzle [? ] describes the simplest case of ACO parallelisation, in whichindependently instances of the ACO algorithm are run on different processors.Parallel runs have no communication overhead, and the final solution is takenas the best-solution over all independent executions. Improvements overnon-communicating parallel runs may be obtained by exchange informationamong processors. Michel and Middendorf [? ] present a solution based onthis principle, whereby separate colonies exchange pheromone information.In more recent work, Chen et al. [? ] divide the ant population into equally-sized sub-colonies, each assigned to a different processor. Each sub-colonysearches for an optimal local solution, and information is exchanged betweenprocessors periodically. Lin et al. [? ] propose dividing up the probleminto subcomponents, with each subgraph assigned to a different processingunit. To explore a graph and find a complete solution, an ant moves fromone processing unit to another, and messages are sent to update pheromonelevels. The authors demonstrate that this approach reduces local complexityand memory requirements, thus improving overall efficiency.

    23

  • 6.2. GPU implementations

    In terms of GPU-specific designs for the ACO algorithm, Jiening et al.[? ] propose an implementation of the Max-Min Ant System (one of manyACO variants) for the TSP, using C++ and NVIDIA Cg. They focus theirattention on the tour construction stage, and compute the shortest pathin the CPU. Catala et al. [? ] proposed two ACO implementations onGPUs, applying them to the Orienteering Problem, using vertex and shaderprocessors. In [? ] You discusses a CUDA implementation of the AntSystem for the TSP. The tour construction stage is identified as a CUDAkernel, being launched by as many threads as there are artificial ants inthe simulation. The tabu list of each ant is stored in shared memory, andthe pheromone and distances matrices are stored in texture memory. Thepheromone update stage is calculated on the CPU. Li et al. [? ] proposea method based on a fine-grained model for GPU-acceleration, which mapsa parallel ACO algorithm to the GPU through CUDA. Ants are assigned tosingle processors, and they are connected by a population-structure [? ].

    Fu et al. [? ] design the MAX-MIN Ant System for the TSP with MAT-LAB and the Jacket toolbox for accelerating some parts of the algorithmon the GPU. They states the low performance obtained by the traditionalroulette wheel as a selection process on GPUs. They propose an alternativeselection process, called All-In-Roulette, which generates m n pseudo-random number matrix, being m the number of ants and n the number ofcities. In the Robin M. Weisss phd, he applies the ACO algorithm to thedata-mining problem [? ]. He analyses several GPU ACO design, showingthe low performance of the previous designs based on task-parallelism.

    Although these proposals offer a useful starting point when consideringGPU-based parallelisation of ACO, they are deficient in two main regards.Firstly, they fail to offer any systematic analysis of how best to implement thisparticular algorithm. Secondly, they fail to consider an important componentof the ACO algorithm; namely, the pheromone update.

    7. Conclusions and Future Work

    Ant Colony Optimisation (ACO) belongs to the family of population-based meta-heuristic that has been successfully applied to many NP-completeproblems. As a population-based algorithm, it is intrinsically parallel, andthus well-suited to implementation on parallel architectures. The ACO algo-rithm comprises two main stages; tour construction and pheromone Update.

    24

  • Previous efforts for parallelizing ACO on the GPU focused on the formerstage, using task-based parallelism. We have demonstrated that this ap-proach does not fit well on the GPU architecture, and provided an alternativeapproach based on data parallelism. This enhances the GPU performance byboth increasing the parallelism and avoiding warp divergence. Moreover, wehave proposed an alternative selection procedure that fits better on the GPUarchitecture idiosyncrasy.

    In addition, we have provided the first known implementation of thepheromone update stage on the GPU. Some GPU computing techniques werediscussed in order to avoid atomic instructions. However, we have showedthat those techniques are even more costly than applying atomic operationsdirectly.

    Possible future directions will include investigating the effectiveness ofGPU-based ACO algorithms on other NP-complete optimisation problems.We will also implement other ACO algorithms, such as the Ant Colony Sys-tem, which can also be efficiently implemented on the GPU. The conjunctionof ACO and GPU is still at a relatively early stage; we emphasize that wehave only so far tested a relatively simple variant of the algorithm. Thereare many other types of ACO algorithm still to explore, and as such, it isa potentially fruitful area of research. We hope that this paper stimulatesfurther discussion and work.

    Acknowledgements

    This work was partially supported by a travel grant from the EU FP7NoE HiPEAC IST-217068, the European Network of Excellence on HighPerformance and Embedded Architecture and Compilation. The first twoauthors acknowledge the support of the project from the Spanish MEC andEuropean Commission FEDER funds under grants Consolider Ingenio-2010CSD2006-00046 and TIN2006-15516-C04-03, and also from the FundacionSeneca (Agencia Regional de Ciencia y Tecnologa, Region de Murcia) undergrant 00001/CS/2007.

    References

    [1] M. Dorigo, T. Stutzle, Ant Colony Optimization, Bradford Company,Scituate, MA, USA, 2004.

    25

  • [2] M. Dorigo, M. Birattari, T. Stutzle, Ant colony optimization, Compu-tational Intelligence Magazine, IEEE 1 (4) (2006) 28 39.

    [3] C. Blum, Ant colony optimization: Introduction and recent trends,Physics of Life Reviews 2 (4) (2005) 353373.

    [4] E. Lawler, J. Lenstra, A. Kan, D. Shmoys, The traveling salesman prob-lem, Wiley New York, 1987.

    [5] M. Dorigo, A. Colorni, V. Maniezzo, Positive feedback as a search strat-egy, Tech. Rep. 91-016, Dipartimento di Elettronica, Politecnico di Mi-lano, Milan, Italy (1991).

    [6] M. Dorigo, V. Maniezzo, A. Colorni, The ant system: Optimization bya colony of cooperating agents, IEEE Transactions on Systems, Man,and Cybernetics-Part B 26 (1996) 2941.

    [7] M. Dorigo, E. Bonabeau, G. Theraulaz, Ant algorithms and stigmergy,Future Gener. Comput. Syst. 16 (2000) 851871.

    [8] T. Stutzle, Parallelization strategies for ant colony optimization, in:PPSN V: Proceedings of the 5th International Conference on ParallelProblem Solving from Nature, Springer-Verlag, London, UK, 1998, pp.722731.

    [9] X. JunYong, H. Xiang, L. CaiYun, C. Zhong, A novel parallel antcolony optimization algorithm with dynamic transition probability, In-ternational Forum on Computer Science-Technology and Applications 2(2009) 191194.

    [10] Y. Lin, H. Cai, J. Xiao, J. Zhang, Pseudo parallel ant colony opti-mization for continuous functions, International Conference on NaturalComputation 4 (2007) 494500.

    [11] M. Garland, D. B. Kirk, Understanding throughput-oriented architec-tures, Commun. ACM 53 (2010) 5866.

    [12] J. R. Fischer, U. States., Frontiers of massively parallel scientific compu-tation, National Aeronautics and Space Administration, Scientific andTechnical Information Office, [Washington, D.C], 1987.

    26

  • [13] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Mor-ton, E. Phillips, Y. Zhang, V. Volkov, Parallel computing experienceswith cuda, IEEE Micro 28 (2008) 1327.

    [14] J. M. Cecilia, J. M. Garca, M. Ujaldon, A. Nisbet, M. Amos, Paralleliza-tion strategies for ant colony optimisation on gpus, in: NIDISC 2011:14th International Workshop on Nature Inspired Distributed Comput-ing. Proc. 25th International Parallel and Distributed Processing Sym-posium (IPDPS 2011), Anchorage (Alaska), USA, 2011.

    [15] Johnson, David S., Mcgeoch, Lyle A., The Traveling Salesman Problem:A Case Study in Local Optimization, 1997.

    [16] M. Dorigo, Optimization, learning and natural algorithms, Ph.D. thesis,Politecnico di Milano, Italy (1992).

    [17] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Ma-chine Learning, 1st Edition, Addison-Wesley Longman Publishing Co.,Inc., Boston, MA, USA, 1989.

    [18] NVIDIA, NVIDIA CUDA C Programming Guide 3.1.1, 2010.

    [19] NVIDIA, NVIDIA CUDA C Best Practices Guide 3.2, 2010.

    [20] NVIDIA, NVIDIA CUDA CURAND Library., 2010.

    [21] T. Scavo, Scatter-to-gather transformation for scalability (Aug 2010).

    [22] NVIDIA, Whitepaper NVIDIAs Next Generation CUDA Compute Ar-chitecture: Fermi, 2009.

    [23] G. Reinelt, TSPLIB a traveling salesman problem library, ORSAJournal on Computing 3 (4) (1991) 376384.

    [24] TSPLIB Webpage,http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/

    (February 2011).

    [25] R. Michel, M. Middendorf, An island model based ant system with looka-head for the shortest supersequence problem, in: Proceedings of the5th International Conference on Parallel Problem Solving from Nature,PPSN V, Springer-Verlag, London, UK, 1998, pp. 692701.

    27

  • [26] L. Chen, H.-Y. Sun, S. Wang, Parallel implementation of ant colonyoptimization on mpp, in: Machine Learning and Cybernetics, 2008 In-ternational Conference on, Vol. 2, 2008, pp. 981 986.

    [27] W. Jiening, D. Jiankang, Z. Chunfeng, Implementation of ant colonyalgorithm based on gpu, in: CGIV 09: Proceedings of the 2009 SixthInternational Conference on Computer Graphics, Imaging and Visualiza-tion, IEEE Computer Society, Washington, DC, USA, 2009, pp. 5053.

    [28] A. Catala, J. Jaen, J. Modioli, Strategies for accelerating ant colony op-timization algorithms on graphical processing units, in: IEEE Congresson Evolutionary Computation, 2007, pp. 492500.

    [29] Y.-S. You, Parallel ant system for traveling salesman problem on gpus,in: GECCO 2009 - GPUs for Genetic and Evolutionary Computation.,2009, pp. 12.

    [30] J. Li, X. Hu, Z. Pang, K. Qian, A parallel ant colony optimization algo-rithm based on fine-grained model with gpu-acceleration, InternationalJournal of Innovative Computing, Information and Control 5 (2009)37073716.

    [31] J. Fu, L. Lei, G. Zhou, A parallel ant colony optimization algorithm withgpu-acceleration based on all-in-roulette selection, in: 2010 Third Inter-national Workshop on Advanced Computational Intelligence (IWACI),2010, pp. 260 264.

    [32] R. M. Weiss, Gpu-accelerated data mining with swarm intelligence,Ph.D. thesis, Department of Computer Science (Macalester College)(2010).

    28