Soto_Ferrari (2).doc

IJIE 2003 Paper Template

massively SCALABLE PARALLEl approaches for the FloydWarshall algorithmMilton Soto FerrariDepartment of Industrial Engineering

Western Michigan University4601 Campus Drive

Kalamazoo, Michigan 49008Corresponding authors e-mail: [email protected]: Parallel computations have emerged as a sophisticated technique to improve algorithms performance in order to significantly reduce the running time when data volume increases. This paper presents parallel versions for the Floyd-Warshall algorithm with the support for enabling scalable and efcient implementation of data-intensive applications. Two different approaches were developed. A fundamental implementation was parallelized in a homogenous cluster of multi-core CPUs using the g++ OpenMP library considering different matrix dimensions with results of an average speedup of 14 with efficiency equals to 87% when it is compared to the sequential version of the code. In addition a high-level CUDA implementation was performed in a heterogeneous cluster of CPUs and GPUs, for this case the performance achieved a significant average speedup of 65 without errors in the calculations completed within the GPUs.1. INtroductionThe Floyd-Warshall algorithm also knows as the All-pairs shortest paths is a widely used algorithm to compute shortest paths between all pairs of n vertices in an edge weighted directed graph (Floyd, 1962). This implementation is relevant to many types of applications and the outputs are correct as long as no negative cycles exist in the input graph (Hougardy, 2010). Parallel implementations for the problem are relevant since the complexity for the problem has a worst-case runtime of O(n3). The algorithm evaluates the case in which a different path between two vertices including and intermediate phase (k) is an improved response in the final path between vertices (Floyd, 1962).The parallel version contemplate that once a graph model is considered the objective is to evenly distribute the computations over p processors or threads by partitioning the vertices cost evaluation into p equally weighted sets while minimizing overhead threads communication. In order to achieve the parallelization, considerations of dependence between calculations must be contemplated; otherwise the output for the algorithm would not be available to accomplish the optimum.

There are several metrics that characterize the performance of a parallel system, such as, parallel execution time, speedup and efficiency. It is a well-known fact that given a parallel architecture and a problem instance, the speedup of a parallel algorithm does not continue to increase with increasing number of processors but tends to saturate or peak at a certain value (Grama et al. 1993). For the purposes of the performance evaluation we will consider the basic parallel evaluation metrics execution time or running time, speed up and efficiency using different instances (n) and a different number of threads (p) since the proposed fundamental implementation has been developed over a shared memory system.In this paper, we present a parallelized version of the Floyd-Warshall algorithm with different n matrices sizes. The main feature of the implementation is that it can easily replicate the operability and functionality of the algorithm, reducing the running time required for the sequential version of the code.

The implementations were generated in the computer cluster in the Department of Computer Science Research laboratory, at Western Michigan University in Kalamazoo, Michigan. The cluster is a shared memory, multicore and GPU based system, with multi-core processor.

The fundamental parallelized version of the algorithm was developed over the OpenMP parallel environment (OpenMP, 2013). The performance and results of the implementation were evaluated for n sizes of 500, 1000, 2000, 5000 and 10000. In addition, a high-level CUDA NvidiaGPU application is extensively presented as a significant accomplishment in order to increase the performance of the algorithm.The paper is organized as follows: In section 2, we present the fundamental parallelized version of the algorithm using OpenMP library. The results for this application are presented in section 3. The CUDA implementation and the results are presented in sections 4 and 5, respectively. Conclusions and future research are also presented in sections 6 and 7.2. OPENMP IMPLEMENTATION

OpenMP provides a shared-memory API in C compilation. This means that there are special preprocessor instructions know as #pragma that parallelized the directions after the clause within a specific number of threads (Pacheco, 2011). For our problem the main function Floyd was parallelized using an OpenMP parallel-for pragma clause (using a dynamic schedule with default chunk size for load balancing). Testing revealed the dynamic schedule to be the most efficient.

The pragma omp clause for the parallelization of the algorithm should be located after the loop involving the k phase since this evaluation generates a dependence that cannot be parallelized. In order to reach the optimum value the algorithm must know the previous values of explored paths; due to this required evaluation loop k is a dependence process that cannot be parallelized. Figure 1 shows the pseudocode of the program described, and the additions to the basic code for enabling parallelization (in bold). As the computation time was inconsequential to the overall running time, the input and output calculations were not parallelized.Function Floyd OpenMP Parallelization

#pragma omp parallel num_threads (threads)

FOR k = 1 to n

#pragma omp for

FOR i = 1 to n

FOR j = 1 to n

IF Aik + Akj < Aij, THEN

Aij = Aik + AkjPij = kENDIF

ENDFOR

ENDFOR

ENDFOR

Figure 1. Fundamental OpenMP Parallelization Pseudocode2.1 Fundamental Parallel Version

The fundamental parallel version of the algorithm evaluates the paths involving the given cost-matrix information (A matrix) and the precedence vertices (P matrix) for n sizes of 500, 1000, 2000, 5000 and 10000 and threads of 1, 2, 4, 8, 16. Clearly, in this context threads equals to 1 means the running time when the code is executed in a sequential core setting.3. OPENMP implementation RESULTS

The test environment was the computer cluster in the Department of Computer Science Research laboratory, at Western Michigan University in Kalamazoo, Michigan. The cluster is a shared memory, multicore and GPU based system, with 16 multi-core processor. The program was built in g++ -fopenmp compilation using Linux. Figure 2 shows the decrease in running time for the different n sizes when the number of threads p increases and equals the number of cores available in the cluster.

Figure 2. Running Time OpenMPFor the highest n evaluated 10000 the decrease in runtime were from T1 = 4276s when executed sequentially, to approximately T16 = 306s when the number of threads p equals the number of cores, with the OpenMP parallel version. In order to improve the analysis the speedup metric for parallel algorithms must be included. The speedup metric represents the performance of the algorithm according to the running times obtained with the different threads executed. Usually the best we can hope to do is to equally divide the work among the cores, while at the same time introducing no additional work for the cores. If we succeed in doing this, and we run our program with p threads, then our parallel program will run p times faster than the serial program, this behavior is knows as linear speedup. However, it is unlike to achieve a speedup equals to the number of cores, due to the communication overhead necessary to coordinate the data across threads. Serial Programs do not need these overheads since the information flow occurs through only one core (Pacheco, 2011). The speedup (S) is defined as:

(1)

The parallel implementation achieved a speedup of 14 when the number of threads equals the number of cores. Figure 3 shows the speedup achieved for the parallel implementation.

Figure 3. Speedup OpenMP Analyzing Figure 3 it is expected that the difference between speedup and the ideal linear speedup (p) value become greater when threads increases. Another way of saying this is that S/p will get smaller as p increases. This consideration is also another metric in order to evaluate the performance of the algorithm and it is knows as Efficiency and it refers to the expected use of the cores during the running time. The efficiency (E) is defined as:

(2)For the implementation the achieved Efficiency was 0.87 % when the number of threads equals the number of cores. This is an excellent performance for the parallel execution. Figure 4 shows the efficiency reached for the OpenMP algorithm.

Figure 4. Efficiency OpenMP4. CUDA IMPLEMENTATIONCUDA (Compute Unified Device Architecture) is the parallel programming model and software environment provided by NVIDIA to run applications on their GPUs, programmed via simple extensions to the C programming language. The NVIDIA GPU (or device) consists of a set of streaming multiprocessors (SMs), where each SM consists of a group of scalar processor (SP) cores with a multi-threaded instruction unit. (Sanders et al. 2010).CUDA follows a code offloading model, compute-intensive portions of applications running on the host CPU processor are typically offloaded onto the GPU device for acceleration. The kernel is the portion of the program that is compiled to the instruction set of the GPU device and then offloaded to the device before execution. Each kernel executes as a group of threads, which are logically organized in the hierarchical form of grids and blocks of threads. The dimensions of the blocks and the grid can be specified as parameters before each kernel launch. The kernel is mapped onto the device such that each thread block is executed on only one SM, and a single SM can execute multiple such thread blocks simultaneously, depending on the memory requirements of each block (Kirk et al. 2010).4.1 GPU Kernel FloydThe most computationally intensive section of the algorithm is the interactions between the k vertices to analyze. A part of the program to launch in parallel on the GPU is coded as a CUDA kernel. The CUDA runtime is then instructed to launch a number of parallel copies or blocks, and how many threads to launch per block. In the sequential version, the algorithm has three nested loops, an outer loop over k, and two inner loops to cover all elements of the matrix. In the CUDA implementation, we launch the following kernel k times; the kernel serves as loop unrolling to the nested inner loops. The kernel is a global CUDA function that can be called from the host (CPU). It takes as arguments the A and P matrices, the dimension of the matrix n, and the current index at which the inner loops are being unrolled k. Each thread will serve at for indexes in increments of blockDim.x * gridDim.x (which are the block dimension and the grid dimension), for a number of times until the index exceeds the total number of elements N declared in the host (CPU). The inner part of the kernel is very similar to the sequential version, where the costs are checked and updated accordingly. The key point is to obtain the indexes i and j from the current thread index, then, having k and n as parameters, the necessary computations can be done. Figure 5 describe the interactions between the host (CPU) and the device (GPU).

Figure 5. Floyd CUDA

Figure 6 shows the pseudocode of the CUDA implementation described. As the computation time was inconsequential to the overall running time, the input and output calculations were also not parallelized.Device Function Floyd CUDA

__global__ void Floyd

tid = threadIdx.x + blockIdx.x * blockDim.x;

WHILE (tid < N)

i = tid / n;

j = tid % n;

IF Ai*n+k + Ak*n+j < Ai*n+j, THEN

Ai*n+j = Ai*n+k + Ak*n+jPi*n+j = k

ENDIF

ENDWHILE

tid += blockDim.x * gridDim.x;

Figure 6. Device Floyd CUDA Pseudocode

5. Cuda implementation RESULTS

For the highest n evaluated 10000 the decrease in runtime were from T1 = 4276s when executed sequentially, to approximately TN = 65.64s where N equals to 100,000,000 threads launched per block. This result is a considerable reduction over the running time without errors in the calculations performed. Figure 7 shows the running times for the different sizes analyzed.

Figure 7. Running Time CUDA

As is expected the speedup for this execution will be higher. This case achieved a speedup of 65 when n equals to 10000. This speedup is a significance improvement in order to evaluate several applications with greater instances. Figure 8 shows the speedup achieved for the CUDA implementation.

Figure 8. Speedup CUDA

6. Conclusions

In this paper, we described two different parallel implementations of the Floyd-Warshall algorithm. The preliminary results show significant reductions in the computational time for different matrices sizes. This is just a step in addressing the challenge of perform the algorithm in a distributed-memory system, and designing a powerful support tools for solve problems with more massive instances sizes. For such a purpose, we designed and proposed a CUDA approach that could accommodate the exponential growth of this evaluation.

This implementation can be extended to several problems such as graph theory and network designs with the aim of solve difficulties with an optimum result within a reliable time.7. FUture research

Future implementations required the use of MPI (Message Passing Interface) for evaluate the results in a distributed-memory enviroment, with capabilities to use GPU in order to combining the CUDA aproach while running in a set of different nodes. Acknowledgement. The author acknowledge support by the Computer Science Department at Western Michigan University.8. references1. Floyd, R. W. (1962). Algorithm 97: shortest path.Communications of the ACM,5(6), 345.2. Grama, A., Gupta, A., & Kumar, V. (1993). Isoe ciency function: A scalability metric for parallel algorithms and architectures.IEEE Parallel and Distributed Technology, Special Issue on Parallel and Distributed Systems: From Theory to Practice,1(3), 12-21.

3. Hougardy, S. (2010). The FloydWarshall algorithm on graphs with negative cycles.Information Processing Letters,110(8), 279-281.4. Kirk, D. B., & Wen-mei, W. H. (2010).Programming massively parallel processors: a hands-on approach. Morgan Kaufmann.5. OpenMP website. http://www.openmp.org.6. Pacheco, P. (2011).An introduction to parallel programming. Morgan Kaufmann.7. Sanders, J., & Kandrot, E. (2010).CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional.

_1426971339.unknown

_1426974030.unknown

Soto_Ferrari (2).doc

Documents

Transcript of Soto_Ferrari (2).doc