Visual exploration of data by using multidimensional...

21
Visual exploration of data by using multidimensional scaling on multicore CPU, GPU, and MPI cluster Piotr Pawliczek 1,4 , Witold Dzwinel 2, * ,and David A. Yuen 3 1 Department of Biochemistry and Molecular Biology, University of Texas, Medical School at Houston, Houston, TX 77030, USA 2 Department of Computer Science, AGH University of Science and Technology, Krakow, Poland 3 Department of Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455-0219, USA 4 Department of Applied Computer Science and Modeling, AGH University of Science and Technology, Faculty of Metals Engineering and Industrial Computer Science, Krakow, Poland ABSTRACT Visual and interactive data exploration requires fast and reliable tools for embedding of an original data space in 3(2)-dimensional Euclidean space. Multidimensional scaling (MDS) is a good candidate. However, owing to at least O(M 2 ) memory and time complexity, MDS is computationally demanding for interactive visualization of data sets consisting of order of 10 4 objects on computer systems, ranging from PC with multicore CPU processor, graphics processing unit (GPU) board to midrange MPI clusters. To explore in- teractively data sets of that size, we have developed novel efcient parallel algorithms for MDS mapping based on virtual particle dynamics. We demonstrate that the performance of our MDS algorithms implemented in compute unied device architecture environment on a PC equipped with a modern GPU board (Tesla M2090, GeForce GTX 480) is considerably faster than its MPI/OpenMP parallel implementa- tion on the modern midrange professional cluster (10 nodes, each equipped with 2x Intel Xeon X5670 CPUs). We also show that the hybridized two-level MPI/CUDA implementation, run on a cluster of GPU nodes, can additionally provide a linear speedup. Copyright 2013 John Wiley & Sons, Ltd. Received 27 August 2012; Revised 15 March 2013; Accepted 17 March 2012 KEY WORDS: data mining; interactive data visualization; multidimensional scaling; method of particles; multicore CPU; GPUCUDA; MPI cluster 1. INTRODUCTION Multidimensional scaling (MDS) [16] is a very popular data mining technique used in microbiology, genetics, chemistry, geology, geophysics, psychology, and many other disciplines. This feature extraction method consists in mapping of a high-dimensional feature space Ω=N in a low- dimensional X=n Euclidean space, where n <<N. In this paper, we focus on MDS application in visualization and interactive exploration of multidimensional data, that is, n = dimX = 3 (or n = 2). However, MDS has a more general context. It can be also used for embedding of non-metric abstract data Ω={o i ; i = 1,...,M} such as fragments of text or sophisticated shapes, in a targetn-dimensional vector space X ={x i = (x 1 ,...,x n ); i = 1,...,M}. Then, the dissimilarity matrix D ={D ij } MxM where D ij is a dissimilarity measure between objects o i and o j (e.g., elastic or Hausdorff distances between *Correspondence to: Witold Dzwinel, Department of Computer Science, AGH University of Science and Technology, Krakow, Poland. E-mail: [email protected] Copyright © 2013 John Wiley & Sons, Ltd. CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2014; 26:662682 Published online 30 April 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3027

Transcript of Visual exploration of data by using multidimensional...

Page 1: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2014; 26:662–682Published online 30 April 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3027

Visual exploration of data by using multidimensional scaling onmulticore CPU, GPU, and MPI cluster

Piotr Pawliczek1,4, Witold Dzwinel2,*,† and David A. Yuen3

1Department of Biochemistry and Molecular Biology, University of Texas, Medical School at Houston, Houston, TX77030, USA

2Department of Computer Science, AGH University of Science and Technology, Krakow, Poland3Department of Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455-0219, USA

4Department of Applied Computer Science and Modeling, AGH University of Science and Technology, Faculty of MetalsEngineering and Industrial Computer Science, Krakow, Poland

ABSTRACT

Visual and interactive data exploration requires fast and reliable tools for embedding of an original dataspace in 3(2)-dimensional Euclidean space. Multidimensional scaling (MDS) is a good candidate. However,owing to at least O(M2) memory and time complexity, MDS is computationally demanding for interactivevisualization of data sets consisting of order of 104 objects on computer systems, ranging from PC withmulticore CPU processor, graphics processing unit (GPU) board to midrange MPI clusters. To explore in-teractively data sets of that size, we have developed novel efficient parallel algorithms for MDS mappingbased on virtual particle dynamics. We demonstrate that the performance of our MDS algorithmsimplemented in compute unified device architecture environment on a PC equipped with a modern GPUboard (Tesla M2090, GeForce GTX 480) is considerably faster than its MPI/OpenMP parallel implementa-tion on the modern midrange professional cluster (10 nodes, each equipped with 2x Intel Xeon X5670CPUs). We also show that the hybridized two-level MPI/CUDA implementation, run on a cluster of GPUnodes, can additionally provide a linear speedup. Copyright 2013 John Wiley & Sons, Ltd.

Received 27 August 2012; Revised 15 March 2013; Accepted 17 March 2012

KEY WORDS: data mining; interactive data visualization; multidimensional scaling; method of particles;multicore CPU; GPU–CUDA; MPI cluster

1. INTRODUCTION

Multidimensional scaling (MDS) [1–6] is a very popular data mining technique used in microbiology,genetics, chemistry, geology, geophysics, psychology, and many other disciplines. This featureextraction method consists in mapping of a high-dimensional feature space Ω=ℜN in a low-dimensional X=ℜn Euclidean space, where n<<N. In this paper, we focus on MDS application invisualization and interactive exploration of multidimensional data, that is, n= dimX = 3 (or n = 2).

However, MDS has a more general context. It can be also used for embedding of non-metric abstractdata Ω={oi; i= 1,. . .,M} such as fragments of text or sophisticated shapes, in a ‘target’ n-dimensionalvector space X= {xi = (x1,. . .,xn); i = 1,. . .,M}. Then, the dissimilarity matrix D= {Dij}MxM—where Dij

is a dissimilarity measure between objects oi and oj (e.g., elastic or Hausdorff distances between

*Correspondence to: Witold Dzwinel, Department of Computer Science, AGH University of Science and Technology,Krakow, Poland.†E-mail: [email protected]

Copyright © 2013 John Wiley & Sons, Ltd.

Page 2: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

VISUAL EXPLORATION OF DATA BY USING MULTIDIMENSIONALSCALING ON MULTICORE CPU, GPU, AND MPI CLUSTER 663

shapes [7])—is only information on Ω. The resulting vector representation of abstract data can beanalyzed with classical machine learning tools.

To transform Ω into X, the error function V(X) (called ‘the stress function’):

V Xð Þ ¼Xi<j

wij Dkij � dkij

� �m(1)

is minimized. The matrix d= {dij}MxM is an Euclidean distance matrix computed in X, wheredij= ((xi�xj)

T(xi�xj))1/2.

The function V(X) is a discrepancy measure between dissimilarities Dij from Ω and correspondingdistances dij from X. By assuming in (1) that: k = 1, m= 2 and the weights wij= 1/Dij

k , we obtain theclassical MDS version called Sammon’s mapping [6,8]. To find the minimum of the criterion (1),the system of n�M nonlinear equations has to be solved. The number of solutions is infinitebecause the target configuration of feature vectors X is invariant with respect to all isometrictransformations including rotation, axial, and planar symmetries. Moreover, the error function is amultidimensional function ℜnM!ℜ1, which is usually multimodal. It makes the search for itsglobal minimum extremely demanding. In general, this problem is insoluble and the computationalcomplexity scales exponentially with the number of local minima of V(X). It can be solved onlypartially by using heuristics such as simulated annealing [8] or N-body virtual particle solver [9–16].However, both time and memory complexity are still bounded from the bottom by O(M2) term.Thus, for modern PCs and midrange computer systems interactive visualization of order of 104, dataobjects is computationally demanding.

Data exploration by using interactive visualization uses computational steering principles [17]connected with user-driven exploration of parameter space. However, in the case of interactivedata visualization, we have to consider additional specific operations such as: changing type of thecost function, selection of the most suitable dissimilarity measure, deleting unnecessary orunwanted data (e.g., outliers and noise) focusing visualization on selected clusters, or onpenetration of highly imbalanced data. Such the user-driven, online manipulation on visualizeddata set consisting of thousands of objects is possible by using very fast feature extractionprocedure. We believe that MDS based on virtual particle paradigm [9–16] empowered by moderncomputer architectures and novel efficient parallel algorithms can play a role of the interactivevisualization engine.

In the following section, we present the virtual particle method employed for MDS mapping. Next,we describe novel parallel algorithms and their implementations on two different parallel platforms:multithread multicore CPU and graphics processing unit (GPU) boards. We compare also the efficiencyof MDS mapping in heterogeneous parallel environments: Open multiprocessing (OpenMP), CUDAand message passing interface (MPI) on a single small-scale MPI cluster. Finally, we summarize therelated work and discuss the conclusions.

2. MULTIDIMENSIONAL SCALING METHOD BASED ON PARTICLE SYSTEM DYNAMICS.

To obtain the best minimum (i.e., the closest to the global one) of the error function (1), we use thevirtual particle method [9,11,13], which is based on well known simulation paradigm: moleculardynamics (MD). We assume that initially X consists of M randomly distributed, mutuallyinteracting particles defined by their location and velocity (xi, vi), respectively. Every particle irepresents a feature vector (or an abstract data object oi) from the source space Ω. The particlesinteract with each other via semi-harmonic forces fij=�rV(||Dij�dij||). The interaction potential V(.)between particles i and j, is a function of a difference between current particles’ distance dij in X andrespective dissimilarity Dij in Ω. As shown in Figure 1, we assume that the total force Fi acting on a

single particle i is equal to the sum of all pairwise forces fij, that is, Fi ¼XM

j¼1; j6¼i

f ij . We assume

additionally that the kinetic energy of the particle system is dissipated by the friction force

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 3: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

Figure 1. The forces acting on a particle i.

664 P. PAWLICZEK, W. DZWINEL AND D. A. YUEN

proportional to the particle velocity. The particle system evolves in time according to the Newtonian lawsof motion:

dxidt

¼ vi; midvidt

¼ Fi � lvi (2)

where mi and l can be treated as parameters. The forces Fi acting on every particle i are computed on thebasis of the current positions of particles xi. Then, the new positions of particles can be calculated byusing, for example, Verlet’s leap-frog scheme (e.g., [18]). Then, the discrete formulation of equationof motion (2) is as follows:

vnþ12ð Þ

i ¼ 1� b�Δt1þ b�Δt v

n�12ð Þi þ a�Δt

1þ b�Δt Fnð Þi

x nþ1ð Þi ¼ x nð Þ

i þ vnþ12ð Þ

i �Δt

a ¼ gm; b ¼ l

2m; b�Δt < 1

(3)

where Δt is the time-step and (n)—the time-step number. After a certain number of iterations theequilibrium is reached and the particle system freezes. Then, the sum of potentials:

V Xð Þ ¼XM�1

i¼1

XMj¼iþ1

V jjDkij � dkijjj

� �(4)

is equal to the total energy of the particle system in the equilibrium. The final set of xi particle positions isthe solution of the minimization problem (4). The equation (4) represents more general form of the ‘stressfunction’ (1).

In Listing 1, we present the pseudocode of the sequential (one thread) version of the virtual particleMDS code kernel. The value of parameters a, b, and Δt are matched empirically. We assumed that inevery simulation, the number of time-steps is constant and equal to 104. It can be also tunedautomatically to a predefined mapping error V(.).

As shown in [9,11], the main advantages of the virtual particle MDS over other MDS algorithms areas follows:

1. The MDS virtual particle method can be used for an arbitrary error criterion represented by thegeneral formula (4). This allows for better exploration of multidimensional data topology. Forexample, in Figure 2, we demonstrate the results of embedding of 8-dimensional hypercube in

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 4: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

a b c

k=1; m=2; wij=1 k=1; m=2; wij=1/(Dij)2 k=1/2; m=2; wij=1

Figure 2. The results of mapping of the nodes of 8-dimensional hypercube into 2-dimensional Euclideanspace (R8!R2) by using various stress functions (Equation 1).

VISUAL EXPLORATION OF DATA BY USING MULTIDIMENSIONALSCALING ON MULTICORE CPU, GPU, AND MPI CLUSTER 665

Co

2-D target space employing three various criteria (1). By using Euclidean distance (k= 1/2, Figure 2c),we can obtain more detailed view of the feature space local topology. Whereas, by using normal-ized and squared distances, the coarse-grained global structure of the data set can be extracted(Figure 2a,b).

2. The method efficiently explores multimodal and multidimensional domain of the criterion func-tion V(.), so the probability of finding the global minimum is higher. This probability can be in-creased by decreasing dissipation rate, that is, at the cost of longer simulation time. This is similarto simulated annealing heuristics [19] when the rate of reaching global minimum increases withdeceleration of cooling [15].

3. Interactive control of the size of time-step Δt, the value of friction factor l and the type of errorfunction criterion in course of the minimization process allows for better penetration of stressfunction domain and obtain better minimum of the stress function V(.).

4. The method allows for visual data clustering [11, 12, 14,16] and classification [20]. Theparticles representing feature vectors and its clusters can be removed, grouped, andstopped. The result of classification of newly added feature vectors can be observedimmediately.

Listing 1. The pseudo-code of the sequential version of the virtual particle MDS algorithm.

pyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 5: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

666 P. PAWLICZEK, W. DZWINEL AND D. A. YUEN

However, despite many advantages of the virtual particle method over standard approaches [9,11],its computational complexity is still majored by O(M2) term. Consequently, (as shown in ourexperiments) it makes interactive visualization of data sets consisting of order of 104 feature vectorsimpossible on nowadays midrange computers. However, in the course of this paper, we demonstratethat parallelization of MDS by using modern multiprocessor and multithread computer platformssuch as multicore CPUs, GPU boards, and MPI clusters can considerably increase its efficiencybreaking this barrier.

As shown in Listing 1, two nested loops FOR1 and FOR2 in forces computation module decideabout the squared complexity of a single time-step. The remaining part, representing particle motion,has only linear complexity (loop FOR3), and its influence on computational time is negligible forlarge number of feature vectors. That is why in the following sections, we focus our attention onparallelization of the forces calculation procedure.

3. PARALLEL MULTIDIMENSIONAL SCALING ALGORITHMS

3.1. Parallel multidimensional scaling algorithm on multicore CPUs

There are many well-known methods for parallelization of ‘round robin’ MD (i.e., every particleinteracts with all the others) developed on shared memory multicore processors and vectorprocessors (e.g., [21–23]). However, unlike in MD codes, the particle-particle interactions used inMDS algorithm depend also on distances array D, which has to be distributed amongcomputational nodes (Figure 3a). This, seems to be a small difference, is in fact a crucial pointdeteriorating the computational efficiency of this approach. The large size of D, owing to O(M2)memory complexity, is a serious problem for efficient use of cache memory, especially on GPUarchitectures [24].

According to Listing 1 and Figure 3b, the partial forces are calculated row-by-row. Each arraycomponent from Figure 3 stands for the force Fij, distances Dij, and dij in source and targetspaces, respectively. Because D is symmetric, only the lower part of the array is processed. Toparallelize the code, all particle pairs were divided into sets of equal size. Each set isprocessed by a single thread. In Figure 3a, we show the diagram demonstrating how thecomputations are distributed among the threads. To avoid cache conflicts, each thread works onits own copy of the Forces array (Listing 1). These arrays are summed up before calculationof velocities.

a b c

Figure 3. a) Distribution of computations among four threads. b) Passing the distance matrix row-by-row.c) Passing the distance matrix block-by-block.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 6: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

VISUAL EXPLORATION OF DATA BY USING MULTIDIMENSIONALSCALING ON MULTICORE CPU, GPU, AND MPI CLUSTER 667

However, for larger datasets, this algorithm is inefficient. This is because of frequent exchange offragments of the arrays Positions and Forces between cache and operational memory.

To minimize the access time to the operational memory, D should be divided into square blocks ofsize b, as shown in Figure 3c, and be looked through on block-by-block basis. The blocks areredistributed among threads in an analogous manner as particle pairs in Figure 3a.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 7: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

668 P. PAWLICZEK, W. DZWINEL AND D. A. YUEN

In Figure 4, we compare the timings for these two different implementations (i.e., for row-by-rowand block-by-block concepts, respectively). The speed of row-by-row implementation dropssignificantly when the size of processed dataset reaches M = 2� 104 feature vectors. The tests wereperformed on a single server HP SL390 equipped with two Intel Xeon X5670 processors(12 threads). The algorithm was implemented with the use of OpenMP directives.

The pseudocode of this multithread CPU algorithm is presented in Listing 2. The variables ib and jbcontain coordinates of processed block. The matrix of distances D is stored in the array denoted asDistancesBlocks. Their elements are organized according to the order shown in Figure IIIc.To makethis algorithm efficient, the size of block b should be matched to the type of processor, the numberof threads, and the size of data processed. If b is too small, more memory hits are expected.Otherwise, too large b causes cache overflow and obstructs the load balancing. The value of b isselected experimentally. The optimal block size b is chosen at the beginning of simulation byperforming tens of iterations (time-steps) for b being the powers of two from 64 to 2048.

3.2. Parallel multidimensional scaling algorithm on GPU

As shown in numerous publications [https://developer.nvidia.com/category/zone/cuda-zone], by usingGPU instead of CPU, one can considerable accelerate the computations. In this section, we present theMDS algorithm developed in CUDA NVIDIA environment and implemented on GPU boards.

It is well known that the codes executed on GPU boards should minimize the number of operationsinvolving global memory because of its long access time. However, the read–write operations to theglobal memory do not block the multiprocessor. In the moment, when a warp is blocked by read–writeoperations, the multiprocessor can execute other warps. It allows for overlapping the delay caused byusage of global memory by stream of calculations. The maximum number of warps assigned to a singlemultiprocessor depends on both compute capability of GPU board and the code structure.

Our MDS code consists of both CPU and GPU modules. The former represents the code framework,whereas the latter is the simulation kernel. The CPU module reads data, such as particle positions andvelocities to its operational memory and is responsible for calculating dissimilarity array D. It alsotakes part in transferring data to the global memory of the GPU board. However, the most intensivecalculations are executed on GPU board.

We assume that the following assumptions are met:

1. The whole dissimilarity array D is transferred from operational memory to the GPU global memoryand resides there to the end of computations. This is because the transfer from operational to theglobal GPU memory is very slow.

2. Unlike in the CPU case, the whole D is used for computations (M2 components instead ofM(M� 1)/2). It allows for considerable simplification of the algorithm and allows avoidingsummation along the matrix columns, what would be very inefficient on GPU.

Figure 4. The average timings for one multidimensional scaling time-step for two approaches to the imple-mentation of simulation algorithm. The red plot corresponds to the row-by-row method (Figure 3b), whereas

the black one to block-by-block algorithm (Figure 3c).

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 8: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

VISUAL EXPLORATION OF DATA BY USING MULTIDIMENSIONALSCALING ON MULTICORE CPU, GPU, AND MPI CLUSTER 669

3. Buffering arrays, which keep positions and velocities of the particles, are divided into blockscontaining 32 particles each. If a block is not full, additional artificial particles are generated.Consequently, the number of rows and columns in D is increased. All artificial distances areset to zero. This allows the algorithm to be described on the level of warps operation. Each warpexecutes in parallel the instructions for all the 32 particles.

The developed CUDA code is contained in a single kernel. It is executed once per each time-step.At the beginning of computations, the following buffers are created the global (GPU) memory:

1. Buffer gDistances (D), which contains the dissimilarity matrix.2. Buffer gPositionsA (xn), which contains current particle positions.3. Buffer gPositionsB (xn+1), where new particle positions are stored.4. Buffer gVelocities (vn), which contains current particle velocities, replaced subsequently by the

newly computed.

Listing 3 The pseudo-code representing GPU version of the kernel of MDS algorithm

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 9: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

670 P. PAWLICZEK, W. DZWINEL AND D. A. YUEN

In the proposed approach, the matrix D is divided between blocks along rows. Each block processes32 rows. Consequently, each warp calculates 32 partial forces from 32-element column at the sametime. If there are multiple warps per block, each of them processes separate column. Thepseudocode representing the MDS algorithm version intended for calculations on GPU board ispresented in Listing 3. It demonstrates a series of instructions executed by one block. The variables,which begin with ‘s’ and ‘g’ are stored in the shared and global memories, respectively. Each blockcalculates the forces Fi acting on 32 particles corresponding to 32 processed rows of the matrix D.Then the velocities vi (stored in gVelocities buffer) of all particles are computed and their positionsxi (stored in gPositionsB buffer) updated. At the end of the iteration the arrays gPositionsB andgPositionsA are swapped.

The proposed algorithm is designed to efficiently exploit miscellaneous NVIDIA GPU cards. Toobtain the optimal results, the two following parameters must be adjusted:

1. The number of registers per thread RT. This parameter is set during compilation. Its greater valueallows for creating faster code, whereas smaller one allows for increasing the number of warpsresident on one multiprocessor.

2. The number of threads per block TB. This parameter is set in runtime. Its greater value meansmore warps within one block, whereas smaller one allows for increasing the number of blocksresident on one multiprocessor.

Let us assume that RT is fixed. The only reason to decrease TB is to increase the number of blocks,which can reside on one multiprocessor at the same time. The optimal value of TB depends on both thesize of dataset and the type of GPU used. It can be found empirically. Particularly, for a given numberof blocks per multiprocessor and fixed number of registers per threads RT, the parameter TB shouldalways be set as high as possible. For GPU with lower computational capability (1.x or 2.x) thereare at most eight configurations to check, whereas for Kepler devices (3.x) at most 16 configurationsshould be checked. Because the kernel binaries can be reloaded without erasing the GPU cardglobal memory, at the beginning of simulations two or three CUDA binaries with different numberof registers per thread RT can be tested. To find the optimal RT, we carried out bunch of tests onmiscellaneous GPU boards. The results are summarized in Table I.

The GPU computational time includes the following: the time taken by read/write operations onglobal memory, time spent on instructions processing, and latency time. Efficient GPU algorithmsallow for overlapping of read/write operations on global memory with computations performed byother warps. They fall into two categories:

1. The algorithms bounded by the sum of the GPU times needed to read data from and write data toglobal memory.

2. The algorithms bounded by the time spent on instructions processing.

The execution time of poor algorithms is of the order of the sum of these two components or greater.To develop efficient algorithms, we need to know approximate time spent on global memoryoperations and on calculations. The former can be estimated basing on GPU parameters, hence:

theoretical time ¼ amount of data readþ amount of data writtenð Þ= bus width �memory clock rateð Þ

For GPU boards equipped with DDR RAM this theoretical time should be divided by 2.Theestimation of the second execution time component is more problematic. The number of clock

Table I. The optimal numbers of registers per thread.

Computational capability FastMath arithmetic IEEE-754 arithmetic

1.0, 1.1 16, 20 —1.2, 1.3 16, 20, 21 —2.0 36, 42 42

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 10: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

VISUAL EXPLORATION OF DATA BY USING MULTIDIMENSIONALSCALING ON MULTICORE CPU, GPU, AND MPI CLUSTER 671

cycles depends strongly on generation of the GPU architecture and the type and statistics ofinstructions used. Therefore, the minimal number of clock cycles required to execute a givensequence of instructions must be measured. To this end, we prepare two special versions of thekernel, denoted as ‘instructions test’ kernel and ‘memory test’ kernel. The first one consists of purecalculations and does not contain any operation on global memory; the second one consists only ofread/write operation on global memory. These kernels can be used to estimate minimal time neededby pure calculations and global memory operations on a given GPU device. This kind ofperformance analysis is described in details in [25].

The tests of kernel efficiency were performed on dataset consisting of 3.1� 104 vectors with the useof Tesla M2090 and FastMath execution mode. The measured time of a single iteration for all thekernels and theoretical time of transfer data between multiprocessors and global memory arecollected in Figure 5. The time measured for ‘memory test’ kernel is 10% greater than theoretical onebecause of start-up latencies. Whereas, the time measured for ‘instructions test’ kernel is significantlygreater than the time consumed by memory operations. Thus, in this case, the final execution time (i.e.,measured for the original kernel) is bounded by the time spent on instructions processing. As shown inFigure 5, it is only 10% greater than the time obtained for instructions test kernel. It means thatmajority of latencies and the time spent on global memory operations can be overlapped by calculations.

The size of global memory—about 1–6GB on a standard GPU board—imposes an upper limit onthe size of data. Because memory complexity of the MDS code is O(M2), one of the main obstaclesof applying MDS to large dataset is the size of distance matrix D. There are several algorithms thatcircumvent this problem by in-place calculation of elements of distance matrix. However, this kindof solution is very rigid, because algorithm used for calculation of input distances must be hard-coded in implementation of MDS. Moreover, distance matrix must be recalculated every iteration.So, the method used for computation of mid-point distances must be very fast. Consequently, forvery sophisticated and time consuming distances calculated for abstract non-metric data, such asshapes, large molecules, or text corpora, in-place distances recalculation defeats the purpose. In ourapproach, distance matrix D is treated as the input data. This makes MDS code more general andallows for employing it for visualization and feature extraction for various data types.

The problem with the size of distances matrix can be solved by redistributing data among severalGPUs or computers by using, for example, the parallel algorithm presented in the next section.

3.3. Parallel multidimensional scaling algorithm for MPI cluster

The MDS algorithm implemented on multiprocessor cluster uses two-level parallelism. The first levelis connected with internal architecture of cluster nodes. The cluster node can be just a homogeneousshared memory CPU-based multiprocessor or a heterogeneous node empowered with GPU boards.

Figure 5. The comparison of timings obtained by the original kernel (read/write and arithmetic operations),the two test kernels (only arithmetic, only read/write operations), and theoretical estimation.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 11: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

672 P. PAWLICZEK, W. DZWINEL AND D. A. YUEN

In the former, calculations within a single MPI process are parallelized with the use of OpenMP,whereas in the latter, CUDA environment is used.

The higher—coarse-grained—parallelization level corresponds to the topology of cluster nodes. Thedissimilarity matrix D and calculations can be distributed among nodes with the use of adapted undertriangle force block decomposition (UTFBD) algorithm ([26]), which is an optimized version of Tayloralgorithm [27]. These algorithms were originally developed for parallel MD simulations and wereadapted by us to the requirements of our MDS method. Each MPI process corresponds to one systemprocess executed on a dedicated cluster node. Following, we present details of our parallel algorithm.

We define US,S= [urc]S,S matrix of MPI processes. Only its lower triangle and diagonal are used. Asit was shown in Figure 6a, the processes are ordered row-wise. For every element urc of matrix US,S forwhich c≤ r, exactly one process is assigned. This structure defines a constant number of MPI processesS(S+ 1)/2, where S is the matrix size that can be run during computations. The particle ensemble isdivided into S subsets s= 1,. . .,S. The buffers with positions x and velocities v vectorscorresponding to particles from subset s will be denoted as Positionss and Velocitiess respectively.The buffers Positionss and Velocitiess are assigned to all the MPI processes denoted as usc and urs.The MPI processes urc from the main diagonal (r= c) maintain only one subset of particles, whereasthe rest of MPI processes (r> c) have two subsets assigned. As shown in Figure 6b, in course ofsimulation every process urc from diagonal (r = c) uses and updates all distances (both in ‘target’ and‘source’ spaces) between particles from the assigned subset. The rest of processes, urc (r> c), keeptrack of all distances between particles from subset r with particles from subset c. In every time-step, current distances between particles from d array are compared with the values from the inputdissimilarity matrix D. Consequently, the total forces acting on each particle are computed. Then,the particles are moved according to the Newtonian dynamics.

Because particles’ positions are redistributed between processes, each process can compute onlypartial forces. To compute the total forces acting on particles from a subset s, corresponding partial

dc

ba

Figure 6. Diagrams demonstrating the distribution of computations onto cluster nodes.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 12: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

VISUAL EXPLORATION OF DATA BY USING MULTIDIMENSIONALSCALING ON MULTICORE CPU, GPU, AND MPI CLUSTER 673

forces from row s and column s must be added. They are gathered by the process uss (Figures 6c,d).The computation stages of a single time-step are summarized in Table II.

To exchange data between processes, only two MPI functions are needed: MPI-broadcast andMPI-reduce. Unfortunately, both of them are blocking procedures. This generates a serious problemwith rapid decrease in computational efficiency. During every time-step, the off diagonal processes ofmatrix U have to execute each function twice: the first time for synchronization of the first subset alongrow, the second for synchronization of the second subset along column. However, these two operationscan be executed in parallel, because they work on separate buffers. To bypass the lack of non-blockingversions of these functions in MPI standard, we employ threads. Then, every off diagonal process usestwo threads during data synchronization. One thread synchronizes data along a row, whereas anotheralong a column.

4. TESTS AND RESULTS

In all the tests reported later, we compare the average execution times of one time-step of particle-based MDS parallel algorithms, which pseudocodes are shown in Listings 2 and 3. The averageswere calculated for runs executing at least 200 time-steps. The tests were carried out for optimizedC++ codes with MPI functions, OpenMP directives, or CUDA NVIDIA procedures in respect tocomputational environment used. All the operations performed on real numbers were executed usingthe float variables. The efficiency of the sequential code was tested on several computerarchitectures described in Table III. All the tests were performed using 64-bits Linux system. Thecodes were compiled by using g++ compiler ver.4 with O3 option. Only on the ‘Baribal’supercomputer (Table III) the code was compiled using icpc, Intel compiler.

4.1. Tested computer systems

In Table III, we provide a brief description of the computer systems used in our tests. The list containsolder SMP system (SGI Altix 3700), strong computational clusters (HP SL390) empowered with GPUboards, and middle-level laptop equipped with Intel Core i5 processor and NVIDIA GPU GeForce GT330M. We also used a separate workstation to test other GPU boards.

4.2. Data test beds

For testing, we used a few data sets of very different character [28, 29]. However, we have noticed thatdata topology does not noticeably influences the timings of a single iteration albeit it has considerableeffect on the quality of final mapping and the number of iterations necessary to obtain optimal value ofthe ‘stress’ function (1). We are focused here on the efficiency of a single iteration and on parallelimplementation issues rather than on the quality of mapping. The former depends on the hardwareand software issues, whereas the latter on a proper choice of heuristics and its parameters.Therefore, to make the results consistent, we used in the tests only one artificially generated dataset

Table II. The functions of processes u from matrix Un,n during a single time step.

Processes from diagonal (r = c) Processes lying out of diagonal (r> c)

Update vectors from Velocitiesr and vectors fromPositionsr according to formulas (3).

do nothing

Broadcast vectors from Positionsr along rowand column (Figure 6a).

Receive two buffers with vectors: Positionsr fromprocess urr and Positionsc from process ucc (Figure 6a)

Compute partial forces acting on particles fromsubset r (Figure 6b)

Compute partial forces on the ground of distancesconnecting particles from subset r with particles fromsubset c (Figure 6b)

Compute total forces acting on particles fromsubset r by gathering and addition properpartial forces from processes lying in thesame row and column (Figure 6c,d)

Send partial forces acting on particles from subset rto process urr and partial forces acting on particles fromsubset c to process ucc (Figure 6c,d)

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 13: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

Table

III.The

computersystem

sused

forperformance

analysis(a)CPU

system

sand(b)GPU

boards.

(a)

Processor

name

Com

puter

Num

berof

processors

Num

berof

cores

perprocessor

Clock

speed(G

Hz)

Cache

size

(MB)

Mem

ory

size

(GB)

IntelXeonX5670

ServerHPSL390,

onenode

of‘Zeus’

GPGPU

cluster,ACK

CyfronetAGH

26

2.93

1270

IntelCorei5M

430

LaptopSam

sung

R780

12

2.27

34

IntelItanium

2Madison

(IA-64)

SGIAltix3700,supercomputer‘Baribal’,

ACK

CyfronetAGH

256(20was

used)

11.5

6or

4512

(b)

GPU

board

Com

pute

capability

Num

berof

multip

rocessors

Num

berof

CUDA

coresper

multip

rocessor

Multip

rocessors

clockrate

(GHz)

Global

mem

ory

size

(GB)

Global

mem

orybus

width

(bits)

Global

mem

oryclock

rate

(GHz)

GeForce

8800

Ultra

1.0

168

1.51

0.75

384

1.15

GeForce

GT330M

1.2

68

1.27

1.00

128

0.79

Tesla

C1060

1.3

308

1.30

4.00

512

0.80

GeForce

GTX

480

2.0

1532

1.40

1.50

384

1.85

Tesla

M2090

2.0

1632

1.30

5.25

384

1.85

674 P. PAWLICZEK, W. DZWINEL AND D. A. YUEN

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 14: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

VISUAL EXPLORATION OF DATA BY USING MULTIDIMENSIONALSCALING ON MULTICORE CPU, GPU, AND MPI CLUSTER 675

consisting of 40-dimensional vectors (i.e., N= dim(Ω) = 40) representing two classes of the same size:class 1 and class 2. First, 1–20 vector coordinates were generated randomly from [�3/2, 3/2] interval.For vectors belonging to the class 1, their 21–40 coordinates are random numbers from [�3/2, 1/2]interval, whereas those from class 2 were generated in [�1/2, 3/2] interval. We use several datasetsof various sizes: H1, H2, H3, and so on consisting from 1024 up to 40� 1024 (~4� 104) vectors.The final results of H4 data visualization (Ω!Σ where dim(Σ) = n= 3) using our MDS algorithm isdemonstrated in Figure 7.

We assume additionally that two dissimilarity matrices, D and d, are Euclidean. The minimizedstress function is represented by the formula (1) with k = 1, m= 2, wij= 1. The computationalefficiency measured for average execution time of one time-step is very similar for other choices ofstress function type.

4.3. Results of tests—multicore CPU

The pseudo-code of the algorithm developed for multithread implementation of MDS is presented inListing 2. The code was tested on all the platforms from Table III(a). The maximal number ofthreads did not exceed the total accessible number of cores for tested configurations. The onlyexception was Intel Core i5, which because of Hyper-Threading technology allows for concurrentexecution of two treads on a single core. The calculations on SGI Altix 3700 (Baribal) wasperformed by using 20 processors, whereas the tests on Xeon 5670 processor were conveyed on oneHP SL390 server, that is, 12 threads.

In Figure 8, we display the average execution times of one time-step of the multithread MDS codefor tested platforms and H1–H30 testing datasets. The plot from Figure 8 demonstrates that one node ofHP SL390 cluster consisting of two Intel Xeon X5670 CPUs (12 cores) remains unbeatable. It is 2–3times faster than 20 processors of obsolete SGI Altix 3700 cluster.

The high efficiency of parallel implementation of our algorithm from Listing 2 is confirmed bynearly liner speedups collected in Table IV. They were obtained for H30 data file. It means that theefficiency achieved is higher than 90%. The only exception occurred for the case when four threadswere executed on single Intel Core i5 processor. The Hyper-Threading technology was not able tosubstitute two additional cores, so the speedup is relatively low.

4.4. Results of tests—GPU

All of the GPU boards from Table III(b) were tested assuming simplified floating-point arithmetic(denoted as FastMath version). Additionally, the GPU boards with compute capability equal to 2.0or greater were tested for IEEE-754 standard of arithmetic (IEEE version). In Figure 9, we present

Figure 7. The H4 dataset visualized by multidimensional scaling employing virtual particle method.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 15: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

Figure 8. The averaged timings of the multithread version of MDS algorithm for a single time-step.

Table IV. Speedups obtained for various computer architectures and number of threads.

CPU type Threads count

2 4 12 20Intel Itanium 2 (Baribal) 2.0 4.0 11.7 18.6Intel Core i5 M 430 2.0 2.2 — —Intel Xeon X5670 2.0 3.9 11.2 —

a b

igure 9. Timings obtained for GPU boards versus two Intel Xeon X5670 CPUs for two types of arithmetic,a) FastMath and b) IEEE.

676 P. PAWLICZEK, W. DZWINEL AND D. A. YUEN

F

the averaged timings of a single iteration of our MDS algorithm from Listing 3 for a set of GPU boards.The timings obtained on two Intel Xeon X5670 CPUs by multithread version of MDS algorithmdescribed in the previous section are shown for comparison.

In Table V, we collected the speedups for tested GPU boards both for FastMath and IEEEarithmetic. They were calculated versus both single thread code version and full 12-threads HPSL390 server (2xIntel Xeon 5670). The measurements were performed for the dataset H13.

All the tested GPU boards were faster than the cluster node with the use of the FastMath arithmetic.The timings obtained by the optimal OpenMP code on the cluster node were comparable with theCUDA code executed on the weak GeForce GT 330M GPU board from a medium class laptop. ThePCs with old (but quite strong) GPU boards GeForce 8800 Ultra and Tesla C1060 are about four

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 16: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

Table V. Speedups obtained for various NVIDIA GPU boards measured against single node with two IntelXeon X5670 processors.

GPU typeFast floating points operations IEEE floating points operations

vs. two processors(12 threads)

vs. single core(1 thread)

vs. two processors(12 threads)

vs. single core(1 thread)

GeForce 8800 Ultra 3.8 42.8 — —GeForce GT 330M 1.2 13.4 — —Tesla C1060 6.4 71.2 — —GeForce GTX 480 13.4 150.2 5.7 63.8Tesla M2090 13.3 148.9 5.5 61.8

VISUAL EXPLORATION OF DATA BY USING MULTIDIMENSIONALSCALING ON MULTICORE CPU, GPU, AND MPI CLUSTER 677

and six times faster, whereas Tesla M2090 and GeForce GTX 480 beats 13 times the cluster nodeperformance for FastMath mode arithmetic (Table V).

The advantage of GPU processors shrinks if the full floating-point arithmetic according to IEEE 754standard is necessary. Only the GPUs with compute capability equal or greater than 2.0 meet thisrequirement. The performance drops more than two times in that case (Figure 9b and Table V).Albeit the advantage of GPU implementations over CPU cluster node is evident, two additionalfactors, which additionally diminish GPU advantage over CPU, should be taken into account:

1. The coding time using CUDA is a few times slower and much more sophisticated than imple-mentation of OpenMP standard.

2. The global memory of GPU boards is significantly smaller than operation memory of tested CPUboard.

In respect to MDS and interactive visualization of large datasets, the second aspect is especiallypainful. The idea of keeping only a part of distances array in GPU global memory is extremelyinefficient because of small throughput between operational and global memories (16GB/s forPCI-Express x16 2x).

As it was mentioned in Section 3.2, the algorithm developed was designed to utilize efficientlymiscellaneous NVIDIA GPU boards. To check the scalability of our algorithm, we compared theresults obtained by tested GPU boards for FastMath arithmetic on H13 dataset. Since the time ofkernel execution is bounded by instructions processing, overall speed of calculations should beproportional to computational power of used GPU. The speed of calculations is represented asnumber of iterations per second, whereas computational power of the GPU boards is represented bymultiplication of the overall number of CUDA cores and the clock rate of multiprocessors. Theresult is presented in Figure 10. It shows that tested implementation has a very good scalability.

Figure 10. The comparison of computational power of selected GPU and calculation speed obtained.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 17: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

678 P. PAWLICZEK, W. DZWINEL AND D. A. YUEN

4.5. Results of tests—MPI cluster

As was mentioned in the Introduction, the interactive visualization by using MDS mapping ofM> 104

feature vectors is beyond the computational ability of a single processor board. The memory shortageproblem becomes more obvious if the running operating system is 32-bit that can handle at most 4GBvirtual memory per process. Therefore, we can utilize distributed resources of a computer cluster tohandle larger data. For large data processing we have developed a parallel MPI version of the MDS,which is described in Section 3.3.

For testing purposes, we used two MDS code versions exploiting two-level parallelism on amidrange MPI cluster (HP SL390). At the node level, we implemented the MDS algorithm written forCPU with OpenMP directives and the GPU version of MDS with CUDA programming interface. Thesecond level parallelism exploits cluster node topology and is realized by the MPI-based algorithmdescribed in Section 3.3. In the tests, we used MPICH2 [30] environment, which is consistent withMPI2 standard.

The CPU and GPU versions were tested using H40 and H24 datasets, respectively. Different size oftesting datasets is because of the limitations imposed by the size of GPU global memory. Thecomputations were performed using float arithmetic. The speedups for these two parallel versions ofMDS were compared with the timings obtained for multithread MDS developed for multicore CPUon one cluster node (Listing 2) and for one Tesla M2090, respectively.

As shown in Figure 11, the efficiency of a single node is about 40% for both versions and on 10 nodesthe speedup is around four for both cases. However, in comparison with CPU version, the GPUimplementation gives a little bit lower speedup. This is because of the smaller size of data processedand faster execution time on a single GPU board. In that case, the serial component in the Amdahl law,that is, the communication between the GPU boards, is greater than for slower CPU version.

5. RELATED WORK

Because data mining of large data sets became one of the hottest topics in computer science, MDS hasrecently attracted much attention as a robust visualization tool for data exploration [31, 32]. KnowingMDS limitations such as quadratic memory and computational complexity, plenty of approaches weredeveloped concerning both methodological and implementation issues. The most recent reviews canbe found in [31, 32]. Among many approaches to MDS, the classical concept dominates [2,5, 6, 33].It employs dissimilarity matrix between objects as the most reliable representation of the

a b

Figure 11. The speedups for HP SL390 cluster employing two-level parallelism: a) MPI interface andOpenMP on CPU nodes (2x Xeon X5670) and b) MPI interface and CUDA on GPU nodes (Tesla M2090).

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 18: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

VISUAL EXPLORATION OF DATA BY USING MULTIDIMENSIONALSCALING ON MULTICORE CPU, GPU, AND MPI CLUSTER 679

multidimensional data topology. The major factors, which differentiate the MDS methods based ondissimilarity matrix are as follows:

1. Definition of dissimilarity and the metrics in the context of non-metric and metric spaces, respec-tively [1–6].

2. Usage of the partial dissimilarity matrix or its approximation (such as in [11,25,34–36]).3. The type of minimized cost function (stress function) [2,4–6, 33].4. The choice of minimization procedure (e.g., in [9,31, 32, 37–39]).5. The implementation issues (e.g., [25,31, 32]).

One of the first papers, which address the problem of implementation of MDS method on GPU, waswritten by Reina and Ertl, 2005 [40]. The paper presents GPU version of FastMap [41]—a quite simpleand very fast visualization method. Its GPU version was implemented with the use of OpenGL library.Execution times obtained on GeForce 6800 GT were about 40 times lower than those obtained by CPUimplementation run on a single Pentium 4 2.4GHz processor.

Another GPU implementation of MDS method can be found in [42]. This paper describes animplementation of high-throughput MDS algorithm in CUDA environment. The high-throughputMDS algorithm bases on maximization of Pearson correlation between original and target distancematrices and was presented in earlier papers [43] and [44]. The authors reported that on NVIDIATesla S870 GPU rack they gained speedup within interval 50–60 measured against Matlabimplementation run in multithread mode on a 16 core server equipped with 3GHz AMD OpteronCPUs. The NVIDIA Tesla S870 GPU rack consists of four Tesla C870 processors. Single TeslaC870 processor is comparable with GeForce 8800 Ultra graphic card.

According to the review paper [32], the most efficient implementation of MDS allowing forvisualization of 2� 105+ feature vectors is the GPU implementation of Glimmer algorithmdescribed in [40]. It integrates two other approaches: Chalmers’s algorithm [30] and multigrid MDS[40]. Glimmer was designed in a way allowing for its direct and efficient implementation in GPUenvironment. However, as shown in [28], the final results of GLIMMER mapping are far from theglobal minimum of the cost function (1). It was shown in [25,28] that the MDS method employingparticle dynamics and incomplete distances matrix can achieve similar GPU efficiency as Glimmerwith considerably smaller error (1).

Anyway, in this paper, we concentrate on the implementation of MDS algorithm, which uses fulldissimilarity matrix. Unlike approximate algorithms, such as GLIMMER, it ensures that the mappedstructure is unambiguous. By using full dissimilarity matrix, we avoid systematic errors caused byoverridden number of degrees of freedom in approximate algorithms.

In [9, 10, 28,34], we have presented a few parallel implementations of MDS based onparticle dynamics. Our algorithms were inspired by well-known MD parallel codes. In [29], wereported satisfactory linear speedup with 40%–80% efficiency using 36 nodes (144 treads)and 6� 104 feature vectors. The most recent parallel MPI implementation of MDS with fulldissimilarity matrix which uses SMACOF algorithm (Scaling by MAjorizing a ComplicatedFunction) is presented in [45]. This approach bases on function majorization concept [39]. Ingeneral, the minimum obtained by using this type of algorithm is local and is not as good as thoseobtained by heuristics (e.g., [9, 10, 23]). However, unlike for heuristics, it can be achieved muchfaster. In [46], the authors report a very good performance of their parallel MDS algorithm on theclusters of AMD Opteron 8356 (2.3 GHz) and Intel Xeon E7450 (2.4 GHz) consisting of 256 and768 nodes, respectively. The largest data set visualized consists of 105 feature vectors.

Although the MPI implementations allow for visualization of the datasets of larger sizes because ofthe scalable memory, the simulation times are still unsatisfactory to enable interactive visualization.Meanwhile, as shown in Figure 9a, one iteration of simulation of dynamics of 2� 104 particles(feature vectors) using our MDS virtual particle algorithm running on Fermi GPU board and usingFastMath arithmetic requires only about 20ms. For a typical number of time-steps needed to obtaina stable minimum, that is, n = 1000–5000, we obtain the total computational time equal to 20–100 s.This result is more than satisfactory for interactive visualization and control having in mind that thesystem can be controlled in course of particle system evolution.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 19: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

Figure 12. The average time of a single iteration for particle-based multidimensional scaling on selectedCPU, GPU nodes and MPI cluster.

680 P. PAWLICZEK, W. DZWINEL AND D. A. YUEN

6. CONCLUSIONS

In Figure 12, we compare the best timings of the MDS algorithm implementations achieved on GPUboards with its CPU, MPI/CPU, and MPI/CUDA versions. The timings were obtained for the samedata set (H19 dataset) and by using IEEE arithmetic mode. The size of datasets (2� 104 featurevectors) is limited by the size of the global memory of GeForce GTX 480 GPU board. This issmaller size of datasets than that used in Section 4.5 results in lesser number of nodes (6 instead 10)of MPI server used for calculations. Larger number of nodes used for that size of data wouldconsiderably worsen the computational efficiency because of high communication/calculation ratio.

As shown in Figure 12 (also Figure 9 and Table V), the advantage of GPU over CPU systems isevident. Employing single GeForce 480 GTX or Tesla M2090 GPU board allows for achievingcomputational speed twice as that obtained for six nodes of midrange HP SL390 cluster. Fromresults obtained in Section 4.5, one can expect that even for larger datasets, 10 nodes HP SL390cluster would be slower than a single Tesla M2090 or Fermi board with adequate global memory. Itmeans that a workstation with a strong GPU board can have similar computational power as aprofessional cluster equipped with several strong two-processor CPU nodes.

Such the bottlenecks as relatively small global memory of GPU boards, algorithmic constraints, thedifficulty in CUDA or OpenCL programming, and limited portability of CUDA codes are still seriousdisadvantages of wider exploitation of computational capabilities of GPU boards. On the other hand,for the problems such as MDS and interactive visualization of large datasets, where FastMatharithmetic mode is sufficient for obtaining satisfactory results, the advantage of GPU boards andclusters over their CPU equivalents is overwhelming (Figure 9a and Table V). The strongest FermiGPU board is then about 13 times faster than two-processor, 12-thread Intel Xeon X5670 server.

To sum up, the implementation of MDS employing particle dynamics in GPU computationalenvironment allows for interactive visualization of datasets consisting of order of 104 objects(feature vectors) on PC equipped with GPU boards with compute capability at least 2.0.

ACKNOWLEDGEMENTS

This research has been financed by the Polish Ministry of Higher Education and Science, project NN519443039 and partially by AGH Grant No.11.11.120.777. It has also been supported by CMG program ofthe US National Science Foundation. We thank NVIDIA Company for support and hardware donations. Partof computations was performed on the resources provided by the Academic Computer Centre CYFRONETAGH in Krakow (projects MNiSW/SGI3700/AGH/130/2006 and MNiSW/Zeus_GPGPU/AGH/037/2011).

REFERENCES

1. Young G, Householder AS. Discussion of a set of points in terms of their mutual distances. Psychometrika1938; 3(1):19–22.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 20: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

VISUAL EXPLORATION OF DATA BY USING MULTIDIMENSIONALSCALING ON MULTICORE CPU, GPU, AND MPI CLUSTER 681

2. Torgerson WS. Multidimensional scaling. 1. Theory and method, Psychometrika 1952; 17:401–419.3. Torgerson WS. Theory and methods of scaling. John Wiley & Sons: New York, 1958.4. Coombs C. H. A Theory of Data. John Wiley & Sons: New York, Chapters 5, 6, 7:80–180, 1964.5. Kruskal J. Multidimensional scaling by optimizing goodness-of-fit to a nonmetric hypothesis. Psychometrika 1964; 29:1–27.6. Sammon JW.A nonlinear mapping for data structure analysis. IEEE Transactions on Computers 1969;C-18(5):401–409.7. Younes L. Computable elastic distances between shapes. SIAM Journal on Applied Mathematics 1998; 58(2):565–586.8. Dzwinel W. How to make Sammon’s mapping useful for multidimensional data structures analysis? Pattern Recog-

nition 1994; 27(7):949–959.9. Dzwinel W. Virtual particles and search for global minimum, Future Generation Computer Systems 1997; 12:371–389.

10. Blasiak J, Dzwinel W. Visual clustering of multidimensional and large data sets using parallel environments. LectureNotes in Computer Science 1998; 1401:403–410.

11. Dzwinel W, Błasiak J. Method of particles in visual clustering of multi-dimensional and large data sets. FutureGeneration Computer Systems 1999;15:365–379.

12. Arodź T, Boryczko K, Dzwinel W, Kurdziel M, Yuen DA. Visual exploration of multidimensional feature space of bi-ological data. Proceedings of 16th IEEE Visualization 2005 (VIS 2005), Minneapolis, Minnesota October 23–28, 2005.

13. Andrecut M. Molecular dynamics multidimensional scaling. Physics Letters A 2009; 373(23/24):2001–6.14. Yuen DA, Dzwinel W, Ben-Zion Y, Kadlec B. Visualization of earthquake clusters over multidimensional space.

Encyclopedia of Complexity and System Science, Springer Verlag: New York, 2347–2371, 200915. Kurdziel M, Boryczko K, Dzwinel W. Procrustes analysis of truncated least squares multidimensional scaling.

Computing and Informatics 2012; 31(6+):1417–144016. Nguyen D, Dzwinel W, Cios KJ. Visualization of highly-dimensional data in 3-D space. Proceedings of the 2011 11th Inter-

national Conference on Intelligent Systems Design and Applications, 22–24 November 2011, Cordoba, Spain, 225–230.17. Parker SG. Jonson CR, Beazley D. Computational steering. Software systems and strategies. IEEE Computational

Science&Engineering 1997; 4(4):50–59.18. Rapaport DC. The Art of Molecular Dynamics Simulation. Cambridge University Press: New York, 1996.19. Kirkpatrick S, Gelatt Jr. CD, Vecchi MP. Optimization by simulated annealing. Science 1983. 220/4598, 671–680.20. DzwinelW, Yuen DA, Boryczko K, Ben-Zion Y, Yoshioka S, Ito T. Nonlinear multidimensional scaling and visualization

of earthquake clusters over space, time and feature space. Nonlinear Processes in Geophysics 2005; 12:117–128.21. Ahlrichs R, Brode S. An optimized MD program for the vector computer CYBER-205. Computer Physics Commu-

nications 1986; 42(1):51–55.22. Ahlrichs R, Brode S.A new rigidmotion algorithm forMD simulations.Computer Physics Communications 1986; 42:59–64.23. Smith W, Forester TR. Parallel macromolecular simulations and the replicated data strategy: I. The computation of

atomic forces. Computer Physics Communications. 1994; 79(1):52–62.24. Bae S-H, Qiu J, Fox G. Adaptive interpolation of multidimensional scaling. Procedia Computer Science 2012; ICCS

2012, 9:393–402.25. Pawliczek P, Dzwinel W. Interactive data mining by using multidimensional scaling, Procedia Computer Science,

ICCS 2013, 1–10, 2013. (in press)26. Shu J, Wang B, Chen M, Wang J, Zheng W. Optimization techniques for parallel force-decomposition algorithm in

molecular dynamic simulations. Computer Physics Communications 2003; 154:121–130.27. Taylor VE, Stevens RL, Arnold KE. Parallel molecular dynamics: communication requirements for massively parallel

machines. Proceedings of Frontiers’95, Fifth Symposium on the Frontiers of Massively Parallel Computation, 156,IEEE Comput. Society Press, 1994.

28. Pawliczek P. Improvement of multidimensional scaling efficiency in the context of interactive visualization of large datasets. PhD thesis at AGHUniversity of Science and Technology, Department of Computer Science, Krakow, Poland, 2012.

29. Pawliczek P, Dzwinel, W. Parallel implementation of multidimensional scaling algorithm based on particle dynamics,Lecture Notes in Computer Science 2010, PPAM, Wrocław, 13–16 September 2009, LNCS 6067:312–321.

30. Gropp W. MPICH2: A New start for MPI implementations. D. Kranzlmüller, J. Volkert, P. Kacsuk, J. Dongarra(editors), Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in ComputerScience, LNCS 2474, Springer Verlag, Heidelberg-Berlin; 2002.

31. Borg I, Groenen PJF. Metric and Nonmetric MDS. Modern Multidimensional Scaling: Theory and Applications.Springer Verlag: New York, second edition, 2005.

32. France SL, Carroll JD. Two-Way multidimensional scaling: a review. IEEE Transactions on Systems, Man, andCybernetics, Part C: Applications and Reviews 2011; 41: 644–661.

33. Niemann, H. Linear and nonlinear mapping of patterns, Pattern Recogn. 1980; 12(2):83–87.34. Pawliczek P, Dzwinel W. Visual analysis of multidimensional data using fast MDS algorithm. Conference on photonics ap-

plications in astronomy, communications, industry, and high-energy physics experiments 2007, Wilga, Poland, May 21–27,2007. Proceedings of the society of photo-optical instrumentation engineers (SPIE), 6937(1–2):M9372-M9372, 2007.

35. Chalmers M. A linear iteration time layout algorithm for visualizing high-dimensional data. IEEE Visualization ’96.Proceedings, 127–131, 1996.

36. Ingram S, Munzner T, Olano M. Glimmer. Multilevel MDS on the GPU. IEEE Transactions on Visualization andComputer Graphics 2009, 15:249–261.

37. Klock H, Buhmann JM. Data visualization by multidimensional scaling: a deterministic annealing approach. PatternRecognition 2000, 33:651–669.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe

Page 21: Visual exploration of data by using multidimensional ...mineralscloud.com/reports/allpublications/PapersSince2013/cpe302… · Visual exploration of data by using multidimensional

682 P. PAWLICZEK, W. DZWINEL AND D. A. YUEN

38. Varoneckas A, Zilinskas A, Zilinskas J. Multidimensional scaling using parallel genetic algorithm. Computer AidedMethods in Optimal Design and Operations 2006; 129–138.

39. De Leeuw J, Mair P. Multidimensional scaling using majorization: SMACOF in R. Journal Of Statistical Software2009; 31:1–30.

40. Reina G, Ertl T. Implementing FastMap on the GPU: considerations on general-purpose computation on graphicshardware. Theory and Practice of Computer Graphics 2005; 51–58.

41. Faloutsos C, Lin K-I. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and mul-timedia datasets. Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data,163–167, 1995.

42. Fester T, Schreiber F, Strickert M. CUDA-based multi-core implementation of MDS-based bioinformaticsalgorithms. Lecture Notes in Informatics Series of the German Informatics Society (GI). German Conference onBioinformatics, 67–79, 2009.

43. Strickert M, Teichmann S, Sreenivasulu N, Seiffert U. High-throughput multi-dimensional scaling (HiT-MDS) forcDNA-array expression data. Lecture Notes in Computer Science 2005, LNCS 3696:625–634.

44. Strickert M, Sreenivasulu N, Usadel B, Seiffert U. Correlation-maximizing surrogate gene space for visual mining ofgene expression patterns in developing barley endosperm tissue. BMC Bioinformatics 2007; 8:165.

45. Micikevicius Paulius. Analysis-driven optimization. GPU Technology Conference 2010, http://www.nvidia.com/content/GTC-2010/pdfs/2012_GTC2010.pdf.

46. Seung-Hee Bae, Qiu J, Fox G, High performance multidimensional scaling for large high-dimensional data visuali-zation, IEEE Transaction of Parallel and Distributed System, January 2012, (in press)

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:662–682DOI: 10.1002/cpe