[IEEE 2011 IEEE International Conference on Robotics and Biomimetics (ROBIO) - Karon Beach, Thailand...

Abstract—Computer vision and sensing systems have been

widely applied to industry, and this includes robotics, an area requiring real-time applications. However, robust and high- accuracy systems are usually time-consuming, but high-performance devises can speed up the systems. Many applications use GPU (Graphic Processing Unit) for faster computation. The greatest advantages of the GPU system include high-memory bandwidth and flexibility and ease of programming. However, this method requires transferring data between host (computer) and device (graphic card). Moreover, it only supports data parallelism or overlap copy data and kernel execution. Multi-core CPU offers another approach for improving system performance. OpenMP, an API that supports multi-core CPU programming strongly, offers flexibility and simple programming using compiler directives items; in particular, the computation can be done without transferring data to the device and full task parallel execution. However, multi-core CPU is limited to several cores. Combining the advantages of GPU and multi-core CPU optimizes system performance. This paper proposes a novel high-performance computing model using combination architecture of GPU and multi-core CPU to obtain a two parallel layer system including data parallel and task parallel models. The result implements a highly accurate 3D reconstruction application using the improved HOC structured light coding algorithm. The experimental results of the proposed model reveal a speed 20 times faster than the original implementation in the single CPU.

Keywords: structured light coding, high-performance computing, GPU, multi-core CPU

This work was partially supported by WCU program through the National Research Foundation of Korea funded by the Ministry of Education, Science and Technology (R31-2010-000-10062-0), by PRCP through NRF of Korea funded by MEST (2011-0018397), performed for the Intelligent Robotics Deverlopment Program, one of the 21st Century Frontier R&D Programs (F0005000-2010-32), and by the KORUS-Tech program (KT-2010-SW-AP-FS0-0004).

I. INTRODUCTION

GPU refers to high-performance devices which can be accelerated in a wide range of application in the high-performance computing (HPC) field. Many studies have determined the performance and advantages of the GPU technique [1]. In particular, the development of the GPGPU (General Purpose GPU) technique allows us to perform not only computer graphic applications but also more general computation and stream processing on non-graphic data [2]. The advantages of GPU include high memory bandwidth, flexibility, and easy programming. In November 2006, NVIDIA introduced CUDA (Compute Unified Device Architecture), a general purpose parallel computing architecture that leverages the parallel computer engine in NVIDIA GPUs, allowing it to solve many complex computational problems in a more efficient way than a CPU.

But CUDA architecture also has limitations. The input data needs to transfer between host and device for computation. Therefore, the transference-of-data step sometimes becomes the bottleneck of systems that handle large amount of input data [3]. Moreover, CUDA architecture only supports the data parallel mechanism and overlapping copy data and kernel execution. This mechanism is suitable for the SIMD (Single Instruction Multiple Data) model where an instruction iterates a lot of times. However, it is hard to obtain optimal performance using the complicated application, which requires both data parallelism and task parallelism, given the CUDA architecture’s capability [4]. Multi-core CPU is also a high performance device supported by OpenMP API for parallel programming. A multi-core device has several CPU cores, which cannot be compared with hundreds of cores inside each GPU chip; while they are quite expensive, they offer high performance. Unlike CUDA architecture, the multi-core application using OpenMP does not need to transfer data for execution and supports both parallel layers include data parallel for SIMD structures and high-level task parallel [5]. These properties represent the advantages of multi-core CPU devices.

High-Performance Computing Model for 3D Camera System

Hong-Nam Ta1, Sukhan Lee1,2, Fellow, IEEE

1School of Information and Communication Enginerring, Sungkyunkwan University, Suwon, South Korea 2Department of Interaction Science, Sungkyunkwan University, Seoul, Korea

(Tel : +82-31-299-6465; E-mail: [email protected], [email protected])

978-1-4577-2138-0/11/$26.00 © 2011 IEEE 354

Proceedings of the 2011 IEEEInternational Conference on Robotics and Biomimetics

December 7-11, 2011, Phuket, Thailand

Because of the intriguing advantages of both devices, many studies tend to combine them in a high-performance power system. Honghoon and Anjin [6] proposed a combination model of CUDA and OpenMP for a Neural Network implementation. However their architecture offers a simple combination form because the authors separated the process into two sequential parts and performed each part using different devices—both are data parallel execution. We know that we can replace the use of OpenMP for reducing the input data size before transferring to execute in GPU by other techniques to optimize the data-transfer step [7]. Kim and Park [8] used the combination of three platforms (OpenMP, CUDA, and SSE) for implementing feature extraction application. Nevertheless, the combination failed to dramatically improve the system’s performance. In this paper, we propose a novel high-performance computing model for a structured light system using GPU and multi-core CPU. The proposed model includes two overlap parallel layers, data parallel execution in GPU, and task parallel execution in multi-core CPU. The GPU program was designed and implemented using CUDA architecture, NVIDIA Company; the multi-core CPU program was implemented using OpenMP API.

We implemented the structured light improved HOC coding algorithm (I-HOC), first proposed in 2005 [9]. Improvements have increased its accuracy as well as execution time. Therefore, the proposed high-performance computing model was applied to an I-HOC algorithm to obtain a high-performing and highly accurate structured light coding algorithm. This paper is organized as follows. Section 2 introduces the improved HOC coding algorithm. Section 3 describes the proposed high-performance computing model for a structured light system. The experimental results are analyzed in section 4. Section 5 concludes the paper.

II. IMPROVED HOC CODING ALGORITHM

A. Improved HOC coding Algorithm. Many studies have proposed the temporal structured light

coding technique including Binary code [10], Gray code [11], and recently, Hierarchical Orthogonal Code (HOC). The HOC technique focuses on reducing the code length as much as possible, while preserving the characteristics of the orthogonal code. Experimental results revealed HOC to be a more robust method with higher accuracy because of its direct use of separate codes for error correction and disparity computation. The improved version of HOC (I-HOC) proposes to greatly increase the accuracy of the original and also increases computation time [12]. Table I shows an accuracy comparison of I-HOC and the traditional GC-I (Gray Coding Inverse) algorithm.

TABLE I

ACCURACY COMPARISON OF I-HOC AND GRAY CODE INVERSE STD Error max Number of points

I-HOC 0.453 3.674 27877 GC-I 0.788 20.997 27888

STD: standard derivation Figure 3 shows a block diagram of I-HOC coding algorithm. The input data include 16 light patterns and 4 reference patterns with the resolution of 640x480 pixels. Figure 1 shows these patterns divided into four layers.

Fig. 1. Four layers of HOC light pattern set.

The I-HOC algorithm includes two independent processes.

The first process, a traditional HOC coding algorithm, decodes the intensity value of every pixel in each layer and then cumulates the results of four layers. Figure 2(a) reveals that the correspondence value of pixel in layer one is determined by comparing the intensity with its correspondence pixels in layer one, , , and . The decode value of 4 layers then forms the cumulative total:

(1) Here Cres, is the total correspondence value and decode value of current pixel in layer , respectively; is the correspondence layer. The second process, the boundary estimation, is another approach developed to correct the results of the first. It includes algorithms to determine exactly the boundary at points A, B, C, and D. The pixel that belongs to the regions is then divided by these boundary points (such as AB, BC, CD, and DA) and must have the same correspondence values. In this way, we can check the correction of the correspondence value of pixel based on its region. This tactic makes the I-HOC method much more accurate than its predecessor.

355

(a)

(b)

Fig. 2. The two approaches to finding correspondence in the I-HOC algorithm, (a) Correspondence search based on decodeintensity, (b) Boundary estimation process.

Fig. 3. Block diagram of improved I-HOC coding algorithm for parallel computation.

III. PROPOSED HIGH-PERFORMANCE COMPUTING MODEL

This section describes the proposed high-performance computing model in detail. In this model, we use both GPU and multi-core CPU to form the two parallel layers system. We overlap these two parallel layers to optimize performance. The first level, a high level task parallel, is implemented in the multi-core CPU using OpenMP APIs. It divides the main process into sub-processes, which are concurrently executed by the CPU threads. Next, performances of these CPU threads are continuously enhanced by implementing the low level data parallel. This level divides the data of these CPU threads into thousands of sub-problems, which concurrently execute by CUDA threads in the GPU platform.

A. Implementation task parallel computation level Task parallel, one advantage of the multi-core CPU

platform, allows for executing many tasks with different instructions simultaneously. Task-parallel execution requires

that all tasks be independent. In this case, the main program is divided into many CPU threads; each thread handles one or more tasks. Therefore, the amount of CPU thread depends on the amount of independent task and the number of CPU core. Their total performance approximates the performance of the slowest threads.

The I-HOC algorithm (Fig. 3), has two independent processes in its structured data. One finds correspondence or the original HOC algorithm and the other estimates boundary—the improvement part. So, we can apply the task parallel model to these processes. By doing this way, we divide these two independent processes into two parallel CPU threads. This value is appropriate with the number of the CPU core of our multi-core hardware platform, Intel Core 2 Duo CPU E7500. The two CPU threads are named CPU_thread_0 and CPU_thread_1, as described in Figure 3.

We chose the OpenMP API for multi-core CPU programming because of its simplicity and convenience. It allows the programmer to define the parallel threads by using compiler directives items. From Figure 3, we define the two CPU threads function as:

omp_set_number_threads(num_cpu_threads). Here, we set the constant value num_cpu_threads to two. When the number of CPU threads is set, each thread is represented by a private identification value. We can indicate this value inside each thread as the function:

omp_get_thread_num() function. The parallel threads implemented inside the #pragma parallel block depend on the correspondence thread identifications. #pragma omp parallel shared(inputImage) { // parallel code implement here

Id = omp_get_thread_num(); if(id == 0) // thread 0 implementation else Thread 1 implementation } Because these two threads share the input data, we define them as shared variable in shared(inputImage) item.

B. Implementation data parallel computation level Task parallel, a fast and effective method to improve the

performance, is limited by several CPU cores. Therefore, it is difficult to reduce the computation time 10 or 20 times by using only the multi-core CPU. In this case, we propose applying the second data parallel level.

Because the CPU threads have a parallel execution, their joint performance equals the performance of the slowest CPU thread as described in Figure 4. Therefore, the point at which

356

the two CPU threads have the same execution time represents the best time. However, this offers some difficulty because two CPU threads differ in both input data and instruction sets. In the I-HOC algorithm, the finding-correspondence process has simple computation and takes hundreds of milliseconds. But with the second thread, the boundary-estimation process is a hard computation thread, and the worst test case may take more than ten seconds. To improve the performance of the system, we proposed using CUDA to apply the second data parallel level to the boundary-estimation process. This solution also reduces the difference between the computation times of the two CPU threads and then obtains the optimal system performance.

The boundary estimation step was implemented in the GPU following the CUDA structure. The GPU’s data parallel mechanism differs from the task parallel. It is suitable for the large iteration process in which the same instruction is applied to a large amount of data. This structure is called Single Instruction Multiple Data.

First, the input data is pre-processed by normalizing and Gaussian blur to reduce noise. The normalization function is:

(1)

where, are the original intensity value and the normalization value of the considered pixel; , are the intensity of correspondence pixels in the black reference image and white reference image. These images are obtained by generating full black and full white patterns. From Equation (1), we can see that the normalization of pixel i is independent of the other pixels. Each pixel can then be fed into one CUDA thread for parallel execution. With the number of light pattern at 16 images, the resolution is 640x480; 4.915.200 CUDA parallel threads are created. Experimental results show the parallel execution could reduce 90% of the execution time of the normalization process.

Next, the data is blurred by one dimension Gaussian filter (2) to reduce the noise. In the implementation, the Gaussian filter was represented by 1x(2n+1) matrix R and the Gaussian value of each pixel is calculated by (3). Here, the Gaussian value of pixel i depends on the value of n neighbor pixels, making it difficult for parallel execution because we want these pixels updated with new values parallel with pixel i by the parallel threads. To solve this problem, a small buffer is created in each parallel thread to hold the old values of n neighbors of pixel i. It ensures the correct updating of the Gaussian value of every pixel.

(2)

(3)

Finally, we determined the exact intersection points of light patterns by a complicated process. Here, the intersection point in each layer is determined by considering two adjacent patterns as described in Figure 2(b). The process is calculated for each pair of lines of two adjacency patterns in each layer. Therefore, 7680 parallel sub-problems are created. In the single CPU platform, this is the bottleneck point because of the high consumption time. However, by using data parallel implementation in the GPU we can reduce the computation time equal to the original HOC algorithm.

Fig. 4. Performance improvement of multi-core CPU implementation.

C. Mapping memory to reduce the transfer data process One problem of CUDA programming model for GPU

involves the transferring data process. Because we need to transfer the data between the computer and the graphic card for computation, this step may take a considerable amount of time if the application uses a large amount of data. In the worst case scenario, the data transfer consumption time may be higher than the consumption time for parallel execution inside the GPU. It then becomes the new bottleneck of the GPU system rather than the computation process. For these reasons, many techniques have been developed to improve the data transfer process in GPU parallel computation models. The effective solution includes overlapping data transfer and kernel execution, one improved property offered by the new CUDA architecture. It allows devices to perform copies between a page-locked host memory and a device memory concurrently with the kernel execution. Use of page-locked memory not only helps overlapping data transfer and kernel execution but also maps the host memory into the address space of the device. Therefore, it eliminates the need for copying data from host o or from device memory; the kernel implicitly performs data transfer as needed.

Page-locked memory is located in and shared by two memory spaces, host memory and device memory. To access this special memory we need two pointers, one for the host and another for the device by calling on a cudaHostgetDevicePointer() function. We also need to synchronize access using a stream or events. Experiment results show that we can save 50% of data transfer using the

357

overlap execution model. Figure 5 shows the overlap and breaks the computation into a chunks model for the normalization step of the boundary estimation process in GPU. Here, the input data is divided into two parts Data_0 and Data_1. The optimum chunk value depends on the hardware platform and the complexity of the kernel function.

Fig. 5. Execution timeline of normalization in I-HOC coding algorithm with arrows indicating inter-engine dependencies.

IV. EXPERIMENTAL RESULTS

The experiment was implemented in Intel core 2 Duo CPU E7500, CPU speed 2.94 GHz and Tesla C2050 graphic card. In this section, we describe the detailed results of our proposed method.

A. Results of data parallel computation in GPU Figure 4 shows the performance of the boundary estimation process implemented on two platforms, GPU and single CPU. Because the boundary estimation process is suitable for the SIMD structure, we can separate it into independent sub-problems. When executing in CUDA, the sub-problems are executed in parallel, therefore saving a great deal of execution time. In Figure 6, the red line presents the performance of implementation in the GPU platform; the blue line presents the performance of the original implementation method in a single CPU. The results of five test cases showed that in the best case, the CUDA version proved up to 20 times faster than the original version of the boundary estimation process of I-HOC.

Fig. 6. Performance comparison of boundary estimation process of single CPU platform and GPU platform.

B. Performance result of purposed model Figure 7 shows the total performance of the I-HOC coding implementation using the proposed model. The authors compared the performance of the proposed model with the original implementation method and the multi-core CPU version (without GPU). The results showed that the proposed model obtained higher performance than the other methods.

Fig. 7. Performance comparison of I-HOC implementation in proposed model, multi-core CPU model, and original model.

In general, the proposed method offers an improvement 5 times faster than the multi-core CPU version and 15 times faster than the original implementation. However, the values are not fixed but can be changed depending on the complexity of the input data. The more complicated the input scene, the greater the possible performance improvement. The reason for this is because when the complexity of the scene increases, the boundary estimation algorithm needs more computation time. Because the original method computed the data sequentially, the additional time for each sub-problem is added to the sum of the total computation time. In the proposed method, however, since the thousands of sub-problems are executed concurrently, the addition value is divided by thousands. Therefore, the increment in the computation time is small. Another advantage of the proposed model is stability. The fact that additional time is divided by thousands as mentioned above makes the system more stable. But, most importantly, the proposed model changes the bottleneck point of the system. In the original model, the bottleneck point occurs in the computation process because it is the most time-consuming process. However, in the proposed model, the execution time may be faster than the data-transferring process between the graphic device and the host. This moves the bottleneck to the data transferring process. But the data transfer does not depend on the complexity of the input scene; it depends on the size of the input images. This value does not change for a fixed system. Therefore, the proposed model

0

2

4

6

8

10

12

1 2 3 4 5

0

2

4

6

8

10

12

14

1 2 3 4 5

Copy Engine Kernel Engine

Time

358

performs in a more stable fashion than the original model. In Figure 7 it is similar to a straight line.

TABLE 2 PERFORMANCE COMPARISON BETWEEN ORIGINAL I-HOC,

PROPOSED MODEL, AND GRAY CODE INVERSE Test case

I-HOC (single CPU)

GC_I (single CPU)

I-HOC (GPU+ CPUs)

1 6892 112 593

2 5129 190 586

3 8060 133 600

4 10346 127 598

5 11475 164 605 Unit: millisecond (ms) Table 2 shows the performance of the original I-HOC implementation, the proposed model, and the Gray Code inverse algorithm when implemented in a single CPU platform. By using the high-performance device, the proposed model obtains a time performance close to the Gray Code Inverse method and higher accuracy than the Gray Code Inverse. Therefore, the combination of the proposed model and the I-HOC algorithm created a high-performance and high-accuracy structured light coding method. Figure 8 shows the results of the proposed model for the 3D reconstruction application using I-HOC. (a) (b) (c) (d) Fig. 8. Test scenes of 3D reconstruction application using proposed model.

V. CONCLUSION

In this paper, we propose a novel, high-performance computing model using GPU and multi-core CPU. The main advantage of the proposed model is the optimal system performance by implementation in two parallel layers. The

data parallel layer is handled by the GPU and the high level task parallel layer is handled by a multi-core CPU using OpenMP API. Moreover, the proposed high-performance model is appropriate for 3D camera system application because it can handle a large amount of data and consists of a SIMD structure. The experiment’s results implementing an I-HOC using the proposed model show a high-performance and high-accuracy structured light coding method.

REFERENCES

[1]. Asano, S.; Maruyama, T.; Yamaguchi, Y.; “Performance comparison of FPGA, GPU and CPU in image processing”, International Conference on Field Programmable Logic and Applications, pp. 126 – 131. 2009.

[2]. Kothapalli, K.; Mukherjee, R.; Rehman, M.S.; Patidar, S.; Narayanan, P.J.; Srinathan, K.; “A performance prediction model for the CUDA GPGPU platform”. International Conference on High Performance Computing (HiPC) 2009, pp: 463 – 472. 2009.

[3]. Grottel, S.; Reina, G.; Ertl, T.; “Optimized data transfer for time-dependent, GPU-based glyphs”, PacificVis '09. IEEE Pacific Visualization Symposium, 2009, pp. 65 – 72. 2009.

[4]. Ngai-Man Cheung; Au, O.C.; Man-Cheung Kung; Wong, P.H.W.; Chun Hung Liu; “Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors”, IEEE Transactions on Circuits and Systems for Video Technology, pp. 1692 – 1703. 2009.

[5]. Martorell, X.; Gonzalez, M.; Duran, A.; Balart, J.; Ferrer, R.; Ayguade, E.; Labarta, J.; “Techniques supporting threadprivate in OpenMP”, 20th International Parallel and Distributed Processing Symposium, 2006.

[6]. Honghoon Jang; Anjin Park; Keechul Jung.; “Neural Network Implementation Using CUDA and OpenMP”, DICTA '08.Digital Image Computing: Techniques and Applications, pp. 155 – 161. 2008.

[7]. NVIDIA diverloper zone: http://developer.nvidia.com/cuda -gpus

[8]. unchul Kim; Eunsoo Park; Xuenan Cui; Hakil Kim; Gruver, W.A.; “A fast feature extraction in object recognition using parallel processing on CPU and GPU”, SMC 2009. IEEE International Conference on Systems, Man and Cybernetics, pp. 3842 – 3847. 2009.

[9]. Sukhan Lee; Jongmoo Choi; Daesik Kim; Jaekeun Na; Seungsub Oh; “Signal Separation Coding for Robust Depth Imaging Based on Structured Light”, Proceedings of the 2005 IEEE International Conference on Robotics and Automation, pp. 4430 – 4436. 2005.

[10]. J. L. Posdamer and M. D. Altschuler; “Surface measurement by space encoded projected beam systems”, Computer Graphics and Image Processing, pp. 1-17. 1982.

[11]. J. Gühring.; “Dense 3-D surface acquisition by structured light using off-the-shelf components”, Proc. of the SPIE Photonics West, Electronic Imaging 2001, Videometrics and Optical Methods for 3D Shape Measurement VII, pp. 220-231. 2001.

[12]. Lam Quang Bui, and Sukhan Lee; “Ray-Tracing Codec for Structured Light 3D Camera”, ISRC Technical Report.

359

[IEEE 2011 IEEE International Conference on Robotics and Biomimetics (ROBIO) - Karon Beach, Thailand...

Documents

Transcript of [IEEE 2011 IEEE International Conference on Robotics and Biomimetics (ROBIO) - Karon Beach, Thailand...