Fast implementation of iterative reconstruction with exact ray-driven projector on GPUs

TSINGHUA SCIENCE AND TECHNOLOGY IS SN l l 1 0 0 7 - 0 2 1 4 l l 0 5 / 2 0 l l p p 3 0 - 3 5 Volume 15, Number 1, February 2010

Fast Implementation of Iterative Reconstruction with Exact Ray-Driven Projector on GPUs

Fang Xu**

Siemens Corporate Research, Princeton, NJ 08540, USA

Abstract: Iterative methods are popular choices in image reconstruction fields due to their capability of re-

covering object information from incomplete acquisition data. However, the computation process involves

frequent uses of forward and backward projections that are computationally expensive. Past research has

proved that a forward projector that can produce high quality images is crucial to achieve a good conver-

gence rate. In this paper a high performance iterative reconstruction framework is introduced, where two

most popular iterative algorithms: Simultaneous Algebraic Reconstruction Technique (SART) and Or-

dered-subsets Expectation Maximization (OSEM) are supported. The framework utilizes Siddon’s ray-driven

method to generate forward projected images. Benefited from functionalities offered by current generation of

graphics processing units (GPUs), it achieves better performance when compared to previous GPU imple-

mentations that use grid-interpolated methods, on top of the significant speedups over CPU-based

solutions.

Key words: image reconstruction; tomography; graphics processing units

Introduction

Conventional analytical image reconstruction algo-rithms, such as filtered backprojection (FBP)[1], are simple and easy to implement. But they are not suit-able for applications where there is a high degree of incompleteness of projection data, e.g., digital tomo-synthesis[2] whose available images are generated from a series of limited viewing angles due to geometry constraints. To address the problem, iterative methods are often adopted due to their superior capability for solving ill-posed inverse problems. Previous studies have shown that they can not only produce results with higher in-depth resolution but effectively reduce ghost artifacts[3]. Among them, algebraic algorithms such as the simultaneous algebraic reconstruction technique

(SART)[4], and statistical methods based on maxi-mum-likelihood (ML), e.g., ordered-subset expectation maximization (OSEM)[5], are popular candidates. However, these iterative methods are computationally very intensive due to the massive dimension of system matrix that models the imaging process. Moreover, when coupled with projection images of high resolu-tion generated from latest flat-panel detectors, the re-construction process usually results in a very lengthy calculation time for CPU-based solutions. This has prevented iterative algorithms from being applied in time-critical clinical applications.

Recently graphics processing units (GPU) have emerged as a popular platform to perform numerous computationally intensive tasks thanks to its low-cost, commodity parallel-computing architecture. Various scientific applications including a wide area of medical imaging modalities have successfully utilized GPUs to boost their performance. Using GPUs for tomographic reconstruction is particularly attractive due to the

Received: 2009-10-18; revised: 2009-12-29

** To whom correspondence should be addressed. E-mail: [email protected]

Fang Xu Fast Implementation of Iterative Reconstruction with Exact … 31

effective mapping of the algorithm and significant speedups it offers. Both analytical and iterative meth-ods have been accelerated on GPUs, while competitive performance has been obtained compared to other popular platforms using parallel processors (Cell, FPGA and so on)[6-10]. The most significant speedups were observed on iterative applications due to their high complexity of computation. Specifically, forward projection, being the most frequently used component, has been extensively investigated. These include exact algorithms such as Siddon’s method[11], where inter-section lengths between a voxel and the ray are com-puted. For this, various acceleration schemes were proposed to make it amenable for parallel comput-ing[12-14]. The other popular category is grid- interpo-lated method, where sample points are spaced in iden-tical distance along the ray[15].

In spite of all the successful applications using GPUs, certain hardware limitations often prevent straightforward implementations of general-purpose computations due to the evolving programming inter-face. Before render-to-3-D-texture functionality is sup-ported on the latest hardware, the reconstructed vol-ume usually has to be represented as stacks of 2-D tex-tures. Hence, approximate methods that result in loss of image quality have been used[10], or extra memory addressing schemes and management have to be in-troduced to implement the 3-D ray-driven forward pro-jection[16]. To cope with the issue, here we designed a simplified GPU-based reconstruction framework that takes advantage of the latest functionalities to achieve optimal speedups without compromising the image quality.

1 Theory

The procedure of tomographic imaging can be mod-eled via the following equation:

1, 1,2,...,

N

i j ijj

p v w i M (1)

Here the pixel value pi on the image plane is the line integral computed for the i-th ray that involves voxels vj during the traversal throughout the volume. wij is the contribution factor from voxel vj with respect to pixel pi , while M and N refer to the total number of pixels and voxels. Iterative methods solve the above equation system by using a numerical approximation approach,

where projections of current estimation volume are compared with ground truth (scanner images) to derive correction images, which are in turn backwardly pro-jected to update the current volume estimation. By re-peatedly performing the above steps the difference between the estimation and ground truth will reduce until convergence is reached. Equation (2) describes the SART algorithm adopted in this framework:

1 sow=

sowi

i iij

p P ik kj j

j

p fp wv v

1

1

i

i

Nk

i l ill

ijNp P

ilk lj N

ijp P

p v ww

wv

w (2)

Here, fpi is the pixel value calculated from forward projection, sowi and sowj are the sum of weights of all voxels with respect to pixel pi and voxel vj with respect to all pixels on the projection. The algorithm requires three major components: forward projection of current estimated volume, computation of correction image between the scanner image and the forward projection, and lastly backward projection of the correction image to the estimated volume. Similarly, the OSEM algo-rithm can be formulated by

set

set

set

1 1

sow

i

i

i

iN

i kp Pij l il

p P ik k k lj j j N

jij

p P

pp w v wfp

v v vw

(3)

These algorithms perform in such a way that they iter-ate through forward projection, correction computation, backward projection and volume update stages until a convergence is reached, which is usually defined using a threshold computed either between successively reconstructed volumes, or the forward projected im-ages of the reconstructed volume and input scanner projections.

2 Implementation 2.1 Forward projection using Siddons’ ray-driven

projector

We implemented Siddon’s forward projection method

Tsinghua Science and Technology, February 2010, 15(1): 30-35

32

on GPU based on the improved incremental algorithm described in Refs. [12] and [13]. A preprocessing step is first performed to generate s-direction vectors of detector rays, as well as their intersection locations (entry and exit) with the volume bounding box. The first intersected voxel is then calculated to obtain ini-tial parametrical information of the ray. Then the in-cremental algorithm is applied to traverse the ray until it finishes visiting all intersection points with voxels. The core of the algorithm is performing comparisons on each ray’s current parametric components along the X, Y, and Z axies to decide in which direction the ray will encounter the next closest intersection point. Fig-ure 1 shows the pseudo code for GPU implementation.

Fig. 1 Pseudo codes for GPU implementation

Traditional ray-driven forward projectors on GPUs mostly use the grid-interpolated scheme, where sample points are spaced in identical distance and independent of viewing angle. Interpolation is commonly per-formed with the nearest or trilinear filter, and the tra-pezoidal rule is applied for integration calculation. Despite the fact that Siddon’s method uses a box in-terpolation filter, it considers ray-voxel intersection length as weighting factor for the integration calcula-tion. And it has been shown to perform better than many popular forward projectors, particularly on low frequency data[15]. To improve image quality, grid-in-terpolated scheme has to reduce sampling distance to increase the total number of sample points, but this will lead to high computational complexity and sig-nificantly affect the performance of forward projection. Figures 2 and 3 compare results generated from for-ward projectors using Siddon’s scheme and grid-in-terpolated scheme on a uniform cube dataset. As we can see from the figure, grid-interpolated schemes produce jagged edges, and a sampling step size of 0.2 is required to generate a visually smooth curve to ap-proximate the ground truth, which happens to be iden-tical with what Siddon’s method produces.

Fig. 2 Forward projections of a uniform cube dataset using Siddon’s method and grid-interpolated method

Fig. 3 Magnified view of Fig. 2

Complexity-wise, nearest neighbour sampling has been proved to be insufficient to produce good results despite its simplicity. The use of trilinear interpolation improves the image quality but requires many more arithmetic operations per sampling point. In out im-plementation on GPU, each sampling operation breaks down to approximately 15 additions/subtractions, 3 floor/ceiling and 7 multiplications. In contrast, our im-plementation of Siddon’s forward projection uses 2 comparisons to figure out the next intersection point, 3 additions/subtractions and 1 multiplication to obtain the ray-voxel intersection length. Although generally Siddon’s method will collect more sampling points than a grid-interpolated method with a step size of 1 due to its uneven spacing scheme, the reduced compu-tational complexity still yields better performance. As shown in Table 1, forward projection using Siddon’s

Table 1 Forward projection performance on GPU. A group of 52 projections of resolution of 5122 are gener-ated from a volume resolution of 2563.

Forward Proj. Method Time (s) Speedup Siddon 1.8 N/A

Trilinear (step size = 1.0) 2.8 56% Trilinear (step size = 0.5) 5.3 194% Trilinear (step size = 0.2) 12.6 600%

getRayDirectionAndEntryExitInfo(); getFirstIntersectedVoxel(); raySum = sampleFirstVoxel(); for (s = 0; s < intersectPointNumber; s++) { marchRay(); voxelValue = sampleCurrentVoxel(); raySum += voxelValue*voxelRayLength; }


method always outperforms those using trilinear inter-polation of different sampling frequency, with a speedup varying from 55% to almost 6 fold.

2.2 Backward projector and correction

As previous implementations described, we continue to adopt the voxel-driven approach for backward projec-tion, which is proved to be an efficient method amena-ble for parallel computing[6]. In details, each voxel of the volume looks for its projected location on the pro-jection image using the geometry matrix. Then sam-pling is performed on the detector image, using either nearest or bilinear interpolation kernels. In our frame-work we take advantage of the render-to-3-D-texture functionality to implement the slice by slice update of the volume. In practice, two copies of volumetric datasets are needed to avoid simultaneous read-write of the texture.

When SART algorithm is used, sum-of-weight im-ages are required to normalize the difference projection calculated from the scanner and projected images dur-ing the correction step. We combine the computation of weight images with the normal forward projection stage, where an extra accumulation is performed dur-ing the ray traversal with a constant sample value of 1. This extra value is stored in the 2nd channel of the projection image to enable the generation of weight image on-the-fly. This combination technique does not introduce any penalty on performance since the extra accumulation effort is minimal and storing/reading from multiple color channels are well parallelized on GPU.

2.3 Integration of components

Both SART and OSEM share similar components de-scribed in Section 2.1 and Section 2.2. A typical SART iteration consists of one forward projection, one cor-rection image calculation and one backward projection. In addition, OSEM needs a separate volume to store the temporary results generated from subset im-ages during the backprojection stage, as well as an extra procedure to update the current volume being reconstructed.

3 Results

The framework was tested using a GeForce 8800 GT

graphics card (NVIDIA, Santa Clara, CA) with 512 MB video memory. OpenGL Shading language (GLSL) was used to program the GPU. The CPU host codes were compiled with Microsoft Visual Studio 8.0 run-ning on an Intel Xeon 1.86 GHz PC.

The first experiment ran SART and OSEM using Siddon’s forward projector on two phantom datasets. Both configurations and outputs of the experiments are recorded in Table 2. Please note that we did not include the disk operation time consumed by transferring pro-jections to GPU memory and downloading volume data to main memory after reconstruction. Images of reconstructed volume slices of phantom No. 2 and a lung dataset are shown in Fig. 4 and Fig. 5.

The experiment shows that when reconstructing a 2563 volume from 52 projections at 5122 resolution, 3 to 5 iterations of SART or OSEM using 4 subsets can produce good results. Therefore, a typical tomosynthe-sis study can be performed within 10 to 30 seconds after projection images are acquired, depending on the algorithm used. With the introduction of top-of-the-line GPUs, such as NVidia’s 9 series and 200 series, we expect interactive reconstruction thanks to the scalabil-ity provided by modern graphics architectures. Both a

(a) SART 3 iterations (b) SART 5 iterations

(c) OSEM, 13 subsets, 1 iteration (d) OSEM, 4 subsets, 3 iterations

Fig. 4 Reconstructed phantoms from SART and OSEM using Siddon's forward projector. A volume of 2563 resolution and 52 acquisitions of projections at a resolution of 5122 were used.

Tsinghua Science and Technology, February 2010, 15(1): 30-35

34

quantitative and visual comparison show that the GPU framework produces results with no degrades of qual-ity compared to a CPU-based program, which was also confirmed in Ref. [7]. Table 2 also shows that due to the extra update procedure introduced by the subset reconstruction that incurs expensive write-to-3-D-tex-ture operations, OSEM is generally slower than SART. Moreover, performance of OSEM reconstructions scales accordingly with the number of subsets used.

(a) Transverse plane (b) Coronal plane

Fig. 5 Reconstruction of a lung dataset using SART. Three iterations are applied on a 2563 volume and 200 projection images of 10242.

Table 2 Reconstruction performance on two phantom datasets using SART and OSEM with Siddon’s method on a Geforce 8800GT GPU

Dataset Phantom 1 Phantom 2 Volume Res. 2563 2563 Projection Res. 2562 5122 Acquisition Num. 160 52 SART 4.3s 3.5s OSEM (4 subsets) 11.0s 9.8s OSEM (10 subsets) 19.4s N/A OSEM (13 subsets) N/A 22.9s OSEM (40 subsets) 61.6s N/A

Table 3 compares reconstruction performance from SART and OSEM, using Siddon’s method and trilinear interpolation of various sampling distances. Similar to the values presented in Table 1, reconstructions using Siddon’s method in the forward projector have better performance over those using trilinear interpolation schemes. For SART reconstruction, a speedup of 5% has been gained for the trilinear implementation of unit sampling distance and up to 2 fold of speedups can be observed when a denser sampling scheme of trilinear interpolation is used. Compared to Table 1, the speed-ups are not as significant due to the fact that the backprojection stage is typically time consuming and occupies much of the reconstruction process. This

downgrade was exemplified more when the framework was tested using OSEM because it involved more volume update operations.

Table 3 Comparison of reconstruction performance on a phantom dataset using various forward projectors. A volume of 2563 resolution and 160 acquisitions of projections at a resolution of 2562 were used. Values in the parentheses for all trilinear methods indicate the sampling distance. Projections are divided into 10 sub-sets for OSEM experiments.

Forward Proj. Method Time (s) SpeedupSiddon (SART) 4.3 Trilinear (1.0) (SART) 4.5 5% Trilinear (0.5) (SART) 6.4 49% Trilinear (0.2) (SART) 12.6 193% Siddon (OSEM) 19.4 Trilinear (0.5) (OSEM) 19.9 2.6% Trilinear (0.2) (OSEM) 21.6 11.3%

4 Conclusions

In this paper we proposed a GPU-accelerated high performance image reconstruction framework that sup-ports SART and OSEM algorithms. The framework employed Siddon’s method for the forward projector and showed various degrees of speedups over previous implementations that use grid-interpolated methods. Benefited from the use of 3-D texture rendering func-tionalities offered by the latest GPU hardware, our framework can easily accommodate various ray-driven modalities and is easy to be extended to incorporate and simulate advanced physical effects, such as at-tenuation correction, scattering effects, and so on.

References

[1] Feldkamp L A, Davis L C, Kress J W. Practical cone beam algorithm. Journal of the Optical Society of America A, 1984, 1: 612-619.

[2] Dobbins J T, Godfrey D J. Digital X-ray tomosynthesis: Current state of the art and clinical potential. Physics in Medicine and Biology, 2003, 48: 65- R106.

[3] Reiser I, Bian J, Nishikawa R M, et al. Comparison of reconstruction algorithms for digital breast tomosynthesis. In: Proceedings of the 9th International Meeting on Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine. Lindau, Germany, 2007: 155-158.

[4] Andersen A, Kak A. Simultaneous algebraic reconstruction technique (SART): A superior implementation of the ART


algorithm. Ultrasound Imaging, 1984, 6: 81-94. [5] Hudson H, Larkin R. Accelerated image reconstruction

using ordered subsets of projection data. IEEE Transac-tions on Medical Imaging, 1994, 13: 601-609.

[6] Knaup M, Kachelrieß M. Acceleration techniques for 2D parallel and 3D perspective forward and backprojections. In: Proceedings of the 9th International Meeting on Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine. Lindau, Germany, 2007: 45-48.

[7] Pratx G, Chinn G, Habte F, et al. Acceleration of fully 3-D list-mode OSEM for high-resolution PET using graphics processing units. In: Proceedings of the 9th International Meeting on Fully Three-Dimensional Image Reconstruc-tion in Radiology and Nuclear Medicine. Lindau, Germany, 2007: 41-44.

[8] Wang Z, Han G, Li T, et al. Speedup OS-EM image recon-struction by pc graphics card technologies for quantitative SPECT with varying focal-length fan-beam collimation. IEEE Transactions on Nuclear Science, 2005, 52(5): 1274-1280.

[9] Xu F, Khamene A, Fluck O. High performance tomosyn-thesis enabled via a GPU-based iterative reconstruction framework. In: Proceedings of SPIE. San Diego, USA, 2009, 7258: 72585A-8.

[10] Xu F, Mueller K. Accelerating popular tomographic recon-struction algorithms on commodity PC graphics hardware.

IEEE Transactions on Nuclear Science, 2005, 52(3): 654-663.

[11] Siddon R L. Fast calculation of the exact radiological path for a three-dimensional CT array. Medical Physics, 1985, 12(2): 252-255.

[12] Christiaens M, Sutter B D, Bosshere K D, et al. A fast, cache-aware algorithm for the calculation of radiological paths exploiting subword parallelism. Journal of Systems Architecture, 1999, 45(10): 781-790.

[13] Jacobs F, Sundermann E, De Sutter B, et al. A fast algo-rithm to calculate the exact radiological path through a pixel or voxel space. Journal of computing and informa-tion technology, 1998, 6(1): 89-94.

[14] Zhao H, Reader A J. Fast ray-tracing technique to calculate line integral paths in voxel arrays. In: Proceedings of IEEE Nuclear Science Symposium. Portland, USA, 2003: 211-218.

[15] Xu F, Mueller K. A comparative study of popular interpo-lation and integration methods for use in computed tomo-graphy. In: Proceedings of IEEE 2006 International Sym-posium on Biomedical Imaging. Arlington, VA, 2006: 1252-1255.

[16] Després P, Rinkel J, Hasegawa B H, et al. Stream proces-sors: A new platform for Monte Carlo calculations. Journal of Physics: Conference Series, 2008, 102: 012007.

Fast implementation of iterative reconstruction with exact ray-driven projector on GPUs

Documents

Transcript of Fast implementation of iterative reconstruction with exact ray-driven projector on GPUs