Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam...
-
Upload
rebekah-sherwood -
Category
Documents
-
view
221 -
download
1
Transcript of Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam...
Implementation of Voxel Volume Projection Operators Using
CUDAApplications to Iterative Cone-Beam
CT reconstruction
Three questions to keep in mind during the presentation1. Name three differences between the GPU and
CPU contributing to the observed performance improvements.
2. Scattered accumulation (a[ix] = a[ix] + b) can result in reduced performance and unexpected results in CUDA. Why? Please state two possible problems!
3. Suppose that each thread in a CUDA program is responsible for writing to one unique output element. Give one possible solution for handling ”odd sized” output data.
What are voxel volume projectors?
• Object: f• Projection data: p (line integrals through f)• Forward projection: P, mapping f to p• Backprojection: PT, mapping p to f
X-ray source
Detector
f
p
The tomography problem
• Given p, find an image/volume f, such that the difference between Pf and p is small in some sense, e.g. z(f)=||Pf-p||2 attains a minimum.
• Steepest descent approach: – Gradient: z(f)=2PT(Pf-p)– Update step: fk+1=fk-2PT(Pfk-p)
• We use this method only for illustration of why P and PT are needed. In practice, faster and better regularized methods are needed.
Why implement P and PT on the GPU?
• Reasonable sizes of f and p are around 2GB, implying a size of 2GBx2GB for P and PT.
• Although these matrices are sparse, the number of nonzero elements is approximately 2000GB, i.e., matrix-vector products involving these matrices are computationally demanding.
• The GPU offers many solutions that help speeding up the computations:– Massive parallelism– Hardware linear interpolation– Very high memory bandwidth– Texture caches ”optimized for 2D locality”
The Joseph forward projector [1]
• Illustration in the 2D case:
• Generalization to 3D is straightforward. Instead of linear intepolation, bilinear interpolation is used.
[1] Joseph, P. M., An improved algorithm for reprojecting rays through pixel images, IEEE Transactions on Medical Imaging, 1982, MI-1, 192-196
L
x0
x1
x2
x3
pi= L (x0+x1+x2+x3)
Implementation sketch for P
• We choose to let one thread correspond to one ray.
• For each ray/thread:1. Determine the ray origin and direction.2. Determine the ray and voxel volume intersection.3. Step along the ray and accumulate values from
the voxel volume, using the 3D-texture cache.4. Multiply with L and store to output element.
Handling of ”odd” detector sizes
• The x-ray detector is divided into 2D blocks corresponding to CUDA thread blocks, using a static block size (16x2).
• To handle the case with detector dimensions not divisible with the block size, conditional statements were used:
• Although this reduces efficiency due to divergent programs, the reduction is small for detectors with reasonably large number of channels and rows.
// Return if outside detectorif (rowIx>=Nrows) return;if (chIx>=Nch) return;
Implementation details 1Calc. of source and detector positions
// Calculate focus center and displacement vectorsfloat alpha = theta + D_ALPHA(chIx);float3 fc = make_float3(Rf*cos(alpha), Rf*sin(alpha), z0+D_Z(chIx));float3 fw = make_float3(-sin(alpha), cos(alpha), 0.f);float3 fh = make_float3(-cos(alpha)*sin(aAngle), -sin(alpha)*sin(aAngle), cos(aAngle));
// Calculate detector center and displacement vectorsfloat3 dc = fc + (Rf+Rd) * make_float3(-cos(theta), -sin(theta), D_SLOPE(rowIx));float3 dw = make_float3(sin(theta), -cos(theta), 0.f);float3 dh = make_float3(0.f, 0.f, 1.f);
// Calculate focus position in texture index coordinatesfloat3 f = ((fc + fOffset.x*fw + fOffset.y*fh) - origin) / spacing;
// Calculate detector position in texture index coordinatesfloat3 d = ((dc + dOffset.x*dw + dOffset.y*dh) - origin) / spacing;
// Create ray structRay ray;ray.o = f;ray.d = normalize(d - f);
Code based on NVIDIA CUDA SDK ”volumeRender”
Implementation details 2Accumulation of contributions
// Accumulate contributions along rayfloat dValue = 0;float t = tnear;for(int i=0; i<dimensions.x; i++) { // Calculate current position float3 pos = ray.o + ray.d*t + 0.5f; // texture origin: (0.5f, 0.5f, 0.5f)
// read from 3D texture dValue += tex3D(devT_reconVolume, pos.x, pos.y, pos.z);
// increase t t += tstep; if (t >= tfar) break;}
// Update detector data valuedValue *= length((ray.d*spacing))*tstep;projectionData[rowIx*Nch+chIx] += dValue;
Code based on NVIDIA CUDA SDK ”volumeRender”
Experiments – forward projection (P)
• Input data dimensions: 672x24x3000 floats• Output data dimensions: 512x512x257 floats• Only very small differences in accuracy, without
practical implications, occur between CPU and GPU implementations.
• Calculation times– CPU: 2500 s– GPU: 47 s
• Speed up factor: approximately 50x• For a larger collection of problems, speedups between
20x and 50x have been observed.
Efficient implementation of the exact adjoint operator PT is much trickier, why?• 2D illustration of the procedure:
• Same interpolation coefficients as for P, but with scattered accumulation instead of reading.
• No hardware interpolation. No 2D/3D textures.• New parallelization setup is needed. One ray/one
thread leads to corrupted results.
pi
One slow but exact implementation for the exact transpose of P• Let one thread represent only one accumulation from a ray,
i.e.,– calculation of position in voxel volume.– calculation and multiplication with interpolation coefficients.– accumulation to the 4 voxels closest to the active ray/voxel
volume plane intersection.• Very short, and very many rays threads.• One thread block represents a number of rays, separated
so that conflicting updates do not occur:
Block 0 Block 1 Block 2 ...
Experiments – backprojection (PT)
• Input data dimensions: 512x512x257 floats• Output data dimensions: 672x24x3000 floats• Calculation times– CPU: 2700 s– GPU: 700 s
• Speed up factor: approximately 4x• For a larger collection of problems, speedups
between 4x and 8x have been observed.
Approximate implementation of the adjoint operator PT
• Bilinear interpolation on the detector.• 2D-illustration:
• Parallelization is now accomplished by letting one pixel correspond to one thread.
• This method is approximate since the interpolation coefficients generally are different from those of PT.
Performance comparisonApproximate versus exact PT operator• Calculation times
– CPU PT : 2700s– GPU PT (exact): 700s– GPU PT (approximate): 60s
• Axial images of Head phantom reconstructions (25HU-75HU):
Exact Approximate
Conclusions and future research• For operations such as P and PT, speedups of 4x to 50x can be
obtained.• The amount of speedup is highly dependent on
– The possibility to efficiently read memory (coalesced or by the use of texture caches / constant memory).
– The possibility to use hardware interpolation– Complexity of the kernel. Using too many registers slows down the
program. Lookup tables in constant memory can help reducing the amount of registers.
• Although scattered write is supported by CUDA, for these problemes, it did not seem good to use it from a performance point of view. Note that this prohibits using cudaArray textures.
• Remark: Even if the approximate PT operator give a bad result in the example given here, there are other situations where the exact operator is superior. It is therefore of interest to find better approximations.