Accelerated Filtering using OpenCL€¦ · Accelerated Filtering using OpenCL J. Waage Abstract...

INF358 Seminar in Visualization (2009)Ivan Viola and Helwig Hauser (Editors)

Accelerated Filtering using OpenCL

J. Waage

AbstractFiltering is useful for noise reduction and edge detections in volumes. With the release of general purpose parallelcomputing interfaces, opportunities for increases in performance arises. In this paper I will present four volumefilters, implemented using OpenCL. The filters consists of: a box filter, a Gaussian filter, a median filter and acentral difference filter. The two first are implemented in a separable way, the latter ones are non-separable.Local memory on each streaming multiprocessor is used to speed up the memory access of the compute kernels.Comparison with CUDA implementation shows equivalent results, with the box filter showing better performance.

Figure 1: To the left central difference filter and box filterapplied on tooth data set. On the right same central differ-ence filter followed by median filter. Rendering was done us-ing Volumeshop [BG05].

1. Introduction

Lately new parallel computing APIs has opened new oppor-tunities in the field of visualization [RXTC07]. The mostnotable in this context are NVIDIA’s CUDA [Nvi09] andthe Khronos group’s OpenCL [Khr09]. There are other newAPIs like RapidMind [BB09] and AMD’s Stream [BB09]but it seems they are not in much in use in visualization.CUDA has been out for a while, and it is what one hasadopted in visualization, while OpenCL is so brand new thatthere is little documentation on its use.

The interest for these APIs is motivated by the computa-tional power they enable. In medical visualization one is in

increasing need for quick access to medical imagery before,after, and during procedures [RXTC07] [MOOXS09]. Thesame is true in other fields that require real time updates,and real time interaction. In many cases even small delaysare unacceptable and will render a method useless from apractical point of view [RXTC07] [MOOXS09].

Volume filtering is a field with problems that lend them-selves well to acceleration on the GPU, as shown by Viola etal. [VKG03] and Hadwiger et al. [HTHG01]. If to be used inreal time or near real time interaction scenario, fast filteringis a must.

This paper contributes by introducing fast volume fil-ters, increasing performance by using local memory on eachstreaming processor to store often accessed values. Access tothis local memory is possible through OpenCL or CUDA, formy implementation I use OpenCL. The choice of OpenCL ismotivated by the lack of documentation on its use, and pos-sibility of comparison with CUDA.

The filters implemented:

• Box filter (separable).• Gaussian filter (separable).• Median filter.• Central difference filter.

Apart from two filters being separable, each filter hasproperties which make different kinds of implementationsefficient. The box filter can take advantage of the rollingmethod seen in Figure 3, while the Gaussian filter can notdo this due to having to weight each contributing voxel. Themedian filter requires a lot more computation for each kernelthan the central difference filter, since it involves sorting orselection of the median.

c! The Eurographics Association 2009.

J. Waage / Accelerated Filtering using OpenCL

The rest of this paper is organized as follows: In section2 I present a survey of relevant work on filtering and us ofOpenCL, along with a short introduction to OpenCL con-cepts and lingo. In section 3 I will present my implementa-tion. Section 4 contain results and comparison of my filtersagainst other implementations. In section 5 conclusions aredrawn and directions for future work are given.

2. Related Work

In this section I will present previous work on volume fil-tering on the GPU, followed by work on parallel computingusing OpenCL, and an explanation of OpenCL concepts andlingo.

Hadwiger et al. [HTHG01] present hardware implementa-tions of Bicubic B-spline, Bicubic Catmull-Rom and Tricu-bic B-spline filters. Filters are implemented using multiplepasses to accommodate hardware.

Vioala et al. [VKG03] accelerate volume filters on theGPU using Cg. They present algorithms for a median, bi-lateral and rotated mask filters. For the median filter theypresent use an algorithm which works on the original un-sorted data. Viola et al. also consider important aspects likecaching on the GPU, and the negative effect of excessive useof conditionals.

Khare [Kha07] presents implementation of mean, sobel,median and gaussian volume filters. Comparison of filterrunning times on GPU and CPU is given.

Gohara [Goh09] presents work on calculation of the elec-trostatic properties of molecules using OpenCL. His imple-mentation uses coalesced loading to local memory to greateffect. Gohara discusses padding in the context of use of con-ditionals in kernels, and in the context of effective coalescedloads.

OpenCL introduces names, labels and concepts that oneneeds to understand before using it. I will present a shortintroduction to some of these concepts, they are more exten-sively covered by Gohara [Goh09] and the OpenCL specifi-cation [Khr09].

The first most important concept is that of a compute ker-nel. A compute kernel is a function that gets executed onthe GPU in parallel. In OpenCL these are written in a subsetof the 1999 C language specification. For my implementa-tion, the most important restrictions in this subset is the lackof recursion, and no variable size arrays. Other restrictionsinclude restricted pointer use and no standard C99 libraryfunction. A complete list of restrictions can be found in theOpenCL specification [Khr09].

A unit of work which one would apply a compute ker-nel to, are called work items. Each work item knows its ownposition within the local and the global work group. The lo-cal refers to a group which could share data using the localmemory, and the global work group is the full group of work

items. Work group sizes can be specified in up to 3 dimen-sions, the local work group has to divide the global workgroup in each direction. Work items are enqueued for execu-tion by their work group to a command queue.

OpenCL memory model splits memory in four categories.Below their labels and physical counterparts on the GPU arespecified.

• Private memory - The registers of each streaming proces-sor, this memory is exclusive to the work item.

• Local memory - 16 KB of very fast memory shared be-tween the streaming processors on each streaming multiprocessor.

• Global memory - The main memory on the GPU.• Constant memory - Global memory marked read only.

A concept which is important in relation to local mem-ory is a coalesced load. If OpenCL registers that it can loada continuos chunk of memory, it will execute a coalescedload. Such a load is much faster than loading each item indi-vidually.

3. Filter implementation

In this section I will present my implementation of each fil-ter. When used for time complexities m will refer to filtersize, and n will refer to the size of the volume in one direc-tion. For simplicity all running times will assume that thevolume is equally large in all directions.

Figure 2: This image shows how the separable box filter iscomputed. It shows a pass in the y direction, where the meanfor all voxels(represented by grey boxes) attached to one redcylinder is computed by one compute kernel. Similar passesare done in the x and z direction to complete the filter.

3.1. Volume box filter

The 3D box filter works by for each voxel, evaluating all thesurrounding voxels in a chosen range, and averaging their



Figure 3: Evaluation of one line of voxels showing one stepof the rolling method, the voxels within the transparent areaare those evaluated, the green voxel is added to the compu-tation, while the blue is removed. The red voxel representsthe postion where the result is placed.

value. At the edge cases I only evaluate the voxels in thevolume, a different solution would be to pad with zeros, andevaluate these as part of the volume. An example of a filteredvolume is given in Figure 4.

The box filter is separable, meaning that they can besplit into sequential solutions in each direction, reducing therunning time from O(n3m3) in a naive parallel solution toO(n3m). This means that the filter can be computed as seenin Figure 2, where the each red cylinders represents a workitem sequentially using the rolling method to compute themean. The rolling method shown in figure 3 allows me toonly add values in front of the kernel and delete values be-hind it, instead of reevaluating all the voxels in the kerneleach time. Using the rolling method, the time complexity isreduced to O(n3).

Since one kernel only does the filter in one direction, threekernels were created and then used in succession, alwaysworking on the data from previous filter, and storing in thememory the previous filter read from, minimizing the use ofglobal memory.

3.2. Volume Gaussian filter

The volume Gaussian filter evaluates the surrounding voxelsand weights them according to a number of discrete samplesof the gaussian function (1), with ! as the standard devia-tion. Resulting in values further away from the center arebeing weighted less then those close to the center. Use ofthe filter can be seen in figure 5. The values sampled fromthe Gaussian function are stored in an array which is thenuploaded to the GPU in read only memory for the computekernels to access.

1!2"!

e!x2

2!2 (1)

As the box filter, the Gaussian filter is separable. They dif-fer in that the Gaussian filter can not use the rolling methodsince it needs to weight values. This results in a filter thatinstead of the O(n3) time complexity of the box filter, runsin O(n3m) which is much worse for large filter sizes.

To better deal with the increased time complexity for largefilter sizes, I load lines of voxels into local memory. I loadlines in the direction of the pass, as seen in Figure 2. Withthese lines in local memory, one compute kernel is executedfor each voxel in the volume, each compute kernel onlyneeds to read from the line it is part of, so all sampling ofneighboring voxels are done from local memory.

The process above is repeated once for each direction. Asin the box filter implementation I use two memory objectsand read from one and write to the other, switching themaround for each direction.

3.3. Volume median filter

The median filter evaluates the surrounding voxels based onthe filter size, and it selects the median value from them. Itbetter reflect the underlying data then the mean or gaussiansince it selects values from the actual data. It is also more re-sistant to extreme values compared to the the mean or gaus-sian filter. These features makes the filter useful for noisereduction while preserving details. In Figure 7 comparisonbetween a filtered volume and the original volume is given.

Figure 6: The voxels loaded into local memory when exe-cuting the median filter of size 3 are in the blue box. The redvoxels are the compute kernels using this memory.

The median filter is not separable, for each voxel it mustevaluate the surrounding voxels and find the median one.Doing this computation in parallel means many samples ofthe same voxels over and over. To make this sampling fastI load the data to be sampled into local memory, as seen inFigure 6.



Figure 4: Comparison between original volume and applied box filter (right) with filter size 3. Rendered in Volumshop [BG05].

This means that the compute kernels in each line of vox-els, all read from local memory, which is very fast. The load-ing of these lines are also very fast since these will be reg-istered by OpenCL as coalesced loads and done in batchesand not voxel by voxel.

The algorithm for finding the median is then applied to thecorrect part of the local memory for each kernel. The algo-rithm is the one used by Viola et al. [VKG03]. It works forall sizes of input, does not copy memory, and it is relativelyfast.

3.4. Volume central difference filter

The 3D central difference filter approximates the length ofthe central difference in each voxel of the volume. For eachvoxel the six values surrounding it are evaluated to computethe central difference.

Computation is done by storing |u| in the position of thecurrent voxel, where u is a vector containing the central dif-ference in x, y and z direction. The calculation of u is done

using equation (2) , (3) , (4) and (5) where d(x,y,z) is thecurrent voxel under evaluation.

u = ai+bj+ ck (2)

a =d(x +1,y,z)"d(x"1,y,z

2(3)

b =d(x,y+1,z)"d(x,y"1,z

2(4)

c =d(x,y,z+1)"d(x,y,z"1

2(5)

Values outside the edges are treated as 0, this was chosensince most volumes do not have important features aroundthe edges minimizing the implications of this approximation.

This filters benefits from the same loading scheme used



Figure 5: Comparison between original volume and applied gaussian filter (right) with standard deviation 1. Rendered inVolumshop [BG05].

Figure 7: Comparison between original volume and applied median filter (left) with filter size 3. Rendered inVolumshop [BG05].

for the median filter, except that the number of lines neededis smaller. The needed lines are shown in figure 8. As withthe median filter, these are loaded into local memory using acoalesced load, and then compute kernels for each voxel inthe middle line are added to the command queue.

4. Performance and results

To test my filters I created float arrays of size 1283 and 2563

as test data. Filter sizes are a selection from the range of pos-sible sizes in my implementation. Performances are givenin seconds. My results contain measurements of the time ofcomputation, and measurements of the time it takes to loaddata to the GPU, do the computation and transfer results



Figure 8: The blue cross contains the voxels loaded intolocal memory in the central difference filter. Red voxels arethe compute kernels using this block of local memory.

back to memory. These measurements are marked loadingincluded.

The test environment consist of a 2 GHz Intel Core 2 Duoand a Geforce 9400M GPU. Sporting 2 streaming multipro-cessors with 8 processors each, 256 MB of global memoryand 16 KB of local memory per streaming multiprocessor.

The tests were done using the Mac OS X implementationof OpenCL. Times were recorded by using the UpTime()call from the CoreServices framework [App09]. To makesure that computation has finished before starting timemeasurements, an OpenCL call clFinish(cmd_queue)was used, causing the system to wait for the cmd queue tofinish before continuing.

As shown in Table 1, and as expected, my box filter im-plementation is not dependent on the size of the filter, sincethe time complexity is roughly the same for all filter sizeswhen using the rolling method.

Table 1: Box filter performance

Filter size 3 5 41Performance just computation(128) 0.15 0.15 0.15Performance loading included(128) 0.4 0.4 0.4Performance just computation(256) 0.7 0.7 0.7Performance loading included(256) 1.2 1.2 1.2

The gaussian filter does not use the rolling method, andshould therefore see increased running time for larger filtersizes. The use of local memory might hide this, as the localmemory would be more used in the larger sizes of filters.As seen in Table 2, there are only slight increases in running

times for larger filter sizes, suggesting the local memory ismore efficiently used for larger filter sizes.

Table 2: Gaussian filter performance

Standard deviation 1 2 4Performance just computation (128) 0.35 0.45 0.65Performance loading included (128) 0.6 0.67 0.88Performance just computation (256) 2.1 2.3 2.7Performance loading included (256) 2.5 2.7 3.2

The central difference filter is only created for a filter sizeof 3, the only meaningful comparison is between the differ-ent sizes of volumes. The work size in the larger volume iseight times larger then in the smaller. The computation inthe larger volume runs in half the time one would expect.This might suggest that the local memory is used more effi-ciently for larger volumes, but a more thorough study wouldbe needed to be conclusive. Running times for the centraldifference filter can be found in Table 3.

Table 3: Central difference filter performance

Filter size 3Performance just computation (128) 0.17Performance loading included (128) 0.4Performance just computation (256) 0.6Performance loading included (256) 1.0

The median filter performs about as expected, growingquite close to linearly with increasing amount of work. Themedian filter of size 5 could sadly not be tested for the largevolume, as the current implementation uses too much localmemory. Table 4 contains the running times.

Table 4: Median filter performance

Filter size 3 5Performance just computation (128) 0.27 2.3Performance loading included (128) 0.47 2.4Performance just computation (256) 2.13 NAPerformance loading included (256) 2.55 NA

4.1. Comparison against other implementations

The only relevant comparison I could do were against aCUDA implementation by Jeong [Jeo07]. Jeong implementsmean, median and gaussian filters. His results given in Ta-ble 5 are from computations using the NVIDIA Tesla C870GPU with seven times the amount of streaming processorsand ten times the computing power of my GPU. His test vol-umes were of size 1283.

My forthcoming comparison of our implementations as-sume that his GPU will process somewhere between 7 to 10



times faster then mine and that transfer speed from globalmemory to registers or local memory is roughly the same.

Table 5: Jeong [Jeo07] Filter performances

Box filterFilter size 3 5 7 9Performance 0.0705 0.05 0.08 0.132

Gaussian filterVariance 1 2 4 8Performance 0.0279 0.0316 0.0317 0.0327

Median filterFilter size 3 5 7 9Performance 0.0705 0.232 0.544 1.07

• Box filter - My filter is roughly twice as fast for the filtersize of 3, and increasingly faster as filter size increases.This is probably due to my use of the rolling method.

• Gaussian filter - My filter is slightly slower comparedto his for variance of one, and increasingly slower forhigher variances. His implementation seems to have thesame running time for all variances, which suggests he isusing a constant size array to store samples.

• Median filter - My median median filter is about twiceas fast as his for the filter size of 3, for the filter size of 5the results are about the same. I have no good explanationfor why this is so, though I suspect the loading of localmemory is part of it.

4.2. Local memory use

Making good use of local memory is crucial when comput-ing using OpenCL. In my median and central difference im-plementations, what to load into local memory is not trivial.For simplicity I chose entire lines of voxels as seen in figure6 and 8.

For smaller volumes one single line of voxels is very lim-iting. If I could guarantee that a volume is below a certainthreshold, I could load say 32 entire lines into local mem-ory. This would allow the middle lines of this chunk to becomputed, and save many loads.

There is also the issue of only 16 KB available localmemory, which for large volumes and larger filter sizeswould force use of shorter lines. This also raises the issueof padding. In the case for shorter loads, padding might benecessary to to get optimal performance.

5. Conclusion

I have created fast volume filters using OpenCL. The medianand box filter show better performance then comparable so-lutions in CUDA. As a developer I think OpenCL is a very

solid API for doing computation on the GPU. For visualiza-tion I would use it over shading languages for all tasks notdirectly shading related.

As future work on filters using OpenCL. I would like toinvestigate the use of padding to increase performance of fil-ters. Another interesting optimization topic would be an al-gorithm for finding optimal local memory loads for volumes,something which would be useful in a more general imple-mentation of the median filter. For the box and gaussian filterI would like to experiment with using coalesced loads to addvalues to local memory, then putting the result of the com-putation in a different location so that the next pass could docoalesced loads as well.

6. Acknowledgements

I want to thank Paolo Angelelli for his guidance and helpwhile doing my implementation and writing this paper. I alsowant to thank Ivan Viola, Morten Bendiksen and Eirik Vikfor good constructive criticism.

References

[App09] APPLE: Apple developer center web page,http://developer.apple.com/mac/, 2009.

[BB09] BORGO R., BRODLIE K.: State of the art reporton gpu visualization.

[BG05] BRUCKNER S., GRÖLLER M.: Volumeshop: Aninteractive system for direct volume illustration. In Pro-ceedings of IEEE Visualization (2005), vol. 5, pp. 671–678.

[Goh09] GOHARA D.: OpenCL tutorials web page,http://www.macresearch.org/opencl, 2009.

[HTHG01] HADWIGER M., THEUSSL T., HAUSER H.,GRÖLLER E.: Hardware-accelerated high-quality filter-ing on PC hardware. In Workshop on Vision, Modelling,and Visualization VMVÕ01 (2001), Citeseer, pp. 105–112.

[Jeo07] JEONG W.: Won-Ki Jeong web page,http://www.cs.utah.edu/~wkjeong/http://www.na-mic.org/Wiki/images/f/f9/Itk-gpu-meeting-Fall2007.ppt, 2007.

[Kha07] KHARE A.: Volume analysis and visualization.

[Khr09] KHRONOS: OpenCL overview web page, http://www.khronos.org/opencl/, 2009.

[MOOXS09] MUYAN-ÖZÇELIK P., OWENS J., XIA J.,SAMANT S.: Fast deformable registration on the gpu:A cuda implementation of demons. In ComputationalSciences and Its Applications, 2008. ICCSA’08. Interna-tional Conference on (2009).

[Nvi09] NVIDIA: CUDA web page, http://www.nvidia.com/object/cuda_learn.html, 2009.



[RXTC07] RIABKOV D., XUE X., TUBBS D.,CHERYAUKA A.: Accelerated cone-beam backpro-jection using gpu-cpu hardware. In Proceedings of the9th International Meeting on Fully Three-DimensionalImage Reconstruction in Radiology and NuclearMedicine (2007), pp. 68–71.

[VKG03] VIOLA I., KANITSAR A., GRÖLLER M.:Hardware-based nonlinear filtering and segmentation us-ing high-level shading languages. In Proceedings of IEEEVisualization (2003), pp. 309–316.



Appendix A: Median Filter kernel

__kernel void medianFilter(__global const float *indat,__global float *answer,__global const int *info,__local float *localMem)

{int gid = get_global_id(0);int lSize = get_local_size(0);int lid = get_local_id(0);

int xSize = info[0];int ySize = info[1];int zSize = info[2];

//Ypos and Zpos represented by gid.int zPos = (gid/ySize)/xSize;int yPos = (gid/xSize)%ySize;

int h;

for(int i = -1; i <= 1; i++) {for(int j = -1; j <= 1; j++) {h = (i+1)*3 + (j+1);if((zPos-i) >= 0 &&(zPos-i) < zSize && (yPos-j) >= 0 &&(yPos-j) < ySize) {localMem[lid + lSize*h] = indat[(zPos-i)*xSize*ySize + (yPos-j)*xSize + lid];}else {localMem[lid + lSize*h] = 0;}}}

barrier(CLK_LOCAL_MEM_FENCE);

int a,i, less, greater, equal;float min, max, guess, maxltguess, mingtguess;

min = localMem[lid];max = localMem[lid];for(i= -1; i<=1; i++) {a = lid + i;if(a>= 0 && a < xSize) {for(int j = 0; j<9; j++) {

min = fmin(localMem[lSize*j + a],min);max = fmax(localMem[lSize*j + a],max);}}else {min = fmin(0.0f,min);max = fmax(0.0f,max);}}

// Find meadian in local memory, with-out copying.while(1) {guess = (min+max)/2;less = 0;greater = 0;equal = 0;maxltguess = min ;mingtguess = max ;for(i= -1; i<=1; i++) {a = lid + i;if(a < 0 || a >= info[0]) {for(int j = 0; j<9; j++) {if (0 < guess) {less++;maxltguess = fmax(0.0f,maxltguess);}else if (0 > guess) {greater++;mingtguess = fmin(mingtguess,0.0f);}else {equal++;}}}else {for(int j = 0; j<9; j++) {if (localMem[lSize*j + a] < guess) {less++;maxltguess = fmax(maxltguess,localMem[lSize*j + a]);}else if (localMem[lSize*j + a] > guess) {greater++;mingtguess = fmin(mingtguess,localMem[lSize*j + a]);}else {equal++;}}}}if (less <= (27+1)/2 && greater <= (27+1)/2) {break;}else if (less>greater) {max = maxltguess ;}else {min = mingtguess;}}

if (less >= (27+1)/2) {answer[gid] = maxltguess;}else if (less+equal >= (27+1)/2) {answer[gid] = guess;



}else {answer[gid] = mingtguess;}}



Appendix B: Central difference filter kernel

__kernel void medianFilter(__global const float *indat,__global float *answer,__global const int *info,__local float *localMem)

{int gid = get_global_id(0);int lSize = get_local_size(0);int lid = get_local_id(0);


//Ypos and Zpos represented by gid.int zPos = (gid/ySize)/xSize;int yPos = (gid/xSize)%ySize;

int size = info[3];

localMem[lid] = indat[(zPos)*xSize*ySize + (yPos)*xSize + lid];if((zPos-size) >= 0 &&(zPos+size) < zSize && (yPos-size) >= 0 &&(yPos+size) < ySize && lid-size >= 0 && lid+size < xSize) {localMem[lid + lSize*1] = in-dat[(zPos)*xSize*ySize + (yPos-size)*xSize + lid];localMem[lid + lSize*2] = in-dat[(zPos)*xSize*ySize + (yPos+size)*xSize + lid];localMem[lid + lSize*3] = indat[(zPos-size)*xSize*ySize + (yPos)*xSize + lid];localMem[lid + lSize*4] = in-dat[(zPos+size)*xSize*ySize + (yPos)*xSize + lid];

//barrier(CLK_LOCAL_MEM_FENCE);

float4 vec;

vec.w = 0;vec.x = 0;vec.y = 0;vec.z = 0;

vec.x = fabs(localMem[lid - size]-localMem[lid + size]);vec.y = fabs(localMem[lid + lSize*1]-localMem[lid + lSize*2]);vec.z = fabs(localMem[lid + lSize*3]-localMem[lid + lSize*4]);

answer[gid] = length(vec);

}else {

answer[gid] = 0;}}



Appendix C: Box filter kernel

//Kernel for z direction.__kernel void boxfilterZ(__global const float *indat,__global float *answer,__global const int *info){

//Sum of adjacent pixels.float sum = 0;

int xSize = get_global_size(0);int ySize = get_global_size(1);int zSize = info[2];

//Xpos and Ypos represented by gid.int xPos = get_global_id(0);int yPos = get_global_id(1)*xSize;

int c = 0;int d = 0;

// values used to calculate sumfloat values = 0;

// Size of the box.int boxSize = info[3];

// Finds the sum of the first pixel.for(int i = 0 ; i<= boxSize; i++) {

c = 0 + i;if(c >= 0 && c < zSize) {

c = c*xSize*ySize;sum = sum + indat[xPos+yPos+c];values++;}}

answer[yPos + xPos] = sum/values;

// Finds the sum of the next pix-els based on the previous.for(int i = 1 ; i < zSize; i++) {c = i - boxSize - 1;if(c >= 0 && c < zSize) {

c = c*xSize*ySize;sum = sum - indat[xPos+yPos+c];values-;}

d = i + boxSize;if(d >= 0 && d < zSize) {d = d*xSize*ySize;sum = sum + indat[xPos+yPos+d];values++;}

answer[i*xSize*ySize + yPos + xPos] = sum/values;;}}

//Kernel for Y direction.__kernel void boxfilterY(__global const float *indat,__global float *answer,__global const int *info){

int xSize = get_global_size(0);int ySize = info[1];

float sum = 0;

int xPos = get_global_id(0);int zPos = get_global_id(1)*xSize*ySize;


float values = 0;

int boxSize = info[3];

for(int i = 0 ; i<= boxSize; i++) {c = 0 + i;

if(c >= 0 && c < ySize) {c = c*xSize;

sum = sum + indat[xPos+zPos+c];values++;}}

answer[zPos + xPos] = sum/values;

for(int i = 1 ; i < ySize; i++) {c = i - boxSize - 1;if(c >= 0 && c < ySize) {

c = c*xSize;sum = sum - indat[xPos+zPos+c];values-;}

d = i + boxSize;if(d >= 0 && d < ySize) {d = d*xSize;sum = sum + indat[xPos+zPos+d];values++;}answer[zPos + i*xSize + xPos] = sum/values;}}



//Kernel for x direction.__kernel void boxfilterX(__global const float *indat,__global float *answer,__global const int *info){

float sum = 0;

int xSize = info[0];int ySize = get_global_size(0);

int yPos = get_global_id(0)*xSize;int zPos = get_global_id(1)*xSize*ySize;


float values = 0;

int boxSize = info[3];

for(int i = 0 ; i<= boxSize; i++) {c = 0 + i;

if(c >= 0 && c < xSize) {sum = sum + indat[zPos+yPos+c];values++;}}

answer[zPos + yPos] = sum/values;

for(int i = 1 ; i < xSize; i++) {c = i - boxSize -1;if(c >= 0 && c < xSize) {sum = sum - indat[zPos+yPos+c];values-;}

d = i + boxSize;if(d >= 0 && d < xSize) {sum = sum + indat[zPos+yPos+d];values++;}answer[zPos + yPos + i] = sum/values;}

}



Appendix D: Gaussian filter kernel

//Kernel for z direction.__kernel void gaussianZ(__global const float *indat,__global float *answer,__global const int *info,__global const float *gaussian,__local float *localMem){int lid = get_local_id(0);



//Xpos and Ypos represented by gid.int xPos = get_global_id(0);int yPos = get_global_id(1);yPos = yPos*xSize;

localMem[lid] = indat[lid*xSize*ySize + yPos + xPos];


int c = 0;


// Sums for one pixelfor(int i = -boxSize ; i<= boxSize; i++) {c = lid + i;if(c >= 0 && c < zSize) {sum = sum + localMem[c]*gaussian[i+boxSize];}}

answer[lid*xSize*ySize+yPos+xPos] = sum;

}

//Kernel for Y direction.__kernel void gaussianY(__global const float *indat,__global float *answer,__global const int *info,__global const float *gaussian,__local float *localMem){int lid = get_local_id(0);

int xSize = info[0];

int ySize = info[1];int zSize = info[2];

//Xpos and Ypos represented by gid.int xPos = get_global_id(0);int zPos = get_global_id(2);zPos = zPos*xSize*ySize;

localMem[lid] = indat[zPos + lid*xSize + xPos];


int c = 0;



// Sums for one pixelfor(int i = -boxSize ; i<= boxSize; i++) {c = lid + i;if(c >= 0 && c < ySize) {sum = sum + localMem[c]*gaussian[i+boxSize];}}answer[zPos+lid*xSize+xPos] = sum;

}

//Kernel for x direction.__kernel void gaussianX(__global const float *indat,__global float *answer,__global const int *info,__global const float *gaussian,__local float *localMem){int lid = get_local_id(0);


float sum = 0;

int yPos = get_global_id(1);int zPos = get_global_id(2);

yPos = yPos*xSize;zPos = zPos*xSize*ySize;

localMem[lid] = indat[zPos + yPos + lid];




int c = 0;


// Sums for one pixelfor(int i = -boxSize ; i<= boxSize; i++) {c = lid + i;if(c >= 0 && c < xSize) {sum = sum + localMem[c]*gaussian[i+boxSize];}}answer[zPos+yPos+lid] = sum;

}


Accelerated Filtering using OpenCL€¦ · Accelerated Filtering using OpenCL J. Waage Abstract...

Documents

Transcript of Accelerated Filtering using OpenCL€¦ · Accelerated Filtering using OpenCL J. Waage Abstract...