GPU-Efficient Recursive Filtering and Summed-Area Tables
description
Transcript of GPU-Efficient Recursive Filtering and Summed-Area Tables
![Page 1: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/1.jpg)
GPU-Efficient Recursive Filtering and Summed-Area
TablesJeremiah van OostenReinier van Oeveren
![Page 2: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/2.jpg)
Introduction Related Works
◦ Prefix Sums and Scans◦ Recursive Filtering◦ Summed-Area Tables
Problem Definition Parallelization Strategies
◦ Baseline (Algorithm RT)◦ Block Notation◦ Inter-block Parallelism◦ Kernel Fusion (Algorithm 2)
Overlapping◦ Causal-Anticausal overlapping (Algorithm 3 & 4)◦ Row-Column Causal-Anitcausal overlapping (Algorithm 5)
Summed-Area Tables◦ Overlapped Summed-Area Tables (Algorithm SAT)
Results Conclusion
Table of Contents
![Page 3: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/3.jpg)
Introduction
![Page 4: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/4.jpg)
Linear filtering is commonly used to blur, sharpen or down-sample images.
A direct implementation evaluating a filter of support d on a h x w image has a cost of O(hwd).
Introduction
![Page 5: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/5.jpg)
The cost of the image filter can be reduced using a recursive filter in which case previous results can be used to compute the current value:
Cost can be reduced to O(hwr) where r is the number of recursive feedbacks.
Introduction
![Page 6: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/6.jpg)
At each step, the filter produces an output element by a linear combination of the input element and previously computed output elements.
Recursive Filters
0(hwr)
Continue…
![Page 7: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/7.jpg)
Applications of recursive filters◦ Low-pass filtering like Gaussian kernels◦ Inverse Convolution (◦ Summed-area tables
Recursive Filters
input blurred
recursive filters
![Page 8: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/8.jpg)
Recursive filters can be causal or anticausal (or non-causal).
Causal filters operate on previous values.
Anticausal filters operate on “future” values.
Causality
![Page 9: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/9.jpg)
Anticausal filters operate on “future” values.
Anticausal
Continue…
𝒛 ¿ 𝑅(𝒚 ,𝒆)
𝑧𝑖 ¿ 𝑦 𝑖−∑𝑘=1
𝑟
𝑎𝑘𝑧 𝑖+𝑘
![Page 10: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/10.jpg)
It is often required to perform a sequence of recursive image filters.
Filter Sequences
X
P
Y
E
ZP’ U E’V
• Independent Columns• Causal
• Anticausal
• Independent Rows• Causal
• Anticausal𝑉=𝐹 𝜏 (𝐸 ′ ,𝑈 )
𝑈=𝐹𝜏 (𝑃 ′ ,𝑍 )
𝑌=𝐹 (𝑃 , 𝑋 )
𝑍=𝐹 (𝑌 ,𝐸)
![Page 11: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/11.jpg)
The naïve approach to solving the sequence of recursive filters does not sufficiently utilize the processing cores of the GPU.
The latest GPU from NVIDIA has 2,668 shader cores. Processing even large images (2048x2048) will not make full use of all available cores.
Under utilization of the GPU cores does not allow for latency hiding. We need a way to make better utilization of the GPU without
increasing IO.
Maximizing Parallelism
![Page 12: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/12.jpg)
In the paper “GPU-Efficient Recursive Filtering and Summed-Area Tables” by Diego Nehab et. al. they introduce a new algorithmic framework to reduce memory bandwidth by overlapping computation over the full sequence of recursive filters.
Overlapping
![Page 13: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/13.jpg)
Partition the image into 2D blocks of size .
Block Partitioning
![Page 14: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/14.jpg)
Related Works
![Page 15: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/15.jpg)
A prefix sum
Simple case of a first-order recursive filter. A scan generalizes the recurrence using an arbitrary
binary associative operator. Parallel prefix-sums and scans are important building
blocks for numerous algorithms. [Iverson 1962; Stone 1971; Blelloch 1989; Sengupta et.
al. 2007] An optimized implementation comes with the CUDPP
library [2011].
Prefix Sums and Scans
![Page 16: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/16.jpg)
A generalization of the prefix sum using a weighted combination of prior outputs.
This can be implemented as a scan operation with redefined basic operators.
Ruijters and Thevenaz [2010] exploit parallelisim across the rows and columns of the input.
Recursive Filtering
![Page 17: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/17.jpg)
Sung and Mitra [1986] use block parallelism and split the computation into two parts:◦ One computation based only on the block data
assuming a zero initial conditions.◦ One computation based only on the initial
conditions and assuming zero block data.
Recursive Filtering
![Page 18: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/18.jpg)
Summed-area tables enable the averaging rectangular regions of pixel with a constant number of reads
Summed-Area Tables
+LR
-LL
-UR
+UL
width
height
𝑎𝑣𝑔=𝐿𝑅−𝐿𝐿−𝑈𝑅+𝑈𝐿
h𝑤𝑖𝑑𝑡 ∙h h𝑒𝑖𝑔 𝑡
![Page 19: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/19.jpg)
The paper titled “Fast Summed-Area Table Generation…” from Justin Hensley et. al. (2005) describes a method called recursive doubling which requires multiple passes of the input image. (A 256x256 image requires 16 passes to compute).
Summed-Area Tables
Image A
Image B
Image A
Image B
![Page 20: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/20.jpg)
In 2010, Justin Hensley extended his 2005 implementation to compute shaders taking more samples per pass and storing the result in intermediate shared memory. Now a 256x256 image only required 4 passes when reading 16 samples per pass.
Summed-Area Tables
![Page 21: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/21.jpg)
Problem Definition
![Page 22: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/22.jpg)
Casual recursive filters of order are characterized by a set of feedback coefficients in the following manner.
Given a prologue vector and an input vector of any size the filter produces the output:
Problem Definition
𝒚=𝐹 (𝒑 , 𝒙)
Such that ( has the same size as the input ).
![Page 23: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/23.jpg)
Causal recursive filters depend on a prologue vector
Problem Definition
𝑦 𝑖=𝑥 𝑖−∑𝑘=1
𝑟
𝑎𝑘 𝑦 𝑖−𝑘
Similar for the anitcausal filter. Given an input vector and an epilogue vector , the output vector is defined by:
𝑧𝑖=𝑦 𝑖−∑𝑘=1
𝑟
𝑎 ′𝑘 𝑧𝑖+𝑘
![Page 24: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/24.jpg)
For row processing, we define an extended casual filter and anticausal filter .
Problem Definition
𝒖=𝐹 𝜏 (𝒑 ′ ,𝒛 )
𝒗=𝑅𝜏 (𝒖 ,𝒆 ′ )
![Page 25: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/25.jpg)
With these definitions, we are able to formulate the problem of applying the full sequence of four recursive filters (down, up, right, left).
Problem Definition
X
P
Y
E
ZP’ U E’V
• Independent Columns• Causal
• Anticausal
• Independent Rows• Causal
• Anticausal𝑉=𝐹 𝜏 (𝐸 ′ ,𝑈 )
𝑈=𝐹𝜏 (𝑃 ′ ,𝑍 )
𝑌=𝐹 (𝑃 , 𝑋 )
𝑍=𝐹 (𝑌 ,𝐸)
![Page 26: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/26.jpg)
The goal is to implement this algorithm on the GPU to make full use of all available resources.◦ Maximize occupancy by splitting the problem up
to make use of all cores.◦ Reduce I/O to global memory.
Must break the dependency chain in order to increase task parallelism.
Primary design goal: Increase the amount of parallelism without increasing memory I/O.
Problem Definition
![Page 27: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/27.jpg)
Prior Parallelization Strategies
![Page 28: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/28.jpg)
Baseline algorithm ‘RT’Block notation Inter-block parallelismKernel fusion
Prior Parallelization strategies
![Page 29: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/29.jpg)
Algorithm Ruijters & ThévenazIndependent row and column processing Step RT1: In parallel for each column in ,
apply sequentially and store . Step RT2: In parallel for each column in ,
apply sequentially and store . Step RT1: In parallel for each row in ,
apply sequentially and store . Step RT1: In parallel for each row in ,
apply sequentially and store .
![Page 30: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/30.jpg)
Algorithm RT in diagram formin
put
outp
utst
ages
column processing row processing
![Page 31: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/31.jpg)
Completion takes 4r ) steps Bandwidth usage in total is
= streaming multiprocessors = number of cores (per processor) = width of the input image = height of the input image = order of the applied filter
Algorithm RT performance
![Page 32: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/32.jpg)
Partition input image into blocks◦ = number of threads in warp (=32)
What means what?◦ = block in matrix with index ◦ = column-prologue submatrix◦ = column-epilogue submatrix
For rows we have (similar) transposed operators: and
Block notation (1)
![Page 33: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/33.jpg)
Block notation (1 cont’d)
![Page 34: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/34.jpg)
Tail and head operators: selecting prologue- and epilogue-shaped submatrices from
Block notation (2)
![Page 35: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/35.jpg)
Result: blocked version of problem definition, , ,
Block notation (3)
![Page 36: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/36.jpg)
Superposition (based on linearity)
Effects of the input and prologue/epilogue on the output can be computed independently
Some useful key properties (1)
![Page 37: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/37.jpg)
Express as matrix productsFor any , is the identity matrix
Some useful key properties (2)
, Precomputed matrices that depend only on the feedback coefficients of filters and respectively. Details in paper.
![Page 38: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/38.jpg)
Perform block computation independently
Inter-block parellelism (1)
𝑩𝒎 (𝒚 )=𝑭 (𝑷𝒎−𝟏 (𝐲 ) ,𝑩𝒎 (𝒙 ) )output block
Prologue / tail of prev. output block superposition
𝑩𝒎 (𝒚 )=𝑭 (𝟎 ,𝑩𝒎 (𝒙 ) )+𝑭 (𝑷𝒎−𝟏 (𝒚 ) ,𝟎 )
![Page 39: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/39.jpg)
Inter-block parellelism (2)𝑩𝒎 (𝒚 )=𝑭 (𝟎 ,𝑩𝒎 (𝒙 ) )+𝑭 (𝑷𝒎−𝟏 (𝒚 ) ,𝟎 )
𝑩𝒎(𝒚 )
incomplete causal output
first term second term
𝑭 ( 𝑰 𝒓 ,𝟎 )𝑷𝒎−𝟏 (𝒚 )
𝑨𝑭𝑷 𝑷𝒎−𝟏(𝒚 )
![Page 40: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/40.jpg)
Inter-block parellelism (3) (1)
1.3 In parallel for all m, compute & store output block using (2) and the previously computed
Recall: (2)
1.1 In parallel for all m, compute and store each
1.2 Sequentially for each m, compute and store the according to (1) and using the previously computed
Algorithm 1
![Page 41: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/41.jpg)
Inter-block parellelism (4)Processing all rows and columns using causal and anti-causal filter pairs requires 4 successive applications of algorithm 1.
There are independent tasks: hides memory access latency.
However.. The memory bandwidth usage is now Significantly more than algorithm RT (
can be solved
![Page 42: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/42.jpg)
Original idea: Kirk & Hwu [2010]
Use output of one kernel as input for the next without going through global memory.
Fused kernel: code from both kernels but keep intermediate results in shared mem.
Kernel fusion (1)
![Page 43: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/43.jpg)
Use Algorithm 1 for all filters, do fusing. Fuse last stage of with first stage of Fuse last stage of and first stage of Fuse last stage of with first stage of
Kernel fusion (2)
We aimed for bandwidth reduction. Did it work? Algorithm 1: Algorithm 2: yes, it did!
![Page 44: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/44.jpg)
Kernel fusion (3), Algorithm 2
inpu
tou
tput
stag
es
fix fix fix fix
* for the full algorithm in text, please see the paper
*
![Page 45: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/45.jpg)
Further I/O reduction is still possible: by recomputing intermediary results instead of storing in memory.
More bandwidth reduction: (=good)
No. of steps: (≈bad*)
Bandwidth usage is less than Algorithm RT(!) but involves more computations *But.. future hardware may tip the balance in favor of more computations.
Kernel fusion (4)
![Page 46: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/46.jpg)
Overlapping
![Page 47: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/47.jpg)
Overlapping is introduced to reduce IO to global memory.
It is possible to work with twice-incomplete anticausal epilogues , computed directly from the incomplete causal output block .
This is called casual-anticausal overlapping.
Causal-Anticausal Overlapping
![Page 48: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/48.jpg)
Recall that we can express the filter so that the input and the prologue or epilogue can be computed independently and later added together.
Causal-Anticausal Overlapping
𝐹 (𝒑 , 𝒙) ¿ 𝐹 (𝟎 ,𝒙 )+𝐹 (𝒑 ,𝟎 ) ,𝑅(𝒚 ,𝒆) ¿ 𝑅 (𝒚 ,𝟎 )+𝐹 (𝟎 ,𝒆)
![Page 49: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/49.jpg)
Using the previous properties, we can split the dependency chains of anticausal epilogues.
Causal-Anticausal Overlapping
![Page 50: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/50.jpg)
Which can be further simplified to:
Causal-Anticausal Overlapping
![Page 51: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/51.jpg)
Where the twice-incomplete is such that
Causal-Anticausal Overlapping
Each twice-incomplete epilogue depends only on the corresponding input block and therefore they can all be computed in parallel already in the first pass. As a byproduct of that same pass, we can compute and store the that will be needed to obtain . With , we can compute all in the following pass.
![Page 52: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/52.jpg)
1. In parallel for all , compute and store and .2. Sequentially for each , compute and store
the using the previously computed .3. Sequentially for each , compute and store
using the previously computed and .4. In parallel for all , compute each causal
output block using the previously computed . Then compute and store each anticausal output block using the previously computed .
Algorithm 3 (Row & Column)
![Page 53: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/53.jpg)
Algorithm 3 computes row and columns in separate passes. Fusing these two stages, results in algorithm 4.
Algorithm 4
![Page 54: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/54.jpg)
1. In parallel for all and , compute and store the ) and .2. Sequentially for each , but in parallel for each , compute and
store the using the previously computed .3. Sequentially for each , but in parallel for each , compute and
store the using the previously computed and .4. In parallel for all and , compute using the previously
computed . Then compute and store the using the previously computed . Finally, compute and store both and .
5. Sequentially for each , but in parallel for each , compute and store the from .
6. Sequentially for each , but in parallel for each , compute and store each using the previously computed and .
7. In parallel for all and , compute using the previously computed and . Then compute and store the using the previously computed .
Algorithm 4
![Page 55: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/55.jpg)
Adds causal-anticausal overlapping◦ Eliminates reading and writing causal results
Both in column and in row processing◦ Modest increase in computation
Algorithm 4in
put
outp
utst
ages
fix bothfix both
![Page 56: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/56.jpg)
There is still one source of inefficiency in algorithm 4. We wait until the complete block is available in stage 4 before computing incomplete and twice-incomplete .
We can overlap row and column computations and work with thrice-incomplete transposed epilogues obtained directly during algorithm 4 stage 1.
Row-Column Casual-Anticausal Overlapping
![Page 57: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/57.jpg)
Below is the formula for completing the thrice-incomplete transposed prologues:
Row-Column Casual-Anticausal Overlapping
The thrice-incomplete satisfies
![Page 58: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/58.jpg)
Row-Column Casual-Anticausal Overlapping To complete the four-times-incomplete
transposed epilogues of :
![Page 59: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/59.jpg)
1. In parallel for all and , compute and store each , , , and .
2. In parallel for all , sequentially for each , compute and store the using the previously computed .
3. In parallel for all , sequentially for each , compute and store using the previously computed and .
4. In parallel for all , sequentially for each , compute and store using the previously computed , , and .
5. In parallel for all , sequentially for each , compute and store using the previously computed , , , and .
6. In parallel for all and , successively compute , , , and using the previously computed , , , and . Store .
Algorithm 5
![Page 60: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/60.jpg)
Adds row-column overlapping◦ Eliminates reading and writing column results◦ Modest increase in computation
Algorithm 5in
put
outp
utst
ages
fix all!
![Page 61: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/61.jpg)
Start from input and global borders
![Page 62: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/62.jpg)
Load blocks into shared memory
![Page 63: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/63.jpg)
Compute & store incomplete borders
![Page 64: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/64.jpg)
Compute & store incomplete borders
![Page 65: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/65.jpg)
Compute & store incomplete borders
![Page 66: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/66.jpg)
Compute & store incomplete borders
![Page 67: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/67.jpg)
Compute & store incomplete borders
![Page 68: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/68.jpg)
Compute & store incomplete borders
![Page 69: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/69.jpg)
Compute & store incomplete borders
![Page 70: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/70.jpg)
Compute & store incomplete borders
![Page 71: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/71.jpg)
All borders in global memory
![Page 72: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/72.jpg)
Fix incomplete borders
![Page 73: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/73.jpg)
Fix twice-incomplete borders
![Page 74: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/74.jpg)
Fix thrice-incomplete borders
![Page 75: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/75.jpg)
Fix four-times-incomplete borders
![Page 76: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/76.jpg)
Done fixing all borders
![Page 77: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/77.jpg)
Load blocks into shared memory
![Page 78: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/78.jpg)
Finish causal columns
![Page 79: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/79.jpg)
Finish anticausal columns
![Page 80: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/80.jpg)
Finish causal rows
![Page 81: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/81.jpg)
Finish anticausal rows
![Page 82: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/82.jpg)
Store results to global memory
![Page 83: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/83.jpg)
Done!
![Page 84: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/84.jpg)
Overlapped Summed-Area Tables
![Page 85: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/85.jpg)
A summed-area table is obtained using prefix sums over columns and rows.
The prefix-sum filter is a special case first-order causal recursive filter (with feedback coefficient ).
We can directly apply overlapping to optimize the computation of summed-area tables.
Overlapped Summed-Area Tables
![Page 86: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/86.jpg)
In blocked form, the problem is to obtain output from input where the blocks satisfy the relations:
Overlapped Summed-Area Tables
𝐵𝑚 ,𝑛 (𝑉 )=𝑆𝜏 (𝑃𝑚 ,𝑛− 1𝜏 (𝑉 ) ,𝐵𝑚 ,𝑛(𝑌 ))
𝐵𝑚 ,𝑛 (𝑌 )=𝑆 (𝑃𝑚−1 ,𝑛 (𝑌 ) ,𝐵𝑚 ,𝑛 ( 𝑋 ))
![Page 87: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/87.jpg)
Using the strategy developed for causal-anticausal overlapping, computing and using overlapping becomes easy.
In the first stage, we compute the incomplete output blocks and directly from the input.
Overlapped Summed-Area Tables
𝐵𝑚 ,𝑛 (𝑌 )=𝑆(𝟎 ,𝐵𝑚 ,𝑛 (𝑋 ))𝐵𝑚 ,𝑛 (𝑉 )=𝑆𝜏 (𝟎 ,𝐵𝑚 ,𝑛(𝑌 ))
![Page 88: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/88.jpg)
We store only the incomplete prologues and . Then we complete them using:
Overlapped Summed-Area Tables
𝑃𝑚 ,𝑛 (𝑌 )=𝑃𝑚−1 ,𝑛 (𝑌 )+𝑃𝑚 ,𝑛(𝑌 )𝑃𝑚 ,𝑛
𝜏 (𝑉 )=𝑃𝑚 ,𝑛− 1𝜏 (𝑉 )+𝑠 (𝑃𝑚− 1 ,𝑛 (𝑌 ) )+𝑃𝑚 ,𝑛
𝜏 (𝑉 )
Scalar denotes the sum of all entries in vector .
![Page 89: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/89.jpg)
1. In parallel for all and , compute and store the and .
2. Sequentially for each , but in parallel for each , compute and store the using the previously computed. Compute and store .
3. Sequentially for each , but in parallel for each , compute and store the using the previously computed , and .
4. In parallel for all and , compute then compute and store using the previously computed and .
Algorithm SAT
![Page 90: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/90.jpg)
S.1 Reads the input then computes and stores the incomplete prologues (red) and (blue).
Algorithm SAT
![Page 91: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/91.jpg)
S.2 Completes the prologues (red) and computes scalars (yellow).
Algorithm SAT
![Page 92: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/92.jpg)
S.3 Completes prologues
Algorithm SAT
![Page 93: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/93.jpg)
S.4 Reads the input and completed prologues, then computes and stores the final summed-area table.
Algorithm SAT
![Page 94: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/94.jpg)
Results
![Page 95: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/95.jpg)
First-order filter benchmarks• Alg. RT is the baseline implementation• Ruijters et al. 2010 “GPU prefilter […]”• Alg. 2 adds block parallelism & tricks• Sung et al. 1986 “Efficient […] recursive […]”• Blelloch 1990 “Prefix sums […]”• + tricks from GPU parallel scan algorithms
• Alg. 4 adds causal-anticausal overlapping• Eliminates 4hw of IO• Modest increase in computation• Alg. 5 adds row-column overlapping• Eliminates additional 2hw of IO• Modest increase in computation
Alg. Step Complexity
Max. # of Threads
MemoryBandwidth
5
4
2
RT
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
Thr o
ughp
ut ( G
iP/s
)
RT245
Cubic B-Spline Interpolation (GeForce GTX 480)
![Page 96: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/96.jpg)
Summed-area table benchmarks
• Harris et al 2008, GPU Gems 3• “Parallel prefix-scan […]”• Multi-scan + transpose + multiscan• Implemented with CUDPP
• Hensley 2010, Gamefest• “High-quality depth of field”• Multi-wave method• Our improvements
+ specialized row and column kernels+ save only incomplete borders+ fuse row and column stages
• Overlapped SAT• Row-column overlapping
u64 128 256 512 1024 2048 4096
Inp t size (pixels)
2 22 2 2 22
1
2
3
4
5
6
7
8
9
Thro
ughp
ut ( G
iP/s
)
Summed-area Table(GeForce GTX 480)
Harris et al [2008]Hensley [2010]Improved Hensley [2010]Overlapped SAT
• First-order filter, unit coefficient, no anticausal component
![Page 97: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/97.jpg)
Conclusion
![Page 98: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/98.jpg)
The paper describes an efficient algorithmic framework that reduces memory bandwidth over a sequence of recursive filters.
It splits the input into blocks that are processed in parallel on the GPU.
Overlapping the causal, anticausal, row and columns filters to reduce IO to global memory which leads to substantial performance gains.
Conclusion
![Page 99: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/99.jpg)
Difficult to understand theoretically Complex implementation
Drawbacks
![Page 100: GPU-Efficient Recursive Filtering and Summed-Area Tables](https://reader036.fdocuments.in/reader036/viewer/2022062410/568163bd550346895dd4d7ea/html5/thumbnails/100.jpg)
Questions?
baseline
Alg. RT (0.5 GiP/s)
+ block parallelism
Alg. 2 (3 GiP/s)
+ causal-anticausal overlapping
Alg. 4 (5 GiP/s)
+ row-column overlapping
Alg. 5 (6 GiP/s)