Data-Parallel Algorithms and Data Structures -...

Data-Parallel Algorithms and DataData-Parallel Algorithms and DataStructuresStructures

John OwensJohn OwensUC DavisUC Davis

• Each fragment is shaded w/SIMD program

• Shading can use valuesfrom texture memory

• Image can be used astexture on future passes

• Application specifiesgeometry -> rasterized

Programming a GPU for GraphicsProgramming a GPU for Graphics

• Run a SIMD program overeach fragment

• “Gather” is permittedfrom texture memory

• Resulting buffer can betreated as texture on nextpass

• Draw a screen-sized quad

Programming a GPU for GP Programs (Old-School)Programming a GPU for GP Programs (Old-School)

• Run a SIMD program overeach element/thread

• Global memory supportsboth gather and scatter

• Resulting buffer can beused in other general-purpose kernels or astexture

• Define a computationdomain over elements

Programming a GPU for GP Programs (New School)Programming a GPU for GP Programs (New School)

One Slide Summary of TodayOne Slide Summary of Today

• GPUs are great at running many closely-coupled but independent threads in parallel– The programming model is SPMD—single program,

multiple data

• GPU computing boils down to:– Define a computation domain that generates many

parallel threads• This is the data structure

– Iterate in parallel over that computation domain,running a program over all threads• This is the algorithm

OutlineOutline

• Data Structures– GPU Memory Model– Taxonomy

• Algorithmic Building Blocks– Sample Application– Map– Gather & Scatter– Reductions– Scan (parallel prefix)– Sort, search, …

Memory HierarchyMemory Hierarchy

• CPU and GPU Memory Hierarchy

Disk

CPU Main Memory

GPU Main(Video) MemoryCPU Caches

CPU Registers GPU Caches(Texture, Shared)

GPU Temporary Registers

GPU Constant Registers

Big Picture: GPU Memory ModelBig Picture: GPU Memory Model

• GPUs are a mix of:– Historical, fixed-function capabilities– Newer, flexible, programmable capabilities

• Fixed-function:– Known access patterns, behaviors– Accelerated by special-purpose hardware

• Programmable:– Unknown access patterns– Generality good

• Memory model must account for both– Consequence: Ample special-purpose functionality– Consequence: Restricting flexibility may improve performance

» Take advantage of special-purpose functionality!

Taxonomy of Data StructuresTaxonomy of Data Structures

• Dense arrays– Address translations: n-dimension to m-dimension

• Sparse arrays– Static sparse arrays (e.g. sparse matrix)– Dynamic sparse arrays (e.g. page tables)

• Adaptive data structures– Static adaptive data structures (e.g. k-d trees)– Dynamic adaptive data structures (e.g. adaptive

multiresolution grids)

• Non-indexable structures (e.g. stack)

Details? Lefohn et al.’s Glift (TOG Jan. ‘06), GPGPU survey

GPU Memory ModelGPU Memory Model

• More restricted memory access than CPU– Allocate/free memory only before computation– Transfers to and from CPU are explicit– GPU is controlled by CPU, can’t initiate transfers,

access disk, etc.

• To generalize …– GPUs are better at accessing data structures– CPUs are better at building data structures– As CPU-GPU bandwidth improves, consider doing

data structure tasks on their “natural” processor

GPU Memory ModelGPU Memory Model

• Limited memory access during computation(kernel)– Registers (per fragment/thread)

• Read/write

– Local memory (shared among threads)• Does not exist in general• CUDA allows access to shared memory btwn threads

– Global memory (historical)• Read-only during computation• Write-only at end of computation (precomputed address)

– Global memory (new)• Allows general scatter/gather (read/write)

– No collision rules!– Exposed in AMD R520+ GPUs, NVIDIA G80+ GPUs

Data-Parallel AlgorithmsData-Parallel Algorithms

• The GPU is a data-parallel processor– Data-parallel kernels of applications can be accelerated on

the GPU

• Efficient algorithms require efficient buildingblocks

• This talk: data-parallel building blocks– Map– Gather & Scatter– Reduce– Scan

• Next talk (Naga Govindaraju): sorting and searching

Sample Motivating ApplicationSample Motivating Application

• How bumpy is a surface that werepresent as a grid of samples?

• Algorithm:– Loop over all elements– At each element, compare the value of that element

to the average of its neighbors (“difference”).Square that difference.

– Now sum up all those differences.• But we don’t want to sum all the diffs that are 0.• So only sum up the non-zero differences.

– This is a fake application—don’t take it too seriously.

Picture courtesy http://www.artifice.com


for all samples:neighbors[x,y] =

0.25 * ( value[x-1,y]+ value[x+1,y]+value[x,y+1]+value[x,y-1] ) )

diff = (value[x,y] - neighbors[x,y])^2

result = 0for all samples where diff != 0:

result += diff

return result



0.25 * ( value[x-1,y]+ value[x+1,y]+

value[x,y+1]+value[x,y-1] ) )



result += diff

return result

The Map OperationThe Map Operation

• Given:– Array or stream of data elements A– Function f(x)

• map(A, f) = applies f(x) to all ai∈ A• GPU implementation is straightforward

– Graphics: A is a texture, ai are texels– General-purpose: A is a computation domain, ai are

elements in that domain– Pixel shader/kernel implements f(x), reads ai as x– (Graphics): Draw a quad with as many pixels as

texels in A with f(x) pixel shader active– Output stored in another texture/buffer



0.25 * ( value[x-1,y]+ value[x+1,y]+




result += diff

return result

Scatter vs. GatherScatter vs. Gather

• Gather: p = a[i]– Vertex or Fragment programs

• Scatter: a[i] = p– Historically vertex programs, but exposed in

fragment programs in latest GPUs



0.25 * ( value[x-1,y]+ value[x+1,y]+




result += diff

return result

Parallel ReductionsParallel Reductions

• Given:– Binary associative operator ⊕ with identity I

– Ordered set s = [a0, a1, …, an-1] of n elements

• reduce(⊕, s) returns a0 ⊕ a1 ⊕ … ⊕ an-1

• Example: reduce(+, [3 1 7 0 4 1 6 3]) = 25

• Reductions common in parallel algorithms– Common reduction operators are +, ×, min and max– Note floating point is only pseudo-associative

Parallel Reductions on the GPUParallel Reductions on the GPU

1D parallel reduction:add two halves of texture together repeatedly...… until we’re left with a single row of texels

++NN

NN/2/2NN/4/4…… 11

O(logO(log22N)N) steps, O( steps, O(NN) work) work

++ ++

Multiple 1D Parallel ReductionsMultiple 1D Parallel Reductions

Can run many reductions in parallelUse 2D texture and reduce one dimension

++MxNMxN

MxNMxN/2/2MxNMxN/4/4…… Mx1Mx1

++ ++

O(logO(log22N)N) steps, O( steps, O(MNMN) work) work

2D reductions2D reductions

• Like 1D reduction, only reduce in bothdirections simultaneously

– Note: can add more than 2x2 elements per pixel• Trade per-pixel work for step complexity• Best perf depends on specific GPU (cache, etc.)



0.25 * ( value[x-1,y]+ value[x+1,y]+




result += diff

return result

Stream CompactionStream Compaction

• Input: stream of 1s and 0s[1 0 1 1 0 0 1 0]

• Operation:“sum up all elements before you”• Output: scatter addresses for “1” elements

[0 1 1 2 3 3 3 4]• Note scatter addresses for red elements are

packed!

Parallel Scan (aka prefix sum)Parallel Scan (aka prefix sum)

• Given:– Binary associative operator ⊕ with identity I

– Ordered set s = [a0, a1, …, an-1] of n elements

• scan(⊕, s) returns[I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]

• Example:scan(+, [3 1 7 0 4 1 6 3]) =

[0 3 4 11 11 15 16 22]

(From Blelloch, 1990, “Prefix Sums and Their Applications”)

Applications of ScanApplications of Scan

• Radix sort• Quicksort• String comparison• Lexical analysis• Stream compaction

• Sparse matrices• Polynomial evaluation• Solving recurrences• Tree operations• Histograms

A Naive Parallel Scan AlgorithmA Naive Parallel Scan Algorithm

T0 36140713

Log(n) iterations

Note: With graphics API,can’t read and writethe same texture, somust “ping-pong”.Not necessary withnewer APIs.


T0 36140713

T1 97547843Stride 1

Log(n) iterations

Note: With graphics API,can’t read and writethe same texture, somust “ping-pong”

For i from 1 to log(n)-1:• Specify domain from

2i to n. Element kcomputes

vout = v[k] + v[k-2i].


T0 36140713

T1 97547843Stride 1

Log(n) iterations





• Due to ping-pong,specify 2nd domainfrom 2(i-1) to 2i with asimple pass-throughshader

vout = vin.


T0 36140713

T1 97547843

T0 14111212111143

Stride 1

Stride 2

Log(n) iterations






vout = vin.


T0 36140713

T1 97547843

T0 14111212111143

Out 25221615111143

Stride 1

Stride 2

Stride 4

Log(n) iterations






vout = vin.


• Algorithm given in more detail by Horn [‘05]• Step-efficient, but not work-efficient

– O(log n) steps, but O(n log n) adds– Sequential version is O(n)– A factor of log(n) hurts: 20x for 106 elements!

• Dig into parallel algorithms literature for abetter solution– See Blelloch 1990, “Prefix Sums and Their

Applications”

Balanced TreesBalanced Trees

• Common parallel algorithms pattern– Build a balanced binary tree on the input data and

sweep it to and from the root– Tree is conceptual, not an actual data structure

• For scan:– Traverse down from leaves to root building partial

sums at internal nodes in the tree• Root holds sum of all leaves

– Traverse back up the tree building the scan from thepartial sums

– 2 log n passes, O(n) work

Balanced Tree ScanBalanced Tree Scan“Balanced tree”: O(n)

computation, space

• First build sums inplace up the tree:“reduce”

• Traverse back downand use partial sumsto generate scan:“downsweep”

See Mark Harris’s articlein GPU Gems 3

Figure courtesy Figure courtesy Shubho Sengupta Shubho Sengupta

Scan with ScatterScan with Scatter

• Scatter in pixel shaders makes algorithmsthat use scan easier to implement– NVIDIA CUDA and AMD CTM enable this– NVIDIA CUDA SDK includes example implementation

of scan primitive– http://www.gpgpu.org/scan-gpugems3

• CUDA Parallel Data Cache improvesefficiency– All steps executed in a single kernel– Threads communicate through shared memory– Drastically reduces bandwidth bottleneck!

Parallel SortingParallel Sorting

• Given an unordered list of elements,produce list ordered by key value– Kernel: compare and swap

• GPU’s constrained programmingenvironment limits viable algorithms– Bitonic merge sort [Batcher 68]– Periodic balanced sorting networks [Dowd 89]

• Recent research results impressive– Naga Govindaraju will cover the algorithms in detail

Binary SearchBinary Search

• Find a specific element in an ordered list• Implement just like CPU algorithm

– Finds the first element of a given value v• If v does not exist, find next smallest element > v

• Search is sequential, but many searches can beexecuted in parallel– Number of pixels drawn determines number of searches

executed in parallel• 1 pixel == 1 search

• For details see:– “A Toolkit for Computation on GPUs”. Ian Buck and Tim

Purcell. In GPU Gems. Randy Fernando, ed. 2004

SummarySummary

• Think parallel!– Effective programming on GPUs requires mapping

computation & data structures into parallelformulations

• Future thoughts– Definite need for:

• Efficient implementations of parallel primitives• Research on what is the right set of primitives

– How will high-bandwidth connections between CPUand GPU change algorithm and data structuredesign?

ReferencesReferences• “A Toolkit for Computation on GPUs”. Ian Buck

and Tim Purcell. In GPU Gems, Randy Fernando,ed. 2004.

• “Parallel Prefix Sum (Scan) with CUDA”. MarkHarris, Shubhabrata Sengupta, John D. Owens. InGPU Gems 3, Herbert Nguyen, ed. Aug. 2007.

• “Stream Reduction Operations for GPGPUApplications”. Daniel Horn. In GPU Gems 2, MattPharr, ed. 2005.

• “Glift: Generic, Efficient, Random-Access GPUData Structures”. Aaron Lefohn et al., ACM TOG,Jan. 2006.

Data-Parallel Algorithms and Data Structures -...

Documents

Transcript of Data-Parallel Algorithms and Data Structures -...