Data-Parallel Algorithms and Data Structures -...
Transcript of Data-Parallel Algorithms and Data Structures -...
Data-Parallel Algorithms and DataData-Parallel Algorithms and DataStructuresStructures
John OwensJohn OwensUC DavisUC Davis
• Each fragment is shaded w/SIMD program
• Shading can use valuesfrom texture memory
• Image can be used astexture on future passes
• Application specifiesgeometry -> rasterized
Programming a GPU for GraphicsProgramming a GPU for Graphics
• Run a SIMD program overeach fragment
• “Gather” is permittedfrom texture memory
• Resulting buffer can betreated as texture on nextpass
• Draw a screen-sized quad
Programming a GPU for GP Programs (Old-School)Programming a GPU for GP Programs (Old-School)
• Run a SIMD program overeach element/thread
• Global memory supportsboth gather and scatter
• Resulting buffer can beused in other general-purpose kernels or astexture
• Define a computationdomain over elements
Programming a GPU for GP Programs (New School)Programming a GPU for GP Programs (New School)
One Slide Summary of TodayOne Slide Summary of Today
• GPUs are great at running many closely-coupled but independent threads in parallel– The programming model is SPMD—single program,
multiple data
• GPU computing boils down to:– Define a computation domain that generates many
parallel threads• This is the data structure
– Iterate in parallel over that computation domain,running a program over all threads• This is the algorithm
OutlineOutline
• Data Structures– GPU Memory Model– Taxonomy
• Algorithmic Building Blocks– Sample Application– Map– Gather & Scatter– Reductions– Scan (parallel prefix)– Sort, search, …
Memory HierarchyMemory Hierarchy
• CPU and GPU Memory Hierarchy
Disk
CPU Main Memory
GPU Main(Video) MemoryCPU Caches
CPU Registers GPU Caches(Texture, Shared)
GPU Temporary Registers
GPU Constant Registers
Big Picture: GPU Memory ModelBig Picture: GPU Memory Model
• GPUs are a mix of:– Historical, fixed-function capabilities– Newer, flexible, programmable capabilities
• Fixed-function:– Known access patterns, behaviors– Accelerated by special-purpose hardware
• Programmable:– Unknown access patterns– Generality good
• Memory model must account for both– Consequence: Ample special-purpose functionality– Consequence: Restricting flexibility may improve performance
» Take advantage of special-purpose functionality!
Taxonomy of Data StructuresTaxonomy of Data Structures
• Dense arrays– Address translations: n-dimension to m-dimension
• Sparse arrays– Static sparse arrays (e.g. sparse matrix)– Dynamic sparse arrays (e.g. page tables)
• Adaptive data structures– Static adaptive data structures (e.g. k-d trees)– Dynamic adaptive data structures (e.g. adaptive
multiresolution grids)
• Non-indexable structures (e.g. stack)
Details? Lefohn et al.’s Glift (TOG Jan. ‘06), GPGPU survey
GPU Memory ModelGPU Memory Model
• More restricted memory access than CPU– Allocate/free memory only before computation– Transfers to and from CPU are explicit– GPU is controlled by CPU, can’t initiate transfers,
access disk, etc.
• To generalize …– GPUs are better at accessing data structures– CPUs are better at building data structures– As CPU-GPU bandwidth improves, consider doing
data structure tasks on their “natural” processor
GPU Memory ModelGPU Memory Model
• Limited memory access during computation(kernel)– Registers (per fragment/thread)
• Read/write
– Local memory (shared among threads)• Does not exist in general• CUDA allows access to shared memory btwn threads
– Global memory (historical)• Read-only during computation• Write-only at end of computation (precomputed address)
– Global memory (new)• Allows general scatter/gather (read/write)
– No collision rules!– Exposed in AMD R520+ GPUs, NVIDIA G80+ GPUs
Data-Parallel AlgorithmsData-Parallel Algorithms
• The GPU is a data-parallel processor– Data-parallel kernels of applications can be accelerated on
the GPU
• Efficient algorithms require efficient buildingblocks
• This talk: data-parallel building blocks– Map– Gather & Scatter– Reduce– Scan
• Next talk (Naga Govindaraju): sorting and searching
Sample Motivating ApplicationSample Motivating Application
• How bumpy is a surface that werepresent as a grid of samples?
• Algorithm:– Loop over all elements– At each element, compare the value of that element
to the average of its neighbors (“difference”).Square that difference.
– Now sum up all those differences.• But we don’t want to sum all the diffs that are 0.• So only sum up the non-zero differences.
– This is a fake application—don’t take it too seriously.
Picture courtesy http://www.artifice.com
Sample Motivating ApplicationSample Motivating Application
for all samples:neighbors[x,y] =
0.25 * ( value[x-1,y]+ value[x+1,y]+value[x,y+1]+value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0for all samples where diff != 0:
result += diff
return result
Sample Motivating ApplicationSample Motivating Application
for all samples:neighbors[x,y] =
0.25 * ( value[x-1,y]+ value[x+1,y]+
value[x,y+1]+value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0for all samples where diff != 0:
result += diff
return result
The Map OperationThe Map Operation
• Given:– Array or stream of data elements A– Function f(x)
• map(A, f) = applies f(x) to all ai∈ A• GPU implementation is straightforward
– Graphics: A is a texture, ai are texels– General-purpose: A is a computation domain, ai are
elements in that domain– Pixel shader/kernel implements f(x), reads ai as x– (Graphics): Draw a quad with as many pixels as
texels in A with f(x) pixel shader active– Output stored in another texture/buffer
Sample Motivating ApplicationSample Motivating Application
for all samples:neighbors[x,y] =
0.25 * ( value[x-1,y]+ value[x+1,y]+
value[x,y+1]+value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0for all samples where diff != 0:
result += diff
return result
Scatter vs. GatherScatter vs. Gather
• Gather: p = a[i]– Vertex or Fragment programs
• Scatter: a[i] = p– Historically vertex programs, but exposed in
fragment programs in latest GPUs
Sample Motivating ApplicationSample Motivating Application
for all samples:neighbors[x,y] =
0.25 * ( value[x-1,y]+ value[x+1,y]+
value[x,y+1]+value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0for all samples where diff != 0:
result += diff
return result
Parallel ReductionsParallel Reductions
• Given:– Binary associative operator ⊕ with identity I
– Ordered set s = [a0, a1, …, an-1] of n elements
• reduce(⊕, s) returns a0 ⊕ a1 ⊕ … ⊕ an-1
• Example: reduce(+, [3 1 7 0 4 1 6 3]) = 25
• Reductions common in parallel algorithms– Common reduction operators are +, ×, min and max– Note floating point is only pseudo-associative
Parallel Reductions on the GPUParallel Reductions on the GPU
1D parallel reduction:add two halves of texture together repeatedly...… until we’re left with a single row of texels
++NN
NN/2/2NN/4/4…… 11
O(logO(log22N)N) steps, O( steps, O(NN) work) work
++ ++
Multiple 1D Parallel ReductionsMultiple 1D Parallel Reductions
Can run many reductions in parallelUse 2D texture and reduce one dimension
++MxNMxN
MxNMxN/2/2MxNMxN/4/4…… Mx1Mx1
++ ++
O(logO(log22N)N) steps, O( steps, O(MNMN) work) work
2D reductions2D reductions
• Like 1D reduction, only reduce in bothdirections simultaneously
– Note: can add more than 2x2 elements per pixel• Trade per-pixel work for step complexity• Best perf depends on specific GPU (cache, etc.)
Sample Motivating ApplicationSample Motivating Application
for all samples:neighbors[x,y] =
0.25 * ( value[x-1,y]+ value[x+1,y]+
value[x,y+1]+value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0for all samples where diff != 0:
result += diff
return result
Stream CompactionStream Compaction
• Input: stream of 1s and 0s[1 0 1 1 0 0 1 0]
• Operation:“sum up all elements before you”• Output: scatter addresses for “1” elements
[0 1 1 2 3 3 3 4]• Note scatter addresses for red elements are
packed!
Parallel Scan (aka prefix sum)Parallel Scan (aka prefix sum)
• Given:– Binary associative operator ⊕ with identity I
– Ordered set s = [a0, a1, …, an-1] of n elements
• scan(⊕, s) returns[I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]
• Example:scan(+, [3 1 7 0 4 1 6 3]) =
[0 3 4 11 11 15 16 22]
(From Blelloch, 1990, “Prefix Sums and Their Applications”)
Applications of ScanApplications of Scan
• Radix sort• Quicksort• String comparison• Lexical analysis• Stream compaction
• Sparse matrices• Polynomial evaluation• Solving recurrences• Tree operations• Histograms
A Naive Parallel Scan AlgorithmA Naive Parallel Scan Algorithm
T0 36140713
Log(n) iterations
Note: With graphics API,can’t read and writethe same texture, somust “ping-pong”.Not necessary withnewer APIs.
A Naive Parallel Scan AlgorithmA Naive Parallel Scan Algorithm
T0 36140713
T1 97547843Stride 1
Log(n) iterations
Note: With graphics API,can’t read and writethe same texture, somust “ping-pong”
For i from 1 to log(n)-1:• Specify domain from
2i to n. Element kcomputes
vout = v[k] + v[k-2i].
A Naive Parallel Scan AlgorithmA Naive Parallel Scan Algorithm
T0 36140713
T1 97547843Stride 1
Log(n) iterations
Note: With graphics API,can’t read and writethe same texture, somust “ping-pong”
For i from 1 to log(n)-1:• Specify domain from
2i to n. Element kcomputes
vout = v[k] + v[k-2i].
• Due to ping-pong,specify 2nd domainfrom 2(i-1) to 2i with asimple pass-throughshader
vout = vin.
A Naive Parallel Scan AlgorithmA Naive Parallel Scan Algorithm
T0 36140713
T1 97547843
T0 14111212111143
Stride 1
Stride 2
Log(n) iterations
Note: With graphics API,can’t read and writethe same texture, somust “ping-pong”
For i from 1 to log(n)-1:• Specify domain from
2i to n. Element kcomputes
vout = v[k] + v[k-2i].
• Due to ping-pong,specify 2nd domainfrom 2(i-1) to 2i with asimple pass-throughshader
vout = vin.
A Naive Parallel Scan AlgorithmA Naive Parallel Scan Algorithm
T0 36140713
T1 97547843
T0 14111212111143
Stride 1
Stride 2
Log(n) iterations
Note: With graphics API,can’t read and writethe same texture, somust “ping-pong”
For i from 1 to log(n)-1:• Specify domain from
2i to n. Element kcomputes
vout = v[k] + v[k-2i].
• Due to ping-pong,specify 2nd domainfrom 2(i-1) to 2i with asimple pass-throughshader
vout = vin.
A Naive Parallel Scan AlgorithmA Naive Parallel Scan Algorithm
T0 36140713
T1 97547843
T0 14111212111143
Out 25221615111143
Stride 1
Stride 2
Stride 4
Log(n) iterations
Note: With graphics API,can’t read and writethe same texture, somust “ping-pong”
For i from 1 to log(n)-1:• Specify domain from
2i to n. Element kcomputes
vout = v[k] + v[k-2i].
• Due to ping-pong,specify 2nd domainfrom 2(i-1) to 2i with asimple pass-throughshader
vout = vin.
A Naive Parallel Scan AlgorithmA Naive Parallel Scan Algorithm
T0 36140713
T1 97547843
T0 14111212111143
Out 25221615111143
Stride 1
Stride 2
Stride 4
Log(n) iterations
Note: With graphics API,can’t read and writethe same texture, somust “ping-pong”
For i from 1 to log(n)-1:• Specify domain from
2i to n. Element kcomputes
vout = v[k] + v[k-2i].
• Due to ping-pong,specify 2nd domainfrom 2(i-1) to 2i with asimple pass-throughshader
vout = vin.
A Naive Parallel Scan AlgorithmA Naive Parallel Scan Algorithm
• Algorithm given in more detail by Horn [‘05]• Step-efficient, but not work-efficient
– O(log n) steps, but O(n log n) adds– Sequential version is O(n)– A factor of log(n) hurts: 20x for 106 elements!
• Dig into parallel algorithms literature for abetter solution– See Blelloch 1990, “Prefix Sums and Their
Applications”
Balanced TreesBalanced Trees
• Common parallel algorithms pattern– Build a balanced binary tree on the input data and
sweep it to and from the root– Tree is conceptual, not an actual data structure
• For scan:– Traverse down from leaves to root building partial
sums at internal nodes in the tree• Root holds sum of all leaves
– Traverse back up the tree building the scan from thepartial sums
– 2 log n passes, O(n) work
Balanced Tree ScanBalanced Tree Scan“Balanced tree”: O(n)
computation, space
• First build sums inplace up the tree:“reduce”
• Traverse back downand use partial sumsto generate scan:“downsweep”
See Mark Harris’s articlein GPU Gems 3
Figure courtesy Figure courtesy Shubho Sengupta Shubho Sengupta
Scan with ScatterScan with Scatter
• Scatter in pixel shaders makes algorithmsthat use scan easier to implement– NVIDIA CUDA and AMD CTM enable this– NVIDIA CUDA SDK includes example implementation
of scan primitive– http://www.gpgpu.org/scan-gpugems3
• CUDA Parallel Data Cache improvesefficiency– All steps executed in a single kernel– Threads communicate through shared memory– Drastically reduces bandwidth bottleneck!
Parallel SortingParallel Sorting
• Given an unordered list of elements,produce list ordered by key value– Kernel: compare and swap
• GPU’s constrained programmingenvironment limits viable algorithms– Bitonic merge sort [Batcher 68]– Periodic balanced sorting networks [Dowd 89]
• Recent research results impressive– Naga Govindaraju will cover the algorithms in detail
Binary SearchBinary Search
• Find a specific element in an ordered list• Implement just like CPU algorithm
– Finds the first element of a given value v• If v does not exist, find next smallest element > v
• Search is sequential, but many searches can beexecuted in parallel– Number of pixels drawn determines number of searches
executed in parallel• 1 pixel == 1 search
• For details see:– “A Toolkit for Computation on GPUs”. Ian Buck and Tim
Purcell. In GPU Gems. Randy Fernando, ed. 2004
SummarySummary
• Think parallel!– Effective programming on GPUs requires mapping
computation & data structures into parallelformulations
• Future thoughts– Definite need for:
• Efficient implementations of parallel primitives• Research on what is the right set of primitives
– How will high-bandwidth connections between CPUand GPU change algorithm and data structuredesign?
ReferencesReferences• “A Toolkit for Computation on GPUs”. Ian Buck
and Tim Purcell. In GPU Gems, Randy Fernando,ed. 2004.
• “Parallel Prefix Sum (Scan) with CUDA”. MarkHarris, Shubhabrata Sengupta, John D. Owens. InGPU Gems 3, Herbert Nguyen, ed. Aug. 2007.
• “Stream Reduction Operations for GPGPUApplications”. Daniel Horn. In GPU Gems 2, MattPharr, ed. 2005.
• “Glift: Generic, Efficient, Random-Access GPUData Structures”. Aaron Lefohn et al., ACM TOG,Jan. 2006.