CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum –...

45
CS 179: GPU Programming Lecture 8

Transcript of CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum –...

Page 1: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

CS 179: GPU ProgrammingLecture 8

Page 2: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Last time

• GPU-accelerated:– Reduction– Prefix sum– Stream compaction– Sorting (quicksort)

Page 3: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Today

• GPU-accelerated Fast Fourier Transform• cuFFT (FFT library)

Page 4: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Signals (again)

Page 5: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

“Frequency content”

• What frequencies are present in our signals?

Page 6: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Discrete Fourier Transform (DFT)

• Given signal over time, represents DFT of

– Each row of W is a complex sine wave– Each row multiplied with - inner product of wave with signal– Corresponding entries of - “content” of that sine wave!

Page 7: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Discrete Fourier Transform (DFT)

• Alternative formulation:

– - values corresponding to wave k• Periodic – calculate for 0 ≤ k ≤ N - 1

Page 8: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Discrete Fourier Transform (DFT)

• Alternative formulation:

– - values corresponding to wave k• Periodic – calculate for 0 ≤ k ≤ N - 1

– Naive runtime: O(N2)• Sum of N iterations, for N values of k

Page 9: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Discrete Fourier Transform (DFT)

• Alternative formulation:

– - values corresponding to wave k• Periodic – calculate for 0 ≤ k ≤ N - 1

– Naive runtime: O(N2)• Sum of N iterations, for N values of k

Page 10: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Roots of unity

Page 11: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Discrete Fourier Transform (DFT)

• Alternative formulation:

– - values corresponding to wave k• Periodic – calculate for 0 ≤ k ≤ N - 1

Number of distinct values: N, not N2 !

Page 12: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

(Proof)

• Breakdown (assuming N is power of 2):– (Let , smallest root of unity)

Page 13: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

(Proof)

• Breakdown (assuming N is power of 2):– (Let , smallest root of unity)

Page 14: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

(Proof)

• Breakdown (assuming N is power of 2):– (Let , smallest root of unity)

Page 15: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

(Proof)

• Breakdown (assuming N is power of 2):– (Let , smallest root of unity)

Page 16: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

(Proof)

• Breakdown (assuming N is power of 2):– (Let , smallest root of unity)

DFT of xn, even n! DFT of xn, odd n!

Page 17: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

(Divide-and-conquer algorithm)Recursive-FFT(Vector x):

if x is length 1: return x

x_even <- (x0, x2, ..., x_(n-2) )x_odd <- (x1, x3, ..., x_(n-1) )

y_even <- Recursive-FFT(x_even)y_odd <- Recursive-FFT(x_odd)

for k = 0, …, (n/2)-1: y[k] <- y_even[k] + wk * y_odd[k] y[k + n/2] <- y_even[k] - wk * y_odd[k]

return y

Page 18: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

(Divide-and-conquer algorithm)Recursive-FFT(Vector x):

if x is length 1: return x

x_even <- (x0, x2, ..., x_(n-2) )x_odd <- (x1, x3, ..., x_(n-1) )

y_even <- Recursive-FFT(x_even)y_odd <- Recursive-FFT(x_odd)

for k = 0, …, (n/2)-1: y[k] <- y_even[k] + wk * y_odd[k] y[k + n/2] <- y_even[k] - wk * y_odd[k]

return y

T(n/2)

T(n/2)

O(n)

Page 19: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Runtime

• Recurrence relation:– T(n) = 2T(n/2) + O(n)

O(n log n) runtime! Much better than O(n2)

• (Minor caveat: N must be power of 2)– Usually resolvable

Page 20: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Parallelizable?

• O(n2) algorithm certainly is!for k = 0,…,N-1:

for n = 0,…,N-1: ...

• Sometimes parallelization outweighs runtime!– (N-body problem, …)

Page 21: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Recursive index tree

x0, x1, x2, x3, x4, x5, x6, x7

x0, x2, x4, x6 x1, x3, x5, x7

x0, x4 x2, x6 x1, x5 x3, x7

x0 x4 x2 x6 x1 x5 x3 x7

Page 22: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Recursive index tree

x0, x1, x2, x3, x4, x5, x6, x7

x0, x2, x4, x6 x1, x3, x5, x7

x0, x4 x2, x6 x1, x5 x3, x7

x0 x4 x2 x6 x1 x5 x3 x7

Order?

Page 23: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

0 0004 1002 0106 1101 0015 1013 0117 111

Page 24: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Bit-reversal order

0 000 reverse of… 0004 100 0012 010 0106 110 0111 001 1005 101 1013 011 1107 111 111

Page 25: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Iterative approach

x0, x1, x2, x3, x4, x5, x6, x7

x0, x2, x4, x6 x1, x3, x5, x7

x0, x4 x2, x6 x1, x5 x3, x7

x0 x4 x2 x6 x1 x5 x3 x7

Stage 3

Stage 2

Stage 1

Page 26: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

(Divide-and-conquer algorithm review)

Recursive-FFT(Vector x):

if x is length 1: return x

x_even <- (x0, x2, ..., x_(n-2) )x_odd <- (x1, x3, ..., x_(n-1) )

y_even <- Recursive-FFT(x_even)y_odd <- Recursive-FFT(x_odd)

for k = 0, …, (n/2)-1: y[k] <- y_even[k] + wk * y_odd[k] y[k + n/2] <- y_even[k] - wk * y_odd[k]

return y

T(n/2)

T(n/2)

O(n)

Page 27: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Iterative approach

x0, x1, x2, x3, x4, x5, x6, x7

x0, x2, x4, x6 x1, x3, x5, x7

x0, x4 x2, x6 x1, x5 x3, x7

x0 x4 x2 x6 x1 x5 x3 x7

Stage 3

Stage 2

Stage 1

Page 28: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Iterative approachBit-reversed

access

http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm

Page 29: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Iterative approachBit-reversed

access

http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm

Stage 1 Stage 2 Stage 3

Page 30: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Iterative approachIterative-FFT(Vector x):

y <- (bit-reversed order x)N <- y.lengthfor s = 1,2,…,log(N):

m <- 2s

wn <- e2πj/m

for k: 0 ≤ k ≤ N-1, stride m: for j = 0,…,(m/2)-1:

u <- y[k + j] t <- (wn)j * y[k + j +

m/2]

y[k + j] <- u + t y[k + j + m/2] <- u - t

return y

Bit-reversed access

http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm

Stage 1 Stage 2 Stage 3

Page 31: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Iterative approachIterative-FFT(Vector x):

y <- (bit-reversed order x)N <- y.lengthfor s = 1,2,…,log(N):

m <- 2s

wn <- e2πj/m

for k: 0 ≤ k ≤ N-1, stride m: for j = 0,…,(m/2)-1:

u <- y[k + j] t <- (wn)j * y[k + j +

m/2]

y[k + j] <- u + t y[k + j + m/2] <- u - t

return y

O(log N) iterations

O(N/m) iterations

O(m) iterations

Total: O(N log N) runtime

Page 32: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

CUDA approach

Bit-reversed access

http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm

Stage 1 Stage 2 Stage 3

Page 33: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

CUDA approach

Bit-reversed access

http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm

Stage 1 Stage 2 Stage 3

__syncthreads() barriers!

Page 34: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

CUDA approach

Bit-reversed access

http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm

Stage 1 Stage 2 Stage 3

__syncthreads() barriers!

Non-coalesced memory access!!

Page 35: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

CUDA approach

Bit-reversed access

http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm

Stage 1 Stage 2 Stage 3

__syncthreads() barriers!

Load into shared memory

Non-coalesced memory access!!

Page 36: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

CUDA approach

Bit-reversed access

http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm

Stage 1 Stage 2 Stage 3

__syncthreads() barriers!

Load into shared memory

Non-coalesced memory access!!

Bank conflicts!!

Page 37: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Generalization

• Above algorithm works for array sizes of 2n (integer n)

• Generalize?

Page 38: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

“Bit”-reversal order

• N = 6: (small example)– Factors into 3,2• 1st digit: base 3• 2nd digit: base 2

000110112021

001020011121

012345

024135

Page 39: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Generalization

• Radix-3 case (followed by radix-2)

https://www.projectrhea.org/

Page 40: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

• Suppose N = N1N2 (factorization…)• Algorithm: (recursive)– Take FFT along rows– Do multiply operations (complex roots of unity)– Take FFT along columns

• “Cooley-Tukey FFT algorithm”

Page 41: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Inverse DFT/FFT

• Similarly parallelizable!– (Sign change in complex terms)

Page 42: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

cuFFT

• FFT library included with CUDA– Approximately implements previous algorithms• (Cooley-Tukey/Bluestein)• Also handles higher dimensions

Page 43: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

cuFFT 1D example

Correction: Remember to use cufftDestroy(plan) when finished with transforms

Page 44: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

cuFFT 3D example

Correction: Remember to use cufftDestroy(plan) when finished with transforms

Page 45: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Remarks

• As before, some parallelizable algorithms don’t easily “fit the mold”– Hardware matters more!

– Some resources:• Introduction to Algorithms (Cormen, et al), aka “CLRS”,

esp. Sec 30.5• “An Efficient Implementation of Double Precision 1-D

FFT for GPUs Using CUDA” (Liu, et al.)