A hybrid Cholesky decomposition algorithm for multicore CPUs with
Transcript of A hybrid Cholesky decomposition algorithm for multicore CPUs with
![Page 1: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/1.jpg)
A hybrid Cholesky decomposition algorithm formulticore CPUs with GPU accelerators
Gary Macindoe
Department of Statistical ScienceUniversity College London
8th February 2013
![Page 2: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/2.jpg)
Cholesky Decomposition
Used throughout Computational Statistics and Machine LearningFinds L such that A = LLT
“Square root” of a matrixO(N3) operationsPerformance bottleneckApplies only to symmetric, square, positive definite matricesOperates in the upper or lower triangleProvides fast ways of computing the inverse and determinant
![Page 3: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/3.jpg)
Example use of Cholesky Decomposition
Used in multivariate Normal distribution
To generate random vectors ∼ N (µ,Σ)
z ∼ N (0,1)
x = µ+√
Σz
To calculate the probability density function
(2π)−n2 |Σ|−
12 e−
12 (x−µ)T Σ−1(x−µ)
![Page 4: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/4.jpg)
Example use of Cholesky Decomposition
Used in multivariate Normal distributionTo generate random vectors ∼ N (µ,Σ)
z ∼ N (0,1)
x = µ+√
Σz
To calculate the probability density function
(2π)−n2 |Σ|−
12 e−
12 (x−µ)T Σ−1(x−µ)
![Page 5: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/5.jpg)
Example use of Cholesky Decomposition
Used in multivariate Normal distributionTo generate random vectors ∼ N (µ,Σ)
z ∼ N (0,1)
x = µ+√
Σz
To calculate the probability density function
(2π)−n2 |Σ|−
12 e−
12 (x−µ)T Σ−1(x−µ)
![Page 6: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/6.jpg)
Calculating the Cholesky Decomposition
DefinitionLower triangular Cholesky Decomposition A = LLT
Li,j =
√
Aj,j −∑j−1
k=1 L2j,k if i == j
1Lj,j
(Ai,j −
∑j−1k=1 Li,kLj,k
)if i > j
Each element Li,j of the lower triangular Cholesky decomposition canonly be calculated after all the elements to the left on the same rowLi,0→j and on the diagonal row above Lj,0→j . If the sum under thesquare root is negative then the A is not positive definite.
![Page 7: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/7.jpg)
Calculating the Cholesky Decomposition
DefinitionLower triangular Cholesky Decomposition A = LLT
Li,j =
√
Aj,j −∑j−1
k=1 L2j,k if i == j
1Lj,j
(Ai,j −
∑j−1k=1 Li,kLj,k
)if i > j
Each element Li,j of the lower triangular Cholesky decomposition canonly be calculated after all the elements to the left on the same rowLi,0→j and on the diagonal row above Lj,0→j . If the sum under thesquare root is negative then the A is not positive definite.
![Page 8: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/8.jpg)
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
![Page 9: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/9.jpg)
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
![Page 10: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/10.jpg)
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
![Page 11: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/11.jpg)
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
![Page 12: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/12.jpg)
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
![Page 13: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/13.jpg)
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)
4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
![Page 14: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/14.jpg)
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
![Page 15: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/15.jpg)
Blocked Cholesky decomposition
Divide the matrix into blocks and update each block using a highperformance linear algebra library (BLAS).
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Step 2 is independent of steps 1and 3.
![Page 16: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/16.jpg)
GPU Hardware
GPU dedicates more die area to data processingCPU dedicates more die area to control flow and cache
![Page 17: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/17.jpg)
GPU Hardware
A GPU is an array of Streaming Multiprocessors (SMs)
Each SM has its own shared memory and registersEach SM is simpler than a CPU core
Better at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)
![Page 18: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/18.jpg)
GPU Hardware
A GPU is an array of Streaming Multiprocessors (SMs)Each SM has its own shared memory and registers
Each SM is simpler than a CPU coreBetter at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)
![Page 19: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/19.jpg)
GPU Hardware
A GPU is an array of Streaming Multiprocessors (SMs)Each SM has its own shared memory and registersEach SM is simpler than a CPU core
Better at simple arithmetic (+, −, ×)Worse at complex arithmetic (÷, sin, cos, tan, log, exp,...)
![Page 20: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/20.jpg)
GPU Programming
GPUs execute kernel functions written in CUDA-Ctemplate <typename T>
__global__ void scale(int n, T alpha , T * x, int incx) {
const int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
x[i * incx] *= alpha;
}
nVidia CUDA compiler converts CUDA-C code into GPU binaryfilesCUDA runtime library provides an API to
transfer compiled code and data onto the GPUlaunch kernel functions using a 3D grid of 3D thread blocks
![Page 21: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/21.jpg)
GPU Thread hierarchy
Each thread block runs on one SMEach SM runs more than one thread block
Within each thread block threads are multitasked in groups of 32called warpsWithin a warp threads are SIMD
Run the same instruction at the same timeThreads within a block can synchronize with each other
Ensures all threads in the block are at the same instruction
![Page 22: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/22.jpg)
GPU Memory hierarchy
GPUs have three types of memory to store data:Global memory - large, slow, accessible from all threads (andthe CPU)Shared memory - small, fast, shared between threads in a blockRegisters - small, fastest, private to each thread
In global and shared memory highest bandwidth is obtained whenconsecutive threads access consecutive elements.
![Page 23: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/23.jpg)
GPU vs CPU
CPU better atcomplex maths (like square root in Cholesky)branching (if/then/else)
GPU better atsimple maths (sums, multiplies)executing operations in parallel
![Page 24: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/24.jpg)
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
![Page 25: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/25.jpg)
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
![Page 26: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/26.jpg)
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
![Page 27: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/27.jpg)
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
![Page 28: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/28.jpg)
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)
4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
![Page 29: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/29.jpg)
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
![Page 30: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/30.jpg)
Hybrid Blocked Cholesky decomposition
@
@
@
@
@
A B
C D
B starts at top left and moves tobottom right
1: B = B − AAT
2: D = D − CAT
3: B = chol(B)4: D = D(B−1)T
Have GPU perform BLAS operations in steps 1, 2 and 4 whichcontain lots of simple parallel operations while CPU performs smallerCholesky decomposition in step 3 which contains more complexserial operations.
![Page 31: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/31.jpg)
GPU Matrix Multiply
C = αAB + βC
Each element of C is independentHave a 2D grid of 2D thread blocks each runningCi,j = α
∑kl=0 Ai,lBl,j + βCi,j in parallel
Calculating one element of C requires reading k elements fromAi,0→k and k elements from B0→k ,j
Calculating all of C requires reading 2mnk elements from globalmemoryProblem: reading elements of A and B from global memory isslow
![Page 32: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/32.jpg)
Blocked GPU Matrix Multiply
A
-
B?
C
Divide A into blocks ofmb × kb, B into blocks ofkb × nb and C into blocks ofmb × nbStore one block of C inregisters and process a1× nb row per thread usingthread blocks of mb × 1Read blocks of A and B intoshared memory and accessfrom all threads in the threadblock
![Page 33: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/33.jpg)
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires readingm
mb ×n
nb ×kkb ×mb × kb elements of A and
mmb ×
nnb ×
kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in totalThis is
21
mb + 1nb
times less than when no blocking is used (mb = 1 and nb = 1)
![Page 34: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/34.jpg)
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires reading
mmb ×
nnb ×
kkb ×mb × kb elements of A and
mmb ×
nnb ×
kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in totalThis is
21
mb + 1nb
times less than when no blocking is used (mb = 1 and nb = 1)
![Page 35: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/35.jpg)
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires readingm
mb ×n
nb ×kkb ×mb × kb elements of A
andm
mb ×n
nb ×kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in totalThis is
21
mb + 1nb
times less than when no blocking is used (mb = 1 and nb = 1)
![Page 36: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/36.jpg)
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires readingm
mb ×n
nb ×kkb ×mb × kb elements of A and
mmb ×
nnb ×
kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in totalThis is
21
mb + 1nb
times less than when no blocking is used (mb = 1 and nb = 1)
![Page 37: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/37.jpg)
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires readingm
mb ×n
nb ×kkb ×mb × kb elements of A and
mmb ×
nnb ×
kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in total
This is2
1mb + 1
nb
times less than when no blocking is used (mb = 1 and nb = 1)
![Page 38: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/38.jpg)
Bandwidth Reduction
There are mmb ×
nnb blocks of C
Calculating all of C now requires readingm
mb ×n
nb ×kkb ×mb × kb elements of A and
mmb ×
nnb ×
kkb × kb × nb elements of B
orm × n × k × (
1mb
+1
nb)
elements in totalThis is
21
mb + 1nb
times less than when no blocking is used (mb = 1 and nb = 1)
![Page 39: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/39.jpg)
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
![Page 40: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/40.jpg)
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?
ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
![Page 41: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/41.jpg)
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
![Page 42: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/42.jpg)
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)
Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
![Page 43: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/43.jpg)
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/s
Bandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
![Page 44: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/44.jpg)
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/s
FLOP:word ratio: 708.4839.744 = 17.82
Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
![Page 45: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/45.jpg)
Compute Bound Matrix MultiplyTheoremIf the bandwidth reduction is greater than the FLOP:word ratio thenthe algorithm will be compute bound.[1]
What is the FLOP:word ratio?ratio of floating point operations (FLOPs) performed to memorybandwidth required expressed as a number of elements (words)
Can be worked out using the GPU documentation (GTX 285)Throughput (FLOPs):30 SMs × 1.476GHz× 16 operations per clock cycle =708.48GFLOPs/sBandwidth (words):512bit memory interface× 2.484GHz/32bits per word =39.744× 109words/sFLOP:word ratio: 708.48
39.744 = 17.82Choosing mb = 64 and nb = 16 gives a bandwidth reduction of25.6×
![Page 46: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/46.jpg)
GPU Symmetric Rank-K Update
C = αAAT + βC
Similar to matrix multiplication with B = AT except only the lower halfof C is written to
A-
-C
@
@
@
Can use the same code modified so that thread blocks strictly abovethe diagonal exit early and those on the diagonal only write to thelower half
![Page 47: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/47.jpg)
GPU Symmetric Rank-K Update
C = αAAT + βC
Similar to matrix multiplication with B = AT except only the lower halfof C is written to
A-
-C
@
@
@
Can use the same code modified so that thread blocks strictly abovethe diagonal exit early and those on the diagonal only write to thelower half
![Page 48: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/48.jpg)
GPU Symmetric Rank-K Update
C = αAAT + βC
Similar to matrix multiplication with B = AT except only the lower halfof C is written to
A-
-C
@
@
@
Can use the same code modified so that thread blocks strictly abovethe diagonal exit early and those on the diagonal only write to thelower half
![Page 49: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/49.jpg)
GPU Triangular SolveSolves XAT = αB by calculating
B = αB(A−1)T
where X overwrites B
Bi,j = αBi,j −j∑
k=0
Ak ,iBi,k
Each row needs to beupdated left-to-rightSchedule column of threadblocks and use a loop toenforce ordering (slow)
A
@
@
@
-
?
B- -
![Page 50: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/50.jpg)
GPU Triangular SolveSolves XAT = αB by calculating
B = αB(A−1)T
where X overwrites B
Bi,j = αBi,j −j∑
k=0
Ak ,iBi,k
Each row needs to beupdated left-to-rightSchedule column of threadblocks and use a loop toenforce ordering (slow)
A
@
@
@
-
?
B- -
![Page 51: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/51.jpg)
GPU Triangular SolveSolves XAT = αB by calculating
B = αB(A−1)T
where X overwrites B
Bi,j = αBi,j −j∑
k=0
Ak ,iBi,k
Each row needs to beupdated left-to-rightSchedule column of threadblocks and use a loop toenforce ordering (slow)
A
@
@
@
-
?
B- -
![Page 52: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/52.jpg)
Hybrid Cholesky Decomposition - Results
0
20
40
60
80
100
120
140
160
180
0 500 1000 1500 2000 2500 3000 3500 4000 4500
GFLO
Ps/s
n
Performance reaches 180GFLOPs/s
![Page 53: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/53.jpg)
Replacing Triangular Solve
Triangular Solve is slow
B = αB(A−1)T
Contains inverse (slow) and matrix multiplication (fast)Separate into A = A−1 and B = αBAT
![Page 54: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/54.jpg)
Replacing Triangular Solve
Triangular Solve is slow
B = αB(A−1)T
Contains inverse (slow) and matrix multiplication (fast)
Separate into A = A−1 and B = αBAT
![Page 55: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/55.jpg)
Replacing Triangular Solve
Triangular Solve is slow
B = αB(A−1)T
Contains inverse (slow) and matrix multiplication (fast)Separate into A = A−1 and B = αBAT
![Page 56: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/56.jpg)
GPU Triangular Matrix Multiply
B = αBAT
Implementation which updates B in place has similardependencies to triangular solve (so similar performance)
Bi,j = α
j∑k=0
Ak ,iBk ,j
Elements of B which have not yet been calculated are used toupdate the current element
In an “out of place” implementation each element is independent
Xi,j = α
j∑k=0
Ak ,iBk ,j
![Page 57: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/57.jpg)
GPU Triangular Matrix Multiply
B = αBAT
Implementation which updates B in place has similardependencies to triangular solve (so similar performance)
Bi,j = α
j∑k=0
Ak ,iBk ,j
Elements of B which have not yet been calculated are used toupdate the current elementIn an “out of place” implementation each element is independent
Xi,j = α
j∑k=0
Ak ,iBk ,j
![Page 58: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/58.jpg)
Calculating the Inverse
Have replaced triangular solve B = αB(A−1)T with triangularmultiply X = αBAT
Now need to form inverse of A
A is diagonal block in blocked Cholesky decompositionHave just computed Cholesky decomposition of A using CPUCholesky decomposition provides faster calculation of inverseCalculate inverse of diagonal block A on CPU and copy intotemporary block on GPU.
![Page 59: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/59.jpg)
Calculating the Inverse
Have replaced triangular solve B = αB(A−1)T with triangularmultiply X = αBAT
Now need to form inverse of AA is diagonal block in blocked Cholesky decompositionHave just computed Cholesky decomposition of A using CPUCholesky decomposition provides faster calculation of inverseCalculate inverse of diagonal block A on CPU and copy intotemporary block on GPU.
![Page 60: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/60.jpg)
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z
1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
![Page 61: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/61.jpg)
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z
1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
![Page 62: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/62.jpg)
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
![Page 63: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/63.jpg)
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
![Page 64: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/64.jpg)
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
![Page 65: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/65.jpg)
Hybrid Cholesky decomposition without triangularsolve
@
@
@
@
@
A B
C D
X
Z1: B = B − AAT
2: X = D − CAT
3: B = chol(B)4: Z = B−1
5: D = XZ T
Use out of place matrix multiply to populate X then triangular multiplyto copy back to D
![Page 66: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/66.jpg)
Hybrid Cholesky decomposition without triangularsolve
0
20
40
60
80
100
120
140
160
180
0 500 1000 1500 2000 2500 3000 3500 4000 4500
GFLO
Ps/s
n
![Page 67: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/67.jpg)
Improving diagonal block transfer
Each memory copy has overhead
t =n
bandwidth+ overhead
![Page 68: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/68.jpg)
Improving diagonal block transfer
Each memory copy has overhead
t =n
bandwidth+ overhead
![Page 69: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/69.jpg)
Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrix
Each column must be copied separately
t(m,n) = n × (m
bandwidth+ overhead)
If the number of rows is a multiple of the memory alignmentthere is no padding
t(m,n) =m × n
bandwidth+ overhead
But submatrices are always padded
![Page 70: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/70.jpg)
Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrixEach column must be copied separately
t(m,n) = n × (m
bandwidth+ overhead)
If the number of rows is a multiple of the memory alignmentthere is no padding
t(m,n) =m × n
bandwidth+ overhead
But submatrices are always padded
![Page 71: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/71.jpg)
Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrixEach column must be copied separately
t(m,n) = n × (m
bandwidth+ overhead)
If the number of rows is a multiple of the memory alignmentthere is no padding
t(m,n) =m × n
bandwidth+ overhead
But submatrices are always padded
![Page 72: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/72.jpg)
Copying matricesMatrices are stored in memory as an array of n columns of melements (“column major”)Each column is padded so that the next column is aligned on amemory boundarySubmatrices share the same padding as the larger matrixEach column must be copied separately
t(m,n) = n × (m
bandwidth+ overhead)
If the number of rows is a multiple of the memory alignmentthere is no padding
t(m,n) =m × n
bandwidth+ overhead
But submatrices are always padded
![Page 73: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/73.jpg)
Block Column Copy
@
@
@
@
@
A B
C D
X
Z
Define column around BNo padding when n is amultiple of the memoryalignment
Faster to copy (larger) columnwhen
n × nbbandwidth
+ overhead <
nb× (nb
bandwidth+ overhead)
Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace
![Page 74: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/74.jpg)
Block Column Copy
@
@
@
@
@
A B
C D
X
Z
Define column around B
No padding when n is amultiple of the memoryalignment
Faster to copy (larger) columnwhen
n × nbbandwidth
+ overhead <
nb× (nb
bandwidth+ overhead)
Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace
![Page 75: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/75.jpg)
Block Column Copy
@
@
@
@
@
A B
C D
X
Z
Define column around BNo padding when n is amultiple of the memoryalignment
Faster to copy (larger) columnwhen
n × nbbandwidth
+ overhead <
nb× (nb
bandwidth+ overhead)
Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace
![Page 76: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/76.jpg)
Block Column Copy
@
@
@
@
@
A B
C D
X
Z
Define column around BNo padding when n is amultiple of the memoryalignment
Faster to copy (larger) columnwhen
n × nbbandwidth
+ overhead <
nb× (nb
bandwidth+ overhead)
Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace
![Page 77: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/77.jpg)
Block Column Copy
@
@
@
@
@
A B
C D
X
Z
Define column around BNo padding when n is amultiple of the memoryalignment
Faster to copy (larger) columnwhen
n × nbbandwidth
+ overhead <
nb× (nb
bandwidth+ overhead)
Don’t have to worry aboutoverwriting updated D asmatrix multiply is now out ofplace
![Page 78: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/78.jpg)
Block Column Copy - results
0
50
100
150
200
250
300
350
0 500 1000 1500 2000 2500 3000 3500 4000 4500
GFLO
Ps/s
n
![Page 79: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/79.jpg)
Tuning the block size
For CPU blocked algorithms the block size is chosen so thatblocks fit in the CPU cacheIntroduced an extra level of blocking for hybrid algorithmAim to choose block size so that workload is balanced betweencomputing devices
![Page 80: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/80.jpg)
Static block size
Aim to minimise area between two curves
![Page 81: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/81.jpg)
Dynamic block size
Can change block size on each iteration and still have a correctalgorithm
![Page 82: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/82.jpg)
Dynamic block size - results
0
50
100
150
200
250
300
350
0 500 1000 1500 2000 2500 3000 3500 4000 4500
GFLO
Ps/s
n
![Page 83: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/83.jpg)
GPU Cholesky decomposition
Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPU
Implement unblocked Cholesky decomposition for the GPUDue to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads
GPU is already performing matrix multiplyPossible to overlap both on the GPU?
![Page 84: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/84.jpg)
GPU Cholesky decomposition
Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU
Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads
GPU is already performing matrix multiplyPossible to overlap both on the GPU?
![Page 85: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/85.jpg)
GPU Cholesky decomposition
Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU
Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads
GPU is already performing matrix multiplyPossible to overlap both on the GPU?
![Page 86: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/86.jpg)
GPU Cholesky decomposition
Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU
Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads
GPU is already performing matrix multiply
Possible to overlap both on the GPU?
![Page 87: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/87.jpg)
GPU Cholesky decomposition
Now that the block size changes on each iteration it can get toosmall for the matrix multiply to sufficiently overlap the 2× copyand Cholesky decomposition on the CPUImplement unblocked Cholesky decomposition for the GPU
Due to data dependencies only runs on one SM so that threadscan share resultsUses triangular packed storage to make most efficient use ofshared memory and get large number of threads
GPU is already performing matrix multiplyPossible to overlap both on the GPU?
![Page 88: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/88.jpg)
Multiple kernels
nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranches
nVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branchesWrite combined matrix multiply and unblocked Choleskydecomposition kernel
First n − 1 thread blocks perform matrix multiplicationThread block n performs Cholesky decomposition
![Page 89: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/89.jpg)
Multiple kernels
nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranchesnVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branches
Write combined matrix multiply and unblocked Choleskydecomposition kernel
First n − 1 thread blocks perform matrix multiplicationThread block n performs Cholesky decomposition
![Page 90: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/90.jpg)
Multiple kernels
nVidia CUDA Programming Guide (2008): maximumperformance occurs when no threads execute divergentbranchesnVidia CUDA Programming Guide (2011): maximumperformance occurs when no threads within the same warpexecute divergent branchesWrite combined matrix multiply and unblocked Choleskydecomposition kernel
First n − 1 thread blocks perform matrix multiplicationThread block n performs Cholesky decomposition
![Page 91: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/91.jpg)
Multiple Kernels - results
0
50
100
150
200
250
300
350
0 500 1000 1500 2000 2500 3000 3500 4000 4500
![Page 92: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/92.jpg)
Final Results
0
50
100
150
200
250
300
350
0 500 1000 1500 2000 2500 3000 3500 4000 4500
![Page 93: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/93.jpg)
Conclusions
Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.
Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.Different types of parallel workloads can be sent to the mostappropriate device.
![Page 94: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/94.jpg)
Conclusions
Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.
Different types of parallel workloads can be sent to the mostappropriate device.
![Page 95: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/95.jpg)
Conclusions
Possible to get large speed improvements for inherentlysequential algorithms such as the Cholesky decomposition bycarefully considering the structure of the algorithm and the typeof operations performed at each step.Having a GPU available allows processing to overlap makingmaximum use of the available parallelism.Different types of parallel workloads can be sent to the mostappropriate device.
![Page 96: A hybrid Cholesky decomposition algorithm for multicore CPUs with](https://reader031.fdocuments.in/reader031/viewer/2022021804/620d7a9f8d625361597ae318/html5/thumbnails/96.jpg)
References
Vasily Volkov and James W. Demmel.Benchmarking GPUs to tune dense linear algebra.In Proceedings of the 2008 ACM/IEEE conference onSupercomputing, SC ’08, pages 1–11, Piscataway, NJ, USA,2008. IEEE Press.