In-Memory Data Compression for Sparse Matrix on GPU & CPU
Transcript of In-Memory Data Compression for Sparse Matrix on GPU & CPU
![Page 1: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/1.jpg)
Dr. Lawlor, U. Alaska 1
In-Memory Data Compressionfor Sparse Matrix on GPU & CPU
8
Dr. Orion Sky [email protected]. Alaska Fairbanks
IA^3 workshop, SC13 2013-11-17
![Page 2: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/2.jpg)
Dr. Lawlor, U. Alaska 2
Why Compress Memory?“Communication dominates energy” — Keckler, Dally, Khailany, Garland,
and Glasco, 2011
“Minimizing data transport energy, rather than arithmetic logic unit (ALU) energy, is the real challenge.”
— Josep Torrellas, 2009
“At scale arithmetic is free, and communication is expensive.”
— me, now
![Page 3: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/3.jpg)
Dr. Lawlor, U. Alaska 3
Modern CPU Storage Hierarchy
Disclaimer: costs are quoted to zero significant digits
![Page 4: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/4.jpg)
Dr. Lawlor, U. Alaska 4
Tradeoffs: Compressing Memory Compressed data uses less
bandwidth between levels – Linear speedup, e.g., save 20%
storage, save 20% bandwidth Compressed data might fit in a
higher level – Nonlinear gains possible, e.g.,
fit in cache, get 3x speedup To be useful, benefits must outweigh
encoding and decoding costs– Good implementation matters!
![Page 5: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/5.jpg)
Dr. Lawlor, U. Alaska 5
Sparse Matrix Matrix is a modern lingua franca
– Computer scientists, application scientists, mathematicians
Dense matrix is wildly inefficient– Mesh-derived matrix is sparse
E.g., 5 nonzeros/row, 106 cols! “Sparse Matrix” skips over zeros
– Only store or compute nonzeros
E.g., 106 x 106 matrix in 108 bytes See Intel MKL, NVIDIA CUSPARSE
![Page 6: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/6.jpg)
Dr. Lawlor, U. Alaska 6
Compressed Sparse Row: CSR(Prior Work)
![Page 7: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/7.jpg)
Dr. Lawlor, U. Alaska 7
Prior: Compressed Sparse Row
Start with a matrix
![Page 8: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/8.jpg)
Dr. Lawlor, U. Alaska 8
Prior: Compressed Sparse Row
Write down the nonzeros or “values”
![Page 9: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/9.jpg)
Dr. Lawlor, U. Alaska 9
Prior: Compressed Sparse Row
Write down the column numbers
![Page 10: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/10.jpg)
Dr. Lawlor, U. Alaska 10
Prior: Compressed Sparse Row
Start column for each row
![Page 11: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/11.jpg)
Dr. Lawlor, U. Alaska 11
Prior: CSR Matrix-Vector Productvoid csr_product(const real *src,real *dest){
for (int r=0;r<max_row;++r) {real sum=0.0;for (int i=start[r];i<start[r+1];++i)
sum+=vals[i]*src[ cols[i] ];dest[r]=sum;
}}
![Page 12: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/12.jpg)
Dr. Lawlor, U. Alaska 12
Prior: CSR Matrix-Vector Productvoid csr_product(const real *src,real *dest){
for (int r=0;r<max_row;++r) {real sum=0.0;for (int i=start[r];i<start[r+1];++i)
sum+=vals[i]*src[ cols[i] ];dest[r]=sum;
}}
Irregular reads!
![Page 13: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/13.jpg)
Dr. Lawlor, U. Alaska 13
Prior: CSR on Multicore
SIMD (SSE/AVX) doesn't help at all due to the vector gather cost
void csr_product(const real *src,real *dest){
for (int r=0;r<max_row;++r) {real sum=0.0;for (int i=start[r];i<start[r+1];++i)
sum+=vals[i]*src[ cols[i] ];dest[r]=sum;
}}
![Page 14: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/14.jpg)
Dr. Lawlor, U. Alaska 14
Prior: CSR on Multicore
OpenMP helps, but scales poorly: “irregular memory read” rate limited
void csr_product(const real *src,real *dest){#pragma omp parallel for schedule(dynamic,32)
for (int r=0;r<max_row;++r) {real sum=0.0;for (int i=start[r];i<start[r+1];++i)
sum+=vals[i]*src[ cols[i] ];dest[r]=sum;
}}
![Page 15: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/15.jpg)
Dr. Lawlor, U. Alaska 15
Compressed Column Index: CCI(new work)
![Page 16: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/16.jpg)
Dr. Lawlor, U. Alaska 16
New: Compressed Column Index Compress the “cols” array only
– Rationale: “vals” is application data, “start” is already small
Store each column index as a variable bit-width jump instead of “int”
– Reduce 32 bits to 2-3 bits Compression must be lossless Most matrix column indexes shrink by
over 90%!
![Page 17: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/17.jpg)
Dr. Lawlor, U. Alaska 17
E.g. Compressed Column Index Initial size: 5 ints, 160 bits Initial size: 5 ints, 160 bits
In fixed-length binary code, 10 bits
00 01 01 00 10 In variable-length code, 9 bits
1 01 01 1 001
![Page 18: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/18.jpg)
Dr. Lawlor, U. Alaska 18
Actual Variable-Length Code
We encode an ascending sequence of column numbers (integers)
“run” is contiguous sequence of columns– A common special case (>50%!)
“jump” is a gap in the sequence– Handles more general case
![Page 19: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/19.jpg)
Dr. Lawlor, U. Alaska 19
Compression...
Can you make it run fast?
Because if it's not fast, it's useless.
![Page 20: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/20.jpg)
Dr. Lawlor, U. Alaska 20
Running Fast: Avoid Branches Typical compression implementation
uses sequence of branch instructions:– Is it the 1-bit code?– Or is it this 3-bit code?– How about this 3-bit code?
Switch uses a branch table
This is very hard on branch predictors“If those branches are predictable,
by definition your compression is bad” --Chris Hartman, U. Alaska
![Page 21: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/21.jpg)
Dr. Lawlor, U. Alaska 21
Running Fast: Use Decode Table Avoid branches using decode table
– Always look up 3 bits– Index into 8-entry table– Add duplicates for shorter
codes– See [Edfors et al 1993]
Can shrink the decode table by splitting opcode from data
– Table stores a bit mask to extract data portion
Result: branch-free decoding
![Page 22: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/22.jpg)
Dr. Lawlor, U. Alaska 22
Running Fast: Decode Table
int code=fetch_compressed_bits(bit_index);const table_entry_t &T=table[code&opcode_mask]; bit_index+=T.count; // move data pointercolumn+=T.mask & (code>>T.shift); // extract col
![Page 23: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/23.jpg)
Dr. Lawlor, U. Alaska 23
Running Fast: Decode Table
int code=fetch_compressed_bits(bit_index);const table_entry_t &T=table[code&opcode_mask]; bit_index+=T.count; // move data pointercolumn+=T.mask & (code>>T.shift); // extr
![Page 24: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/24.jpg)
Dr. Lawlor, U. Alaska 24
Running Fast: Decode Table
int code=fetch_compressed_bits(bit_index);const table_entry_t &T=table[code&opcode_mask]; bit_index+=T.count; // move data pointercolumn+=T.mask & (code>>T.shift); // extract column
![Page 25: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/25.jpg)
Dr. Lawlor, U. Alaska 25
Multicore Performance
![Page 26: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/26.jpg)
Dr. Lawlor, U. Alaska 26
General Multicore Scaling Rule
Bandwidth Limited
Arithmetic Limited
![Page 27: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/27.jpg)
Dr. Lawlor, U. Alaska 27
Sparse Matrix Multicore Scaling
![Page 28: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/28.jpg)
Dr. Lawlor, U. Alaska 28
Sparse Matrix Multicore Scaling
CCI is bad on 1 core
![Page 29: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/29.jpg)
Dr. Lawlor, U. Alaska 29
Sparse Matrix Multicore Scaling
CSR saturates quickly
![Page 30: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/30.jpg)
Dr. Lawlor, U. Alaska 30
Sparse Matrix Multicore Scaling
CCI scales better
![Page 31: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/31.jpg)
Dr. Lawlor, U. Alaska 31
GPU Performance
![Page 32: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/32.jpg)
Dr. Lawlor, U. Alaska 32
Prior: CSR on GPU, in CUDA__global__ void csr_kernel(const real *src,real *dest,
const real *vals,const int *cols,const int *start){ int r=threadIdx.x+blockIdx.x*blockDim.x; if (r>=max_row) return; real sum=0.0; for (int i=start[r];i<start[r+1];++i) sum+=vals[i]*src[ cols[i] ]; dest[r]=sum;}
Naive thread-per-row CUDA CSR implementation is about 2GF
– Slower than a single CPU core!
![Page 33: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/33.jpg)
Dr. Lawlor, U. Alaska 33
The ProblemCSR on GPU: The Thread Problem
Thread 0Row 0
Thread 1Row 1
Thread 2Row 2
Thread 3Row 3
Time 0 Time 1 Time 2 Time 3
![Page 34: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/34.jpg)
Dr. Lawlor, U. Alaska 34
The ProblemCSR on GPU: The Thread Problem
Thread 0Row 0
Thread 1Row 1
Thread 2Row 2
Thread 3Row 3
Time 0 Time 1 Time 2 Time 3
No locality!
![Page 35: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/35.jpg)
Dr. Lawlor, U. Alaska 35
The ProblemCSR on GPU: The “Stride” Fix
Block 0Row 0
Block 1Row 1
Block 2Row 2
Block 3Row 3
Thread 0 Thread 1 Thread 2 Thread 3
![Page 36: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/36.jpg)
Dr. Lawlor, U. Alaska 36
The ProblemCSR on GPU: The “Stride” Fix
Block 0Row 0
Block 1Row 1
Block 2Row 2
Block 3Row 3
Thread 0 Thread 1 Thread 2 Thread 3
Perfect locality!
![Page 37: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/37.jpg)
Dr. Lawlor, U. Alaska 37
The Problem
__global__ void csr_kernel(const real *src,real *dest,const real *vals,const int *cols,const int *start)
{ int thread=threadIdx.x+blockIdx.x*blockDim.x; int r=thread/stride; if (r>=max_row) return; real sum=0.0; for (int i=start[r]+(thread%stride);i<start[r+1];i+=stride) sum+=vals[i]*src[ cols[i] ]; shared_reduce(&sum,stride); dest[r]=sum;}
CSR on GPU: “Stride” Code
Bell and Garland [2008] technique– In practice, stride=8 threads/row
![Page 38: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/38.jpg)
Dr. Lawlor, U. Alaska 38
GPU: Workgroup Size vs Time
Too small:Workgroupcreationoverhead
Too big:Not enoughparallelism
JustRight!
192 256
![Page 39: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/39.jpg)
Dr. Lawlor, U. Alaska 39
Conclusions
![Page 40: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/40.jpg)
Dr. Lawlor, U. Alaska 40
Performance Numbers Matrix: AF_shell10, 1.5m rows
– CSR: 206.7 MB– CCI: 18.2 MB (saved 91%)– Total read bandwidth saved: 30%
Multicore: Core i7-3770K– CCI runs at 9.6 GF: 11ms per SpMV– 23% faster than CSR via Intel MKL 11
GPU: NVIDIA GeForce 570– CCI runs at 18.8 GF: 5.5ms per SpMV– 25% faster than CSR via CUSPARSE
4.2
![Page 41: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/41.jpg)
Dr. Lawlor, U. Alaska 41
CCI Beats CSR For Good Matrices
![Page 42: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/42.jpg)
Dr. Lawlor, U. Alaska 42
The Future: Memory Bandwidth Today: >1TF/s, but only 0.1TB/s Don't communicate, recompute
Compiler+runtime decides when 64-bit -> 32 -> 16 -> 8 -> 4 ...
Spend arithmetic on decompression Split solution + residual storage
• Most flops use fewer bits, in residual Fight roundoff with stochastic
rounding • Add noise to improve precision
![Page 43: In-Memory Data Compression for Sparse Matrix on GPU & CPU](https://reader033.fdocuments.in/reader033/viewer/2022041516/62527db80ab0856ffc61056d/html5/thumbnails/43.jpg)
Dr. Lawlor, U. Alaska 43
Conclusions Fight bandwidth, not arithmetic
– Save bandwidth, save time– Saves energy, too!
CPU and GPU performance are similar– But GPU needs strided access– SIMD for CPU is explicit
In-memory data compression is feasible today, at over 46GB/s http://tinyurl.com/lawlorCCI