arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author:...

42
Parallel lossless compression using GPUs Eva Sitaridi* Rene Mueller Tim Kaldewey Columbia University IBM Almaden IBM Almaden [email protected] [email protected] [email protected] *Work done while interning in IBM Almaden, partially funded from NSF Grant IIS-1218222

Transcript of arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author:...

Page 1: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Parallel lossless compression using GPUs

Eva Sitaridi* Rene Mueller Tim Kaldewey

Columbia University IBM Almaden IBM Almaden

[email protected] [email protected] [email protected]

*Work done while interning in IBM Almaden, partially funded from NSF Grant IIS-1218222

Page 2: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Agenda

• Introduction

• Overview of compression algorithms

• GPU implementation – LZSS compression

– Huffman coding

• Experimental results

• Conclusions

2

Page 3: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Why compression? • Data volume doubles every 2 years*

– Data retained for longer periods

– Data retained for business analytics

• Make better utilization of available storage resources

– Increase storage capacity

– Improve backup performance

– Reduce bandwidth utilization

•Compression should be seamless •Decompression important for Big Data workloads

*Sybase Adaptive Server Enterprise Data Compression, Business white paper 2012

Page 4: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Compression trade-offs

Compression ratio

Compression speed

Decompression speed

More important in some cases!

Compression speed vs Compression efficiency Decompression speed vs Compression efficiency

Compression speed vs Decompression speed 4

Input file Initial input file

Resources •Memory bandwidth •Memory space •CPU utilization

Page 5: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Compression resource intensive

Dataset: English Wikipedia pages 1GB XML text dump

5 •Default compression level used - Performance on Intel i7-3930K (6 cores, 3.2 GHz)

Compression efficiency

pigz

lzma

bzip2

gzip

xz 0.001

0.01

0.1

1

0 0.1 0.2 0.3 0.4

Compression efficiency=0.5 Compressed file is half the original

Page 6: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Compression libraries

6

snappy

All use LZ-variants

XZ

Deflate format

– LZ77 compression

– Huffman coding

– Single threaded

Parallel gzip

Page 7: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

LZSS compression

Input characters

7

0 1 2 3 …

ATTACTAGAATGT TACTAATCTGAT CGGGCCGGGCCTG

Output tokens

ATTACTAGAATGT(2,5)…

Backreferences (Position, Length)

Literals Unmatched characters

Minimum match length

Page 8: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

LZSS compression

Input characters

8

0 1 2 3 …

ATTACTAGAATGT TACTAATCTGAT CGGGCCGGGCCTG

Sliding window buffer Unencoded lookahead characters

Find longest match Output tokens

ATTACTAGAATGT(2,5)…

Backreferences (Position, Length)

Literals Unmatched characters

Minimum match length

Page 9: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

(0,4)M(5,4)COMM… Output data block

W I K I P E D I A . C O

WIKIMEDIACOMM

Window buffer contents

LZSS decompression

Input data block

Tokens

Page 10: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Huffman algorithm • Huffman tree

– Leaves: encoded symbols – Unique prefix for each character

• Huffman coding – Short codes for frequent characters

• Huffman decoding A) Traverse tree to decode B) Use look-up tables for faster decoding

10

13

6 7

‘a’ ‘’f’ 3 4

‘s’

‘e’

’h’ ‘r’

0 1

0 1 1

1 1 0 0

0

Page 11: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

What to accelerate? Profile of gzip on Intel i7-3930K

Input: Compressible database column

11

>85% of time spent on string matching

Accelerate LZSS first

LZSS: Longest match

LZSS: Other

Huffman: Send bits

Update crc

Huffman: Compress block

Huffman: Count tally

87.2% 4.9%

1.9% 1.4% 1.9% 1.8%

Page 12: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Why GPUs?

Intel i7-3930K Tesla K20x

Memory Bandwidth (Spec)

51.2 GB/s 250 GB/s

Memory Bandwidth (Measured)

40.4 GB/s 197GB/s

#Cores 6 2688

12

•LZSS string matching is memory bandwidth intensive - Leverage GPU bandwidth

Page 13: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

How to parallelize compression/decompression?

13

Thread 1

Thread 2

Data block 1

Data block 2

Naïve approach: Threads process independent data/file blocks Input file

Split input file in independent blocks

>1000 cores available!

Page 14: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Memory access pattern

14

Data Block 1 Data Block 2 Data Block 3

T1 T2 T3

Thread memory accesses in the same cache line

Optimal GPU memory access pattern

Data block 1 Data Block 2 Data Block 3

T1 T2 T3

Data block size>32K Many cache lines loaded •Low memory bandwidth

Actual memory access pattern

Page 15: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Thread utilization

15

T1 T2 T3 T4 T5 T6

SIMT Architecture: Group execution

Data block 1

6 active threads Iter. 1

i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. }

Data block 2 Data block 3 Data block 4 Data block 5 Data block 6 Different #iterations for each thread

Page 16: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Thread utilization

16

T1 T2 T3 T4 T5 T6

SIMT Architecture: Group execution

Data block 1

i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. }

Data block 2 Data block 3 Data block 4 Data block 5 Data block 6

4 active threads Iter. 2

Different #iterations for each thread

Page 17: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Thread utilization

17

T1 T2 T3 T4 T5 T6

SIMT Architecture: Group execution

Data block 1

i=thread id j=0 … while(window[i]==lookahead[j]) { j++; …. }

Data block 2 Data block 3 Data block 4 Data block 5 Data block 6

1 active thread Iter. 3

Different #iterations for each thread

(6+4+1)/(3*6) = 11/18 = 61% thread utilization

Page 18: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

GPU LZSS General compression

Thread group 1

Thread group 2

Input file

Data block 1

Data block 2

Store list of compressed data block offsets Parallel decompression

Compact

Output file

Intermediate output

Thread group n

Data block n

18

Better approach: Each data block is processed by a thread group

Page 19: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Compression efficiency vs Compression performance

19

Faster performance drop •No gain in compression efficiency

Window size

GPU LZSS* Lookahead: 66 chars Block size: 64K chars

* Related papers A. Ozsoy and M. Swany, “CULZSS: LZSS Lossless Data Compression on CUDA” A. Balevic, “Parallel Variable-Length Encoding on GPGPUs”

Page 20: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

GPU LZSS decompression

20

CCGA(0,2)CGG(4,3)AGTT

1) Compute total size of tokens (serialized)

CCGACCCGGCCCAGTT

Compressed input Uncompressed output

Tokens

Page 21: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

GPU LZSS decompression

21

CCGA(0,2)CGG(4,3)AGTT

Compressed input Uncompressed output

2) Read tokens (parallel)

Tokens CCGACCCGGCCCAGTT

Page 22: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

GPU LZSS decompression

22

CCGA(0,2)CGG(4,3)AGTT

3.2) Write uncompressed output:

CCGACCCGGCCCAGTT

Compressed input Uncompressed output

Problem: Backreferences processed in parallel might be dependent! Use voting function __ballot to detect conflicts

3.1) Compute uncompressed output

Tokens

Page 23: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Writing LZSS tokens to output Case A: All literals

Case B: Literals & non-conflicting backreferences

Case C: Literals & conflicting backreferences

23

CCGAGATTGAGTT Tokens

1) Write literals (parallel)

CCGA(0,2)CGG(0,3)AGTT Tokens 1) Write literals (parallel) 2) Write backreferences (parallel)

CCGA(0,2)CGG(4,3)AGTT Tokens 1) Write literals (parallel) 2) Write non-conflicting backreferences (parallel) 3) Write remaining backreferences (serial)

Page 24: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Huffman entropy coding • Inherently sequential

• Coding challenge

– Compute destination of encoded data

• Decoding challenge

– Determine codeword boundaries

24

Focus on decoding for end-to-end decompression

Page 25: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Parallel Huffman decoding

25

01100110 10111001 11010110 11100001 10111011 01110001 00000010 00001110

File block

Page 26: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Parallel Huffman decoding

26

01100110 10111001 11010110 11100001 10111011 01110001 00000010 00001110

•During coding •Split data blocks in sub-blocks •Store sub-block offsets Parallel sub-block decoding

File block

Offset 1

Offset 2

Offset 3

Offset 4

Page 27: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Parallel Huffman decoding

27

01100110 10111001 11010110 11100001 10111011 01110001 00000010 00001110

•During coding •Split data blocks in sub-blocks •Store sub-block offsets Parallel sub-block decoding

File block

Offset 1

Offset 2

Offset 3

Offset 4

•During decoding •Use look-up tables for decoding rather than Huffman trees •Fit look-up table in shared memory

•Reduce number of codes for length and distance

Page 28: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Parallel Huffman decoding

28

01100110 10111001 11010110 11100001 10111011 01110001 00000010 00001110

•During coding •Split data blocks in sub-blocks •Store sub-block offsets Parallel sub-block decoding

File block

Trade compression efficiency for decompression speed

Offset 1

Offset 2

Offset 3

Offset 4

•During decoding •Use look-up tables for decoding rather than Huffman trees •Fit look-up table in shared memory

•Reduce number of codes for length and distance

Page 29: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Experimental system

29

Intel i7-3930K Tesla K20x

Memory bandwidth (Spec)

51.2 GB/s 250 GB/s

Memory bandwidth (Measured)

40.4 GB/s 197 GB/s

Memory capacity 64 GB 6 GB

#Cores 6 (12 threads) 2688

Clock frequency 3.2 GHz 0.732 GHz

Linux, kernel 3.0.74

Page 30: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Datasets

30

Dataset Size Comp. efficiency*

English wikipedia

1GB 0.35

Database column

245MB 0.98

*For default parameter of gzip

•Datasets already loaded in memory •No disk I/O

Page 31: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Decompression performance

31

Page 32: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Decompression performance

32

Data transfers slow down performance

Page 33: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Hide GPU to CPU transfer I/O using CUDA Streams

Stre

am

Batch processing

… Read B1 Decompress B1 Write B1 Read B2 Decode B1

Time

Page 34: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Hide GPU to CPU transfer I/O using CUDA Streams

Read B1 Decode B1

Decode B2 Read B2

Read B3 Decode B3

Stre

am

Stre

am

Batch processing

Decompress B1 Write B1

Decompress B2

Pipeline PCI/E transfers

Read B1 Decompress B1 Write B1 Read B2 Decode B1

Time

Page 35: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Hide GPU to CPU transfer I/O using CUDA Streams

Read B1 Decode B1

Decode B2 Read B2

Read B3 Decode B3

Stre

am

Stre

am

Batch processing

Time

Read B1 Decode B1

Decode B2

Decompress B1

Read B2

Read B3 Decode B3

Write B2

Decompress B3

Stre

am

Write B1

Write B3

Decompress B2

Decompress B1 Write B1

Decompress B2

Pipeline PCI/E transfers & Concurrent kernel execution

Pipeline PCI/E transfers

Read B1 Decompress B1 Write B1 Read B2 Decode B1

Page 36: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Decompression performance

36

Page 37: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Data transfer latency hidden

Decompression performance

37

Page 38: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Database column Huffman %

LZSS %

Decompression time breakdown

38

English Wikipedia

Huffman %

LZSS %

Page 39: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Database column Huffman %

LZSS %

Decompression time breakdown

39

LZSS faster for incompressible datasets

English Wikipedia

Huffman %

LZSS %

Page 40: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Decompression performance vs Compression efficiency English Wikipedia

gzip

pigz

xz

GPU Deflate w PCI/E transfer

bzip2

lzma

0.01

0.1

1

10

0 0.1 0.2 0.3 0.4 0.5

Ban

dw

idth

(G

B/s

)

Compression efficiency

Page 41: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Conclusions • Decompression

– Hide GPU-CPU latency using 4-stage pipelining

– LZSS faster for incompressible files

• Compression

– Reduce search time (using hash tables ?)

41

Page 42: arallel Lossless Compression Using GPUs - NVIDIA · arallel Lossless Compression Using GPUs Author: Evangelia Sitaridi Subject: Given the high cost of enterprise data storage, compression

Conclusions • Decompression

– Hide GPU-CPU latency using 4-stage pipelining

– LZSS faster for incompressible files

• Compression

– Reduce search time (using hash tables ?)

42

Questions?