Lossless Data Compression on GPUs

71
Lossless Data Compression on GPUs Ritesh Patel Jason Mak University of California, Davis University of California, Davis 1

Transcript of Lossless Data Compression on GPUs

Page 1: Lossless Data Compression on GPUs

Lossless Data Compression on GPUs

Ritesh Patel Jason Mak University of California, Davis University of California, Davis

1

Page 2: Lossless Data Compression on GPUs

Motivation • Data storage

• Computation vs. Data Transfer o Is compress-then-send worthwhile?

2

GPU CPU GPU CPU 20 MB 100 MB

Less transfer time

More computation

Page 3: Lossless Data Compression on GPUs

PCI-E vs. GPU bandwidth

3

0

50

100

150

200

250

1995 1999 2003 2007 2011

Ban

dw

idth

(G

B/s

)

GPU Global Memory

Bandwidth

PCI-Express

Bandwdith

Page 4: Lossless Data Compression on GPUs

Topics Covered • Related Work

• Three Algorithms in the Compression Pipeline 1. Burrows-Wheeler Transform

2. Move-to-Front Transform

3. Huffman Coding

• Results

• Future Work

4

Page 5: Lossless Data Compression on GPUs

Related Work • Domain-specific compression

o Floating-point data compression

o Texture compression

• Parallel bzip2 (pbzip2) o Uses pthread library

o bzip2 not data-parallel-friendly

5

Page 6: Lossless Data Compression on GPUs

Compression Pipeline

6

Burrows-Wheeler Transform

Input Data: n characters

Move-to-Front Transform

Huffman Coding

Output Data: compressed

n characters

n characters

Many runs of repeated characters

Low entropy Lots of 0s, 1s, etc.

Page 7: Lossless Data Compression on GPUs

Burrows-Wheeler Transform • Transforms a string and gives has many

runs of repeated characters

• Same characters get grouped

7

Input Data

Output Data: compressed

Page 8: Lossless Data Compression on GPUs

Burrows-Wheeler Transform • Transforms a string and gives has many

runs of repeated characters

• Same characters get grouped

8

Input Data

Output Data: compressed

ababacabac ccbbbaaaaa

Input Output

Page 9: Lossless Data Compression on GPUs

Burrows-Wheeler Transform

a b a b a c a b a c

c a b a b a c a b a

a c a b a b a c a b

b a c a b a b a c a

a b a c a b a b a c

c a b a c a b a b a

a c a b a c a b a b

b a c a b a c a b a

a b a c a b a c a b

b a b a c a b a c a

BWT Input: ababacabac

1) Cyclical Rotations

9

Input Data

Output Data: compressed

Page 10: Lossless Data Compression on GPUs

Burrows-Wheeler Transform

a b a b a c a b a c

c a b a b a c a b a

a c a b a b a c a b

b a c a b a b a c a

a b a c a b a b a c

c a b a c a b a b a

a c a b a c a b a b

b a c a b a c a b a

a b a c a b a c a b

b a b a c a b a c a

a b a b a c a b a c

a b a c a b a b a c

a b a c a b a c a b

a c a b a b a c a b

a c a b a c a b a b

b a b a c a b a c a

b a c a b a b a c a

b a c a b a c a b a

c a b a b a c a b a

c a b a c a b a b a

BWT Input: ababacabac

1) Cyclical Rotations 2) Sort

10

Input Data

Output Data: compressed

Page 11: Lossless Data Compression on GPUs

Burrows-Wheeler Transform

a b a b a c a b a c

a b a c a b a b a c

a b a c a b a c a b

a c a b a b a c a b

a c a b a c a b a b

b a b a c a b a c a

b a c a b a b a c a

b a c a b a c a b a

c a b a b a c a b a

c a b a c a b a b a

Index: 0 BWT: ccbbbaaaaa

11

Input Data

Output Data: compressed

Sorted Cyclical Rotations

Page 12: Lossless Data Compression on GPUs

BWT with Merge Sort • Sorting n strings each n characters long

o n = 1 million per block in our implementation

• Radix sort not a good fit

• Break ties on the fly

12

Input Data

Output Data: compressed

Page 13: Lossless Data Compression on GPUs

BWT with Merge Sort • Sorting n strings each n characters long

o n = 1 million per block in our implementation

• Radix sort not a good fit

• Break ties on the fly

• String sorting on GPU o Non-uniform

13

Input Data

Output Data: compressed

a b a b a c a b a c

a b a c a b a b a c

a b a c a b a c a b

a c a b a b a c a b

a c a b a c a b a b

b a b a c a b a c a

b a c a b a b a c a

b a c a b a c a b a

c a b a b a c a b a

c a b a c a b a b a

Pair 1: 7 ties

Pair 2: 0 ties

Pair 3: 4 ties

Page 14: Lossless Data Compression on GPUs

BWT with Merge Sort • Sorting n strings each n characters long

o n = 1 million per block in our implementation

• Radix sort not a good fit

• Break ties on the fly

• String sorting on GPU o Non-uniform

o Parallelize BWT with string sorting algorithm based on merge sort

o Currently fastest string sort on GPU

14

Input Data

Output Data: compressed

Page 15: Lossless Data Compression on GPUs

Move-to-Front Transform • Exploit clumpy characters from BWT

o BWT: ababacabac ccbbbaaaaa

• Improves effectiveness of entropy

encoding

15

Input Data

Output Data: compressed

Page 16: Lossless Data Compression on GPUs

Move-to-Front Transform • Each symbol in the data is replaced by

its index in the list o Initial list: …abcd… (ASCII)

• Recently seen characters are kept at

front of the list o Long sequences of identical symbols replaced by many zeros

16

Input Data

Output Data: compressed ccbbbaaaaa 99,0,99,0,0,99,0,0,0,0

Page 17: Lossless Data Compression on GPUs

Move-to-Front Transform Input MTF List Output

ccbbbaaaaa …abc… (ASCII) [99]

17

• ‘c’ occurs at 99th index Output ‘99’

Input Data

Output Data: compressed

Page 18: Lossless Data Compression on GPUs

Move-to-Front Transform Input MTF List Output

ccbbbaaaaa c…ab… (ASCII) [99]

18

• ‘c’ occurs at 99th index Output ’99’ • Move ‘c’ to front of the MTF list • Shift all previous elements to the right

Input Data

Output Data: compressed

Page 19: Lossless Data Compression on GPUs

Move-to-Front Transform Input MTF List Output

ccbbbaaaaa …abc… (ASCII) [99]

ccbbbaaaaa c…ab… [99,0]

19

• ‘c’ occurs at 0th index Output ’0’

Input Data

Output Data: compressed

Page 20: Lossless Data Compression on GPUs

Move-to-Front Transform Input MTF List Output

ccbbbaaaaa …abc… (ASCII) [99]

ccbbbaaaaa c…ab… [99,0]

20

• ‘c’ occurs at 0th index Output ’0’ • ‘c’ already at front, no shifts

Input Data

Output Data: compressed

Page 21: Lossless Data Compression on GPUs

Move-to-Front Transform Input MTF List Output

ccbbbaaaaa …abc… (ASCII) [99]

ccbbbaaaaa c…ab… [99,0]

21

• ‘c’ occurs at 0th index Output ’0’ • ‘c’ already at front, no shifts • Repeat for all elements

Input Data

Output Data: compressed

Page 22: Lossless Data Compression on GPUs

Move-to-Front Transform Iteration MTF List Transformed String

ccbbbaaaaa …abc… (ASCII) [99]

ccbbbaaaaa c…ab… [99,0]

ccbbbaaaaa c…ab… [99,0,99]

ccbbbaaaaa bc…a… [99,0,99,0]

ccbbbaaaaa bc…a… [99,0,99,0,0]

ccbbbaaaaa bc…a… [99,0,99,0,0,99]

ccbbbaaaaa abc… [99,0,99,0,0,99,0]

ccbbbaaaaa abc… [99,0,99,0,0,99,0,0]

ccbbbaaaaa abc… [99,0,99,0,0,99,0,0,0]

ccbbbaaaaa abc… [99,0,99,0,0,99,0,0,0,0]

22

Input Data

Output Data: compressed

Page 23: Lossless Data Compression on GPUs

Parallel MTF • MTF appears to be highly serial

o Character-by-character dependency

23

Input Data

Output Data: compressed

Page 24: Lossless Data Compression on GPUs

Parallel MTF • MTF appears to be highly serial

o Character-by-character dependency

• Goal: generate multiple MTF lists at

different indices o Compute MTF output in parallel

24

Input Data

Output Data: compressed

Page 25: Lossless Data Compression on GPUs

Parallel MTF • MTF appears to be highly serial

o Character-by-character dependency

• Goal: generate multiple MTF lists at

different indices o Compute MTF output in parallel

• How to parallelize? o Two key insights

1. Generate partial MTF list of a substring s

2. Combine two adjacent partial MTF lists (using

“concatenation”)

25

Input Data

Output Data: compressed

Page 26: Lossless Data Compression on GPUs

Example Parallel MTF

B A D A D D D A C C C B B A A A

26

Input Data

Output Data: compressed

Page 27: Lossless Data Compression on GPUs

Example Parallel MTF

B A D A D D D A C C C B B A A A

27

Input Data

Output Data: compressed

A D B

Part 1

MTF List

End of part 1

Page 28: Lossless Data Compression on GPUs

Example Parallel MTF

B A D A D D D A C C C B B A A A

28

Input Data

Output Data: compressed

Part 1 Part 2

A B C D Final MTF List

End of part 2

Page 29: Lossless Data Compression on GPUs

Example Parallel MTF

B A D A D D D A

29

Input Data

Output Data: compressed

A D B A B C 2 partial MTF lists (in parallel)

List 1 List 2

C C C B B A A A Break input into 2 chunks

Page 30: Lossless Data Compression on GPUs

Example Parallel MTF

30

Input Data

Output Data: compressed

A D B A B C 2 partial MTF lists (in parallel)

Appends all unique symbols from list 1 to list 2

A B C D

B A D A D D D A C C C B B A A A

Page 31: Lossless Data Compression on GPUs

Example Parallel MTF

31

Input Data

Output Data: compressed

A D B A B C 2 partial MTF lists (in parallel)

Same MTF list as we had before! A B C D

B A D A D D D A C C C B B A A A

Page 32: Lossless Data Compression on GPUs

Example Parallel MTF

B A D A D D D A C C C B B A A A

32

Input Data

Output Data: compressed

B A D A D D D A C C C B B A A A

Break input into 4 chunks

Page 33: Lossless Data Compression on GPUs

Example Parallel MTF

B A D A D D D A C C C B B A A A

33

Input Data

Output Data: compressed

B A D A D D D A C C C B B A A A

Break input into 4 chunks

Generate partial MTF lists

A D B… A D… B C… A B…

Page 34: Lossless Data Compression on GPUs

Example Parallel MTF A D B A D B C A B

34

Input Data

Output Data: compressed

Operator Appends all unique symbols to current list

MTF list at index 4

Page 35: Lossless Data Compression on GPUs

Example Parallel MTF A D B A D B C A B

A D B

35

Input Data

Output Data: compressed

Operator Appends all unique symbols to current list

MTF list at index 8

Page 36: Lossless Data Compression on GPUs

Example Parallel MTF A D B A D B C A B

A D B

36

Input Data

Output Data: compressed

Operator Appends all unique symbols to current list

B C A D MTF list at index 12

Page 37: Lossless Data Compression on GPUs

Example Parallel MTF A D B A D B C A B

A D B

37

Input Data

Output Data: compressed

Operator Appends all unique symbols to current list

B C A D

A B C D MTF list at index 16

Page 38: Lossless Data Compression on GPUs

Example Parallel MTF A D B A D B C A B

A D B

38

Input Data

Output Data: compressed

Operator Appends all unique symbols to current list

B C A D

A B C D

4

8

12

16

Generated lists at indices: 4,8,12,16

Page 39: Lossless Data Compression on GPUs

Example Parallel MTF A D B A D B C A B

A D B A B C

39

Input Data

Output Data: compressed

Operator Appends all unique symbols to current list

Page 40: Lossless Data Compression on GPUs

Example Parallel MTF A D B A D B C A B

A D B A B C

A B C D

40

Input Data

Output Data: compressed

Operator Appends all unique symbols to current list

Page 41: Lossless Data Compression on GPUs

Example Parallel MTF

Operator Appends all unique symbols to current list

A D B A D B C A B

A D B A B C

A B C D

B C A D 41

Input Data

Output Data: compressed

Page 42: Lossless Data Compression on GPUs

Example Parallel MTF A D B A D B C A B

A D B A B C

A B C D

B C A D

4

8

12

16

42

Input Data

Output Data: compressed

Generated lists at indices: 4,8,12,16

Page 43: Lossless Data Compression on GPUs

Example Parallel MTF

B A D A D D D A C C C B B A A A

B A D A D D D A C C C B B A A A

A D B… B C A D… A D B… A B C… (Initial MTF List)

43

0 4 8 12

Input Data

Output Data: compressed

MTFLists

In

Page 44: Lossless Data Compression on GPUs

Example Parallel MTF

B A D A D D D A C C C B B A A A

B A D A D D D A C C C B B A A A

A D B… B C A D… A D B… A B C… (Initial MTF List)

[1,0,0,1] [0,2,0,0] [3,0,0,3] [1,1,3,1]

44

0 4 8 12

Input Data

Output Data: compressed

MTFLists

In

MTFOut

Page 45: Lossless Data Compression on GPUs

So far…

45

Burrows-Wheeler Transform

ababacabac

Move-to-Front Transform

Huffman Coding

Output Data: compressed

ccbbbaaaaa

99,0,99,0,0,99,0,0,0,0

Grouped characters

Lots of 0s

Input

Page 46: Lossless Data Compression on GPUs

Huffman Coding • Final stage that performs actual

compression

• Replace each character with a bit code

• Characters that occur more often get

shorter codes

• Huffman Coding is more effective with

repetitive data (after MTF)

46

Input Data

Output Data: compressed

Page 47: Lossless Data Compression on GPUs

Huffman Coding

47

“HUFF” 00 01 1 1 32 bits 6 bits

Input Data

Output Data: compressed

Page 48: Lossless Data Compression on GPUs

Huffman Coding • Huffman Stages:

1. Generate 256-bin histogram

2. Build Huffman tree

3. Replace characters with codes

48

Input Data

Output Data: compressed

Page 49: Lossless Data Compression on GPUs

Huffman Histogram • Histogram

o Build a 256-entry histogram to count characters

o To do this on the GPU, divide the input data among

threads with each thread maintaining its own

histogram

o Merging histograms requires atomic operations

49

+ + =

Input Data

Output Data: compressed

Page 50: Lossless Data Compression on GPUs

Huffman Tree • Huffman Tree Algorithm

o Remove two lowest counts, combine to form composite node

o Composite node “inserted” into histogram with combined

counts

o Each step is dependent on the previous step

o Parallelize finding the two lowest counts (maximum 256-way

parallelism)

50

Input Data

Output Data: compressed

Page 51: Lossless Data Compression on GPUs

Huffman Tree Algorithm

51

Histogram H: 1 U: 1 F: 2

“HUFF”

Input Data

Output Data: compressed

Page 52: Lossless Data Compression on GPUs

Huffman Tree Algorithm

52

Histogram H: 1 U: 1 F: 2

H U

1 1

Input Data

Output Data: compressed

Page 53: Lossless Data Compression on GPUs

Huffman Tree Algorithm

53

Histogram F: 2

H U

1 1

Input Data

Output Data: compressed

Page 54: Lossless Data Compression on GPUs

Huffman Tree Algorithm

54

Histogram H+U: 2

F: 2

H U

1 1

2

Input Data

Output Data: compressed

Page 55: Lossless Data Compression on GPUs

Huffman Tree Algorithm

55

Histogram H+U: 2

F: 2

H U

F

1 1

2 2

Input Data

Output Data: compressed

Page 56: Lossless Data Compression on GPUs

Huffman Tree Algorithm

56

H U

F

1 1

2 2

Histogram (empty)

Input Data

Output Data: compressed

Page 57: Lossless Data Compression on GPUs

Huffman Tree Algorithm

57

H U

F

0 1

0 1

Root

Huffman Codes

H: 00 U: 01 F: 1

Left 0 Right 1

Input Data

Output Data: compressed

Page 58: Lossless Data Compression on GPUs

Huffman Coding • Huffman coding assigns prefix codes

• No codes share the same prefix

58

Huffman Codes H: 00 U: 01 F: 1

Input Data

Output Data: compressed

Page 59: Lossless Data Compression on GPUs

Huffman Coding

59

Huffman Codes

H: 00 U: 01 F: 1

“HUFF” 000111 32 bits 6 bits

Input Data

Output Data: compressed

Page 60: Lossless Data Compression on GPUs

Huffman Coding • Replace characters with codes

o To do in parallel, divide input among threads

o Hard to do on the GPU because codes are variable-length

o Each thread must calculate the correct offset to begin writing

its code

60

1 0010 000011 0010 000011

t0 t1 t2 t3 t4

Input Data

Output Data: compressed

Page 61: Lossless Data Compression on GPUs

Results • GPU vs. Bzip2

o Overall

• GPU 2.78x slower

o BWT

• GPU 2.9x slower (91% of runtime)

o MTF + Huffman

• GPU 1.34x slower

61

Page 62: Lossless Data Compression on GPUs

Results

62

File (Size)

Compress Rate (MB/s)

BWT Sort Rate (Mstrings/s)

MTF+Huffman Rate (MB/s)

Compress Ratio 𝑪𝒐𝒎𝒑𝒓𝒆𝒔𝒔𝒆𝒅 𝑺𝒊𝒛𝒆

𝑼𝒏𝒄𝒐𝒎𝒑𝒓𝒆𝒔𝒔𝒆𝒅 𝑺𝒊𝒛𝒆

Text (97 MB)

GPU: 7.37 bzip2: 10.26

GPU: 9.84 bzip2: 14.2

GPU: 29.4 bzip2: 33.1

GPU: 0.33 bzip2: 0.29

Source Code (203 MB)

GPU: 4.25 bzip2: 9.8

GPU: 4.71 bzip2: 12.2

GPU: 44.3 bzip2: 48.8

GPU: 0.24 bzip2: 0.18

XML (151 MB)

GPU: 1.42 bzip2: 5.3

GPU: 1.49 bzip2: 5.4

GPU: 32.6 bzip2: 69.2

GPU: 0.19 bzip2: 0.10

Benchmark results

Worst performance during string sort and overall

Best amount of compression

Page 63: Lossless Data Compression on GPUs

Results • BWT Performance

o GPU – 91% of runtime

o Bzip2 – 81% of runtime

o Tie Analysis:

• Amount of compression is

data dependent

• Better compression leads to

longer ties and poor

performance

63

Page 64: Lossless Data Compression on GPUs

Results • MTF Performance

o Majority of runtime during

character replacement

o Moving element to front, shifting

all other elements

o Thread divergence

64

Page 65: Lossless Data Compression on GPUs

Results • Huffman Performance

o Majority of runtime during

Huffman tree building

o 256-way parallelism is

inadequate for the GPU

65

Page 66: Lossless Data Compression on GPUs

Decompression • Reverse Burrows-Wheeler transform

o Much faster than forward BWT

o Requires only 1 character-sort

o Similar to linked-list traversal

o Preliminary implementation is an extension of Hillis and Steele’s parallel linked-list traversal algorithm

66

Page 67: Lossless Data Compression on GPUs

Decompression • Reverse Burrows-Wheeler transform

o Much faster than forward BWT

o Requires only 1 character-sort

o Similar to linked-list traversal

o Preliminary implementation is an extension of Hillis and Steele’s parallel linked-list traversal algorithm

• Reverse Move-to-Front transform o Parallel approach uses similar algorithm as forward-MTF

• Generate partial MTF lists

• Combine adjacent lists

67

Page 68: Lossless Data Compression on GPUs

Decompression • Reverse Burrows-Wheeler transform

o Much faster than forward BWT

o Requires only 1 character-sort

o Similar to linked-list traversal

o Preliminary implementation is an extension of Hillis and Steele’s parallel linked-list traversal algorithm

• Reverse Move-to-Front transform o Parallel approach uses similar algorithm as forward-MTF

• Generate partial MTF lists

• Combine adjacent lists

• Huffman decoding o Decode each encoded block in parallel

68

Page 69: Lossless Data Compression on GPUs

Future Work • Explore other GPU-based string sorts that can better

handle long strings with many ties

• Develop a Huffman tree building algorithm that has

more than 256-way parallelism

• Overlap GPU compression and PCI-Express data

transfer

69

Page 70: Lossless Data Compression on GPUs

Conclusion • Implemented parallel lossless data compression on

the GPU

• Parallelized BWT, MTF, and Huffman Coding o Developed a novel algorithm for MTF

o BWT string sort contributes to the majority of runtime

• Implementation is slow but may be better suited for

future GPU architectures and other parallel

environments

70

Page 71: Lossless Data Compression on GPUs

Acknowledgements • NSF grants OCI-1032859 and CCF-1017399

• HP Labs Innovation Research Program

• Discussion and feedback o Shubho Sengupta (Intel)

o Adam Herr (Intel)

o Anjul Patney (UC Davis)

o Stanley Tzeng (UC Davis)

71