LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

105
LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah

Transcript of LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Page 1: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

LZ77 Compression Using Altera OpenCL

Mohamed Abdelfattah

Page 2: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

LZ77 Compression in OpenCL

Goal:- Demonstrate that a compression algorithm can be

implemented using the OpenCL compiler

2

high-performanceefficiently

2 GB/s

Outline:

1. OpenCL single-threaded flow

2. LZ77 overview

3. Implementation details

4. Optimizations & results

Page 3: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

OpenCL Single-threaded Code

Basically C-code- OpenCL compiler extracts parallelism automatically- Pipeline parallelism

3

FPGA

One or more custom kernels

Kernels can communicate directly through “channels”

Page 4: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

4

FPGA

Load x Load y

Store z

Page 5: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

5

Load x Load y

Store z

1

Page 6: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

6

Load x Load y

Store z

1

2

Page 7: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

7

Load x Load y

Store z

1

2

3

Page 8: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

8

Load x Load y

Store z

2

3

4

Page 9: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

9

Load x Load y

Store z

3

4

5

Can start new loop iteration every cycle! Initiation interval II = 1

No loop-carried dependencies

Page 10: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

void kernelsimple(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y; output[i] = z; }}

OpenCL Single-threaded Code

10

Load x Load y

Store z

Page 11: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

void kernelcomplex(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; }}

OpenCL Single-threaded Code

11

Load x Load y

Store z

Page 12: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

void kernelcomplex(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; }}

OpenCL Single-threaded Code

12

Load x Load y

Store z

Page 13: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

void kernelcomplex(global int *input, int size, global int *output){ for(i=1..size) { int x = input[i]; int y = input[i+1]; if(loop_carried/2 == 1) z = x + y; else z = x – y; loop_carried *= z; output[i] = z; }}

OpenCL Single-threaded Code

13

Load x Load y

Store z

Loop-carriedcomputation

Need data from iteration x for iteration x+1

Page 14: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

OpenCL Single-threaded Code

14

Load x Load y

Store z

Load x Load y

Store z

Simple Complex

Page 15: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

OpenCL Single-threaded Code

15

Load x Load y

Store z

Load x Load y

Store z

1 1

Simple Complex

Page 16: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

16

11

2 2

Simple Complex

Page 17: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

17

22

3 3

11

Simple Complex

Page 18: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

18

32

4 3

1

2

1

1Pipeline bubble!

Takes 2 cycles to computeStall!

Stall!

!!

Simple Complex

Page 19: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

19

4

2

5

3

3

2

1Continue

Takes 2 cycles to compute

4

!!

Simple Complex

Page 20: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

20

5

2

6

3

4

3

2Bubble!

Takes 2 cycles to compute

4

!!

Stall!

Stall!

Simple Complex

Page 21: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

21

6

3

7

4

5

4

2Continue

Takes 2 cycles to compute

5

!!

Simple Complex

Page 22: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

22

7

3

8

4

6

5

3Bubble!

Takes 2 cycles to compute

5

!!

Stall!

Stall!

Simple Complex

Page 23: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

23

8

4

9

5

7

6

3Continue

Takes 2 cycles to compute

6

!!

Simple Complex

Page 24: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

24

9

4

10

5

8

7

4Bubble!

Takes 2 cycles to compute

6

!!

Stall!

Stall!

Simple Complex

Page 25: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Load x Load y

Store z

Load x Load y

Store z

OpenCL Single-threaded Code

25

10

5

11

6

9

8

4

Takes 2 cycles to compute

7

!!

II = 1 II = 2

Double the throughput

Optimize loop-carried computation

A new iteration of the loop starts every “II” cycles

Simple Complex

Page 26: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

LZ77 Compression in OpenCL

26

Outline:

1. OpenCL single-threaded flow

2. LZ77 overview

3. Implementation details

4. Optimizations & results

Page 27: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

LZ77 Compression Example

This sentence is an easy sentence to compress.

27

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

Page 28: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

LZ77 Compression Example

28

This sentence is an easy sentence to compress.

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

Page 29: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

LZ77 Compression Example

29

This sentence is an easy sentence to compress.

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

Page 30: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

LZ77 Compression Example

30

This sentence is an easy sentence to compress.

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

Page 31: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

LZ77 Compression Example

31

This sentence is an easy sentence to compress.

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

Page 32: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

LZ77 Compression Example

32

This sentence is an easy sentence to compress.

1. Scan file byte by byte2. Look for matches3. Replace with a reference to previous occurrence

Page 33: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

This sentence is an easy sentence to compress.

LZ77 Compression Example

33

1. Scan file byte by byte2. Look for matches

1. Match length2. Match offset

3. Replace with a reference to previous occurrence

Page 34: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

This sentence is an easy sentence to compress.

LZ77 Compression Example

34

1. Scan file byte by byte2. Look for matches

1. Match length = 22. Match offset

3. Replace with a reference to previous occurrence

Page 35: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

This sentence is an easy sentence to compress.

LZ77 Compression Example

35

1. Scan file byte by byte2. Look for matches

1. Match length = 32. Match offset

3. Replace with a reference to previous occurrence

Page 36: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

This sentence is an easy sentence to compress.

LZ77 Compression Example

36

1. Scan file byte by byte2. Look for matches

1. Match length = 82. Match offset

3. Replace with a reference to previous occurrence

Match offset = 20 bytes

Page 37: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

This sentence is an easy sentence to compress.

LZ77 Compression Example

37

1. Scan file byte by byte2. Look for matches

1. Match length = 82. Match offset = 20

3. Replace with a reference to previous occurrence

Match offset = 20 bytes

Page 38: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

This sentence is an easy @(8,20) to compress.

LZ77 Compression Example

38

1. Scan file byte by byte2. Look for matches

• Match length = 8• Match offset = 20

3. Replace with a reference to previous occurrence• Marker, length, offset

Page 39: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

This sentence is an easy sentence to compress. This sentence is an easy @(8,20) to compress.

LZ77 Compression Example

39

1. Scan file byte by byte2. Look for matches

• Match length = 8• Match offset = 20

3. Replace with a reference to previous occurrence• Marker, length, offset

Saved 5 bytes!

Page 40: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

LZ77 Compression in OpenCL

40

Outline:

1. OpenCL single-threaded flow

2. LZ77 overview

3. Implementation details

4. Optimizations & results

Page 41: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Single-threaded OpenCL flow Single kernel: fully pipelined II = 1

Throughput estimate = 16 bytes/cycle * 200 MHz = 3051 MB/s

Overview

41

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

Page 42: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Comparison against CPU/Verilog

42

Page 43: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Comparison against CPU/Verilog

43

• Best implementation of Gzip on CPU• By Intel corporation• On Intel Core i5 (32nm) processor• 2013• Compression Speed: 338 MB/s• Compression ratio: 2.18X

Page 44: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Comparison against CPU/Verilog

44

• Best implementation on ASICs• AHA products group• Coming up Q2 2014• Compression Speed: 2.5 GB/s

Page 45: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Comparison against CPU/Verilog

45

• Best implementation on FPGAs• Verilog• IBM Corporation• Nov. 2013 ICCAD• Altera Stratix-V A7• Compression Speed: 3 GB/s

Page 46: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Comparison against CPU/Verilog

46

• OpenCL design example• Altera Stratix-V A7• Developed in 1 month• Compression speed ?• Compression Ratio ?

Page 47: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Comparison against CPU/Verilog

47

2.7 GB/s3 GB/s

2.5 GB/s

0.3 GB/s

Page 48: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Comparison against CPU

48

Same compression ratio

12X better performance/Watt

Page 49: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Comparison against Verilog

49

12% more resources

Much lower design effort and design time

10% Slower

Page 50: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Implementation Overview

50

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

Page 51: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Shift In New Data

51

Current Window Input from DDR memory

Page 52: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Shift In New Data

52

Current Window

sample_text

e.g.

o l d _ t e x t

Cycle boundary

Page 53: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Shift In New Data

53

Current Window

sample_text

e.g.

o l d _ t e x t

Cycle boundary

VEC = 4

Use text in our example, but can be anything

Page 54: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Shift In New Data

54

Current Window

sample_text

e.g.

t e x t

Cycle boundary

Page 55: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Shift In New Data

55

Current Window

le_text

e.g.

t e x t s a m p

Cycle boundary

Page 56: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Implementation Overview

56

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

Page 57: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

e x t sx t s at s a mt e x t

2. Dictionary Lookup/Update

57

t e x t s a m pCurrent Window:

1. Compute hash2. Look for match in 4 dictionaries3. Update dictionaries

Dictionary0

Dictionary1

Dictionary2

Dictionary3

Dictionaries buffer the text that we have already processed, e.g.:

Page 58: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

2. Dictionary Lookup/Update

58

t e x t s a m pCurrent Window:

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x t

Hash

t e x l

t e e n

Page 59: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

2. Dictionary Lookup/Update

59

t e x t s a m pCurrent Window:

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x t

Hash

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e

Page 60: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

2. Dictionary Lookup/Update

60

t e x t s a m pCurrent Window:

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x tHash

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e

x a n t

x y l o

x e l y

x i r t

Page 61: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

2. Dictionary Lookup/Update

61

t e x t s a m pCurrent Window:

t e x t

e x t s

x t s a

t s a m

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t a n _

t e x tHash

t e x l

t e e n

e a t e

e a r s

e e p s

e n t e

x a n t

x y l o

x e l y

x i r t

t e e n

t e a l

t a n _

t a m e

Possile matches from history (dictionaries)

Page 62: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

2. Dictionary Lookup/Update

62

Dictionary0

Dictionary1

Dictionary2

Dictionary3

t e x t s a m pCurrent Window:

t e x t

e x t s

x t s a

t s a m

Hash

Page 63: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

2. Dictionary Lookup/Update

63

W0

RD02

RD03

RD00

RD01Dictionary0

W1

RD12

RD13

RD10

RD11Dictionary1

W2

RD22

RD23

RD20

RD21Dictionary2

W3

RD32

RD33

RD30

RD31Dictionary3

t e x t s a m pCurrent Window:

Generate exactly the number of read/write ports that we need

t e x t

t a n _

t e x t

t e x l

t e e n

Page 64: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Implementation Overview

64

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

Page 65: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

65

Current Windows:

t e x t

e x t s

x t s a

t s a m

t a n _t e x tt e x lt e e n

e a t ee a r se e p se n t e

x a n tx y l ox e l yx i r t

t e e n t e a l t a n _t a m e

Comparison Windows:

A set of candidate matches for each incoming substring

The substrings

Compare current window against each of its 4 compare windows

Page 66: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

66

Current Window:

t e x t

t a n _t e x tt e x lt e e n

Comparison Windows:

1432Match Length:

Comparators

We have another 3 of those

Compare each byte

Page 67: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

67

Current Window:

t e x t

t a n _t e x tt e x lt e e n

Comparison Windows:

1432Match Length:

Comparators

4

Match Reduction

Best Length:

Page 68: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

68

Page 69: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

69

Page 70: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

70

Page 71: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

71

Typical C-code

Fixed loop bounds – compiler can unroll loop

Page 72: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

One bestlength associated with each current_window

72

t e x t

e x t s

x t s a

t s a m

3

3

4

3

3

1

t e x t s a m p

Page 73: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

73

3

t e x t s a m p

Cycle boundary

1 3 4

Matches

0

1

2

4

0 1 2 3

Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• (heuristic for bin-packing) last-fit

Best lengths:

Page 74: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

74

3

t e x t s a m p

Cycle boundary

1 3 4

Matches

0

1

2

4

0 1 2 3

Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• (heuristic for bin-packing) last-fit

Best lengths:

Too short

Last-fit

Overlap

Last-fit

Page 75: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

75

3

t e x t s a m p

Cycle boundary

1 3 4

Matches

0

4

0 1 2 3

Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• (heuristic for bin-packing) last-fit

Best lengths:

Last-fit

1

2

Too short

Overlap

Last-fit

Page 76: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

76

3

t e x t s a m p

Cycle boundary

1 3 4

Matches:

0 1 2 3

Select the best combination of matches from the set of candidate matches1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• (heuristic for bin-packing) last-fit4. Compute “first valid position” for next step

Best lengths:

Last-fit

First Valid position next cycle

0 1 2 33

Page 77: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

77

1. Remove matches that are longer when encoded than original

2. Remove matches covered by previous step

3 1 3 4e.g.: Best lengths:

s a m p First Valid ------position

33

3 4 4 2e.g.: Best lengths:

0 1 2

Page 78: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

78

1. Remove matches that are longer when encoded than original

2. Remove matches covered by previous step

3 1 3 4e.g.: Best lengths:

s a m p First Valid ------position

33

-1 -1 -1 2e.g.: Best lengths:

0 1 2

Page 79: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

79

3. From the remaining set; select the best ones last-fit bin-packing

3 0 3 4e.g.: Best lengths:?

0??

Page 80: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

80

3. From the remaining set; select the best ones last-fit bin-packing

3 0 0 4e.g.: Best lengths:

3 -1 -1 4

Page 81: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

3. Match Search & Filtering

81

4. Compute “first valid position” for next step

3 -1 -1 4e.g.: Best lengths:

0 1 2 3

First_valid_pos = 3 3 3 7

t e x t s a m p0 1 2 3 0 1 2 33

Page 82: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Implementation Overview

82

1. Shift In New Data

2. Dictionary Lookup/Update

3. Match Search & Filtering

4. Write to output

Page 83: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

4. Writing to Output

Marker, length, offset- Length is limited by VEC (=16 in our case) – fits in 4 bits- Offset is limited by 0x40000 (doesn’t make sense to be more) – fits in 21 bits

Use either 3 or 4 bytes for this:- Offset < 2048

- Offset = 2048 .. 262144

83

MARKER LENGTH OFFSETOFFSET

OFFSET OFFSETMARKER LENGTH OFFSET

Page 84: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Results

84 OFFSET OFFSETMARKER LENGTH OFFSET

Page 85: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

LZ77 Compression in OpenCL

85

Outline:

1. OpenCL single-threaded flow

2. LZ77 overview

3. Implementation details

4. Optimizations & results Area optimizations Compression ratio Results

Page 86: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Area Optimizations

By choosing the right (hardware) architecture, you are already most of the way there

The last ~5% (of area optimizations) requires some tinkering and advanced knowledge

Example:

86

Page 87: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Match Search & Filtering

87

Generates a long vine of logic:

Compute length

Compute length

Compute length

Compute length

Compute length

Compute length

Causes longer latency in the pipeline increases area

condition

Page 88: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

88

Generates a long vine of logic:

Compute length

Compute length

Compute length

Compute length

Compute length

Compute length

Causes longer latency in the pipeline increases area

Balance the computation:

Balanced tree has shallower pipeline depth Less area

Get rid of the dependency on “length”

Page 89: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Modified Code

89

Instead of having a length variable (= 2,3,4)We have array of bits (= 0011,0111,1111)

4% smaller areaOR operator is cheaper than adder

OR operator creates a balanced tree (no condition)

Page 90: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Compression Ratio

Evaluate compression ratio on widely-used compression benchmarks:- Calgary – Canterbury – Large – Silesia corpora

Text, images, binary, databases – mix of everything Geomean results over all benchmarks

- Initial results: 78.3% or 1.28X

Want to improve results!

90

2. Hash Function1. Bin-packing Heuristic

Page 91: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

91

1. Remove matches that are longer when encoded than original2. Remove matches covered by previous step3. From the remaining set; select the best ones

• heuristic for bin-packing4. Compute “first valid position” for next step

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

Optimization Report in 14.0

Page 92: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

92

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

2

1

Page 93: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

93

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

2

1

!!Stall!

Page 94: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

94

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

2

1

!!Stall!

!!Stall!

3

Page 95: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

95

2. Filter bestlength (covered)

3. Filter bestlength (bin-pack)

4. Compute first_valid_pos

1. Filter bestlength (length)

Dependency causes a stall in the kernel pipeline Cannot start a new

iteration each cycle II = 6

Last-fit bin-packing doesn’t affect “first_valid_position” 3 41 3

Because we always use the last match (which determines first_valid_position)

Page 96: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

96

2. Filter bestlength (covered)

4. Filter bestlength (bin-pack)

3. Compute first_valid_pos

1. Filter bestlength (length)

Last-fit bin-packing doesn’t affect “first_valid_position” 3 41 3

Because we always use the last match (which determines first_valid_position)

Tighter computation for loop-carried variable: Start new iteration each

cycle II = 1

Page 97: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Bin-packing heuristic

We use the “last-fit” heuristic- Reason: We have a loop-carried variable “first_valid_position”

2. Filter bestlength (covered)

4. Filter bestlength (bin-pack)

3. Compute first_valid_pos

1. Filter bestlength (length)

Constraint: cannot change the first_valid_position in this step

Tighter computation for loop-carried variable: Start new iteration each

cycle II = 1

Page 98: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

1. Bin-packing heuristic

Constraint: Match selection heuristic cannot change “first_valid_position”

But: Last-fit is very inefficient

4

t e x t s a m p3 2 0

Matches

0

1

2

4

0 1 2 3

Best lengths:

4. Filter bestlength (bin-pack)

3. Compute first_valid_pos0

0 0 2 -1

4 -1 -1 -1Much better!

Doesn’t affect first_valid_position

Add a step to eliminate matches that have the same reach but smaller value

8% better ratio

Page 99: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

2. Hash Function

Original:- Hash[i] = curr_window[i]- E.g. Hash[text] = ‘t’

XOR2- Hash[i] = curr_window[i] xor curr_window[i+1]- E.g. Hash[text] = ‘t’ xor ‘e’ - Aliasing: ‘t’ xor ‘e’ = ‘e’ xor ‘t’- Not utilizing depth efficiently (256 words but BRAMS go up to 1024)

XOR3- Hash[i] = curr_window[i] << 2 xor

curr_window[i+1] << 1 xor curr_window[i+2]

- Match contains information about first 3 bytes + sense of their ordering- More likely that our compare windows will have a match- Hash (BRAM address) is 10 bits utilizes BRAM depth = 1024

99

3.1% better ratio

7.1% better ratio

Compared to Verilog, it is much easier to try & verify new algorithmsIt is exactly like trying out new C-code

Emulator in 13.1

Page 100: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Compression Ratio

Evaluate compression ratio on widely-used compression benchmarks:- Calgary – Canterbury – Large – Silesia corpora

Text, images, binary, databases – mix of everything Geomean results over all benchmarks

- Initial results: 78.3% or 1.28X

With (simple) huffman encoding (currently on the host)- 47.8% or 2.10X

100

Work in progress

60.2% or 1.67XAfter Optimizations:

Page 101: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Huffman portion of Gzip

16-way parallel variable-bit-width encoding/alignment

Page 102: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Huffman encoding

Huffman symbols are defined at runtime Variable number of bits (≤16) Concatenate codes to form a contiguous output stream

- Separate offset computation from the actual assembly

3 compute phases- Compute code bit-offsets and start offset of next iteration

- Assembly of the codes in the current iteration

- Build fixed-length segments across multiple iterations

102

∑ 𝑙𝑒𝑛𝑖

<< << <<

STORE

Page 103: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Compute offsets

Tight dependency on offset carried across iterations

- Careful about the order of the additions, the compiler does not consider dependencies when it redistributes

associative operations

- Decision whether to write to memory is based on accumulating a full segment

103

∑ 𝑙𝑒𝑛𝑖

pos[0]

basepos

pos[1]

pos[n]

Page 104: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Bit-level shift

Each code shifts to an arbitrary bit-offset within the entire range

2 shift stages- 16 bit barrel shifters- OR reduction tree for final assembly

104

Page 105: LZ77 Compression Using Altera OpenCL Mohamed Abdelfattah.

Thank YouThank You