Basics of Data Compression

39
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa

description

Basics of Data Compression. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Uniquely Decodable Codes. A variable length code assigns a bit string ( codeword ) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? - PowerPoint PPT Presentation

Transcript of Basics of Data Compression

Page 1: Basics of Data Compression

Basics of Data Compression

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Page 2: Basics of Data Compression

Uniquely Decodable Codes

A variable length code assigns a bit string (codeword) of variable length to every symbol

e.g. a = 1, b = 01, c = 101, d = 011

What if you get the sequence 1011 ?

A uniquely decodable code can always be uniquely decomposed into their codewords.

Page 3: Basics of Data Compression

Prefix Codes

A prefix code is a variable length code in which no codeword is a prefix of another one

e.g a = 0, b = 100, c = 101, d = 11Can be viewed as a binary trie

0 1

a

b c

d

0

0 1

1

Page 4: Basics of Data Compression

Average Length

For a code C with codeword length L[s], the average length is defined as

p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]

La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)

We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)

Ss

a sLspCL ][)()(

Page 5: Basics of Data Compression

Entropy (Shannon, 1948)

For a source S emitting symbols with probability p(s), the self information of s is:

bits

Lower probability higher information

Entropy is the weighted average of i(s)

Ss sp

spSH)(

1log)()( 2

)(

1log)( 2 sp

si

s s

s

occ

T

T

occTH

||log

||)( 20

0-th order empirical entropy of string T

i(s)

0 <= H <= log ||H -> 0, skewed distributionH max, uniform distribution

Page 6: Basics of Data Compression

Performance: Compression ratio

Compression ratio =

#bits in output / #bits in input

Compression performance: We relate entropy against compression ratio.

p(A) = .7, p(B) = p(C) = p(D) = .1

H ≈ 1.36 bits

Huffman ≈ 1.5 bits per symb

||

|)(|)(0 T

TCvsTH

s

scspSH |)(|)()(Shannon In practiceAvg cw length

Empirical H vs Compression ratio

|)(|)(|| 0 TCvsTHT

An optimal code is surely one that…

Page 7: Basics of Data Compression

Index construction:Compression of postings

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 5.3 and a paper

Page 8: Basics of Data Compression

code for integer encoding

x > 0 and Length = log2 x +1

e.g., 9 represented as <000,1001>.

code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal)

Length-1

Optimal for Pr(x) = 1/2x2, and i.i.d integers

Page 9: Basics of Data Compression

It is a prefix-free encoding…

Given the following sequence of coded integers, reconstruct the original sequence:

0001000001100110000011101100111

8 6 3 59 7

Page 10: Basics of Data Compression

code for integer encoding

Use -coding to reduce the length of the first field

Useful for medium-sized integers

e.g., 19 represented as <00,101,10011>.

coding x takes about log2 x + 2 log2( log2 x ) + 2 bits.

(Length) x

Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers

Page 11: Basics of Data Compression

Variable-bytecodes [10.2 bits per TREC12]

Wish to get very fast (de)compress byte-align

Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other

groups the bit 1 (tagging)

e.g., v=214+1 binary(v) = 10000000000000110000001 10000000 00000001

Note: We waste 1 bit per byte, and avg 4 for the first byte.

But it is a prefix code, and encodes also the value 0 !!

Page 12: Basics of Data Compression

PForDelta coding

10 11 11 …01 01 11 11 01 42 2311 10

2 3 3 …1 1 3 3 23 13 42 2

a block of 128 numbers = 256 bits = 32 bytes

Use b (e.g. 2) bits to encode 128 numbers or create exceptions

Encode exceptions: ESC or pointers

Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions

Translate data: [base, base + 2b-1] [0,2b-1]

Page 13: Basics of Data Compression

Index construction:Compression of documents

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

Page 14: Basics of Data Compression

Raw docs are needed

Page 15: Basics of Data Compression

Various Approaches

Statistical coding Huffman codes Arithmetic codes

Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,…

Text transforms Burrows-Wheeler Transform bzip

Page 16: Basics of Data Compression

Document Compression

Huffman coding

Page 17: Basics of Data Compression

Huffman Codes

Invented by Huffman as a class assignment in ‘50.

Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…

Properties: Generates optimal prefix codes Cheap to encode and decode La(Huff) = H if probabilities are powers of 2

Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!

Page 18: Basics of Data Compression

Running Example

p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5a(.1) b(.2) d(.5)c(.2)

a=000, b=001, c=01, d=1There are 2n-1 “equivalent” Huffman trees

(.3)

(.5)

(1)

What about ties (and thus, tree depth) ?

0

0

0

11

1

Page 19: Basics of Data Compression

Encoding and Decoding

Encoding: Emit the root-to-leaf path leading to the symbol to be encoded.

Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.

a(.1) b(.2)

(.3) c(.2)

(.5) d(.5)0

0

0

1

1

1

abc... 00000101

101001... dcb

Page 20: Basics of Data Compression

Huffman in practice

The compressed file of n symbols, consists of:

Preamble: tree encoding + symbols in leaves Body: compressed text of n symbols

Preamble = (|| log ||) bits

Body is at least nH and at most nH+n bits

Extra +n is bad for very skewed distributions, namely ones for which H -> 0

Example: p(a) = 1/n, p(b) = n-1/n

Page 21: Basics of Data Compression

There are better choices

T=aaaaaaaaab

Huffman = {a,b}-encoding + 10 bits RLE = <a,9><b,1> = (9) + (1) + {a,b}-

encoding = 0001001 1 + {a,b}-encoding

So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.

Page 22: Basics of Data Compression

Idea on Huffman?

Goal: Reduce the impact of the +1 bit

Solution: Divide the text into blocks of k symbols The +1 is spread over k symbols So the loss is 1/k per symbol

Caution: Alphabet = k, preamble gets larger.

At the limit, preamble = 1 block equal to the input text, and compressed text is 1 bit only.

No compression!

Page 23: Basics of Data Compression

Document Compression

Arithmetic coding

Page 24: Basics of Data Compression

Introduction

It uses “fractional” parts of bits!!

Gets nH(T) + 2 bits vs. nH(T)+n of Huffman

Used in JPEG/MPEG (as option), Bzip

More time costly than Huffman, but integer implementation is not too bad.

Ideal performance. In practice, it is 0.02 * n

Page 25: Basics of Data Compression

Symbol interval

Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive).

e.g.

a = .2

c = .3

b = .5

cum[c] = p(a)+p(b) = .7

cum[b] = p(a) = .2

cum[a] = .0

The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))

Page 26: Basics of Data Compression

Sequence interval

Coding the message sequence: bac

The final sequence interval is [.27,.3)

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a

c

b

0.2

0.3

0.55

0.7

a

c

b

0.2

0.22

0.27

0.3(0.7-0.2)*0.3=0.15

(0.3-0.2)*0.5 = 0.05

(0.3-0.2)*0.3=0.03

(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1

(0.7-0.2)*0.5 = 0.25

Page 27: Basics of Data Compression

The algorithm

To code a sequence of symbols with probabilities

pi (i = 1..n) use the following algorithm:

0

1

0

0

l

s iiii

iii

Tcumsll

Tpss

*11

1 *

p(a) = .2

p(c) = .3

p(b) = .5

0.27

0.2

0.3

2.0

1.0

1

1

i

i

l

s

03.03.0*1.0 is

27.0)5.02.0(*1.02.0 il

Page 28: Basics of Data Compression

The algorithm

Each message narrows the interval by a factor p[Ti]

Final interval size is

1

0

0

0

s

l

n

iin Tps

1

iii

iiii

Tpss

Tcumsll

*1

*11

Sequence interval[ ln , ln + sn )

Take a number inside

Page 29: Basics of Data Compression

Decoding Example

Decoding the number .49, knowing the message is of length 3:

The message is bbc.

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a

c

b

0.2

0.3

0.55

0.7

a

c

b

0.3

0.35

0.475

0.55

0.490.49

0.49

Page 30: Basics of Data Compression

How do we encode that number?

If x = v/2k (dyadic fraction) then the

encoding is equal to bin(v) over k digits (possibly padded with 0s in front)

0111.16/7

11.4/3

01.3/1

Page 31: Basics of Data Compression

How do we encode that number?

Binary fractional representation:

FractionalEncode(x)1. x = 2 * x2. If x < 1 output 03. else {output 1; x = x - 1; }

.... 54321 bbbbb

...2222 44

33

22

11 bbbb

01.3/1

2 * (1/3) = 2/3 < 1, output 0

2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation

Page 32: Basics of Data Compression

Which number do we encode?

Truncate the encoding to the first d = log2 (2/sn) bits

Truncation gets a smaller number… how much smaller?

Truncation Compression

d

i

id

i

id

i

ididb

222212

11

)(

1

)(

2222

log2

log 22nss snn

ln + sn

ln

ln + sn/2

....... 32154321 dddd bbbbbbbbbx 0∞

Page 33: Basics of Data Compression

Bound on code length

Theorem: For a text T of length n, the Arithmetic

encoder generates at most

log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)

= 2 - log2 (∏ i=1,n p(Ti))

= 2 - log2 (∏ [p()occ()])

= 2 - ∑ occ() * log2 p()

≈ 2 + ∑ ( n*p() ) * log2 (1/p())

= 2 + n H(T) bits

T = acabc

sn = p(a) *p(c) *p(a) *p(b) *p(c)

= p(a)2 * p(b) * p(c)2

Page 34: Basics of Data Compression

Document Compression

Dictionary-based compressors

Page 35: Basics of Data Compression

LZ77

Algorithm’s step: Output <dist, len, next-char> Advance by len + 1

A buffer “window” has fixed length and moves

a a c a a c a b c a a a a a aDictionary

(all substrings starting here)

<6,3,a>

<3,4,c>a a c a a c a b c a a a a a a c

a c

a c

Page 36: Basics of Data Compression

LZ77 Decoding

Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor

for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]

Output is correct: abcdcdcdcdcdce

Page 37: Basics of Data Compression

LZ77 Optimizations used by gzip

LZSS: Output one of the following formats(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so that next match is better

Hash Table for speed-up searches on triplets

Triples are coded with Huffman’s code

Page 38: Basics of Data Compression

You find this at: www.gzip.org/zlib/

Page 39: Basics of Data Compression

Google’s solution