Basics of Data Compression

Basics of Data Compression

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Uniquely Decodable Codes

A variable length code assigns a bit string (codeword) of variable length to every symbol

e.g. a = 1, b = 01, c = 101, d = 011

What if you get the sequence 1011 ?

A uniquely decodable code can always be uniquely decomposed into their codewords.

Prefix Codes

A prefix code is a variable length code in which no codeword is a prefix of another one

e.g a = 0, b = 100, c = 101, d = 11Can be viewed as a binary trie

0 1

a

b c

d

0

0 1

1

Average Length

For a code C with codeword length L[s], the average length is defined as

p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]

La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)

We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)

Ss

a sLspCL ][)()(

Entropy (Shannon, 1948)

For a source S emitting symbols with probability p(s), the self information of s is:

bits

Lower probability higher information

Entropy is the weighted average of i(s)

Ss sp

spSH)(

1log)()( 2

)(

1log)( 2 sp

si

s s

s

occ

T

T

occTH

||log

||)( 20

0-th order empirical entropy of string T

i(s)

0 <= H <= log ||H -> 0, skewed distributionH max, uniform distribution

Performance: Compression ratio

Compression ratio =

#bits in output / #bits in input

Compression performance: We relate entropy against compression ratio.

p(A) = .7, p(B) = p(C) = p(D) = .1

H ≈ 1.36 bits

Huffman ≈ 1.5 bits per symb

||

|)(|)(0 T

TCvsTH

s

scspSH |)(|)()(Shannon In practiceAvg cw length

Empirical H vs Compression ratio

|)(|)(|| 0 TCvsTHT

An optimal code is surely one that…

Index construction:Compression of postings


Università di Pisa

Reading 5.3 and a paper

code for integer encoding

x > 0 and Length = log2 x +1

e.g., 9 represented as <000,1001>.

code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal)

Length-1

Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…

Given the following sequence of coded integers, reconstruct the original sequence:

0001000001100110000011101100111

8 6 3 59 7

code for integer encoding

Use -coding to reduce the length of the first field

Useful for medium-sized integers

e.g., 19 represented as <00,101,10011>.

coding x takes about log2 x + 2 log2( log2 x ) + 2 bits.

(Length) x

Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers

Variable-bytecodes [10.2 bits per TREC12]

Wish to get very fast (de)compress byte-align

Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other

groups the bit 1 (tagging)

e.g., v=214+1 binary(v) = 10000000000000110000001 10000000 00000001

Note: We waste 1 bit per byte, and avg 4 for the first byte.

But it is a prefix code, and encodes also the value 0 !!

PForDelta coding

10 11 11 …01 01 11 11 01 42 2311 10

2 3 3 …1 1 3 3 23 13 42 2

a block of 128 numbers = 256 bits = 32 bytes

Use b (e.g. 2) bits to encode 128 numbers or create exceptions

Encode exceptions: ESC or pointers

Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions

Translate data: [base, base + 2b-1] [0,2b-1]

Index construction:Compression of documents


Università di Pisa

Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

Raw docs are needed

Various Approaches

Statistical coding Huffman codes Arithmetic codes

Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,…

Text transforms Burrows-Wheeler Transform bzip

Document Compression

Huffman coding

Huffman Codes

Invented by Huffman as a class assignment in ‘50.

Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…

Properties: Generates optimal prefix codes Cheap to encode and decode La(Huff) = H if probabilities are powers of 2

Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!

Running Example

p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5a(.1) b(.2) d(.5)c(.2)

a=000, b=001, c=01, d=1There are 2n-1 “equivalent” Huffman trees

(.3)

(.5)

(1)

What about ties (and thus, tree depth) ?

0

0

0

11

1

Encoding and Decoding

Encoding: Emit the root-to-leaf path leading to the symbol to be encoded.

Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.

a(.1) b(.2)

(.3) c(.2)

(.5) d(.5)0

0

0

1

1

1

abc... 00000101

101001... dcb

Huffman in practice

The compressed file of n symbols, consists of:

Preamble: tree encoding + symbols in leaves Body: compressed text of n symbols

Preamble = (|| log ||) bits

Body is at least nH and at most nH+n bits

Extra +n is bad for very skewed distributions, namely ones for which H -> 0

Example: p(a) = 1/n, p(b) = n-1/n

There are better choices

T=aaaaaaaaab

Huffman = {a,b}-encoding + 10 bits RLE = <a,9><b,1> = (9) + (1) + {a,b}-

encoding = 0001001 1 + {a,b}-encoding

So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.

Idea on Huffman?

Goal: Reduce the impact of the +1 bit

Solution: Divide the text into blocks of k symbols The +1 is spread over k symbols So the loss is 1/k per symbol

Caution: Alphabet = k, preamble gets larger.

At the limit, preamble = 1 block equal to the input text, and compressed text is 1 bit only.

No compression!


Arithmetic coding

Introduction

It uses “fractional” parts of bits!!

Gets nH(T) + 2 bits vs. nH(T)+n of Huffman

Used in JPEG/MPEG (as option), Bzip

More time costly than Huffman, but integer implementation is not too bad.

Ideal performance. In practice, it is 0.02 * n

Symbol interval

Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive).

e.g.

a = .2

c = .3

b = .5

cum[c] = p(a)+p(b) = .7

cum[b] = p(a) = .2

cum[a] = .0

The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))

Sequence interval

Coding the message sequence: bac

The final sequence interval is [.27,.3)

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a

c

b

0.2

0.3

0.55

0.7

a

c

b

0.2

0.22

0.27

0.3(0.7-0.2)*0.3=0.15

(0.3-0.2)*0.5 = 0.05

(0.3-0.2)*0.3=0.03

(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1

(0.7-0.2)*0.5 = 0.25

The algorithm

To code a sequence of symbols with probabilities

pi (i = 1..n) use the following algorithm:

0

1

0

0

l

s iiii

iii

Tcumsll

Tpss

*11

1 *

p(a) = .2

p(c) = .3

p(b) = .5

0.27

0.2

0.3

2.0

1.0

1

1

i

i

l

s

03.03.0*1.0 is

27.0)5.02.0(*1.02.0 il

The algorithm

Each message narrows the interval by a factor p[Ti]

Final interval size is

1

0

0

0

s

l

n

iin Tps

1

iii

iiii

Tpss

Tcumsll

*1

*11

Sequence interval[ ln , ln + sn )

Take a number inside

Decoding Example

Decoding the number .49, knowing the message is of length 3:

The message is bbc.

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a

c

b

0.2

0.3

0.55

0.7

a

c

b

0.3

0.35

0.475

0.55

0.490.49

0.49

How do we encode that number?

If x = v/2k (dyadic fraction) then the

encoding is equal to bin(v) over k digits (possibly padded with 0s in front)

0111.16/7

11.4/3

01.3/1

How do we encode that number?

Binary fractional representation:

FractionalEncode(x)1. x = 2 * x2. If x < 1 output 03. else {output 1; x = x - 1; }

.... 54321 bbbbb

...2222 44

33

22

11 bbbb

01.3/1

2 * (1/3) = 2/3 < 1, output 0

2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation

Which number do we encode?

Truncate the encoding to the first d = log2 (2/sn) bits

Truncation gets a smaller number… how much smaller?

Truncation Compression

d

i

id

i

id

i

ididb

222212

11

)(

1

)(

2222

log2

log 22nss snn

ln + sn

ln

ln + sn/2

....... 32154321 dddd bbbbbbbbbx 0∞

Bound on code length

Theorem: For a text T of length n, the Arithmetic

encoder generates at most

log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)

= 2 - log2 (∏ i=1,n p(Ti))

= 2 - log2 (∏ [p()occ()])

= 2 - ∑ occ() * log2 p()

≈ 2 + ∑ ( n*p() ) * log2 (1/p())

= 2 + n H(T) bits

T = acabc

sn = p(a) *p(c) *p(a) *p(b) *p(c)

= p(a)2 * p(b) * p(c)2


Dictionary-based compressors

LZ77

Algorithm’s step: Output <dist, len, next-char> Advance by len + 1

A buffer “window” has fixed length and moves

a a c a a c a b c a a a a a aDictionary

(all substrings starting here)

<6,3,a>

<3,4,c>a a c a a c a b c a a a a a a c

a c

a c

LZ77 Decoding

Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor

for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]

Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip

LZSS: Output one of the following formats(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so that next match is better

Hash Table for speed-up searches on triplets

Triples are coded with Huffman’s code

You find this at: www.gzip.org/zlib/

Google’s solution

Basics of Data Compression

Documents

Transcript of Basics of Data Compression