Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit...

33
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

description

Various Approaches Statistical coding Huffman codes Arithmetic codes Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,… Text transforms Burrows-Wheeler Transform bzip

Transcript of Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit...

Page 1: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Index construction:Compression of documents

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

Page 2: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Raw docs are needed

Page 3: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Various ApproachesStatistical coding Huffman codes Arithmetic codes

Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,…

Text transforms Burrows-Wheeler Transform bzip

Page 4: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Basics of Data Compression

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Page 5: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Uniquely Decodable CodesA variable length code assigns a bit string

(codeword) of variable length to every symbol

e.g. a = 1, b = 01, c = 101, d = 011

What if you get the sequence 1011 ?

A uniquely decodable code can always be uniquely decomposed into their codewords.

Page 6: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Prefix CodesA prefix code is a variable length code in

which no codeword is a prefix of another one

e.g a = 0, b = 100, c = 101, d = 11Can be viewed as a binary trie

0 1

a

b c

d

0

0 1

1

Page 7: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Average LengthFor a code C with codeword length L[s], the

average length is defined as

p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)

We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)

Ss

a sLspCL ][)()(

Page 8: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Entropy (Shannon, 1948)For a source S emitting symbols with

probability p(s), the self information of s is:

bits

Lower probability higher information

Entropy is the weighted average of i(s)

Ss sp

spSH)(

1log)()( 2

)(1log)( 2 sp

si

s s

s

occT

ToccTH ||log

||)( 20

0-th order empirical entropy of string T

i(s)

0 <= H <= log ||H -> 0, skewed distributionH max, uniform distribution

Page 9: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Performance: Compression ratioCompression ratio =

#bits in output / #bits in input

Compression performance: We relate entropy against compression ratio.

p(A) = .7, p(B) = p(C) = p(D) = .1

H ≈ 1.36 bitsHuffman ≈ 1.5 bits per symb

|||)(|)(0 T

TCvsTH s

scspSH |)(|)()(Shannon In practiceAvg cw lengthEmpirical H vs Compression ratio

|)(|)(|| 0 TCvsTHT

An optimal code is surely one that…

Page 10: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Document Compression

Huffman coding

Page 11: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Huffman CodesInvented by Huffman as a class assignment in ‘50.

Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…

Properties: Generates optimal prefix codes Cheap to encode and decode La(Huff) = H if probabilities are powers of 2

Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!

Page 12: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Running Examplep(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5

a(.1) b(.2) d(.5)c(.2)

a=000, b=001, c=01, d=1There are 2n-1 “equivalent” Huffman trees

(.3)

(.5)

(1)

What about ties (and thus, tree depth) ?

0

0

0

11

1

Page 13: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Encoding and DecodingEncoding: Emit the root-to-leaf path leading

to the symbol to be encoded.

Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.

a(.1) b(.2)

(.3) c(.2)

(.5) d(.5)0

0

0

1

1

1

abc... 00000101101001... dcb

Page 14: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Huffman in practiceThe compressed file of n symbols, consists of: Preamble: tree encoding + symbols in leaves Body: compressed text of n symbols

Preamble = (|| log ||) bitsBody is at least nH bits and at most nH+n bits

Extra +n is bad for very skewed distributions, namely ones for which H -> 0

Example: p(a) = 1/n, p(b) = n-1/n

Page 15: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

There are better choicesT=aaaaaaaaab

Huffman = {a,b}-encoding + 10 bits RLE = <a,9><b,1> = (9) + (1) + {a,b}-

encoding = 0001001 1 + {a,b}-encoding

So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.

Fax, bzip,… are using RLE

Page 16: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Idea on Huffman?Goal: Reduce the impact of the +1 bit

Solution: Divide the text into blocks of k symbols The +1 is spread over k symbols So the loss is 1/k per symbol

Caution: Alphabet = k, tree gets larger, and so preamble. At the limit, preamble = 1 k-gram = the input text, and the compressed text is 1 bit only.

This means no compression at all !

Page 17: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Document Compression

Arithmetic coding

Page 18: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

IntroductionIt uses “fractional” parts of bits!!

Gets < nH(T) + 2 bits vs. < nH(T)+n (Huffman)

Used in JPEG/MPEG (as option), bzip

More time costly than Huffman, but integer implementation is not too bad.

Ideal performance. In practice, it is 0.02 * n

Page 19: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Symbol intervalAssign each symbol to an interval range from 0

(inclusive) to 1 (exclusive).e.g.

a = .2

c = .3

b = .5

cum[c] = p(a)+p(b) = .7

cum[b] = p(a) = .2

cum[a] = .0

The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))

Page 20: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Sequence interval

Coding the message sequence: bac

The final sequence interval is [.27,.3)

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a

c

b

0.2

0.3

0.55

0.7

a

c

b

0.2

0.22

0.27

0.3(0.7-0.2)*0.3=0.15

(0.3-0.2)*0.5 = 0.05

(0.3-0.2)*0.3=0.03

(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1

(0.7-0.2)*0.5 = 0.25

Page 21: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

The algorithmTo code a sequence of symbols with probabilities

pi (i = 1..n) use the following algorithm:

01

0

0

ls

iiii

iii

TcumsllTpss

*11

1 *

p(a) = .2

p(c) = .3

p(b) = .5

0.27

0.2

0.3

2.01.0

1

1

i

i

ls

03.03.0*1.0 is

27.0)5.02.0(*1.02.0 il

Page 22: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

The algorithm

Each message narrows the interval by a factor p[Ti]

Final interval size is

10

0

0

sl

n

iin Tps

1

iii

iiii

TpssTcumsll

*1

*11

Sequence interval[ ln , ln + sn )

Take a number inside

Page 23: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Decoding Example

Decoding the number .49, knowing the message is of length 3:

The message is bbc.

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a

c

b

0.2

0.3

0.55

0.7

a

c

b

0.3

0.35

0.475

0.55

0.49 0.49

0.49

Page 24: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

How do we encode that number?If x = v/2k (dyadic fraction) then the

encoding is equal to bin(v) over k digits (possibly padded with 0s in front)

0111.16/711.4/3

01.3/1

Page 25: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

How do we encode that number?Binary fractional representation:

FractionalEncode(x)1. x = 2 * x2. If x < 1 output 03. else {output 1; x = x - 1; }

.... 54321 bbbbb...2222 4

43

32

21

1 bbbb

01.3/1

2 * (1/3) = 2/3 < 1, output 0

2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation

Page 26: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Which number do we encode?

Truncate the encoding to the first d = log2 (2/sn) bits

Truncation gets a smaller number… how much smaller?

Truncation Compression

d

i

id

i

id

i

ididb

222212

11

)(

1

)(

2222log

2log 22nss snn

ln + sn

ln

ln + sn/2

....... 32154321 dddd bbbbbbbbbx 0∞

Page 27: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Bound on code lengthTheorem: For a text T of length n, the Arithmetic

encoder generates at most log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)

= 2 - log2 (∏ i=1,n p(Ti)) = 2 - log2 (∏ [p()occ()])= 2 - ∑ occ() * log2 p()

≈ 2 + ∑ ( n*p() ) * log2 (1/p())

= 2 + n H(T) bits

T = acabcsn = p(a) *p(c) *p(a) *p(b) *p(c) = p(a)2 * p(b) * p(c)2

Page 28: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Document Compression

Dictionary-based compressors

Page 29: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

LZ77

Algorithm’s step: Output <dist, len, next-char> Advance by len + 1

A buffer “window” has fixed length and moves

a a c a a c a b c a a a a a aDictionary

(all substrings starting here)<6,3,a>

<3,4,c>a a c a a c a b c a a a a a a c

a c

a c

Page 30: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

LZ77 DecodingDecoder keeps same dictionary window as encoder.

Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor

for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]

Output is correct: abcdcdcdcdcdce

Page 31: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

LZ77 Optimizations used by gzipLZSS: Output one of the following formats

(0, position, length) or (1,char)Typically uses the second format if length <

3.Special greedy: possibly use shorter match so

that next match is betterHash Table for speed-up searches on tripletsTriples are coded with Huffman’s code

Page 32: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

You find this at: www.gzip.org/zlib/

Page 33: Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Google’s solution