Basics of Data Compression
description
Transcript of Basics of Data Compression
Basics of Data Compression
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Uniquely Decodable Codes
A variable length code assigns a bit string (codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11Can be viewed as a binary trie
0 1
a
b c
d
0
0 1
1
Average Length
For a code C with codeword length L[s], the average length is defined as
p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]
La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)
We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)
Ss
a sLspCL ][)()(
Entropy (Shannon, 1948)
For a source S emitting symbols with probability p(s), the self information of s is:
bits
Lower probability higher information
Entropy is the weighted average of i(s)
Ss sp
spSH)(
1log)()( 2
)(
1log)( 2 sp
si
s s
s
occ
T
T
occTH
||log
||)( 20
0-th order empirical entropy of string T
i(s)
0 <= H <= log ||H -> 0, skewed distributionH max, uniform distribution
Performance: Compression ratio
Compression ratio =
#bits in output / #bits in input
Compression performance: We relate entropy against compression ratio.
p(A) = .7, p(B) = p(C) = p(D) = .1
H ≈ 1.36 bits
Huffman ≈ 1.5 bits per symb
||
|)(|)(0 T
TCvsTH
s
scspSH |)(|)()(Shannon In practiceAvg cw length
Empirical H vs Compression ratio
|)(|)(|| 0 TCvsTHT
An optimal code is surely one that…
Index construction:Compression of postings
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading 5.3 and a paper
code for integer encoding
x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal)
Length-1
Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
Given the following sequence of coded integers, reconstruct the original sequence:
0001000001100110000011101100111
8 6 3 59 7
code for integer encoding
Use -coding to reduce the length of the first field
Useful for medium-sized integers
e.g., 19 represented as <00,101,10011>.
coding x takes about log2 x + 2 log2( log2 x ) + 2 bits.
(Length) x
Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers
Variable-bytecodes [10.2 bits per TREC12]
Wish to get very fast (de)compress byte-align
Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other
groups the bit 1 (tagging)
e.g., v=214+1 binary(v) = 10000000000000110000001 10000000 00000001
Note: We waste 1 bit per byte, and avg 4 for the first byte.
But it is a prefix code, and encodes also the value 0 !!
PForDelta coding
10 11 11 …01 01 11 11 01 42 2311 10
2 3 3 …1 1 3 3 23 13 42 2
a block of 128 numbers = 256 bits = 32 bytes
Use b (e.g. 2) bits to encode 128 numbers or create exceptions
Encode exceptions: ESC or pointers
Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions
Translate data: [base, base + 2b-1] [0,2b-1]
Index construction:Compression of documents
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79
Raw docs are needed
Various Approaches
Statistical coding Huffman codes Arithmetic codes
Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,…
Text transforms Burrows-Wheeler Transform bzip
Document Compression
Huffman coding
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…
Properties: Generates optimal prefix codes Cheap to encode and decode La(Huff) = H if probabilities are powers of 2
Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5a(.1) b(.2) d(.5)c(.2)
a=000, b=001, c=01, d=1There are 2n-1 “equivalent” Huffman trees
(.3)
(.5)
(1)
What about ties (and thus, tree depth) ?
0
0
0
11
1
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to the symbol to be encoded.
Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.
a(.1) b(.2)
(.3) c(.2)
(.5) d(.5)0
0
0
1
1
1
abc... 00000101
101001... dcb
Huffman in practice
The compressed file of n symbols, consists of:
Preamble: tree encoding + symbols in leaves Body: compressed text of n symbols
Preamble = (|| log ||) bits
Body is at least nH and at most nH+n bits
Extra +n is bad for very skewed distributions, namely ones for which H -> 0
Example: p(a) = 1/n, p(b) = n-1/n
There are better choices
T=aaaaaaaaab
Huffman = {a,b}-encoding + 10 bits RLE = <a,9><b,1> = (9) + (1) + {a,b}-
encoding = 0001001 1 + {a,b}-encoding
So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.
Idea on Huffman?
Goal: Reduce the impact of the +1 bit
Solution: Divide the text into blocks of k symbols The +1 is spread over k symbols So the loss is 1/k per symbol
Caution: Alphabet = k, preamble gets larger.
At the limit, preamble = 1 block equal to the input text, and compressed text is 1 bit only.
No compression!
Document Compression
Arithmetic coding
Introduction
It uses “fractional” parts of bits!!
Gets nH(T) + 2 bits vs. nH(T)+n of Huffman
Used in JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer implementation is not too bad.
Ideal performance. In practice, it is 0.02 * n
Symbol interval
Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive).
e.g.
a = .2
c = .3
b = .5
cum[c] = p(a)+p(b) = .7
cum[b] = p(a) = .2
cum[a] = .0
The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))
Sequence interval
Coding the message sequence: bac
The final sequence interval is [.27,.3)
a = .2
c = .3
b = .5
0.0
0.2
0.7
1.0
a
c
b
0.2
0.3
0.55
0.7
a
c
b
0.2
0.22
0.27
0.3(0.7-0.2)*0.3=0.15
(0.3-0.2)*0.5 = 0.05
(0.3-0.2)*0.3=0.03
(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1
(0.7-0.2)*0.5 = 0.25
The algorithm
To code a sequence of symbols with probabilities
pi (i = 1..n) use the following algorithm:
0
1
0
0
l
s iiii
iii
Tcumsll
Tpss
*11
1 *
p(a) = .2
p(c) = .3
p(b) = .5
0.27
0.2
0.3
2.0
1.0
1
1
i
i
l
s
03.03.0*1.0 is
27.0)5.02.0(*1.02.0 il
The algorithm
Each message narrows the interval by a factor p[Ti]
Final interval size is
1
0
0
0
s
l
n
iin Tps
1
iii
iiii
Tpss
Tcumsll
*1
*11
Sequence interval[ ln , ln + sn )
Take a number inside
Decoding Example
Decoding the number .49, knowing the message is of length 3:
The message is bbc.
a = .2
c = .3
b = .5
0.0
0.2
0.7
1.0
a
c
b
0.2
0.3
0.55
0.7
a
c
b
0.3
0.35
0.475
0.55
0.490.49
0.49
How do we encode that number?
If x = v/2k (dyadic fraction) then the
encoding is equal to bin(v) over k digits (possibly padded with 0s in front)
0111.16/7
11.4/3
01.3/1
How do we encode that number?
Binary fractional representation:
FractionalEncode(x)1. x = 2 * x2. If x < 1 output 03. else {output 1; x = x - 1; }
.... 54321 bbbbb
...2222 44
33
22
11 bbbb
01.3/1
2 * (1/3) = 2/3 < 1, output 0
2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation
Which number do we encode?
Truncate the encoding to the first d = log2 (2/sn) bits
Truncation gets a smaller number… how much smaller?
Truncation Compression
d
i
id
i
id
i
ididb
222212
11
)(
1
)(
2222
log2
log 22nss snn
ln + sn
ln
ln + sn/2
....... 32154321 dddd bbbbbbbbbx 0∞
Bound on code length
Theorem: For a text T of length n, the Arithmetic
encoder generates at most
log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)
= 2 - log2 (∏ i=1,n p(Ti))
= 2 - log2 (∏ [p()occ()])
= 2 - ∑ occ() * log2 p()
≈ 2 + ∑ ( n*p() ) * log2 (1/p())
= 2 + n H(T) bits
T = acabc
sn = p(a) *p(c) *p(a) *p(b) *p(c)
= p(a)2 * p(b) * p(c)2
Document Compression
Dictionary-based compressors
LZ77
Algorithm’s step: Output <dist, len, next-char> Advance by len + 1
A buffer “window” has fixed length and moves
a a c a a c a b c a a a a a aDictionary
(all substrings starting here)
<6,3,a>
<3,4,c>a a c a a c a b c a a a a a a c
a c
a c
LZ77 Decoding
Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor
for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats(0, position, length) or (1,char)
Typically uses the second format if length < 3.
Special greedy: possibly use shorter match so that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code
You find this at: www.gzip.org/zlib/
Google’s solution