Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit...

Index construction:Compression of documents

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

Raw docs are needed

Various ApproachesStatistical coding Huffman codes Arithmetic codes

Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,…

Text transforms Burrows-Wheeler Transform bzip

Basics of Data Compression

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Uniquely Decodable CodesA variable length code assigns a bit string

(codeword) of variable length to every symbol

e.g. a = 1, b = 01, c = 101, d = 011

What if you get the sequence 1011 ?

A uniquely decodable code can always be uniquely decomposed into their codewords.

Prefix CodesA prefix code is a variable length code in

which no codeword is a prefix of another one

e.g a = 0, b = 100, c = 101, d = 11Can be viewed as a binary trie

Average LengthFor a code C with codeword length L[s], the

average length is defined as

p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)

We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)

a sLspCL ][)()(

Entropy (Shannon, 1948)For a source S emitting symbols with

probability p(s), the self information of s is:

Lower probability higher information

Entropy is the weighted average of i(s)

spSH)(

1log)()( 2

)(1log)( 2 sp

ToccTH ||log

||)( 20

0-th order empirical entropy of string T

0 <= H <= log ||H -> 0, skewed distributionH max, uniform distribution

Performance: Compression ratioCompression ratio =

#bits in output / #bits in input

Compression performance: We relate entropy against compression ratio.

p(A) = .7, p(B) = p(C) = p(D) = .1

H ≈ 1.36 bitsHuffman ≈ 1.5 bits per symb

|||)(|)(0 T

TCvsTH s

scspSH |)(|)()(Shannon In practiceAvg cw lengthEmpirical H vs Compression ratio

|)(|)(|| 0 TCvsTHT

An optimal code is surely one that…

Document Compression

Huffman coding

Huffman CodesInvented by Huffman as a class assignment in ‘50.

Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…

Properties: Generates optimal prefix codes Cheap to encode and decode La(Huff) = H if probabilities are powers of 2

Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!

Running Examplep(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5

a(.1) b(.2) d(.5)c(.2)

a=000, b=001, c=01, d=1There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and DecodingEncoding: Emit the root-to-leaf path leading

to the symbol to be encoded.

Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.

a(.1) b(.2)

(.3) c(.2)

(.5) d(.5)0

abc... 00000101101001... dcb

Huffman in practiceThe compressed file of n symbols, consists of: Preamble: tree encoding + symbols in leaves Body: compressed text of n symbols

Preamble = (|| log ||) bitsBody is at least nH bits and at most nH+n bits

Extra +n is bad for very skewed distributions, namely ones for which H -> 0

Example: p(a) = 1/n, p(b) = n-1/n

There are better choicesT=aaaaaaaaab

Huffman = {a,b}-encoding + 10 bits RLE = <a,9><b,1> = (9) + (1) + {a,b}-

encoding = 0001001 1 + {a,b}-encoding

So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.

Fax, bzip,… are using RLE

Idea on Huffman?Goal: Reduce the impact of the +1 bit

Solution: Divide the text into blocks of k symbols The +1 is spread over k symbols So the loss is 1/k per symbol

Caution: Alphabet = k, tree gets larger, and so preamble. At the limit, preamble = 1 k-gram = the input text, and the compressed text is 1 bit only.

This means no compression at all !

Arithmetic coding

IntroductionIt uses “fractional” parts of bits!!

Gets < nH(T) + 2 bits vs. < nH(T)+n (Huffman)

Used in JPEG/MPEG (as option), bzip

More time costly than Huffman, but integer implementation is not too bad.

Ideal performance. In practice, it is 0.02 * n

Symbol intervalAssign each symbol to an interval range from 0

(inclusive) to 1 (exclusive).e.g.

a = .2

c = .3

b = .5

cum[c] = p(a)+p(b) = .7

cum[b] = p(a) = .2

cum[a] = .0

The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))

Sequence interval

Coding the message sequence: bac

The final sequence interval is [.27,.3)

a = .2

c = .3

b = .5

0.3(0.7-0.2)*0.3=0.15

(0.3-0.2)*0.5 = 0.05

(0.3-0.2)*0.3=0.03

(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1

(0.7-0.2)*0.5 = 0.25

The algorithmTo code a sequence of symbols with probabilities

pi (i = 1..n) use the following algorithm:

TcumsllTpss

p(a) = .2

p(c) = .3

p(b) = .5

2.01.0

03.03.0*1.0 is

27.0)5.02.0(*1.02.0 il

The algorithm

Each message narrows the interval by a factor p[Ti]

Final interval size is

iin Tps

TpssTcumsll

Sequence interval[ ln , ln + sn )

Take a number inside

Decoding Example

Decoding the number .49, knowing the message is of length 3:

The message is bbc.

a = .2

c = .3

b = .5

0.49 0.49

How do we encode that number?If x = v/2k (dyadic fraction) then the

encoding is equal to bin(v) over k digits (possibly padded with 0s in front)

0111.16/711.4/3

01.3/1

How do we encode that number?Binary fractional representation:

FractionalEncode(x)1. x = 2 * x2. If x < 1 output 03. else {output 1; x = x - 1; }

.... 54321 bbbbb...2222 4

1 bbbb

01.3/1

2 * (1/3) = 2/3 < 1, output 0

2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation

Which number do we encode?

Truncate the encoding to the first d = log2 (2/sn) bits

Truncation gets a smaller number… how much smaller?

Truncation Compression

222212

2222log

2log 22nss snn

ln + sn

ln + sn/2

....... 32154321 dddd bbbbbbbbbx 0∞

Bound on code lengthTheorem: For a text T of length n, the Arithmetic

encoder generates at most log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)

= 2 - log2 (∏ i=1,n p(Ti)) = 2 - log2 (∏ [p()occ()])= 2 - ∑ occ() * log2 p()

≈ 2 + ∑ ( n*p() ) * log2 (1/p())

= 2 + n H(T) bits

T = acabcsn = p(a) *p(c) *p(a) *p(b) *p(c) = p(a)2 * p(b) * p(c)2

Dictionary-based compressors

Algorithm’s step: Output <dist, len, next-char> Advance by len + 1

A buffer “window” has fixed length and moves

a a c a a c a b c a a a a a aDictionary

(all substrings starting here)<6,3,a>

<3,4,c>a a c a a c a b c a a a a a a c

LZ77 DecodingDecoder keeps same dictionary window as encoder.

Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor

for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]

Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzipLZSS: Output one of the following formats

(0, position, length) or (1,char)Typically uses the second format if length <

3.Special greedy: possibly use shorter match so

that next match is betterHash Table for speed-up searches on tripletsTriples are coded with Huffman’s code

You find this at: www.gzip.org/zlib/

Google’s solution

Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit...

Documents

Transcript of Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Universit...

Gartner’Hype’Cycle,’July’2013,’h6p:// ... · 5 Gigabytes: 8mm exabyte tale 10 Gigabytes: 20 Gigabytes: Audio collection of the works of Beethoven; five exabyte tapes; VHS

Processing Gigabytes in Secondspeople.apache.org/~sgoeschl/download/jugat/2003-02-03_1.pdf · 2003-03-31 · Lukas Mroz Processing Gigabytes in Seconds 25 Volume Visualization uNo

Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

SnakeT A personalized search engine Paolo Ferragina Dipartimento di Informatica, Università di Pisa (Joint with Antonio Gullì) To be presented at WWW 2005.

Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Paolo Ferragina Dipartimento di Informatica Università di Pisa

Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.

MG4J – Managing GigaBytes for Java Indicizzazione ed interrogazione di basi documentali Ilaria Bordino Yahoo! Research, Barcelona.

Mining Gigabytes of Dynamic Traces for Test Generation Suresh Thummalapenta

MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.

MG4J – Managing GigaBytes for Java Indicizzazione ed interrogazione di una collezione di documenti Esercitazione Ilaria Bordino Yahoo! Research, Barcelona.

Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !

Oracle NetSuite Service Descriptions · Oracle NetSuite Service Descriptions 091319-v2.docx 3 Metric Definitions 50 Gigabytes: is defined as a 50 Gigabytes of additional computer

Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.