Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

59
Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Page 1: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Zone indexes

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 6.1

Page 2: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Parametric and zone indexes

Thus far, a doc has been a term sequence

But documents have multiple parts: Author Title Date of publication Language Format etc.

These are the metadata about a document

Sec. 6.1

Page 3: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Zone

A zone is a region of the doc that can contain an arbitrary amount of text e.g.,

Title Abstract References …

Build inverted indexes on fields AND zones to permit querying

E.g., “find docs with merchant in the title zone and matching the query gentle rain”

Sec. 6.1

Page 4: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Example zone indexes

Encode zones in dictionary vs. postings.

Sec. 6.1

Page 5: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Tiered indexes

Break postings up into a hierarchy of lists Most important … Least important

Inverted index thus broken up into tiers of decreasing importance

At query time use top tier unless it fails to yield K docs

If so drop to lower tiers

Sec. 7.2.1

Page 6: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Example tiered index

Sec. 7.2.1

Page 7: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Index construction:Compression of postings

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 5.3 and a paper

Page 8: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

code for integer encoding

x > 0 and Length = log2 x +1

e.g., 9 represented as <000,1001>.

code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal)

0000...........0 x in binary Length-1

Optimal for Pr(x) = 1/2x2, and i.i.d integers

Page 9: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

It is a prefix-free encoding…

Given the following sequence of coded integers, reconstruct the original sequence:

0001000001100110000011101100111

8 6 3 59 7

Page 10: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

code for integer encoding

Use -coding to reduce the length of the first field

Useful for medium-sized integers

e.g., 19 represented as <00,101,10011>.

coding x takes about log2 x + 2 log2( log2 x ) + 2 bits.

(Length) x

Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers

Page 11: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Variable-bytecodes [10.2 bits per TREC12]

Wish to get very fast (de)compress byte-align

Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other

groups the bit 1 (tagging)

e.g., v=214+1 binary(v) = 10000000000000110000001 10000000 00000001

Note: We waste 1 bit per byte, and avg 4 for the first byte.

But it is a prefix code, and encodes also the value 0 !!

Page 12: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

PForDelta coding

10 11 11 …01 01 11 11 01 42 2311 10

2 3 3 …1 1 3 3 23 13 42 2

a block of 128 numbers = 256 bits = 32 bytes

Use b (e.g. 2) bits to encode 128 numbers or create exceptions

Encode exceptions: ESC or pointers

Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions

Translate data: [base, base + 2b-1] [0,2b-1]

Page 13: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Index construction:Compression of documents

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

Page 14: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Uniquely Decodable Codes

A variable length code assigns a bit string (codeword) of variable length to every symbol

e.g. a = 1, b = 01, c = 101, d = 011

What if you get the sequence 1011 ?

A uniquely decodable code can always be uniquely decomposed into their codewords.

Page 15: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Prefix Codes

A prefix code is a variable length code in which no codeword is a prefix of another one

e.g a = 0, b = 100, c = 101, d = 11Can be viewed as a binary trie

0 1

a

b c

d

0

0 1

1

Page 16: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Average Length

For a code C with codeword length L[s], the average length is defined as

p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]

La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)

We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)

Ss

a sLspCL ][)()(

Page 17: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Entropy (Shannon, 1948)

For a source S emitting symbols with probability p(s), the self information of s is:

bits

Lower probability higher information

Entropy is the weighted average of i(s)

Ss sp

spSH)(

1log)()( 2

)(

1log)( 2 sp

si

s s

s

occ

T

T

occTH

||log

||)( 20

0-th order empirical entropy of string T

i(s)

Page 18: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Performance: Compression ratio

Compression ratio =

#bits in output / #bits in input

Compression performance: We relate entropy against compression ratio.

p(A) = .7, p(B) = p(C) = p(D) = .1

H ≈ 1.36 bits

Huffman ≈ 1.5 bits per symb

||

|)(|)(0 T

TCvsTH

s

scspSH |)(|)()(Shannon In practiceAvg cw length

Empirical H vs Compression ratio

|)(|)(|| 0 TCvsTHT

Page 19: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Statistical Coding

How do we use probability p(s) to encode s?

Huffman codes

Arithmetic codes

Page 20: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Document Compression

Huffman coding

Page 21: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Huffman Codes

Invented by Huffman as a class assignment in ‘50.

Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…

Properties: Generates optimal prefix codes Cheap to encode and decode La(Huff) = H if probabilities are powers of 2

Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!

Page 22: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Running Example

p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5a(.1) b(.2) d(.5)c(.2)

a=000, b=001, c=01, d=1There are 2n-1 “equivalent” Huffman trees

(.3)

(.5)

(1)

What about ties (and thus, tree depth) ?

0

0

0

11

1

Page 23: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Encoding and Decoding

Encoding: Emit the root-to-leaf path leading to the symbol to be encoded.

Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.

a(.1) b(.2)

(.3) c(.2)

(.5) d(.5)0

0

0

1

1

1

abc... 00000101

101001... dcb

Page 24: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Problem with Huffman Coding

Take a two symbol alphabet = {a,b}.Whichever is their probability, Huffman uses 1

bit for each symbol and thus takes n bits to encode a message of n symbols

This is ok when the probabilities are almost the same, but what about p(a) = .999.

The optimal code for a is bits So optimal coding should use n *.0014 bits,

which is much less than the n bits taken by Huffman

00144.)999log(.

Page 25: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Document Compression

Arithmetic coding

Page 26: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Introduction

It uses “fractional” parts of bits!!

Gets nH(T) + 2 bits vs. nH(T)+n of Huffman

Used in JPEG/MPEG (as option), Bzip

More time costly than Huffman, but integer implementation is not too bad.

Ideal performance. In practice, it is 0.02 * n

Page 27: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Symbol interval

Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive).

e.g.

a = .2

c = .3

b = .5

cum[c] = p(a)+p(b) = .7

cum[b] = p(a) = .2

cum[a] = .0

The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))

Page 28: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Sequence interval

Coding the message sequence: bac

The final sequence interval is [.27,.3)

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a

c

b

0.2

0.3

0.55

0.7

a

c

b

0.2

0.22

0.27

0.3(0.7-0.2)*0.3=0.15

(0.3-0.2)*0.5 = 0.05

(0.3-0.2)*0.3=0.03

(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1

(0.7-0.2)*0.5 = 0.25

Page 29: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

The algorithm

To code a sequence of symbols with probabilities

pi (i = 1..n) use the following algorithm:

0

1

0

0

l

s iiii

iii

Tcumsll

Tpss

*11

1 *

p(a) = .2

p(c) = .3

p(b) = .5

0.27

0.2

0.3

2.0

1.0

1

1

i

i

l

s

03.03.0*1.0 is

27.0)5.02.0(*1.02.0 il

Page 30: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

The algorithm

Each message narrows the interval by a factor p[Ti]

Final interval size is

1

0

0

0

s

l

n

iin Tps

1

iii

iiii

Tpss

Tcumsll

*1

*11

Sequence interval[ ln , ln + sn )

Take a number inside

Page 31: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Decoding Example

Decoding the number .49, knowing the message is of length 3:

The message is bbc.

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a

c

b

0.2

0.3

0.55

0.7

a

c

b

0.3

0.35

0.475

0.55

0.490.49

0.49

Page 32: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

How do we encode that number?

If x = v/2k (dyadic fraction) then the

encoding is equal to bin(v) over k digits (possibly padded with 0s in front)

0111.16/7

11.4/3

01.3/1

Page 33: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

How do we encode that number?

Binary fractional representation:

FractionalEncode(x)1. x = 2 * x2. If x < 1 output 03. else {output 1; x = x - 1; }

.... 54321 bbbbb

...2222 44

33

22

11 bbbb

01.3/1

2 * (1/3) = 2/3 < 1, output 0

2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation

Page 34: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Which number do we encode?

Truncate the encoding to the first d = log2 (2/sn) bits

Truncation gets a smaller number… how much smaller?

Truncation Compression

d

i

id

i

id

i

ididb

222212

11

)(

1

)(

2222

log2

log 22nss snn

ln + sn

ln

ln + sn/2

....... 32154321 dddd bbbbbbbbbx 0∞

Page 35: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Bound on code length

Theorem: For a text T of length n, the Arithmetic

encoder generates at most

log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)

= 2 - log2 (∏ i=1,n p(Ti))

= 2 - log2 (∏ [p()occ()])

= 2 - ∑ occ() * log2 p()

≈ 2 + ∑ ( n*p() ) * log2 (1/p())

= 2 + n H(T) bits

T = acabc

sn = p(a) *p(c) *p(a) *p(b) *p(c)

= p(a)2 * p(b) * p(c)2

Page 36: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Document Compression

Dictionary-based compressors

Page 37: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

LZ77

Algorithm’s step: Output <dist, len, next-char> Advance by len + 1

A buffer “window” has fixed length and moves

a a c a a c a b c a a a a a aDictionary

(all substrings starting here)

<6,3,a>

<3,4,c>a a c a a c a b c a a a a a a c

a c

a c

Page 38: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

LZ77 Decoding

Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor

for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]

Output is correct: abcdcdcdcdcdce

Page 39: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

LZ77 Optimizations used by gzip

LZSS: Output one of the following formats(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so that next match is better

Hash Table for speed-up searches on triplets

Triples are coded with Huffman’s code

Page 40: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

You find this at: www.gzip.org/zlib/

Page 41: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Dictionary search

Exact string search

Paper on Cuckoo Hashing

Page 42: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Exact String Search

Given a dictionary D of K strings, of total

length N, store them in a way that we can

efficiently support searches for a pattern P

over them.

Hashing ?

Page 43: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Hashing with chaining

Page 44: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Key issue: a good hash function

Basic assumption: Uniform hashing

Avg #keys per slot = n * (1/m) = n/m = (load factor)

Page 45: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Search cost

m = (n)

Page 46: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

In practice

A trivial hash function is:

prime

Page 47: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

A “provably good” hash is

Each ai is selected at random in [0,m)

k0 k1 k2 kr

≈log2 m

r ≈ L / log2 m

a0 a1 a2 ar

K

a

prime

l = max string lenm = table size

not necessarily: (...mod p) mod m

Page 48: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Cuckoo Hashing

A B C

E D

2 hash tables, and 2 random choices where an item can be

stored

Page 49: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

A B C

E D

F

A running example

Page 50: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

A B FC

E D

A running example

Page 51: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

A B FC

E D

G

A running example

Page 52: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

E G B FC

A D

A running example

Page 53: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Cuckoo Hashing Examples

A B C

E D F

G

Random (bipartite) graph: node=cell, edge=key

Page 54: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Natural Extensions

More than 2 hashes (choices) per key.

Very different: hypergraphs instead of graphs. Higher memory utilization

3 choices : 90+% in experiments 4 choices : about 97%

2 hashes + bins of B-size.

Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths

but more insert time(and random access)

more memory...but more local

Page 55: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Dictionary search

Prefix-string search

Reading 3.1 and 5.2

Page 56: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Prefix-string Search

Given a dictionary D of K strings, of total

length N, store them in a way that we can

efficiently support prefix searches for a

pattern P over them.

Page 57: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Trie: speeding-up searches

1

2 2

0

4

5

6

7

2 3

y

s

1z

stile zyg

5

etic

ialygy

aibelyite

czecin

omo

Pro: O(p) search time

Cons: edge + node labels and tree structure

Page 58: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Front-coding: squeezing strings

http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html...

0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html

3345%

0 http://checkmate.com/All/Natural/Washcloth.html...

….systile syzygetic syzygial syzygy….2 5 5

Gzip may be much better...

Page 59: Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….

systile szaielyite

CTon a sample

2-level indexing

Disk

InternalMemory A disadvantage:

•Trade-off ≈ speed vs space (because of bucket size)

2 advantages:• Search ≈ typically 1 I/O

• Space ≈ Front-coding over buckets