Łódź, 2008

Łódź, 2008

Intelligent Text Processinglecture 3

Word distribution laws.Word-based indexing

Szymon [email protected]

http://szgrabowski.kis.p.lodz.pl/IPT08/

2

Zipf’s law (Zipf, 1935, 1949)[ http://en.wikipedia.org/wiki/Zipf's_law,

http://ciir.cs.umass.edu/cmpsci646/Slides/ir08%20compression.pdf ]

word-rank word-freq constant

That is, a few most frequent words cover a relatively large part of the text, while the majority of words in the given

text’s vocabulary occur only once or twice.

More formally, the frequency of any word is (approximately) inversely proportional to its rank

in the frequency table. Zipf’s law is empirical!

Example from the Brown Corpus (slightly over 106 words): “the” is the most freq. word with ~7% (69971) of all word occs.

The next word, “of”, has ~3.5% occs (36411), followed by “and” with less than 3% occs (28852).

Only 135 items are needed to account for half the Brown Corpus.

3

Does Wikipedia confirm Zipf’s law? [ http://en.wikipedia.org/wiki/Zipf's_law ]

Word freq in Wikipedia, Nov 2006, log-log plot.

x is a word rank,y is the total # of occs.

Zipf's law roughly corresponds to the green (1/x) line.

4

Let’s check it in Python...

distinct words: 283475 top freq words:[('the', 80103), ('and', 63971), ('to', 47413), ('of', 46825), ('a', 37182)]

Dickens collection:http://sun.aei.polsl.pl/~sdeor/stat.php?s=Corpus_dickens&u=corpus/dickens.bz2

5

Lots of words with only a few occurrences,(dickens example continued)

there are 9423 words with freq 1there are 3735 words with freq 2there are 2309 words with freq 3there are 1518 words with freq 4

6

Brown corpus statistics[ http://ciir.cs.umass.edu/cmpsci646/Slides/ir08%20compression.pdf ]

7

Heaps’ law (Heaps, 1978) [ http://en.wikipedia.org/wiki/Heaps'_law ]

Another empirical law. It tells how vocabulary size Vgrows with growing text size n (expressed in words):

V(n) = K · nβ,

where K is typically around 10..100 (for English)and β between 0.4 and 0.6.

Roughly speaking, the vocabulary (# of distinct words) grows proportially to the square

root of the text length.

8

Musings on Heaps’ law

The number of ‘words’ grows without a limit...How is it possible?

Because in new documents new words tend to occur:e.g. (human, geographical etc.) names. But also typos!

On the other hand, the dictionary size growssignificantly slower than the text itself.

I.e. it doesn’t pay much to represent the dictionarysuccinctly (with compression) – dictionary

compression ideas which slow down the accessshould be avoided.

9

Inverted indexes

Almost all real-world text indexes are inverted indexes(in one form or another).

An inverted index (Salton and McGill, 1983) stores wordsand their occurrence lists in the text database/collection.

The occurrence lists may store exact word positions(inverted list) or just a block or a document (inverted file).

Storing exact word positions (inverted list) enables for faster search and facilitates some kinds of queries

(compared to the inverted file),but requires (much) more storage.

10

Inverted file, example[ http://en.wikipedia.org/wiki/Inverted_index ]

We have three texts (documents):

T0 = "it is what it is" T1 = "what is it"T2 = "it is a banana"

We built the vocabulary and the associated document index lists:

Let’s search for what, is and it.That is, we want to obtain the references to all the

documents which contain those three words (at least once each, and in arbitrary positions!).

The answer: 1,02,1,02,1,01,0

11

How to build an inverted index[ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ]

Step 1 – obvious.

Step 2 – split the text into tokens (roughly: words).

Step 3 – reduce the amount of unique tokens(e.g. eliminate capitalized words, plural nouns).

Step 4 – build the dictionary structure and occurrence lists.

12

Tokenization[ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ]

A token is an instance of a sequence of characters in some particular document that are grouped

together as a useful semantic unit for processing.

How to tokenize? Not so easy as it first seems...See the following sent. and possible tokenizations

of some excerpts from it.

13

Tokenization, cont’d[ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ]

Should we accept tokens like C++, C#, .NET, DC-9, M*A*S*H ?

And if M*A*S*H is ‘legal’ (single token), then why not 2*3*5 ?

Is a dot (.) a punctuation mark (sentence terminator)?

Usually yes (of course), but think about emails, URLs and IPs...

14

Tokenization, cont’d[ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ]

Is a space always a reasonable token delimiter?Is it better to perceive New York (or San Francisco,

Tomaszów Mazowiecki...) as one token or two?

Similarly with foreign phrases (cafe latte, en face).

Splitting on spaces may bring bad retrieval results:search for York University will mainly fetch documents

related to New York University.

15

Stop words[ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ]

Stop words are very frequent words which carry little information (e.g. pronouns, articles). For example:

Consider an inverted list (i.e. exact positions for all word occurrences are kept in the index).If we have the stop list (i.e. discard them

during indexing), the index will get smaller, say by 20–30% (note it has to do with Zipf’s law).

Danger of using stop words: some meaningfulqueries may consist of stop words exclusively:

The Who, to be or not to be...The modern trend is not to use a stop list at all.

16

Glimpse – a classic inverted file index[ Manber & Wu, 1993 ]

Assumption: the text to index is a single unstructure file(perhaps a concatenation of documents).

It is divided into 256 blocks of approximately the same size.

Each entry in the index is a word and the list of blocksin which it occurs, in ascending order. Each block

number takes 1 byte (why?).

17

Block-addressing (e.g. Glimpse) index, general scheme [ fig. 3 from Navarro et al., “Adding compression...”, IR, 2000 ]

18

How to search in Glimpse?

Two basic queries:

• keyword search – the user specifies one or more words and requires a list of all documents (or blocks)in which those words occur in any positions,

• phrase search – the user specifies two or more wordsand requires a list of all documents (or blocks)

in which those words occur as a phrase, i.e. one-by-one.

Why phrase search is important?Imagine keyword search for +new +Scotlandand phrase search for “new Scotland”. What’s the difference?

19

The key operation is to intersect several block lists.

Imagine the query has 4 words, and their corresponding lists have length:

10, 120, 5, 30. How do you perform the intersection?

How to search in Glimpse, cont’d(fig. from http://nlp.stanford.edu/IR-book/pdf/01bool.pdf )

It’s best to start with two shortest lists, i.e. of length 5 and 10 in our example. The intersection output will have length at most 5

(but usually less, even 0, when we just stop!).Then we intersect the obtained list with the list of length 30,

and finally with the longest list.

No matter the intersection order, the same result but (vastly) different speeds!

20

How to search in Glimpse, cont’d

We have obtained the intersection of all lists; what then?

Depends on the query: for the keyword query,we’re done (can retrieve now the found blocks / docs).

For the phrase query, we have yet to scan the resulting blocks / documents and checkif and where the phrase occurs in them.

To this end, we can use any fast exact string matching alg, e.g. Boyer–Moore–Horspool.

Conclusion: the smaller the resulting list,the faster the phrase query is handled.

21

Approximate matching with Glimpse

Imagine we want to find occurrences of a given phrasebut with up to k (Levenshtein) errors. How to do it?

Assume for example k = 2 and the phrase grey cat.The phrase has two words, so there are the following

error per word possibilities:0 + 0, 0 + 1, 0 + 2, 1 + 0, 1 + 1, 2 + 0.

22

Approximate matching with Glimpse, cont’d

E.g. 0 + 2 means here that the first word (grey) must be exactly matched, but the second with 2 errors

(e.g. rats).So, the approximate query grey cat translates to many exact queries (many of them rather silly...): grey cat,

gray cat, grey rat, great cat, grey colt, gore cat, grey date...

All those possibilities are obtained from traversingthe vocabulary structure (e.g., a trie).

Another option is on-line approx matching overthe vocabulary represented as plain text

(concatenation of words) – see fig. at the next slide.

23

Approx matching with Glimpse, query example with a single word x

(fig. from Baeza-Yates & Navarro, Fast Approximate String Matching in a Dictionary, SPIRE’98)

24

Block boundaries problem

If the Glimpse blocks never cross documentboundaries (i.e., are natural), we don’t have this problem...

But if the block boundaries are artificial,then we may be unlucky and have one of our

keywords at the very end of a block Bj and the next keyword at the beginning of Bj+1.

How not to miss an occurrence?

There is a simple solution:block may overlap a little. E.g. the last 30 words of each block are repeated at the beginning of the next

block. Assuming the phrase length / keyword sethas no more than 30 words, we are safe.

But we may then need to scan more blocks than necessary (why?).

25

Glimpse issues and limitations

The authors claim their index takes only 2–4 % of the original text size.

But it can work to text collections to about 200 MB only;then it starts to degenerate, i.e., the block lists

tend to be long and many queries and handled not much faster than using online seach (without any index).

How can we help it?Overall idea is fine, so we must take care of details.

One major idea is to apply compressionto the index...

26

(Glimpse-like) index compression

The purpose of data compression in inverted indexesis not only to save space (storage).

It is also to make queries faster!(One big reason is less I/O, e.g. one disk access where

without compression we’d need two accesses.Another reason is more cache-friendly memory access.)

27

• Text (partitioned into blocks) may be compressed on word level (faster text search in the last stage).

• Long lists may be represented as their complements,i.e. the numbers of the block in which

a given word does NOT occur.

• Lists store increasing numbers, so the gaps (differences) between them may be encoded, e.g.

2, 15, 23, 24, 100 2, 13, 8, 1, 76 (smaller numbers).

• The resulting gaps may be statistically encoded(e.g. with some Elias code; next slides...).

Compression opportunities (Navarro et al., 2000)

28

Compression paradigm: modeling + coding

Modeling: the way we look at input data. They can be perceived as individual (context-free) 1-byte

characters, or pairs of bytes, or triples etc. We can look for matching sequences in the past buffer

(bounded or unbounded, sliding or not), the minimum match length can be set to some value, etc. etc.

We can apply lossy or lossless transforms (DCT in JPEG, Burrows–Wheeler transform), etc.

Modeling is difficult.Sometimes more art than science. Often data-specific.

Coding: what we do with the data transformed in the modeling phase.

29

Intro to coding theory

A uniquely decodable code: if any concatenationof its codewords can be uniquely parsed.

A prefix code: no codeword is a proper prefix of any other codeword. Also called an

instantaneous code.

Trivially, any prefix code is uniquely decodable.(But not vice versa!)

30

Average codeword length

Let an alphabet have s symbols,with probabilities p0, p1, ..., ps–1.

Let’s have a (uniquely decodable) code C = [c0, c1, ..., cs–1].

The avg codeword length for a given probability distribution is defined as

So this is a weighted average length over individualcodewords. More frequent symbols have a stronger influence.

31

Entropy

Entropy is the average information in a symbol.Or: the lower bound on the average number (may be fractional) of bits needed to encode an input symbol.

Higher entropy = less compressible data.

What is “the entropy of data” is a vague issue.We measure the entropy always according to a given

model (e.g. context-free aka order-0 statistics).

Shannon’s entropy formula (S – the “source” emitting “messages” / symbols)

32

Redundancy

Simply speaking, redudancy is the excess in the representation of data.

Redundant data means: compressible data. A redundant code is a non-optimal (or: far from optimal)

code.

A code redundancy (for a given prob. distribution):R(C, p0, p1, ..., ps–1) =

L(C, p0, p1, ..., ps–1) – H(p0, p1, ..., ps–1) 0.

The redundancy of a code is the avg excess (over the entropy) per symbol.

Can’t be below 0, of course.

33

Basic codes. Unary code

Extremely simple though. Application: very skew distribution

(expected for a given problem).

34

Basic codes. Elias gamma code

Still simple, and usually much better than the unary code.

35

Elias gamma code in Glimpse (example)

an examplary occurrence list:[2, 4, 5, 6, 9, 11, 40, 42, 43, 94, 96, 120, 133, 134, 151, 203]

list of deltas (differences):[2, 2, 1, 1, 3, 2, 29, 2, 1, 51, 2, 24, 13, 1, 17, 52]

list of deltas minus 1 (as zero was previously impossible, except perhaps on the 1st position):[2, 1, 0, 0, 2, 1, 28, 1, 0, 50, 1, 23, 12, 0, 16, 51]

no compression (one item – one byte): 16 * 8 = 128 bitswith Elias coding: 78 bits

101 100 0 0 101 100 111101101 100 0 11111010011 100 111101000 1110101 0 111100001 11111010100

36

occ_list = [2, 4, 5, 6, 9, 11, 40, 42, 43, 94, 96, 120, 133, 134, 151, 203]import mathdelta_list = [occ_list[0]]for i in range(1, len(occ_list)): delta_list += [occ_list[i]-occ_list[i-1]-1]print occ_listprint delta_list

total_len = 0total_seq = ""for x in delta_list: z = int(math.log(x+1, 2)) code1 = "1" * z + "0" code2 = bin(x+1-2**z)[2:].zfill(z) if z >= 1 else "" total_seq += code1 + code2 + " " total_len += len(code1 + code2)

print "no compression:", len(occ_list)*8, "bits"print "with Elias gamma coding:", total_len, "bits"print total_seq

Python code for the previous example

v2.6 neededas the function bin()

is used

37

Huffman coding (1952) – basic facts

Elias codes assume that we know (roughly) the symbol distribution before encoding.

What if we guessed it badly...?

If we know the symbol distribution, we may construct an optimal code, more precisely,

optimal among the uniquely decodable codes having a codebook.

It is called Huffman coding.

Example. Symbol frequencies above,final Huffman tree on the right.

38

Its redundancy is always less than 1 bit / symbol (but may be arbitrarily close).

In most practical applications (data not very skew)Huffman code avg length only 1-3% worse

than entropy.

E.g. the order-0 entropy of book1 (English 19th century novel, plain text) from the Calgary corpus: 4.567 bpc.

Huffman avg length: 4.569 bpc.

Huffman coding (1952) – basic facts, cont’d

39

Why not using only Huffmanfor encoding occurrence lists?

Łódź, 2008

Documents

Transcript of Łódź, 2008