Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute...

Indexing

Debapriyo Majumdar

Information Retrieval – Spring 2015

Indian Statistical Institute Kolkata

2

Hardware parameters

Caching Accessing data in memory is a few times faster than from disk Keep as much information as possible in main memory

Parameter HDD

Average seek time 5 × 10−3 s (5 ms)

Average time to read a byte from disk 2 × 10−8 s (50MB/s)

Average time to access a byte in memory

5 × 10−9 s (0.5 μs)

Processor’s clock rate 109 per second

Low-level operation (compare, swap a word)

10−8 s(1 μs)

3

Hardware parameters

Reading from disk About 10 MB data in one contiguous chunk on disk

– One seek (~5 ms) + Read (~ 0.2 s) ≈ 0.2 s

About 10 MB data in 1000 non- contiguous chunks on disk– Thousand seeks (~5 s) + Read (~ 0.2 s) ≈ 5.2 s

Parameter HDD




5 × 10−9 s (0.5 μs)



10−8 s(1 μs)

4

Hardware parameters

The OS reads and writes data in blocks – Block sizes may be 8, 16, 32 or 64KB– Reading 1 byte data takes same time as the entire block– Buffer: part of main memory where a block is stored after

reading or while writing

Parameter HDD




5 × 10−9 s (0.2 μs)



10−8 s(1 μs)

5

Hardware parameters

Simultaneous reading and processing– Disk main memory data transfer is done by system bus– Processor is free, can work while reading data– The system can read data and decompress at the same time

Parameter HDD




5 × 10−9 s (0.2 μs)



10−8 s(1 μs)

6

Hardware parameters

The parameters change a lot for SSDs– Seek: about 50 times faster than HDD– Read: about 4 times faster than HDD

Parameter HDD




5 × 10−9 s (0.2 μs)



10−8 s(1 μs)

SSD

0.1 ms

5 × 10−9 s (200 MB/s)

7

Blocked sort based indexing

Segments of the collections Step 1: partition the collection into segments

<Term,DocID> pairs for each segment

Step 2: make a pass over all documents, write out <term, docId> pairs

<Term,DocID> pairs for each segment

sorted by term

Step 3: separately in each segment, sort the <term,docId> pairs by term. May require an external memory sort.

Posting lists for each segment

Final posting lists for each term

Step 4: separately in each segment, create the posting lists for each term

Step 5: for each term, merge the posting lists from all segments and create one single posting list

8

Single-pass in memory indexing

CollectionStep 1: Keep parsing documents, convert to <term,docId> pairs

<Term,DocID> pairs in memory

Step 2: invert <term,docId> pairs in memory to make posting lists; immediately available for searching as wellPosting lists in

memoryStep 3: When memory is exhausted, write out the whole index to disk after sorting the posting lists and dictionaryIndex (dictionary and

posting lists) for one block written to disk

Final index merged

Step 4: merge the index files for different blocks

9

Dynamic indexing For fast access, posting lists are written in contiguous blocks Changes to existing index requires a lot of moving of the data Approach: create an auxiliary index for incremental changes,

keep growing that index– Searches are performed in both old and auxiliary indices, results are

merged– When auxiliary index grows significantly large, merge it with the

original one (costly operation, but not done often)

Ease of merging– If each posting list can be one file Good– Bad idea! The OS cannot handle too many files (too many terms)

well– Tradeoff between #of files and #of posting lists per file: keep some

posting lists in one file, but not all in one

10

SecuritySearch in enterprise data Not all users may be allowed to view all documents Approach 1: use access control lists

– For each user / group of users, have a posting list of documents the user / group CAN access

– Intersect that list with the search result list– Problem: difficult to maintain when access changes– Access control lists may be very large

Approach 2: check at query time– From the retrieved results, filter out those which the user is

not allowed to access– May slow down retrieval

11

Index Compression

Debapriyo Majumdar

Information Retrieval – Spring 2015

Indian Statistical Institute Kolkata

12

Data compressionRepresent the information using (hopefully) lesser number of bits than used by the original representation

Lossy: may lose some “information” – Consequently, may not be able to get back the original

Lossless: just a different representation, will be able to “decompress” and get back the original– We will consider only lossless compression now

Question:

Can a lossless compression algorithm “compress” any input data?

13

Index compression Use less space to store

– In disk or in memory– Saving is saving anyway

Dictionary compression– May store the full or most of the dictionary in memory

Posting list compression– Assumption: cannot store it in memory – Transfer data from disk faster– Typically can compress index by 4 times

Need a fast decompression– Without compression: transfer 4x bytes from disk– With compression: transfer x bytes and decompress

14

The posting lists

For a “small” dataset, need 4 bytes (or even less) to store the doc IDs For a “very large” dataset, may need ~8 bytes to store the doc IDs Observation 1: the docIDs are large integers, but the gaps are not so large!

– Idea: Store the first doc ID, and then store only the gaps (Gap encoding)

Observation 2(a): for some posting lists, the gaps are very small– Very frequent terms

Observation 2(b): For some the gaps can be large as well– Very infrequent terms

Some posting lists may also have some large and some small gaps– Fixed size for gaps would not work!

obama … 3429934 3478390 3689839 …

the … 45 46 47 …

sankhadeep … 54892834 394099834 828493832 …

15

Gap encoding: variable byte encoding

Approach: store the first doc ID, then store the gaps Use 1 byte units First bit is the indicator, next 7 bits are payload (the number) First bit is 1 if this is the last byte, 0 otherwise

doc IDs 824 829 215406

ID, gaps 824 5 214577

encoding 00000110 10111000 10000101 00001101 00001100 10110001

824 is encoded as 0000110 0111000

[without the first bit of both bytes] (512 + 256 + 32 + 16 + 8)

First indicator bit is 0 because this is not the last byte

Some gaps take more bytes, some

gaps take less

16

Elias γ encoding Unary code: n n 1s followed by a 0 – For example 3 1110– Some conventions use n 0s followed by a 1 (equivalent)

γ encoding (due to Peter Elias)– Consider the binary representation of n– It starts with a 1; chop it off (we know it, so redundant) – Example: 9 1001 001 (call this part offset)– Encode the length of the offset in unary– Example: length of offset of 9 is 3. Unary encoding: 1110– γ encoding of 9 1110001

length of offset is 3 in unary

the number is leading 1 and the offset = 1001

17

Elias γ encoding Better than variable byte encoding because γ encoding is bit-

level encoding (empirically 15% better in posting lists) Decompression is less efficient than variable byte encoding Observation: the unary encoding of length may use a lot of

space Solution: encode length + 1 using γ encoding again (Elias δ

encoding)– More effective for large numbers– Example: 509 11111111011111101 (γ) 111000111111101 (δ)

Question: why length + 1? Omega encoding: recursively apply γ encoding on the length

18

References Primarily: IR Book by Manning, Raghavan and

Schuetze: http://nlp.stanford.edu/IR-book/

http://nlp.stanford.edu/IR-book/



Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute...

Documents

Transcript of Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute...