Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute...
-
Upload
tamsyn-craig -
Category
Documents
-
view
213 -
download
0
Transcript of Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute...
Indexing
Debapriyo Majumdar
Information Retrieval – Spring 2015
Indian Statistical Institute Kolkata
2
Hardware parameters
Caching Accessing data in memory is a few times faster than from disk Keep as much information as possible in main memory
Parameter HDD
Average seek time 5 × 10−3 s (5 ms)
Average time to read a byte from disk 2 × 10−8 s (50MB/s)
Average time to access a byte in memory
5 × 10−9 s (0.5 μs)
Processor’s clock rate 109 per second
Low-level operation (compare, swap a word)
10−8 s(1 μs)
3
Hardware parameters
Reading from disk About 10 MB data in one contiguous chunk on disk
– One seek (~5 ms) + Read (~ 0.2 s) ≈ 0.2 s
About 10 MB data in 1000 non- contiguous chunks on disk– Thousand seeks (~5 s) + Read (~ 0.2 s) ≈ 5.2 s
Parameter HDD
Average seek time 5 × 10−3 s (5 ms)
Average time to read a byte from disk 2 × 10−8 s (50MB/s)
Average time to access a byte in memory
5 × 10−9 s (0.5 μs)
Processor’s clock rate 109 per second
Low-level operation (compare, swap a word)
10−8 s(1 μs)
4
Hardware parameters
The OS reads and writes data in blocks – Block sizes may be 8, 16, 32 or 64KB– Reading 1 byte data takes same time as the entire block– Buffer: part of main memory where a block is stored after
reading or while writing
Parameter HDD
Average seek time 5 × 10−3 s (5 ms)
Average time to read a byte from disk 2 × 10−8 s (50MB/s)
Average time to access a byte in memory
5 × 10−9 s (0.2 μs)
Processor’s clock rate 109 per second
Low-level operation (compare, swap a word)
10−8 s(1 μs)
5
Hardware parameters
Simultaneous reading and processing– Disk main memory data transfer is done by system bus– Processor is free, can work while reading data– The system can read data and decompress at the same time
Parameter HDD
Average seek time 5 × 10−3 s (5 ms)
Average time to read a byte from disk 2 × 10−8 s (50MB/s)
Average time to access a byte in memory
5 × 10−9 s (0.2 μs)
Processor’s clock rate 109 per second
Low-level operation (compare, swap a word)
10−8 s(1 μs)
6
Hardware parameters
The parameters change a lot for SSDs– Seek: about 50 times faster than HDD– Read: about 4 times faster than HDD
Parameter HDD
Average seek time 5 × 10−3 s (5 ms)
Average time to read a byte from disk 2 × 10−8 s (50MB/s)
Average time to access a byte in memory
5 × 10−9 s (0.2 μs)
Processor’s clock rate 109 per second
Low-level operation (compare, swap a word)
10−8 s(1 μs)
SSD
0.1 ms
5 × 10−9 s (200 MB/s)
7
Blocked sort based indexing
Segments of the collections Step 1: partition the collection into segments
<Term,DocID> pairs for each segment
Step 2: make a pass over all documents, write out <term, docId> pairs
<Term,DocID> pairs for each segment
sorted by term
Step 3: separately in each segment, sort the <term,docId> pairs by term. May require an external memory sort.
Posting lists for each segment
Final posting lists for each term
Step 4: separately in each segment, create the posting lists for each term
Step 5: for each term, merge the posting lists from all segments and create one single posting list
8
Single-pass in memory indexing
CollectionStep 1: Keep parsing documents, convert to <term,docId> pairs
<Term,DocID> pairs in memory
Step 2: invert <term,docId> pairs in memory to make posting lists; immediately available for searching as wellPosting lists in
memoryStep 3: When memory is exhausted, write out the whole index to disk after sorting the posting lists and dictionaryIndex (dictionary and
posting lists) for one block written to disk
Final index merged
Step 4: merge the index files for different blocks
9
Dynamic indexing For fast access, posting lists are written in contiguous blocks Changes to existing index requires a lot of moving of the data Approach: create an auxiliary index for incremental changes,
keep growing that index– Searches are performed in both old and auxiliary indices, results are
merged– When auxiliary index grows significantly large, merge it with the
original one (costly operation, but not done often)
Ease of merging– If each posting list can be one file Good– Bad idea! The OS cannot handle too many files (too many terms)
well– Tradeoff between #of files and #of posting lists per file: keep some
posting lists in one file, but not all in one
10
SecuritySearch in enterprise data Not all users may be allowed to view all documents Approach 1: use access control lists
– For each user / group of users, have a posting list of documents the user / group CAN access
– Intersect that list with the search result list– Problem: difficult to maintain when access changes– Access control lists may be very large
Approach 2: check at query time– From the retrieved results, filter out those which the user is
not allowed to access– May slow down retrieval
11
Index Compression
Debapriyo Majumdar
Information Retrieval – Spring 2015
Indian Statistical Institute Kolkata
12
Data compressionRepresent the information using (hopefully) lesser number of bits than used by the original representation
Lossy: may lose some “information” – Consequently, may not be able to get back the original
Lossless: just a different representation, will be able to “decompress” and get back the original– We will consider only lossless compression now
Question:
Can a lossless compression algorithm “compress” any input data?
13
Index compression Use less space to store
– In disk or in memory– Saving is saving anyway
Dictionary compression– May store the full or most of the dictionary in memory
Posting list compression– Assumption: cannot store it in memory – Transfer data from disk faster– Typically can compress index by 4 times
Need a fast decompression– Without compression: transfer 4x bytes from disk– With compression: transfer x bytes and decompress
14
The posting lists
For a “small” dataset, need 4 bytes (or even less) to store the doc IDs For a “very large” dataset, may need ~8 bytes to store the doc IDs Observation 1: the docIDs are large integers, but the gaps are not so large!
– Idea: Store the first doc ID, and then store only the gaps (Gap encoding)
Observation 2(a): for some posting lists, the gaps are very small– Very frequent terms
Observation 2(b): For some the gaps can be large as well– Very infrequent terms
Some posting lists may also have some large and some small gaps– Fixed size for gaps would not work!
obama … 3429934 3478390 3689839 …
the … 45 46 47 …
sankhadeep … 54892834 394099834 828493832 …
15
Gap encoding: variable byte encoding
Approach: store the first doc ID, then store the gaps Use 1 byte units First bit is the indicator, next 7 bits are payload (the number) First bit is 1 if this is the last byte, 0 otherwise
doc IDs 824 829 215406
ID, gaps 824 5 214577
encoding 00000110 10111000 10000101 00001101 00001100 10110001
824 is encoded as 0000110 0111000
[without the first bit of both bytes] (512 + 256 + 32 + 16 + 8)
First indicator bit is 0 because this is not the last byte
Some gaps take more bytes, some
gaps take less
16
Elias γ encoding Unary code: n n 1s followed by a 0 – For example 3 1110– Some conventions use n 0s followed by a 1 (equivalent)
γ encoding (due to Peter Elias)– Consider the binary representation of n– It starts with a 1; chop it off (we know it, so redundant) – Example: 9 1001 001 (call this part offset)– Encode the length of the offset in unary– Example: length of offset of 9 is 3. Unary encoding: 1110– γ encoding of 9 1110001
length of offset is 3 in unary
the number is leading 1 and the offset = 1001
17
Elias γ encoding Better than variable byte encoding because γ encoding is bit-
level encoding (empirically 15% better in posting lists) Decompression is less efficient than variable byte encoding Observation: the unary encoding of length may use a lot of
space Solution: encode length + 1 using γ encoding again (Elias δ
encoding)– More effective for large numbers– Example: 509 11111111011111101 (γ) 111000111111101 (δ)
Question: why length + 1? Omega encoding: recursively apply γ encoding on the length
18
References Primarily: IR Book by Manning, Raghavan and
Schuetze: http://nlp.stanford.edu/IR-book/