Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
-
Upload
reynard-farmer -
Category
Documents
-
view
218 -
download
2
Transcript of Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
![Page 1: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/1.jpg)
IndicesTomasz Bartoszewski
![Page 2: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/2.jpg)
Inverted Index• Search
• Construction
• Compression
![Page 3: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/3.jpg)
Inverted Index• In its simplest form, the inverted index of a document
collection is basically a data structure that attaches each distinctive term with a list of all documents that contains the term.
![Page 4: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/4.jpg)
![Page 5: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/5.jpg)
Search Using an Inverted Index
![Page 6: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/6.jpg)
Step 1 – vocabulary searchfinds each query term in the vocabulary
If (Single term in query){
goto step3;
}
Else{
goto step2;
}
![Page 7: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/7.jpg)
Step 2 – results merging• merging of the lists is performed to find their intersection
• use the shortest list as the base
• partial match is possible
![Page 8: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/8.jpg)
Step 3 – rank score computation• based on a relevance function (e.g. okapi, cosine)
• score used in the final ranking
![Page 9: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/9.jpg)
Example
![Page 10: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/10.jpg)
Index Construction
![Page 11: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/11.jpg)
Time complexity
• O(T), where T is the number of all terms (including duplicates) in the document collection (after pre-processing)
![Page 12: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/12.jpg)
Index Compression
![Page 13: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/13.jpg)
Why?• avoid disk I/O
• the size of an inverted index can be reduced dramatically
• the original index can also be reconstructed
• all the information is represented with positive integers -> integer compression
![Page 14: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/14.jpg)
Use gaps• 4, 10, 300, and 305 -> 4, 6, 290 and 5
• Smaller numbers
• Large for rare terms – not a big problem
![Page 15: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/15.jpg)
All in one
![Page 16: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/16.jpg)
Unary• For x:
X-1 bits of 0 and one of 1
e.g.
5 -> 00001
7 -> 0000001
![Page 17: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/17.jpg)
Elias Gamma Coding• in unary (i.e., 0-bits followed by a 1-bit)
• followed by the binary representation of x without its most significant bit.
• efficient for small integers but is not suited to large integers
• is simply the number of bits of x in binary
• 9 -> 000 1001
![Page 18: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/18.jpg)
Elias Delta Coding• For small int longer than gamma codes (better for larger)
• gamma code representation of
• followed by the binary representation of x less the most significant bit
• Dla 9:
-> 00100
9 -> 00100 001
![Page 19: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/19.jpg)
Golomb Coding• values relative to a constant b
• several variations of the original Golomb
• E.g.
Remainder (b possible reminders e.g. b=3: 0,1,2)
binary representation of a remainder requires or
write the first few remainders using r
![Page 20: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/20.jpg)
Example• b=3 and x=9
• => ()
• Result 00010
![Page 21: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/21.jpg)
The coding tree for b=5
![Page 22: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/22.jpg)
Selection of b
• N – total number of documents
• – number of documents that contain term t
![Page 23: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/23.jpg)
Variable-Byte Coding• seven bits in each byte are used to code an integer
• last bit 0 – end, 1 – continue
• E.g. 135 -> 00000011 00001110
![Page 24: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/24.jpg)
Summary• Golomb coding better than Elias
• Gamma coding does not work well
• Variable-byte integers are often faster than Variable-bit (higher storage costs)
• compression technique can allow retrieval to be up to twice as fast than without compression
• space requirement averages 20% – 25% of the cost of storing uncompressed integers
![Page 25: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/25.jpg)
Latent Semantic Indexing
![Page 26: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/26.jpg)
Reason• many concepts or objects can be described in multiple ways
• find using synonyms of the words in the user query
• deal with this problem through the identification of statistical associations of terms
![Page 27: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/27.jpg)
Singular value decomposition (SVD)
• estimate latent structure, and to remove the “noise”
• hidden “concept” space, which associates syntactically different but semantically similar terms and documents
![Page 28: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/28.jpg)
LSI• LSI starts with an m*n termdocument matrix A
• row = term; column = document
• value e.g. term frequency
![Page 29: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/29.jpg)
Singular Value Decomposition• factor matrix A into three matrices:
m is the number of row in A
n is the number of columns in A
r is the rank of A,
![Page 30: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/30.jpg)
Singular Value Decomposition• U is a matrix and its columns, called left singular vectors, are
eigenvectors associated with the r non-zero eigenvalues of
• V is an matrix and its columns, called right singular vectors, are eigenvectors associated with the r non-zero eigenvalues of
• E is a diagonal matrix, E = diag(, , …, ), . , , …, , called singular values, are the non-negative square roots of r non-zero eigenvalues of they are arranged in decreasing order, i.e.,
• reduce the size of the matrices
![Page 31: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/31.jpg)
𝐴𝑘=𝑈𝑘𝐸𝑘𝑉 𝑘𝑇
![Page 32: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/32.jpg)
Query and Retrieval• q - user query (treated as a new document)
• document in the k-concept space, denoted by
![Page 33: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/33.jpg)
Example
![Page 34: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/34.jpg)
Example
![Page 35: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/35.jpg)
Example
![Page 36: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/36.jpg)
Example
![Page 37: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/37.jpg)
Example
![Page 38: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/38.jpg)
Example
![Page 39: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/39.jpg)
Exampleq - “user interface”
![Page 40: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/40.jpg)
Example
![Page 41: Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.](https://reader033.fdocuments.in/reader033/viewer/2022051516/56649eb45503460f94bbc700/html5/thumbnails/41.jpg)
Summary• The original paper of LSI suggests 50–350 dimensions.
• k needs to be determined based on the specific document collection
• association rules may be able to approximate the results of LSI