Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip...
Transcript of Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip...
![Page 1: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/1.jpg)
Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann
LECTURE 2 INDEXING 26.09.2012 Information Retrieval, ETHZ 2012 1
![Page 2: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/2.jpg)
Today’s Overview
1. Introduction 2. Dictionaries 3. Index Construction 4. Distributed Indexing 5. Multiple Query Terms 6. Advanced Posting List Intersection 7. Web-scale Index Serving
Class from 9:15-10:45 (no break), 11-12: Excercise
Information Retrieval, ETHZ 2012 2
![Page 3: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/3.jpg)
INTRODUCTION
3 Information Retrieval, ETHZ 2012
![Page 4: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/4.jpg)
Basic Index: Challenge Design solution to a simple lookup problem:
Efficiently identify documents containing a given term t “Efficiently” = do this in time O(# documents returned)
Use a data structure to be constructed off-line (@ indexing time) in order to avoid linear scanning (@ query time).
Tradeoff response time & query throughput for pre-processing costs & index space (memory, disk).
Any data structure for storing a set of records could be used. Here: focus on arrays & linked lists = posting lists.
4 Information Retrieval, ETHZ 2012
![Page 5: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/5.jpg)
Pre-Computer Age: Book Index Book indexes
Record pages mentioning (e.g.) keywords and names
Goes back to the age of printed books (15th century)
Information Retrieval, ETHZ 2012 5
![Page 6: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/6.jpg)
Posting Lists
ETHZ
docID_4 = docID(“www.ethz.ch”) docID_2 = docID(“wikipedia.org/wiki/ETH_Zurich”)
docID_3 = docID(“www.systems.ethz.ch/…”) docID_1 = docID(“swissinfo.ch/…”)
docID_1 docID_2 docID_3 docID_4 …
ETHZ
Array or linked list Information Retrieval, ETHZ 2012 6
![Page 7: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/7.jpg)
DICTIONARIES
7 Information Retrieval, ETHZ 2012
![Page 8: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/8.jpg)
Basic Index: Dictionary
For each admissible (i.e. single term) query we need to find the corresponding posting list (if it exists, else NULL)
We need an efficient data structure for term look-up, i.e. a dictionary Preferred solution: Hash table
§ Hash function
§ Mechanism for dealing with collisions: e.g. linked list
§ O(1) access for “good” hash functions and large enough n
§ Standard implementations: re-scale at load >0.75
Information Retrieval, ETHZ 2012 8
¤ Büttcher, S., Clarke Ch. L. A., and Cormack, G. V.: Information Retrieval. Implementing and Evaluating Search Engines, Section 4.2, 2010.
![Page 9: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/9.jpg)
Dictionary Hash Table
Information Retrieval, ETHZ 2012 9
terms
class
…
hashes
ETHZ
mountain
weather
0 1 2
r
n
r+1
h
. . .
collision lists
mountain 549283471
ETHZ 398437231
class 234443989
weather 770209991
…
…
… …
class 234443989
<token> <posting list address> =
![Page 10: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/10.jpg)
INDEX CONSTRUCTION
10 Information Retrieval, ETHZ 2012
![Page 11: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/11.jpg)
Basic Index: Generation
Construct all posting lists in one pass over the document collection.
INDEXGENERATION(C) 1 for all documents d in collection C 2 for all terms t occurring in d 3 if not EXISTS(posting_list(t)) 4 then CREATE(posting_list(t)) 5 ADD(posting_list(t),d) 6 else if not CONTAINS(posting_list(t),d) 7 then ADD(posting_list(t),d) 8 return posting_list
Note: indexing terms (=vocabulary) can be identified on the fly. Dictionary construction can happen in parallel.
Information Retrieval, ETHZ 2012 11
![Page 12: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/12.jpg)
Index Construction
Conceptually: 3 steps 1. Make a pass through the collection and assemble all
postings, i.e. pairs (term, doc-id) or (term-id, doc-id)
2. Sort the postings using the term(-id) as the primary and the doc-id as the secondary key
3. Organize doc-ids into posting lists for each term
Information Retrieval, ETHZ 2012 12
![Page 13: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/13.jpg)
Scalable Index Construction
In-memory index construction does not scale. How can we construct an index for very large collections?
Taking into account the hardware constraints on memory, disk, speed etc.
Information Retrieval, ETHZ 2012 13
![Page 14: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/14.jpg)
Sort-Based Index Construction
As we build index, we parse docs one at a time. The final postings for any term is potentially incomplete until the end.
At 10–12 bytes per postings entry, demands a lot of space for large collections.
For large document collections, we need to store intermediate results on disk.
Information Retrieval, ETHZ 2012 14
![Page 15: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/15.jpg)
Blocked Sort-Based Indexing (BSBI)
12-byte (4+4+4) postings (term-id, doc-id, document frequency)
Must now sort many Billions of postings by term-id. Define a block to consist of (say) 10M such postings. We can easily fit that many postings into memory.
Basic idea of algorithm:
Accumulate postings for each block, sort, write to disk. Then merge the block into one long sorted order.
Information Retrieval, ETHZ 2012 15
![Page 16: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/16.jpg)
BSBI Index Construction
Information Retrieval, ETHZ 2012 16
![Page 17: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/17.jpg)
BSBI: Merging Blocks
Information Retrieval, ETHZ 2012 17
![Page 18: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/18.jpg)
Problems with Sort-Based Algorithm
Our assumption was: we can keep the dictionary in memory.
We need the dictionary (which grows dynamically) in order to implement a term to term-id mapping. Actually, we could work with (term, doc-id) postings instead of (term-id, doc-id) postings . . .
. . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)
… term fingerprinting an alternative, but inexact.
Information Retrieval, ETHZ 2012 18
![Page 19: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/19.jpg)
Single Pass in Memory Indexing
Abbreviation: SPIMI Key idea #1: Generate separate dictionaries for each block – no need to maintain term-term-id mapping across blocks. Key idea #2: Don’t sort. Accumulate postings in postings lists as they occur.
With these two ideas we can generate a complete inverted index for each block.
These separate indexes can then be merged into one big index.
Information Retrieval, ETHZ 2012 19
![Page 20: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/20.jpg)
DISTRIBUTED INDEXING
20 Information Retrieval, ETHZ 2012
![Page 21: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/21.jpg)
Distributed Index Generation
For web-scale indexing: must use a distributed computer cluster
Individual machines are fault-prone and may unpredictably slow down or fail
How do we exploit such a pool of machines?
Information Retrieval, ETHZ 2012 21
![Page 22: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/22.jpg)
Master Coordination
Maintain a master machine directing the indexing job – considered “safe”
Break up indexing into sets of parallel tasks
Master machine assigns each task to an idle machine from a pool.
Information Retrieval, ETHZ 2012 22
![Page 23: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/23.jpg)
Parallel Tasks
We will define two sets of parallel tasks and deploy two types of machines to solve them:
§ Parsers
§ Inverters
Break the input document collection into splits (corresponding to blocks in BSBI/SPIMI)
Each split is a subset of documents.
Information Retrieval, ETHZ 2012 23
![Page 24: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/24.jpg)
Parsers
Master assigns a split to an idle parser machine. Parser reads a document at a time and emits (term, doc) pairs.
Parser writes pairs into j term-partitions. Each for a range of terms’ first letters
E.g., a-f, g-p, q-z (here: j = 3)
Information Retrieval, ETHZ 2012 24
![Page 25: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/25.jpg)
Inverters
An inverter collects all (term, doc) pairs (= postings) for one term-partition.
Sorts and writes to postings lists
Information Retrieval, ETHZ 2012 25
![Page 26: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/26.jpg)
Data Flow
Information Retrieval, ETHZ 2012 26
![Page 27: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/27.jpg)
Map Reduce
The index construction algorithm we just described is an instance of Map Reduce.
Map Reduce is a robust and conceptually simple framework for distributed computing . . . . . . without having to write code for the distribution part.
The open source version is called Hadoop.
Hadoop is a key tool for big data. See lecture 3 of Donald Kossmann’s class.
Information Retrieval, ETHZ 2012 27
¤ J. Dean & S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. 6th Symposium on Operating System Design & Implementation, 2004.
![Page 28: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/28.jpg)
MULTIPLE QUERY TERMS
28 Information Retrieval, ETHZ 2012
![Page 29: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/29.jpg)
Basic Index: Modified Challenge Deal with multiple terms:
§ Efficiently identify documents containing a given set of terms t1,…,tk.
§ This is also known as Boolean retrieval with “AND”.
In which way do we need to generalize the • index data structures • index generation, and • query processing?
Information Retrieval, ETHZ 2012 29
![Page 30: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/30.jpg)
Multi-Term Posting Lists Challenge: #terms may be large: O(billions), but #sets of
terms grows exponentially in the set size
In practice some sets (or n-grams) of terms may be used frequently (“mountain bike trails”), but most term combinations will never be observed.
Idea #1: Multi-term posting lists § Identify frequent k-term combinations (from documents
or query logs, k=2 or k=3). Create posting lists for those.
§ Advantage: popular k term queries can be answered as fast as one term queries
Information Retrieval, ETHZ 2012 30
![Page 31: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/31.jpg)
Intersecting Posting Lists
Idea #2: Traverse multiple posting lists in parallel to compute intersection.
In order to be effective (for ~ equal length posting lists): sorted posting lists - sort entries in each list using the same total order (e.g. ascending documentID).
Basic method: § Always advance in posting list with smallest current
element. § Check for documents contained in all lists
Information Retrieval, ETHZ 2012 31
![Page 32: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/32.jpg)
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 32
ETHZ 370871 391223 623920 … 789908
systems
370871 927382 391223 623920 … sort
177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
![Page 33: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/33.jpg)
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 33
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
![Page 34: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/34.jpg)
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 34
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
![Page 35: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/35.jpg)
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 35
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
![Page 36: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/36.jpg)
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 36
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
![Page 37: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/37.jpg)
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 37
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
add docID 391223 to result set
![Page 38: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/38.jpg)
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 38
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
![Page 39: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/39.jpg)
Intersection Algorithm
For simplicity, we focus on the case of two posting lists
Multiple terms can be handled by generalizing to k posting lists
… or by creating temp intermediate posting lists and recursion
Optimization: start with shorter posting lists
Information Retrieval, ETHZ 2012 39
INTERSECT(p1, p2) 1 answer := < > 2 while (p1 != NULL) AND (p2 != NULL) 3 if docID(p1) == docID(p2) then 4 ADD(answer, docID(p1)) 5 ADVANCE(p1) 6 ADVANCE(p2) 7 else if docID(p1) < docID(p2) 8 ADVANCE(p1) 9 else 10 ADVANCE(p2) 11 return answer
![Page 40: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/40.jpg)
Intersecting Posting Lists
How expensive is the parallel intersection of k posting lists?
Number of pointer advances
Reasonable efficiency, if posting lists are approximately of the same length.
Access time dominated by longest posting list. Can we also devise a method that is dominated by the shortest?
Information Retrieval, ETHZ 2012 40
![Page 41: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/41.jpg)
ADVANCED POSTING LIST INTERSECTION
41 Information Retrieval, ETHZ 2012
![Page 42: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/42.jpg)
Alternative Posting List Intersection
Naïve approach when |L_1| << |L_2| § Build a hash map dictionary of docIDs for L_1 and L_2
§ Lookup the elements of L_1 in the dictionary for L_2
§ O(|L_1|) time
Only works well in highly asymmetric case. Can we do better?
Information Retrieval, ETHZ 2012 42
![Page 43: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/43.jpg)
Alternative Posting List Intersection: Refinement § Compute hashed sets h(L1) and h(L2)
§ Bucketed bit set representation of set of hash values
§ Fast intersection in bit set representation
§ Exact intersection
§ #bits in h: small enough to allow for fast intersection; large enough to make L’1 and L’2 small.
Information Retrieval, ETHZ 2012 43
¤ P. Bille, A. Pagh, and R. Pagh. Fast Evaluation of Union-Intersection Expressions. In ISAAC, pages 739–750, 2007.
![Page 44: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/44.jpg)
Posting Lists with Skip Pointers
Other ways to speed up list based intersection: introduce skip pointers
Traverse skip pointers instead of next element pointer, if whole segment can be skipped.
Where to put skip pointers? Heuristics: sqrt spacing
Trade-offs:
(1) space and I/O (!) requirements for skip pointers vs. not
(2) additional comparisons with skip pointers vs. skip gains
Information Retrieval, ETHZ 2012 44
![Page 45: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/45.jpg)
Use of Skip Pointers: Example
Information Retrieval, ETHZ 2012 45
When 8 is reached in both lists. Next element in top list is 41. We can advance to that element. However, we can skip over the block in bottom list and move past 31, skipping 4 elements.
![Page 46: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/46.jpg)
WEB SCALE INDEX SERVING
46 Information Retrieval, ETHZ 2012
![Page 47: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/47.jpg)
Disk vs. RAM
When building a scalable (i.e. Web scale) index, one key design question is to use disk vs. RAM (today also: SSD).
§ RAM ~200x more expensive than disk
§ Disk ~10-20x slower to access
§ Additional overhead for random access = disk seeks
Hardware economics also influence system architecture.
Information Retrieval, ETHZ 2012 47
![Page 48: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/48.jpg)
Distributed Index: Sharding
A related problem for large indexes is how to split up the index into pieces or shards. Relevant performance dimensions are response time or latency (how long does it take to compute a response?) as well as throughput (how many queries/s can be answered?). In addition fault tolerance may be an issue. There are two basic ways of sharding: document sharding or term sharding. Document sharding: each shard contains short posting lists (for a subset of documents). Term sharding: each shard contains few posting lists Information Retrieval, ETHZ 2012 48
![Page 49: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/49.jpg)
Document Sharding
Information Retrieval, ETHZ 2012 49
![Page 50: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/50.jpg)
Term Sharding
Information Retrieval, ETHZ 2012 50
![Page 51: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/51.jpg)
Document Sharding - Pros & Cons
Pros § each shard can
process queries independently
§ easy to keep additional per-doc information
§ network traffic (requests/ responses) small
Information Retrieval, ETHZ 2012 51
Cons § query has to be
processed by each shard
§ O(K*N) disk seeks for K word query on N shards
![Page 52: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/52.jpg)
Term Sharding - Pros & Cons
Pros § K word query =>
handled by at most K shards
§ O(K) disk seeks for K word query
Information Retrieval, ETHZ 2012 52
Cons § much higher network
bandwidth needed § data about each term for
each matching doc must be collected in one place
§ harder to have per-doc information
Document sharding is “standard” approach in Web search.
![Page 53: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/53.jpg)
Basic Design Principles
Document Keying § Documents assigned small integer ids (docids)
§ Smaller ids for higher quality/more important docs: allows for approximation/cut-offs
Index Servers
§ Given (query) return sorted list of (score, docid, ...)
§ Partitioned (“sharded”) by docid
§ Index shards are replicated for capacity
§ Cost is O(# queries * # docs in index)
Information Retrieval, ETHZ 2012 53
![Page 54: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/54.jpg)
Web Search Serving System (Google @ year ~2000)
Information Retrieval, ETHZ 2012 54
![Page 55: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/55.jpg)
Caching
Cache servers § Cache both index results and doc snippets
§ Hit rates typically 30-60% • Depends on frequency of index updates, query traffic, level of
personalization, etc.
Main benefits
§ Performance! 10s of machines do work of 100(0)s
§ Reduce query latency on hits
§ Cache served queries are typically popular and often expensive
Information Retrieval, ETHZ 2012 55
![Page 56: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/56.jpg)
Dealing with Growth
More web pages: more shards
More queries: more replicas
Information Retrieval, ETHZ 2012 56
![Page 57: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/57.jpg)
From Document Sharding to In-memory Index
Must add shards to keep response time low as index size increases
... but query cost increases with # of shards
§ typically >= 1 disk seek / shard / query term
§ even for very rare terms
As # of replicas increases, total amount of memory available increases
Eventually, have enough memory to hold an entire copy of the index in memory Radically changes many design parameters
Information Retrieval, ETHZ 2012 57
![Page 58: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/58.jpg)
In Memory Index (a la Google)
Information Retrieval, ETHZ 2012 58
![Page 59: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next](https://reader033.fdocuments.in/reader033/viewer/2022042100/5e7cb052ce11a0538c013bc8/html5/thumbnails/59.jpg)
Anecdote form the Life of a Search Engine 1999 J
Index updates (~once per month) § Wait until traffic is low
§ Take some replicas offline
§ Copy new index to these replicas
§ Start new frontends pointing at updated index
Disk-optimized update scheme
Information Retrieval, ETHZ 2012 59