L06IndexConstruction
-
Upload
cecsdistancelab -
Category
Documents
-
view
220 -
download
0
Transcript of L06IndexConstruction
-
8/3/2019 L06IndexConstruction
1/47
Prasad L06IndexConstruction 1
Index Construction
Adapted from Lectures by
Prabhakar Raghavan (Yahoo and Stanford) and
Christopher Manning (Stanford)
-
8/3/2019 L06IndexConstruction
2/47
Plann Last lectures:
n Dictionary data structuresn Tolerant retrieval
n Wildcardsn Spell correctionn Soundex
n This time:n Index construction
a-huhy-m
n-z
mo
on
among
$m mace
abandon
amortize
madden
among
Prasad 2L06IndexConstruction
-
8/3/2019 L06IndexConstruction
3/47
Index constructionn How do we construct an index?n What strategies can we use with limited
main memory?n Hardware Basics
n Many design decisions in informationretrieval are based on the characteristics of
hardwaren We begin by reviewing hardware basics
Prasad 3L06IndexConstruction
-
8/3/2019 L06IndexConstruction
4/47
Hardware basicsn Access to data in memory is much faster than
access to data on disk.
n Disk seeks: No data is transferred from disk whilethe disk head is being positioned.
n Therefore: Transferring one large chunk of datafrom disk to memory is faster than transferring
many small chunks.
n Disk I/O is block-based: Reading and writing ofentire blocks (as opposed to smaller chunks).
n Block sizes: 8KB to 256 KB.
Prasad 4L06IndexConstruction
-
8/3/2019 L06IndexConstruction
5/47
Hard disk geometry and terminology
Prasad L06IndexConstruction 5
-
8/3/2019 L06IndexConstruction
6/47
Hardware basicsn Servers used in IR systems now typically have
several GB of main memory, sometimes tens of
GB.
n Available disk space is several (23)orders ofmagnitude larger.
n Fault tolerance is very expensive: Its muchcheaper to use many regular machines rather
than one fault tolerant machine.
Prasad 6L06IndexConstruction
-
8/3/2019 L06IndexConstruction
7/47
Hardware assumptionsn symbol statistic valuen s average seek time 5 ms = 5 x 103 sn
b transfer time per byte 0.02 s = 2 x 108
sn processors clock rate 109 s1n p lowlevel operation 0.01 s = 108 s
(e.g., compare & swap a word)
n size of main memory several GBn size of disk space 1 TB or moren Mem. trans. time per byte 5 nsPrasad 7L06IndexConstruction
-
8/3/2019 L06IndexConstruction
8/47
RCV1: Our corpus for this lecturen Shakespeares collected works definitely arent
large enough.
n The corpus well use isnt really large enougheither, but its publicly available and is at least a
more plausible example.
n As an example for applying scalable indexconstruction algorithms, we will use the Reuters
RCV1 collection (Approx. 1GB).n This is one year of Reuters newswire (part of 1996 and
1997)
Prasad 8L06IndexConstruction
-
8/3/2019 L06IndexConstruction
9/47
A Reuters RCV1 document
Prasad 9L06IndexConstruction
-
8/3/2019 L06IndexConstruction
10/47
Reuters RCV1 statisticsn symbol statistic valuen N documents 800,000n
L avg. # tokens per doc 200n M terms (= word types) 400,000n avg. # bytes per token 6
(incl. spaces/punct.)
n avg. # bytes per token 4.5(without spaces/punct.)
n avg. # bytes per term 7.5n non-positional postings 100,000,0004.5 bytes per word token vs. 7.5 bytes per word type: why?
-
8/3/2019 L06IndexConstruction
11/47
n Documents are parsed to extract words andthese are saved with the Document ID.
I did enact Julius
Caesar I was killedi' the Capitol;
Brutus killed me.
Doc 1
So let it be withCaesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Recall IIR1 index construction
Term Doc #
I 1
did 1enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Prasad 11L06IndexConstruction
-
8/3/2019 L06IndexConstruction
12/47
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Term Doc #
ambitious 2
be 2
brutus 1
brutus 2
capitol 1
caesar 1
caesar 2
caesar 2
did 1
enact 1
hath 1
I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2so 2
the 1
the 2
told 2
you 2
was 1
was 2
with 2
Key stepn After all documents have
been parsed, the inverted file
is sorted by terms.
We focus on this sort step.We have 100M items to sort.
Prasad 12L06IndexConstruction
-
8/3/2019 L06IndexConstruction
13/47
Scaling index construction
n In-memory index construction does not scale.n How can we construct an index for very large
collections?
n Taking into account the hardware constraints wejust learned about . . .
n Memory, disk, speed etc.
Prasad 13L06IndexConstruction
-
8/3/2019 L06IndexConstruction
14/47
Sort-based Index construction
n As we build the index, we parse docs one at a time.n While building the index, we cannot easily exploit
compression tricks (you can, but much more complex)
n The final postings for any term are incomplete untilthe end.
n At 12 bytes per postings entry, demands a lot ofspace for large collections.
n T = 100,000,000 in the case of RCV1n So we can do this in memory in 2008, but
typical collections are much larger. E.g. New YorkTimes provides index of >150 years of newswire
n Thus: We need to store intermediate results on disk.
-
8/3/2019 L06IndexConstruction
15/47
Use the same algorithm for disk?
n Can we use the same index constructionalgorithm (internal sorting algorithms) for larger
collections, but by using disk instead of memory?
n No: Sorting T = 100,000,000 records on disk istoo slow too many disk seeks.
n We need an external sorting algorithm.
Prasad 15L06IndexConstruction
-
8/3/2019 L06IndexConstruction
16/47
Bottleneck
n Parse and build postings entries one doc at atime
n Now sort postings entries by term (then by docwithin each term)
n Doing this with random disk seeks would be tooslow must sort T=100M records
If every comparison took 2 disk seeks, and Nitems could besorted with Nlog2Ncomparisons, how long would this take?
Prasad 16L06IndexConstruction
-
8/3/2019 L06IndexConstruction
17/47
BSBI: Blocked sort-based Indexing
(Sorting with fewer disk seeks)
n 12-byte (4+4+4) records (term, doc, freq).n These are generated as we parse docs.n
Must now sort 100M such 12-byte records byterm.
n Define a Block ~ 10M such recordsn Can easily fit a couple into memory.n
Will have 10 such blocks to start with.n Basic idea of algorithm:
n Accumulate postings for each block, sort, write todisk.
n Then merge the blocks into one long sorted order.
-
8/3/2019 L06IndexConstruction
18/47
Prasad 18L06IndexConstruction
-
8/3/2019 L06IndexConstruction
19/47
Sorting 10 blocks of 10M records
n First, read each block and sort within:n Quicksort takes 2Nlog Nexpected stepsn In our case 2 x (10M log 10M) steps
n Exercise: estimate total time to read each blockfrom disk and quicksort it.
n 10 times this estimate - gives us 10 sorted runsof 10M records each.
n Done straightforwardly, need 2 copies of data ondisk
n But can optimize thisPrasad 19L06IndexConstruction
-
8/3/2019 L06IndexConstruction
20/47
Prasad 20L06IndexConstruction
-
8/3/2019 L06IndexConstruction
21/47
How to merge the sorted runs?
n Can do binary merges, with a merge tree of log210 =4 layers.
n During each layer, read into memory runs in blocks of10M, merge, write back.
Disk
1
3 4
2
2
1
4
3
Runs beingmerged.
Merged run.
Prasad
-
8/3/2019 L06IndexConstruction
22/47
How to merge the sorted runs?
n But it is more efficient to do a n-way merge, whereyou are reading from all blocks simultaneously
n Providing you read decent-sized chunks of eachblock into memory, youre not killed by disk seeks
Prasad 22L06IndexConstruction
-
8/3/2019 L06IndexConstruction
23/47
Remaining problem with sort-
based algorithm
n Our assumption was: we can keep the dictionaryin memory.
n We need the dictionary (which growsdynamically) to map a term to termID.
n Actually, we could work with term,docID postingsinstead of termID,docID postings . . .
n . . . but then intermediate files become very large.(We would end up with a scalable, but very slow
index construction method.)
Prasad 23L06IndexConstruction
-
8/3/2019 L06IndexConstruction
24/47
SPIMI:
Single-pass in-memory indexing
n Key idea 1: Generate separate dictionaries foreach block no need to maintain term-termID
mapping across blocks.
n Key idea 2: Dont sort. Accumulate postings inpostings lists as they occur.
n With these two ideas we can generate acomplete inverted index for each block.
n These separate indexes can then be merged intoone big index.
Prasad 24L06IndexConstruction
-
8/3/2019 L06IndexConstruction
25/47
SPIMI-Invert
n Merging of blocks is analogous to BSBI.
-
8/3/2019 L06IndexConstruction
26/47
SPIMI: Compression
n Compression makes SPIMI even more efficient.n Compression of termsn Compression of postings
n See next lecture
Prasad 26L06IndexConstruction
-
8/3/2019 L06IndexConstruction
27/47
Distributed indexing
n For web-scale indexing (dont try this at home!):must use a distributed computing cluster
n Individual machines are fault-pronen Can unpredictably slow down or fail
n How do we exploit such a pool of machines?
Prasad 27L06IndexConstruction
-
8/3/2019 L06IndexConstruction
28/47
Google data centers
n Google data centers mainly contain commoditymachines and are distributed around the world.
n Estimate: a total of 1 million servers, 3 millionprocessors/cores (Gartner 2007)
n Estimate: Google installs 100,000 servers eachquarter.
n Based on expenditures of 200250 million dollarsper year
n This would be 10% of the computing capacity ofthe world!?!
Prasad 28L06IndexConstruction
-
8/3/2019 L06IndexConstruction
29/47
Google data centers
n If in a non-fault-tolerant system with 1000 nodes,each node has 99.9% uptime, what is the uptime
of the system?
n Answer: 36%n Note, (1 0.9991000) = 0.63n Probability that at least one will fail.
Prasad 29L06IndexConstruction
-
8/3/2019 L06IndexConstruction
30/47
Distributed indexing
n Maintain a mastermachine directing the indexingjob considered safe.
n Break up indexing into sets of (parallel) tasks.
nMaster machine assigns each task to an idlemachine from a pool.
Prasad 30L06IndexConstruction
-
8/3/2019 L06IndexConstruction
31/47
Parallel tasks
n We will use two sets of parallel tasksn Parsersn Inverters
n Break the input document corpus into splits
nEach split is a subset of documents(corresponding to blocks in BSBI/SPIMI)
Prasad 31L06IndexConstruction
-
8/3/2019 L06IndexConstruction
32/47
Parsers
n Master assigns a split to an idle parser machine
nParser reads a document at a time and emits(term, doc) pairs
n Parser writes pairs intojpartitions
nEach partition is for a range of terms
first letters
n (e.g., a-f, g-p, q-z) herej=3.n Now to complete the index inversionPrasad 32L06IndexConstruction
-
8/3/2019 L06IndexConstruction
33/47
Inverters
n An inverter collects all (term,doc) pairs (=postings) for one term-partition.
n Sorts and writes to postings lists
Prasad 33L06IndexConstruction
-
8/3/2019 L06IndexConstruction
34/47
Data flow
splits
Parser
Parser
Parser
Master
a-f g-p q-z
a-f g-p q-z
a-f g-p q-z
Inverter
Inverter
Inverter
Postings
a-f
g-p
q-z
assign assign
Mapphase
Segment files Reducephase 34
-
8/3/2019 L06IndexConstruction
35/47
MapReduce
n The index construction algorithm we justdescribed is an instance of MapReduce.
n MapReduce (Dean and Ghemawat 2004) is arobust and conceptually simple framework for
n distributed computingn without having to write code for the distribution
part.
n They describe the Google indexing system (ca.2002) as consisting of a number of phases, each
implemented in MapReduce.
Prasad 35L06IndexConstruction
-
8/3/2019 L06IndexConstruction
36/47
MapReduce
n Index construction was just one phase.n Another phase: transforming a term-partitioned
index into document-partitioned index.
n Term-partitioned: one machine handles asubrange of terms
n Document-partitioned: one machine handles asubrange of documents
n (As we discuss in the web part of the course)most search engines use a document-partitioned
index better load balancing, etc.)
Prasad 36L06IndexConstruction
-
8/3/2019 L06IndexConstruction
37/47
Schema for index construction in
MapReduce
n Schema of map and reduce functionsn map: input list(k, v) reduce: (k,list(v)) outputn Instantiation of the schema for index constructionn
map: web collection
list(termID, docID)n reduce: (, , )
(postings list1, postings list2, )
n Example for index constructionn map: d2 : C died. d1 : C came, C ced. (, ,
, , ,
n reduce: (, , , ) (, , ,
)
Prasad 37L06IndexConstruction
-
8/3/2019 L06IndexConstruction
38/47
Dynamic indexing
n Up to now, we have assumed that collections arestatic.
n They rarely are:n Documents come in over time and need to be
inserted.
n Documents are deleted and modified.n This means that the dictionary and postings lists
have to be modified:
n Postings updates for terms already in dictionaryn New terms added to dictionary
Prasad 38L06IndexConstruction
-
8/3/2019 L06IndexConstruction
39/47
Simplest approach
n Maintain big main indexn New docs go into small auxiliary indexn Search across both, merge resultsn Deletions
n Invalidation bit-vector for deleted docsn Filter docs output on a search result by this
invalidation bit-vector
n Periodically, re-index into one main index
Prasad 39L06IndexConstruction
-
8/3/2019 L06IndexConstruction
40/47
Issues with main and auxiliary
indexes
n Problem of frequent merges you touch stuff a lotn Poor performance during mergen Actually:
n Merging of the auxiliary index into the main index is efficientif we keep a separate file for each postings list.
n Merge is the same as a simple append.n But then we would need a lot of files inefficient for O/S.
n Assumption for the rest of the lecture: The index isone big file.
n In reality: Use a scheme somewhere in between(e.g., split very large postings lists, collect postings
lists of length 1 in one file etc.)40
-
8/3/2019 L06IndexConstruction
41/47
Logarithmic merge
n Maintain a series of indexes, each twice as largeas the previous one.
n Keep smallest (Z0) in memoryn Larger ones (I0, I1,) on diskn If Z0 gets too big (> n), write to disk as I0n or merge with I0 (if I0 already exists) as Z1n
Either write merge Z1 to disk as I1 (if no I1)n Or merge with I1 to form Z2n etc.Prasad 41L06IndexConstruction
-
8/3/2019 L06IndexConstruction
42/47
Logarithmic merge
-
8/3/2019 L06IndexConstruction
43/47
Logarithmic merge
n Auxiliary and main index: index construction timeis O(T2) as each posting is touched in each
merge.
n Logarithmic merge: Each posting is mergedO(log T) times, so complexity is O(T log T)
n So logarithmic merge is much more efficient forindex construction
n But query processing now requires the mergingof O(log T) indexes
n Whereas it is O(1) if you just have a main andauxiliary index
Prasad 43L06IndexConstruction
-
8/3/2019 L06IndexConstruction
44/47
Further issues with multiple
indexes
n Corpus-wide statistics are hard to maintainn E.g., when we spoke of spell-correction: which of
several corrected alternatives do we present to
the user?
n We said, pick the one with the most hitsn How do we maintain the top ones with multiple
indexes and invalidation bit vectors?
n One possibility: ignore everything but the mainindex for such ordering
n Will see more such statistics used in resultsranking
Prasad 44L06IndexConstruction
-
8/3/2019 L06IndexConstruction
45/47
Dynamic indexing at search
engines
n All the large search engines now do dynamicindexing
n Their indices have frequent incremental changesn News items, new topical web pages
n Sarah Palin
n But (sometimes/typically) they also periodicallyreconstruct the index from scratch
n Query processing is then switched to the newindex, and the old index is then deleted
Prasad 45L06IndexConstruction
-
8/3/2019 L06IndexConstruction
46/47
Prasad 46L06IndexConstruction
-
8/3/2019 L06IndexConstruction
47/47
Other sorts of indexes
n Positional indexesn Same sort of sorting problem just larger
n Building character n-gram indexes:n As text is parsed, enumerate n-grams.n For each n-gram, need pointers to all dictionary
terms containing it the postings.
n Note that the same postings entry will ariserepeatedly in parsing the docs need efficienthashing to keep track of this.
n E.g., that the trigram uouoccurs in the term deciduouswill be discovered on each text occurrence ofdeciduous
n Only need to process each term once
Why?