L06IndexConstruction

8/3/2019 L06IndexConstruction

1/47

Prasad L06IndexConstruction 1

Index Construction

Adapted from Lectures by

Prabhakar Raghavan (Yahoo and Stanford) and

Christopher Manning (Stanford)


2/47

Plann Last lectures:

n Dictionary data structuresn Tolerant retrieval

n Wildcardsn Spell correctionn Soundex

n This time:n Index construction

a-huhy-m

n-z

mo

on

among

$m mace

abandon

amortize

madden

among

Prasad 2L06IndexConstruction


3/47

Index constructionn How do we construct an index?n What strategies can we use with limited

main memory?n Hardware Basics

n Many design decisions in informationretrieval are based on the characteristics of

hardwaren We begin by reviewing hardware basics



4/47

Hardware basicsn Access to data in memory is much faster than

access to data on disk.

n Disk seeks: No data is transferred from disk whilethe disk head is being positioned.

n Therefore: Transferring one large chunk of datafrom disk to memory is faster than transferring

many small chunks.

n Disk I/O is block-based: Reading and writing ofentire blocks (as opposed to smaller chunks).

n Block sizes: 8KB to 256 KB.



5/47

Hard disk geometry and terminology

Prasad L06IndexConstruction 5


6/47

Hardware basicsn Servers used in IR systems now typically have

several GB of main memory, sometimes tens of

GB.

n Available disk space is several (23)orders ofmagnitude larger.

n Fault tolerance is very expensive: Its muchcheaper to use many regular machines rather

than one fault tolerant machine.



7/47

Hardware assumptionsn symbol statistic valuen s average seek time 5 ms = 5 x 103 sn

b transfer time per byte 0.02 s = 2 x 108

sn processors clock rate 109 s1n p lowlevel operation 0.01 s = 108 s

(e.g., compare & swap a word)

n size of main memory several GBn size of disk space 1 TB or moren Mem. trans. time per byte 5 nsPrasad 7L06IndexConstruction


8/47

RCV1: Our corpus for this lecturen Shakespeares collected works definitely arent

large enough.

n The corpus well use isnt really large enougheither, but its publicly available and is at least a

more plausible example.

n As an example for applying scalable indexconstruction algorithms, we will use the Reuters

RCV1 collection (Approx. 1GB).n This is one year of Reuters newswire (part of 1996 and

1997)



9/47

A Reuters RCV1 document



10/47

Reuters RCV1 statisticsn symbol statistic valuen N documents 800,000n

L avg. # tokens per doc 200n M terms (= word types) 400,000n avg. # bytes per token 6

(incl. spaces/punct.)

n avg. # bytes per token 4.5(without spaces/punct.)

n avg. # bytes per term 7.5n non-positional postings 100,000,0004.5 bytes per word token vs. 7.5 bytes per word type: why?


11/47

n Documents are parsed to extract words andthese are saved with the Document ID.

I did enact Julius

Caesar I was killedi' the Capitol;

Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told you

Caesar was ambitious

Doc 2

Recall IIR1 index construction

Term Doc #

I 1

did 1enact 1

julius 1

caesar 1

I 1

was 1

killed 1

i' 1

the 1

capitol 1brutus 1

killed 1

me 1

so 2

let 2

it 2

be 2

with 2

caesar 2the 2

noble 2

brutus 2

hath 2

told 2

you 2

caesar 2

was 2

ambitious 2



12/47

Term Doc #

I 1

did 1

enact 1

julius 1

caesar 1

I 1

was 1

killed 1

i' 1

the 1

capitol 1brutus 1

killed 1

me 1

so 2

let 2

it 2

be 2

with 2

caesar 2

the 2noble 2

brutus 2

hath 2

told 2

you 2

caesar 2

was 2

ambitious 2

Term Doc #

ambitious 2

be 2

brutus 1

brutus 2

capitol 1

caesar 1

caesar 2

caesar 2

did 1

enact 1

hath 1

I 1

I 1

i' 1

it 2

julius 1

killed 1

killed 1

let 2

me 1

noble 2so 2

the 1

the 2

told 2

you 2

was 1

was 2

with 2

Key stepn After all documents have

been parsed, the inverted file

is sorted by terms.

We focus on this sort step.We have 100M items to sort.



13/47

Scaling index construction

n In-memory index construction does not scale.n How can we construct an index for very large

collections?

n Taking into account the hardware constraints wejust learned about . . .

n Memory, disk, speed etc.



14/47

Sort-based Index construction

n As we build the index, we parse docs one at a time.n While building the index, we cannot easily exploit

compression tricks (you can, but much more complex)

n The final postings for any term are incomplete untilthe end.

n At 12 bytes per postings entry, demands a lot ofspace for large collections.

n T = 100,000,000 in the case of RCV1n So we can do this in memory in 2008, but

typical collections are much larger. E.g. New YorkTimes provides index of >150 years of newswire

n Thus: We need to store intermediate results on disk.


15/47

Use the same algorithm for disk?

n Can we use the same index constructionalgorithm (internal sorting algorithms) for larger

collections, but by using disk instead of memory?

n No: Sorting T = 100,000,000 records on disk istoo slow too many disk seeks.

n We need an external sorting algorithm.



16/47

Bottleneck

n Parse and build postings entries one doc at atime

n Now sort postings entries by term (then by docwithin each term)

n Doing this with random disk seeks would be tooslow must sort T=100M records

If every comparison took 2 disk seeks, and Nitems could besorted with Nlog2Ncomparisons, how long would this take?



17/47

BSBI: Blocked sort-based Indexing

(Sorting with fewer disk seeks)

n 12-byte (4+4+4) records (term, doc, freq).n These are generated as we parse docs.n

Must now sort 100M such 12-byte records byterm.

n Define a Block ~ 10M such recordsn Can easily fit a couple into memory.n

Will have 10 such blocks to start with.n Basic idea of algorithm:

n Accumulate postings for each block, sort, write todisk.

n Then merge the blocks into one long sorted order.


18/47



19/47

Sorting 10 blocks of 10M records

n First, read each block and sort within:n Quicksort takes 2Nlog Nexpected stepsn In our case 2 x (10M log 10M) steps

n Exercise: estimate total time to read each blockfrom disk and quicksort it.

n 10 times this estimate - gives us 10 sorted runsof 10M records each.

n Done straightforwardly, need 2 copies of data ondisk

n But can optimize thisPrasad 19L06IndexConstruction


20/47



21/47

How to merge the sorted runs?

n Can do binary merges, with a merge tree of log210 =4 layers.

n During each layer, read into memory runs in blocks of10M, merge, write back.

Disk

1

3 4

2

2

1

4

3

Runs beingmerged.

Merged run.

Prasad


22/47

How to merge the sorted runs?

n But it is more efficient to do a n-way merge, whereyou are reading from all blocks simultaneously

n Providing you read decent-sized chunks of eachblock into memory, youre not killed by disk seeks



23/47

Remaining problem with sort-

based algorithm

n Our assumption was: we can keep the dictionaryin memory.

n We need the dictionary (which growsdynamically) to map a term to termID.

n Actually, we could work with term,docID postingsinstead of termID,docID postings . . .

n . . . but then intermediate files become very large.(We would end up with a scalable, but very slow

index construction method.)



24/47

SPIMI:

Single-pass in-memory indexing

n Key idea 1: Generate separate dictionaries foreach block no need to maintain term-termID

mapping across blocks.

n Key idea 2: Dont sort. Accumulate postings inpostings lists as they occur.

n With these two ideas we can generate acomplete inverted index for each block.

n These separate indexes can then be merged intoone big index.



25/47

SPIMI-Invert

n Merging of blocks is analogous to BSBI.


26/47

SPIMI: Compression

n Compression makes SPIMI even more efficient.n Compression of termsn Compression of postings

n See next lecture



27/47

Distributed indexing

n For web-scale indexing (dont try this at home!):must use a distributed computing cluster

n Individual machines are fault-pronen Can unpredictably slow down or fail

n How do we exploit such a pool of machines?



28/47

Google data centers

n Google data centers mainly contain commoditymachines and are distributed around the world.

n Estimate: a total of 1 million servers, 3 millionprocessors/cores (Gartner 2007)

n Estimate: Google installs 100,000 servers eachquarter.

n Based on expenditures of 200250 million dollarsper year

n This would be 10% of the computing capacity ofthe world!?!



29/47

Google data centers

n If in a non-fault-tolerant system with 1000 nodes,each node has 99.9% uptime, what is the uptime

of the system?

n Answer: 36%n Note, (1 0.9991000) = 0.63n Probability that at least one will fail.



30/47

Distributed indexing

n Maintain a mastermachine directing the indexingjob considered safe.

n Break up indexing into sets of (parallel) tasks.

nMaster machine assigns each task to an idlemachine from a pool.



31/47

Parallel tasks

n We will use two sets of parallel tasksn Parsersn Inverters

n Break the input document corpus into splits

nEach split is a subset of documents(corresponding to blocks in BSBI/SPIMI)



32/47

Parsers

n Master assigns a split to an idle parser machine

nParser reads a document at a time and emits(term, doc) pairs

n Parser writes pairs intojpartitions

nEach partition is for a range of terms

first letters

n (e.g., a-f, g-p, q-z) herej=3.n Now to complete the index inversionPrasad 32L06IndexConstruction


33/47

Inverters

n An inverter collects all (term,doc) pairs (=postings) for one term-partition.

n Sorts and writes to postings lists



34/47

Data flow

splits

Parser

Parser

Parser

Master

a-f g-p q-z

a-f g-p q-z

a-f g-p q-z

Inverter

Inverter

Inverter

Postings

a-f

g-p

q-z

assign assign

Mapphase

Segment files Reducephase 34


35/47

MapReduce

n The index construction algorithm we justdescribed is an instance of MapReduce.

n MapReduce (Dean and Ghemawat 2004) is arobust and conceptually simple framework for

n distributed computingn without having to write code for the distribution

part.

n They describe the Google indexing system (ca.2002) as consisting of a number of phases, each

implemented in MapReduce.



36/47

MapReduce

n Index construction was just one phase.n Another phase: transforming a term-partitioned

index into document-partitioned index.

n Term-partitioned: one machine handles asubrange of terms

n Document-partitioned: one machine handles asubrange of documents

n (As we discuss in the web part of the course)most search engines use a document-partitioned

index better load balancing, etc.)



37/47

Schema for index construction in

MapReduce

n Schema of map and reduce functionsn map: input list(k, v) reduce: (k,list(v)) outputn Instantiation of the schema for index constructionn

map: web collection

list(termID, docID)n reduce: (, , )

(postings list1, postings list2, )

n Example for index constructionn map: d2 : C died. d1 : C came, C ced. (, ,

, , ,

n reduce: (, , , ) (, , ,

)



38/47

Dynamic indexing

n Up to now, we have assumed that collections arestatic.

n They rarely are:n Documents come in over time and need to be

inserted.

n Documents are deleted and modified.n This means that the dictionary and postings lists

have to be modified:

n Postings updates for terms already in dictionaryn New terms added to dictionary



39/47

Simplest approach

n Maintain big main indexn New docs go into small auxiliary indexn Search across both, merge resultsn Deletions

n Invalidation bit-vector for deleted docsn Filter docs output on a search result by this

invalidation bit-vector

n Periodically, re-index into one main index



40/47

Issues with main and auxiliary

indexes

n Problem of frequent merges you touch stuff a lotn Poor performance during mergen Actually:

n Merging of the auxiliary index into the main index is efficientif we keep a separate file for each postings list.

n Merge is the same as a simple append.n But then we would need a lot of files inefficient for O/S.

n Assumption for the rest of the lecture: The index isone big file.

n In reality: Use a scheme somewhere in between(e.g., split very large postings lists, collect postings

lists of length 1 in one file etc.)40


41/47

Logarithmic merge

n Maintain a series of indexes, each twice as largeas the previous one.

n Keep smallest (Z0) in memoryn Larger ones (I0, I1,) on diskn If Z0 gets too big (> n), write to disk as I0n or merge with I0 (if I0 already exists) as Z1n

Either write merge Z1 to disk as I1 (if no I1)n Or merge with I1 to form Z2n etc.Prasad 41L06IndexConstruction


42/47

Logarithmic merge


43/47

Logarithmic merge

n Auxiliary and main index: index construction timeis O(T2) as each posting is touched in each

merge.

n Logarithmic merge: Each posting is mergedO(log T) times, so complexity is O(T log T)

n So logarithmic merge is much more efficient forindex construction

n But query processing now requires the mergingof O(log T) indexes

n Whereas it is O(1) if you just have a main andauxiliary index



44/47

Further issues with multiple

indexes

n Corpus-wide statistics are hard to maintainn E.g., when we spoke of spell-correction: which of

several corrected alternatives do we present to

the user?

n We said, pick the one with the most hitsn How do we maintain the top ones with multiple

indexes and invalidation bit vectors?

n One possibility: ignore everything but the mainindex for such ordering

n Will see more such statistics used in resultsranking



45/47

Dynamic indexing at search

engines

n All the large search engines now do dynamicindexing

n Their indices have frequent incremental changesn News items, new topical web pages

n Sarah Palin

n But (sometimes/typically) they also periodicallyreconstruct the index from scratch

n Query processing is then switched to the newindex, and the old index is then deleted



46/47



47/47

Other sorts of indexes

n Positional indexesn Same sort of sorting problem just larger

n Building character n-gram indexes:n As text is parsed, enumerate n-grams.n For each n-gram, need pointers to all dictionary

terms containing it the postings.

n Note that the same postings entry will ariserepeatedly in parsing the docs need efficienthashing to keep track of this.

n E.g., that the trigram uouoccurs in the term deciduouswill be discovered on each text occurrence ofdeciduous

n Only need to process each term once

Why?

L06IndexConstruction

Documents

Transcript of L06IndexConstruction