L06IndexConstruction

download L06IndexConstruction

of 47

Transcript of L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    1/47

    Prasad L06IndexConstruction 1

    Index Construction

    Adapted from Lectures by

    Prabhakar Raghavan (Yahoo and Stanford) and

    Christopher Manning (Stanford)

  • 8/3/2019 L06IndexConstruction

    2/47

    Plann Last lectures:

    n Dictionary data structuresn Tolerant retrieval

    n Wildcardsn Spell correctionn Soundex

    n This time:n Index construction

    a-huhy-m

    n-z

    mo

    on

    among

    $m mace

    abandon

    amortize

    madden

    among

    Prasad 2L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    3/47

    Index constructionn How do we construct an index?n What strategies can we use with limited

    main memory?n Hardware Basics

    n Many design decisions in informationretrieval are based on the characteristics of

    hardwaren We begin by reviewing hardware basics

    Prasad 3L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    4/47

    Hardware basicsn Access to data in memory is much faster than

    access to data on disk.

    n Disk seeks: No data is transferred from disk whilethe disk head is being positioned.

    n Therefore: Transferring one large chunk of datafrom disk to memory is faster than transferring

    many small chunks.

    n Disk I/O is block-based: Reading and writing ofentire blocks (as opposed to smaller chunks).

    n Block sizes: 8KB to 256 KB.

    Prasad 4L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    5/47

    Hard disk geometry and terminology

    Prasad L06IndexConstruction 5

  • 8/3/2019 L06IndexConstruction

    6/47

    Hardware basicsn Servers used in IR systems now typically have

    several GB of main memory, sometimes tens of

    GB.

    n Available disk space is several (23)orders ofmagnitude larger.

    n Fault tolerance is very expensive: Its muchcheaper to use many regular machines rather

    than one fault tolerant machine.

    Prasad 6L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    7/47

    Hardware assumptionsn symbol statistic valuen s average seek time 5 ms = 5 x 103 sn

    b transfer time per byte 0.02 s = 2 x 108

    sn processors clock rate 109 s1n p lowlevel operation 0.01 s = 108 s

    (e.g., compare & swap a word)

    n size of main memory several GBn size of disk space 1 TB or moren Mem. trans. time per byte 5 nsPrasad 7L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    8/47

    RCV1: Our corpus for this lecturen Shakespeares collected works definitely arent

    large enough.

    n The corpus well use isnt really large enougheither, but its publicly available and is at least a

    more plausible example.

    n As an example for applying scalable indexconstruction algorithms, we will use the Reuters

    RCV1 collection (Approx. 1GB).n This is one year of Reuters newswire (part of 1996 and

    1997)

    Prasad 8L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    9/47

    A Reuters RCV1 document

    Prasad 9L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    10/47

    Reuters RCV1 statisticsn symbol statistic valuen N documents 800,000n

    L avg. # tokens per doc 200n M terms (= word types) 400,000n avg. # bytes per token 6

    (incl. spaces/punct.)

    n avg. # bytes per token 4.5(without spaces/punct.)

    n avg. # bytes per term 7.5n non-positional postings 100,000,0004.5 bytes per word token vs. 7.5 bytes per word type: why?

  • 8/3/2019 L06IndexConstruction

    11/47

    n Documents are parsed to extract words andthese are saved with the Document ID.

    I did enact Julius

    Caesar I was killedi' the Capitol;

    Brutus killed me.

    Doc 1

    So let it be withCaesar. The noble

    Brutus hath told you

    Caesar was ambitious

    Doc 2

    Recall IIR1 index construction

    Term Doc #

    I 1

    did 1enact 1

    julius 1

    caesar 1

    I 1

    was 1

    killed 1

    i' 1

    the 1

    capitol 1brutus 1

    killed 1

    me 1

    so 2

    let 2

    it 2

    be 2

    with 2

    caesar 2the 2

    noble 2

    brutus 2

    hath 2

    told 2

    you 2

    caesar 2

    was 2

    ambitious 2

    Prasad 11L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    12/47

    Term Doc #

    I 1

    did 1

    enact 1

    julius 1

    caesar 1

    I 1

    was 1

    killed 1

    i' 1

    the 1

    capitol 1brutus 1

    killed 1

    me 1

    so 2

    let 2

    it 2

    be 2

    with 2

    caesar 2

    the 2noble 2

    brutus 2

    hath 2

    told 2

    you 2

    caesar 2

    was 2

    ambitious 2

    Term Doc #

    ambitious 2

    be 2

    brutus 1

    brutus 2

    capitol 1

    caesar 1

    caesar 2

    caesar 2

    did 1

    enact 1

    hath 1

    I 1

    I 1

    i' 1

    it 2

    julius 1

    killed 1

    killed 1

    let 2

    me 1

    noble 2so 2

    the 1

    the 2

    told 2

    you 2

    was 1

    was 2

    with 2

    Key stepn After all documents have

    been parsed, the inverted file

    is sorted by terms.

    We focus on this sort step.We have 100M items to sort.

    Prasad 12L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    13/47

    Scaling index construction

    n In-memory index construction does not scale.n How can we construct an index for very large

    collections?

    n Taking into account the hardware constraints wejust learned about . . .

    n Memory, disk, speed etc.

    Prasad 13L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    14/47

    Sort-based Index construction

    n As we build the index, we parse docs one at a time.n While building the index, we cannot easily exploit

    compression tricks (you can, but much more complex)

    n The final postings for any term are incomplete untilthe end.

    n At 12 bytes per postings entry, demands a lot ofspace for large collections.

    n T = 100,000,000 in the case of RCV1n So we can do this in memory in 2008, but

    typical collections are much larger. E.g. New YorkTimes provides index of >150 years of newswire

    n Thus: We need to store intermediate results on disk.

  • 8/3/2019 L06IndexConstruction

    15/47

    Use the same algorithm for disk?

    n Can we use the same index constructionalgorithm (internal sorting algorithms) for larger

    collections, but by using disk instead of memory?

    n No: Sorting T = 100,000,000 records on disk istoo slow too many disk seeks.

    n We need an external sorting algorithm.

    Prasad 15L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    16/47

    Bottleneck

    n Parse and build postings entries one doc at atime

    n Now sort postings entries by term (then by docwithin each term)

    n Doing this with random disk seeks would be tooslow must sort T=100M records

    If every comparison took 2 disk seeks, and Nitems could besorted with Nlog2Ncomparisons, how long would this take?

    Prasad 16L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    17/47

    BSBI: Blocked sort-based Indexing

    (Sorting with fewer disk seeks)

    n 12-byte (4+4+4) records (term, doc, freq).n These are generated as we parse docs.n

    Must now sort 100M such 12-byte records byterm.

    n Define a Block ~ 10M such recordsn Can easily fit a couple into memory.n

    Will have 10 such blocks to start with.n Basic idea of algorithm:

    n Accumulate postings for each block, sort, write todisk.

    n Then merge the blocks into one long sorted order.

  • 8/3/2019 L06IndexConstruction

    18/47

    Prasad 18L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    19/47

    Sorting 10 blocks of 10M records

    n First, read each block and sort within:n Quicksort takes 2Nlog Nexpected stepsn In our case 2 x (10M log 10M) steps

    n Exercise: estimate total time to read each blockfrom disk and quicksort it.

    n 10 times this estimate - gives us 10 sorted runsof 10M records each.

    n Done straightforwardly, need 2 copies of data ondisk

    n But can optimize thisPrasad 19L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    20/47

    Prasad 20L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    21/47

    How to merge the sorted runs?

    n Can do binary merges, with a merge tree of log210 =4 layers.

    n During each layer, read into memory runs in blocks of10M, merge, write back.

    Disk

    1

    3 4

    2

    2

    1

    4

    3

    Runs beingmerged.

    Merged run.

    Prasad

  • 8/3/2019 L06IndexConstruction

    22/47

    How to merge the sorted runs?

    n But it is more efficient to do a n-way merge, whereyou are reading from all blocks simultaneously

    n Providing you read decent-sized chunks of eachblock into memory, youre not killed by disk seeks

    Prasad 22L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    23/47

    Remaining problem with sort-

    based algorithm

    n Our assumption was: we can keep the dictionaryin memory.

    n We need the dictionary (which growsdynamically) to map a term to termID.

    n Actually, we could work with term,docID postingsinstead of termID,docID postings . . .

    n . . . but then intermediate files become very large.(We would end up with a scalable, but very slow

    index construction method.)

    Prasad 23L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    24/47

    SPIMI:

    Single-pass in-memory indexing

    n Key idea 1: Generate separate dictionaries foreach block no need to maintain term-termID

    mapping across blocks.

    n Key idea 2: Dont sort. Accumulate postings inpostings lists as they occur.

    n With these two ideas we can generate acomplete inverted index for each block.

    n These separate indexes can then be merged intoone big index.

    Prasad 24L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    25/47

    SPIMI-Invert

    n Merging of blocks is analogous to BSBI.

  • 8/3/2019 L06IndexConstruction

    26/47

    SPIMI: Compression

    n Compression makes SPIMI even more efficient.n Compression of termsn Compression of postings

    n See next lecture

    Prasad 26L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    27/47

    Distributed indexing

    n For web-scale indexing (dont try this at home!):must use a distributed computing cluster

    n Individual machines are fault-pronen Can unpredictably slow down or fail

    n How do we exploit such a pool of machines?

    Prasad 27L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    28/47

    Google data centers

    n Google data centers mainly contain commoditymachines and are distributed around the world.

    n Estimate: a total of 1 million servers, 3 millionprocessors/cores (Gartner 2007)

    n Estimate: Google installs 100,000 servers eachquarter.

    n Based on expenditures of 200250 million dollarsper year

    n This would be 10% of the computing capacity ofthe world!?!

    Prasad 28L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    29/47

    Google data centers

    n If in a non-fault-tolerant system with 1000 nodes,each node has 99.9% uptime, what is the uptime

    of the system?

    n Answer: 36%n Note, (1 0.9991000) = 0.63n Probability that at least one will fail.

    Prasad 29L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    30/47

    Distributed indexing

    n Maintain a mastermachine directing the indexingjob considered safe.

    n Break up indexing into sets of (parallel) tasks.

    nMaster machine assigns each task to an idlemachine from a pool.

    Prasad 30L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    31/47

    Parallel tasks

    n We will use two sets of parallel tasksn Parsersn Inverters

    n Break the input document corpus into splits

    nEach split is a subset of documents(corresponding to blocks in BSBI/SPIMI)

    Prasad 31L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    32/47

    Parsers

    n Master assigns a split to an idle parser machine

    nParser reads a document at a time and emits(term, doc) pairs

    n Parser writes pairs intojpartitions

    nEach partition is for a range of terms

    first letters

    n (e.g., a-f, g-p, q-z) herej=3.n Now to complete the index inversionPrasad 32L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    33/47

    Inverters

    n An inverter collects all (term,doc) pairs (=postings) for one term-partition.

    n Sorts and writes to postings lists

    Prasad 33L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    34/47

    Data flow

    splits

    Parser

    Parser

    Parser

    Master

    a-f g-p q-z

    a-f g-p q-z

    a-f g-p q-z

    Inverter

    Inverter

    Inverter

    Postings

    a-f

    g-p

    q-z

    assign assign

    Mapphase

    Segment files Reducephase 34

  • 8/3/2019 L06IndexConstruction

    35/47

    MapReduce

    n The index construction algorithm we justdescribed is an instance of MapReduce.

    n MapReduce (Dean and Ghemawat 2004) is arobust and conceptually simple framework for

    n distributed computingn without having to write code for the distribution

    part.

    n They describe the Google indexing system (ca.2002) as consisting of a number of phases, each

    implemented in MapReduce.

    Prasad 35L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    36/47

    MapReduce

    n Index construction was just one phase.n Another phase: transforming a term-partitioned

    index into document-partitioned index.

    n Term-partitioned: one machine handles asubrange of terms

    n Document-partitioned: one machine handles asubrange of documents

    n (As we discuss in the web part of the course)most search engines use a document-partitioned

    index better load balancing, etc.)

    Prasad 36L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    37/47

    Schema for index construction in

    MapReduce

    n Schema of map and reduce functionsn map: input list(k, v) reduce: (k,list(v)) outputn Instantiation of the schema for index constructionn

    map: web collection

    list(termID, docID)n reduce: (, , )

    (postings list1, postings list2, )

    n Example for index constructionn map: d2 : C died. d1 : C came, C ced. (, ,

    , , ,

    n reduce: (, , , ) (, , ,

    )

    Prasad 37L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    38/47

    Dynamic indexing

    n Up to now, we have assumed that collections arestatic.

    n They rarely are:n Documents come in over time and need to be

    inserted.

    n Documents are deleted and modified.n This means that the dictionary and postings lists

    have to be modified:

    n Postings updates for terms already in dictionaryn New terms added to dictionary

    Prasad 38L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    39/47

    Simplest approach

    n Maintain big main indexn New docs go into small auxiliary indexn Search across both, merge resultsn Deletions

    n Invalidation bit-vector for deleted docsn Filter docs output on a search result by this

    invalidation bit-vector

    n Periodically, re-index into one main index

    Prasad 39L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    40/47

    Issues with main and auxiliary

    indexes

    n Problem of frequent merges you touch stuff a lotn Poor performance during mergen Actually:

    n Merging of the auxiliary index into the main index is efficientif we keep a separate file for each postings list.

    n Merge is the same as a simple append.n But then we would need a lot of files inefficient for O/S.

    n Assumption for the rest of the lecture: The index isone big file.

    n In reality: Use a scheme somewhere in between(e.g., split very large postings lists, collect postings

    lists of length 1 in one file etc.)40

  • 8/3/2019 L06IndexConstruction

    41/47

    Logarithmic merge

    n Maintain a series of indexes, each twice as largeas the previous one.

    n Keep smallest (Z0) in memoryn Larger ones (I0, I1,) on diskn If Z0 gets too big (> n), write to disk as I0n or merge with I0 (if I0 already exists) as Z1n

    Either write merge Z1 to disk as I1 (if no I1)n Or merge with I1 to form Z2n etc.Prasad 41L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    42/47

    Logarithmic merge

  • 8/3/2019 L06IndexConstruction

    43/47

    Logarithmic merge

    n Auxiliary and main index: index construction timeis O(T2) as each posting is touched in each

    merge.

    n Logarithmic merge: Each posting is mergedO(log T) times, so complexity is O(T log T)

    n So logarithmic merge is much more efficient forindex construction

    n But query processing now requires the mergingof O(log T) indexes

    n Whereas it is O(1) if you just have a main andauxiliary index

    Prasad 43L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    44/47

    Further issues with multiple

    indexes

    n Corpus-wide statistics are hard to maintainn E.g., when we spoke of spell-correction: which of

    several corrected alternatives do we present to

    the user?

    n We said, pick the one with the most hitsn How do we maintain the top ones with multiple

    indexes and invalidation bit vectors?

    n One possibility: ignore everything but the mainindex for such ordering

    n Will see more such statistics used in resultsranking

    Prasad 44L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    45/47

    Dynamic indexing at search

    engines

    n All the large search engines now do dynamicindexing

    n Their indices have frequent incremental changesn News items, new topical web pages

    n Sarah Palin

    n But (sometimes/typically) they also periodicallyreconstruct the index from scratch

    n Query processing is then switched to the newindex, and the old index is then deleted

    Prasad 45L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    46/47

    Prasad 46L06IndexConstruction

  • 8/3/2019 L06IndexConstruction

    47/47

    Other sorts of indexes

    n Positional indexesn Same sort of sorting problem just larger

    n Building character n-gram indexes:n As text is parsed, enumerate n-grams.n For each n-gram, need pointers to all dictionary

    terms containing it the postings.

    n Note that the same postings entry will ariserepeatedly in parsing the docs need efficienthashing to keep track of this.

    n E.g., that the trigram uouoccurs in the term deciduouswill be discovered on each text occurrence ofdeciduous

    n Only need to process each term once

    Why?