Lecture4- Indexing and Searching I

8/4/2019 Lecture4- Indexing and Searching I

http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 1/56

Indexing and Searching

The main techniques



Introduction

•

There are 2 ways to search a text

• First: Scan the text sequentially (online searching).

– This can be done when the text is small (i.e., a few

megabytes),

– if the text collection is very volatile (i.e., undergoes

modifications very frequently)

– If the index space overhead cannot be afforded.



Introduction• Second: Build data structures over the text (called

indices) – It speeds up the search.

– It is worthwhile when the text collection is large and semi-

static.

– Most real databases are like this.

• E.g : dictionaries, Web search engines, journal archives.

Semi-static collections are collections that can be updated at reasonably regular

intervals



Introduction• Nowadays, the most successful techniques for medium

size databases (say up to 200Mb) combine online andindexed searching.



Introduction

• We cover two main indexing techniques

– Inverted files

– Suffix arrays



Introduction

• Before covering these portions you should be familiar

with

– Sorted arrays

–

Binary search trees – B-trees

– Hash tables

– Tries.



Introduction

• Sorted arrays

– An array whose items are kept sorted,

– so searching is faster



Introduction

• Binary search trees

– A binary tree

– For each internal node x stores an element

– The element stored in the left subtree of x <= x and

elements stored in the right subtree of x >=x

–

Both the left and right subtrees must also be binary searchtrees.



Binary Tree

Each

node has

at most 2

children



Binary Search Tree



Introduction

• B-trees

– A B-tree is a specialized multi way tree designedespecially for use on disk.

–

Used when part or all of the tree must bemaintained in secondary storage such as a magnetic

disk.

– An indexing technique most commonly used in

databases and file systems



Introduction

• B-trees

– A multiway tree of order m is an ordered tree whereeach node has at most m children.

–

The following is a multiway search tree of order 4



Introduction



Introduction• B-trees (contd..)

– Pointers to data are placed in a balance treestructure so that all references to any data can be

accessed in an equal time frame.

– Data in B-tree is kept sorted

• so that searching, inserting and deleting can be done in

logarithmic amortized time

– A b-tree tries to minimize the number of disk

accesses.



Introduction• B-trees Example



Introduction• Searching a B-Tree for Key 21



IntroductionInserting Key 33 into a B-Tree (w/ Split)



IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)



Introduction• Hash table

–

A data structure that uses a hash function to efficiently mapcertain identifiers or keys (e.g., person names) to associated

values (e.g., their telephone numbers).

–

The hash function is used to transform the key into theindex (the hash) of an array element (the slot or bucket )

where the corresponding value is to be sought.

– E.g : Division Method

d



Introduction• Hash table

–

123456123467

123450

– 123456 % 10 = 6 (the remainder is 6 when dividing

by 10)

123467 % 10 = 7 (the remainder is 7)

123450 % 10 = 0 (the remainder is 0)

d



Introduction



Tries

•

Trie , is an ordered tree data structure that is used tostore an array where the keys are usually strings

• It can be used to do a fast search in a large text

• The term trie comes from the word "retrieval".

• Used to implement the dictionary abstract data type

(ADT) where basic operations like search, insert, anddelete can be performed



Tries

•

They can be used for encoding and compression

• They can be used in regular expression search and

approximate string matching



Non Compact and Compact Tries

•

A non compact trie is one in which every edge of theunderlying tree represents a symbol of the alphabet.

• Let's construct the trie from the following 5 strings: BIG,

BIGGER, BILL, GOOD, GOSH.

d



Non Compact and Compact Tries



Non Compact Tries

• When we look for the string GOOD, we start at the root

and we follow the G O OD edges

• If we want to look for the string BAD, we start from the

root, follow the B edge and find out that there is no A edge after. Thus BAD is not in the text.

• The above structure is rather wasteful because each

edge represents a single symbol.

• Not practical for huge texts

C i



Compact Tries

• This type of trie resembles the one in figure above

except that chains which lead to leaves are trimmed.

• This is illustrated in next figure

C T i



Compact Tries

C T i



Compact Tries

The compact form

of the trie is in the

figure

C t T i



Compact Tries

• The number of leaves is n+1 where n is the number of

input strings.• In the leaves, we may store either the strings

themselves or pointers to the strings (that is, integers).

T i ll d "PATRICIA"



Tries called "PATRICIA"

• "PATRICIA" stands for "practical algorithm to retrieve

information coded in alphanumeric".• The difference is that an edge can be labeled with more

than one character.

•

All the unary nodes will be collapsed.

T i ll d "PATRICIA"




T i ll d "PATRICIA"




The very

compact trie

will look as

follows:





• Binary PATRICIA tries has only 2 symbols per edge

S ffi T



Suffix Tree• The suffix tree T(x) of string x[1..n] is the compacted trie

of all suffixes x[i..n] for i = 1,..,n+1, i.e. including theempty suffix

• Allows for a particularly fast implementation of many

important string operations.

• The suffix tree for a string S is a tree (more specifically a

trie) whose edges are labeled with strings, such that each

suffix of S corresponds to exactly one path from the tree'sroot to a leaf.

S ffi T



Suffix Tree• The idea behind suffix tree is to assign to each symbol in

a text an index corresponding to its position in the text.

– ie: First symbol has index 1, last symbol has indice n= #of

symbols in text.

• In the tree we use indices instead of the actual object.

S ffi t



Suffix tree• The advantages are:

–

It requires less storage space. – We do not have to worry how the text is represented (bin, ASCII,

etc)

– We do not have to store the same object twice. (no duplicate)

S ffi t i



Suffix trie

• We begin by giving a position to every suffix in the text.

We can now build a SUFFIX Trie for all n suffixes of the

text.

• E.g.

–TEXT: G O O G O L $

– POSITION: 1 2 3 4 5 6 7

Suffix trie



Suffix trie

The resulting tree has n leaves and height n

S ffi



Suffix tree• The suffix tree is created by TRIMMING (compacting +

collapsing every unary node) of the suffix TRIE

• The following is a picture of a compact suffix tree

S ffi



Suffix tree

Suffix tree



Suffix tree

• In suffix tree we can store pointers rather than words in

the leaves.

• Also we can replace every string by a pair of indices,

(a,b), where a is the index of the beginning of the string

and b the index of the end of the string.• i.e: We write

– (3,7) for OGOL$

– (1,2) for GO

– (7,7) for $

Suffix tree



Suffix tree

• The corresponding suffix tree looks like this

Search in suffix tree




• Pseudo-code for searching in suffix tree:

– Start at root

– Go down the tree by taking each time the corresponding

bifurcation

– If S correspond to a node then return all leaves in subtree

– If S encountered a NIL pointer then S is not in the tree





• If S = "GO" we take the GO bifurcation and return:

GOOGOL$,GOL$.

•

If S = "OR" we take the O bifurcation and then we hit aNIL pointer so "OR" is not in the tree.

Applications of suffix tree



Applications of suffix tree

• Exact matching

• Common substrings, with applications

• Matching statistics

• Suffix arrays

• Genome-scale projects

Exact Matching



Exact Matching

• Given string x and pattern y, report where y occurs in x

• Pattern ata occurs at position 2 in tatat

Exact Matching



Exact Matching

• Given string x and pattern y, report where y occurs in x

• Pattern tatt does not occur in tatat

Assumptions in indexing and searching



Assumptions in indexing and searching

• We make the following assumptions.

– We call n the size of the text database.

– Whenever a pattern is searched, we assume that it is of length

m, which is much smaller than n.

– We call M the amount of main memory available.

– The modifications which a text database undergoes are

additions, deletions, and replacements of pieces of text of size

n' < n.

Reference



Reference

• Modern Information Retrieval by Yates

• http://www.bluerwhite.org/btree/ 01/08/2011

• http://cis.stvincent.edu/carlsond/swdesign/btree/btree.

html 01/08/2011 01/08/2011

•

http://www.cs.princeton.edu/~rs/AlgsDS07/09BalancedTrees.pdf 01/08/2011

• http://www.cs.uregina.ca/Links/class-info/210/Hash/

01/08/2011

• http://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash

_tables.html 01/08/2011

http://www.bluerwhite.org/btree/


http://cis.stvincent.edu/carlsond/swdesign/btree/btree.html%2001/08/2011


http://www.cs.princeton.edu/~rs/AlgsDS07/09BalancedTrees.pdf


http://www.cs.uregina.ca/Links/class-info/210/Hash/

http://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash_tables.html
















References

• http://www.cs.uku.fi/~kilpelai/BSA05/lectures

/slides08.pdf

• http://www.daimi.au.dk/~cstorm/courses/Str

Alg_e05/slides/suffixtrees_uge1_e05.pdf

http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides08.pdf


http://www.daimi.au.dk/~cstorm/courses/StrAlg_e05/slides/suffixtrees_uge1_e05.pdf






Lecture4- Indexing and Searching I

Documents

Transcript of Lecture4- Indexing and Searching I