Lecture4- Indexing and Searching I
-
Upload
priyankaprakasan -
Category
Documents
-
view
229 -
download
0
Transcript of Lecture4- Indexing and Searching I
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 1/56
Indexing and Searching
The main techniques
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 2/56
Introduction
•
There are 2 ways to search a text
• First: Scan the text sequentially (online searching).
– This can be done when the text is small (i.e., a few
megabytes),
– if the text collection is very volatile (i.e., undergoes
modifications very frequently)
– If the index space overhead cannot be afforded.
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 3/56
Introduction• Second: Build data structures over the text (called
indices) – It speeds up the search.
– It is worthwhile when the text collection is large and semi-
static.
– Most real databases are like this.
• E.g : dictionaries, Web search engines, journal archives.
Semi-static collections are collections that can be updated at reasonably regular
intervals
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 4/56
Introduction• Nowadays, the most successful techniques for medium
size databases (say up to 200Mb) combine online andindexed searching.
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 5/56
Introduction
• We cover two main indexing techniques
– Inverted files
– Suffix arrays
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 6/56
Introduction
• Before covering these portions you should be familiar
with
– Sorted arrays
–
Binary search trees – B-trees
– Hash tables
– Tries.
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 7/56
Introduction
• Sorted arrays
– An array whose items are kept sorted,
– so searching is faster
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 8/56
Introduction
• Binary search trees
– A binary tree
– For each internal node x stores an element
– The element stored in the left subtree of x <= x and
elements stored in the right subtree of x >=x
–
Both the left and right subtrees must also be binary searchtrees.
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 9/56
Binary Tree
Each
node has
at most 2
children
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 10/56
Binary Search Tree
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 11/56
Binary Search Tree
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 12/56
Introduction
• B-trees
– A B-tree is a specialized multi way tree designedespecially for use on disk.
–
Used when part or all of the tree must bemaintained in secondary storage such as a magnetic
disk.
– An indexing technique most commonly used in
databases and file systems
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 13/56
Introduction
• B-trees
– A multiway tree of order m is an ordered tree whereeach node has at most m children.
–
The following is a multiway search tree of order 4
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 14/56
Introduction
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 15/56
Introduction• B-trees (contd..)
– Pointers to data are placed in a balance treestructure so that all references to any data can be
accessed in an equal time frame.
– Data in B-tree is kept sorted
• so that searching, inserting and deleting can be done in
logarithmic amortized time
– A b-tree tries to minimize the number of disk
accesses.
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 16/56
Introduction• B-trees Example
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 17/56
Introduction• B-trees Example
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 18/56
Introduction• Searching a B-Tree for Key 21
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 19/56
IntroductionInserting Key 33 into a B-Tree (w/ Split)
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 20/56
IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 21/56
IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 22/56
IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 23/56
IntroductionInserting Key 33 into a B-Tree (w/ Split) (contd..)
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 24/56
Introduction• Hash table
–
A data structure that uses a hash function to efficiently mapcertain identifiers or keys (e.g., person names) to associated
values (e.g., their telephone numbers).
–
The hash function is used to transform the key into theindex (the hash) of an array element (the slot or bucket )
where the corresponding value is to be sought.
– E.g : Division Method
d
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 25/56
Introduction• Hash table
–
123456123467
123450
– 123456 % 10 = 6 (the remainder is 6 when dividing
by 10)
123467 % 10 = 7 (the remainder is 7)
123450 % 10 = 0 (the remainder is 0)
d
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 26/56
Introduction
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 27/56
Tries
•
Trie , is an ordered tree data structure that is used tostore an array where the keys are usually strings
• It can be used to do a fast search in a large text
• The term trie comes from the word "retrieval".
• Used to implement the dictionary abstract data type
(ADT) where basic operations like search, insert, anddelete can be performed
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 28/56
Tries
•
They can be used for encoding and compression
• They can be used in regular expression search and
approximate string matching
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 29/56
Non Compact and Compact Tries
•
A non compact trie is one in which every edge of theunderlying tree represents a symbol of the alphabet.
• Let's construct the trie from the following 5 strings: BIG,
BIGGER, BILL, GOOD, GOSH.
d
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 30/56
Non Compact and Compact Tries
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 31/56
Non Compact Tries
• When we look for the string GOOD, we start at the root
and we follow the G O OD edges
• If we want to look for the string BAD, we start from the
root, follow the B edge and find out that there is no A edge after. Thus BAD is not in the text.
• The above structure is rather wasteful because each
edge represents a single symbol.
• Not practical for huge texts
C i
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 32/56
Compact Tries
• This type of trie resembles the one in figure above
except that chains which lead to leaves are trimmed.
• This is illustrated in next figure
C T i
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 33/56
Compact Tries
C T i
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 34/56
Compact Tries
The compact form
of the trie is in the
figure
C t T i
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 35/56
Compact Tries
• The number of leaves is n+1 where n is the number of
input strings.• In the leaves, we may store either the strings
themselves or pointers to the strings (that is, integers).
T i ll d "PATRICIA"
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 36/56
Tries called "PATRICIA"
• "PATRICIA" stands for "practical algorithm to retrieve
information coded in alphanumeric".• The difference is that an edge can be labeled with more
than one character.
•
All the unary nodes will be collapsed.
T i ll d "PATRICIA"
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 37/56
Tries called "PATRICIA"
T i ll d "PATRICIA"
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 38/56
Tries called "PATRICIA"
The very
compact trie
will look as
follows:
Tries called "PATRICIA"
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 39/56
Tries called "PATRICIA"
• Binary PATRICIA tries has only 2 symbols per edge
S ffi T
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 40/56
Suffix Tree• The suffix tree T(x) of string x[1..n] is the compacted trie
of all suffixes x[i..n] for i = 1,..,n+1, i.e. including theempty suffix
• Allows for a particularly fast implementation of many
important string operations.
• The suffix tree for a string S is a tree (more specifically a
trie) whose edges are labeled with strings, such that each
suffix of S corresponds to exactly one path from the tree'sroot to a leaf.
S ffi T
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 41/56
Suffix Tree• The idea behind suffix tree is to assign to each symbol in
a text an index corresponding to its position in the text.
– ie: First symbol has index 1, last symbol has indice n= #of
symbols in text.
• In the tree we use indices instead of the actual object.
S ffi t
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 42/56
Suffix tree• The advantages are:
–
It requires less storage space. – We do not have to worry how the text is represented (bin, ASCII,
etc)
– We do not have to store the same object twice. (no duplicate)
S ffi t i
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 43/56
Suffix trie
• We begin by giving a position to every suffix in the text.
We can now build a SUFFIX Trie for all n suffixes of the
text.
• E.g.
–TEXT: G O O G O L $
– POSITION: 1 2 3 4 5 6 7
Suffix trie
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 44/56
Suffix trie
The resulting tree has n leaves and height n
S ffi
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 45/56
Suffix tree• The suffix tree is created by TRIMMING (compacting +
collapsing every unary node) of the suffix TRIE
• The following is a picture of a compact suffix tree
S ffi
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 46/56
Suffix tree
Suffix tree
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 47/56
Suffix tree
• In suffix tree we can store pointers rather than words in
the leaves.
• Also we can replace every string by a pair of indices,
(a,b), where a is the index of the beginning of the string
and b the index of the end of the string.• i.e: We write
– (3,7) for OGOL$
– (1,2) for GO
– (7,7) for $
Suffix tree
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 48/56
Suffix tree
• The corresponding suffix tree looks like this
Search in suffix tree
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 49/56
Search in suffix tree
• Pseudo-code for searching in suffix tree:
– Start at root
– Go down the tree by taking each time the corresponding
bifurcation
– If S correspond to a node then return all leaves in subtree
– If S encountered a NIL pointer then S is not in the tree
Search in suffix tree
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 50/56
Search in suffix tree
• If S = "GO" we take the GO bifurcation and return:
GOOGOL$,GOL$.
•
If S = "OR" we take the O bifurcation and then we hit aNIL pointer so "OR" is not in the tree.
Applications of suffix tree
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 51/56
Applications of suffix tree
• Exact matching
• Common substrings, with applications
• Matching statistics
• Suffix arrays
• Genome-scale projects
Exact Matching
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 52/56
Exact Matching
• Given string x and pattern y, report where y occurs in x
• Pattern ata occurs at position 2 in tatat
Exact Matching
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 53/56
Exact Matching
• Given string x and pattern y, report where y occurs in x
• Pattern tatt does not occur in tatat
Assumptions in indexing and searching
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 54/56
Assumptions in indexing and searching
• We make the following assumptions.
– We call n the size of the text database.
– Whenever a pattern is searched, we assume that it is of length
m, which is much smaller than n.
– We call M the amount of main memory available.
– The modifications which a text database undergoes are
additions, deletions, and replacements of pieces of text of size
n' < n.
Reference
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 55/56
Reference
• Modern Information Retrieval by Yates
• http://www.bluerwhite.org/btree/ 01/08/2011
• http://cis.stvincent.edu/carlsond/swdesign/btree/btree.
html 01/08/2011 01/08/2011
•
http://www.cs.princeton.edu/~rs/AlgsDS07/09BalancedTrees.pdf 01/08/2011
• http://www.cs.uregina.ca/Links/class-info/210/Hash/
01/08/2011
• http://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash
_tables.html 01/08/2011
8/4/2019 Lecture4- Indexing and Searching I
http://slidepdf.com/reader/full/lecture4-indexing-and-searching-i 56/56
References
• http://www.cs.uku.fi/~kilpelai/BSA05/lectures
/slides08.pdf
• http://www.daimi.au.dk/~cstorm/courses/Str
Alg_e05/slides/suffixtrees_uge1_e05.pdf