1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim...

62
1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad Ykhlef
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    2

Transcript of 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim...

Page 1: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

1

IS531 - Ch 8

Modern Information Retrieval

Indexing and Searching

Presented byRaed Ibrahim Al-Fayez

Ali Sulaiman Al-Humaimidi

Supervised by:Dr. Mourad Ykhlef

Page 2: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

2

IS531 - Ch 8

Contents

8.1 Introduction 8.2 Inverted Files 8.3 Other Indices for Text 8.4 Boolean Queries 8.5 Sequential Searching 8.6 Pattern Matching 8.7 Structural Queries 8.8 Compression 8.9 Trends and Research Issues

Page 3: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

3

IS531 - Ch 8

8.1 Introduction (1)

Option in searching for basic queries:– Sequential/on-line text searchingSequential/on-line text searching:

FindingFinding the occurrences of a patternpattern in a text when

the text is not preprocessedpreprocessed. GoodGood: when text is smallsmall (in MB) &

when index overheadoverhead can’t be afforded.– Indexed searchingIndexed searching:

Build data structure over the text (indices) to

speedup the search. GoodGood: when text is large or huge &

the text is semi-staticsemi-static (not often updated).

Page 4: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

4

IS531 - Ch 8

8.1 Introduction (3)

Main indexing techniques:– Inverted filesInverted files (Keyword-based search)

best choice for most application.

– Suffix arrays/treesSuffix arrays/trees Faster for phrase searches but hared to build & maintain.

– Signature filesSignature files. Was popular in the mid 1980 & inverted files take place.

For each techniques pay attention to:– Search cost & Space overhead,

– construction cost & maintenance cost.

Page 5: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

5

IS531 - Ch 8

8.1 Introduction (4)

Index should be built and stored in a data structure before searching:– Basic data structures: Sorted Arrays, Binary search tree, B-

tree, hash table, Trie, Patricia tree ..etc.

Trie (from retrieval):– Multi-way trees that store set of strings and able to retrieve them so fast depend on string length.– Every edge of a tree is labeled with a letter.– for storing strings over an alphabet – Used in dictionaries (a, an, and...etc)

Page 6: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

6

IS531 - Ch 8

8.2 Inverted files (1)

Definition:– A word-oriented mechanism for indexing a text collection

in order to speed up the searching task.

– Also called inverted index.

Composed of 2 elements:– VocabularyVocabulary:

Set of all different words in the text.

– OccurrencesOccurrences: for each word a list of all the text positions the word appears. the positions can refer to words or characters.

Page 7: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

7

IS531 - Ch 8

8.2 Inverted files (2)

A sample text and an inverted index built on it:

This is a text. A text has many words. Words are made from letters

1 6 9 11 17 19 24 28 33 40 46 50 55 60

letters

made

many

text

words

Vocabulary

60…

50…

28…

11, 19…

33, 40…

Occurrence

inverted index

text

Page 8: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

8

IS531 - Ch 8

8.2 Inverted files (3)

Required spaceRequired space – The space required for the vocabulary is rather small.

– The occurrences demand much more space.

Block addressingBlock addressing– Reduces space requirements:

Pointers are smaller due to fewer blocks. Also word may occurs in the same block

– The text is divided in blocks, and the occurrences point to the blocks where the word appears (instead of the exact position).

– If the exact occurrence positions are required: Do online searchonline search over the qualifying blocks has to be performed Note: max 256 block and 200MB text!

Page 9: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

9

IS531 - Ch 8

8.2 Inverted files (4)

The sample text split into four blocks

This is a text. A text has many words. Words are made from letters

block 1 block 2 block3 block 4

letters

made

many

text

words

4…

4…

2…

1, 2…

3…

Vocabulary Occurrence

inverted index

text

Page 10: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

10

IS531 - Ch 8

8.2 Inverted files (5)

Block addressing .. :– Blocks of fixed size

Improve efficiency at retrieval time. larger blocks match queries incur more sequential traversals

of text.

– Blocks of natural division of the text collection (files,

docs, web pages ..etc) good for single-word queries without the exact occurrence

position requirement.

Page 11: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

11

IS531 - Ch 8

8.2.1 Searching (1)

General search steps– Vocabulary search:

The words and patterns present in the query are isolated and searched in the vocabulary.

– Retrieval of occurrences The lists of the occurrences of all the words found are retrieved.

– Manipulation of occurrences The occurrences are processed to solve phrases, proximity, or

Boolean operations. If block addressing is used, it may be necessary to directly

search the text to find the information missing from the occurrences.

Page 12: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

12

IS531 - Ch 8

8.2.1 Searching (2)

Singe word queries (Simple): – Return the list of occurrence.

Context queries (Complex): – Each element searched separately and a list is

generated for each of them.

– Lists are traversed to find places where all the words appear in sequence for a phrase query or appear close enough for a proximity query.

In Block addressing watch block boundaries since they may split a match (time consuming).

Page 13: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

13

IS531 - Ch 8

8.2.2 Construction (1)

Constructing:– Building and maintaining an inverted index is relatively

low cost task.– All vocabulary and stored in a data structure (Trie) and

storing with each word a list of occurrences.– Once constructed, it is written to disk in two files:

Posting file: lists of occurrences are stored contiguously. Vocabulary file: Vocabulary is stored in lexicographical order

with a pointer for each word to its list in the first file.

– Spliting the index into 2 files allows the vocabulary to be kept in memory to speed up the search.

Page 14: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

14

IS531 - Ch 8

8.2.2 Construction (2)

Construction step1. Read each word of the text

2. Search the word in the trie. All the vocabulary known up to now is kept in a trie structur

e.

3. If word is not found in the trie, it is added to the trie with its list of occurrence.

4. If word is in the trie, the new position is added to the end of its list of occurrence.

Page 15: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

15

IS531 - Ch 8

8.2.2 Construction (3)

Building an inverted index for the sample text

This is a text. A text has many words. Words are made from letters

1 6 9 11 17 19 24 28 33 40 46 50 55 60

letters: 60

made: 50

many: 28

text: 11,19

words: 33,40

‘l’

‘m’ ‘a’‘d’

‘n’‘t’

‘w’

Page 16: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

16

IS531 - Ch 8

Example (2)

Page 17: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

17

IS531 - Ch 8

8.3 other indices for text

Suffix trees and suffix arrays Signature file

Page 18: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

18

IS531 - Ch 8

Suffix Trees and Suffix Arrays

Suffix – Each position in the text is considered as a text suffix.

A string that start from that text position to the end to the text

Both:– They answer efficiently more complex queries.

– Costly construction process

– The text must be readily available at query time.

– The results are not delivered in text position order.

Page 19: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

19

IS531 - Ch 8

Suffix tree (1)

Index points of interest:– selected form the text, which point to the beginning of the text

positions which will be retrievable.– each position is considered as a text suffix– each suffix is uniquely identified by its position

structure– Trie data structure built over all the suffixes of the text

The pointers to the suffixes are stored at the leaf nodes This trie is compacted into a Patricia tree (compressing unary pat

hs).

Searching– Many basic patterns such as words, prefixes, and phrases can be

searched by a simple trie search.

Page 20: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

20

IS531 - Ch 8

Suffix tree (2) The suffix trie and suffix tree for the sample text

This is a text. A text has many words. Words are made from letters

1 6 9 11 17 19 24 28 33 40 46 50 55 60

60

50

2819

1140

33

‘l’ ‘m

‘a’‘d’

‘n’‘t’

‘e’ ‘x’ ‘t’‘’

‘.’‘w’ ‘o’ ‘r’ ‘d’ ‘s’

‘’

‘.’

suffix trie

‘’

suffix tree (PAT)

1 3

5

6

6050

28

19

11

40

33

‘l’

‘t’

‘m’

‘w’

‘d’

‘n’

‘’

‘.’

‘.’

text. A text has many words. Words are made from letters.many words. Words are made from letters. …………made from letters.letters.

Suffixes

Index point of interest

Page 21: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

21

IS531 - Ch 8

Suffix tree (Example) Let S=abab, a suffix tree of s is a compressed trie of all suffixes of S= abab$

{ $ 5 b$ 4 ab$ 3 bab$ 2 abab$ 1 }

ab

ab

$

ab

$

b

$

$

$

Page 22: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

22

IS531 - Ch 8

Put the largest suffix in

Put the suffix bab$ in

abab$

ab

ab

$

ab$

b

{ $ 5 b$ 4 ab$ 3 bab$ 2 abab$ 1 }

Trivial algorithm to build a Suffix tree

Page 23: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

23

IS531 - Ch 8

Put the suffix ab$ in

ab

ab

$

ab$

b

ab

ab

$

ab$

b

$

{ $ 5 b$ 4 ab$ 3 bab$ 2 abab$ 1 }

Trivial algorithm to build a Suffix tree

Page 24: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

24

IS531 - Ch 8

Put the suffix b$ in

ab

ab

$

ab$

b

$

ab

ab

$

ab$

b

$

$

{ $ 5 b$ 4 ab$ 3 bab$ 2 abab$ 1 }

Trivial algorithm to build a Suffix tree

Page 25: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

25

IS531 - Ch 8

Put the suffix $ in

ab

ab

$

ab$

b

$

$

ab

ab

$

ab$

b

$

$

$

{ $ 5 b$ 4 ab$ 3 bab$ 2 abab$ 1 }

Trivial algorithm to build a Suffix tree

END: label each leaf with the starting point of the corresponding suffix.

12

3

4

5

Page 26: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

26

IS531 - Ch 8

Suffix arrays (1)

Structure– Suffix arrays are space efficient implementation of

suffix trees.– Simply an array containing all the pointers to the text

suffixes listed in lexicographical order.– Supra-indices:

If the suffix array is large, this binary search can perform poorly because of the number of random disk accesses.

Suffix arrays are designed to allow binary searches done by comparing the contents of each pointer.

To remedy this situation, the use of supra-indices over the suffix array has been proposed.

Page 27: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

27

IS531 - Ch 8

Suffix arrays (2)

Example

This is a text. A text has many words. Words are made from letters

1 6 9 11 17 19 24 28 33 40 46 50 55 60

60 50 28 19 11 40 33 Suffix Array

60 50 28 19 11 40 33

lett text word

Suffix Array

Supra-Index

suffix tree

1 3

5

6

6050

2819

1140

33

Page 28: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

28

IS531 - Ch 8

Suffix arrays (3)

Searching– Search steps

Originate two limiting patterns P1 and P2. , S is original pattern

Binary search both limiting patterns in the suffix array. Supra-indices are used as a first step to alleviate disk access.

All the elements lying between both positions point to exactly those suffixes that start like the original pattern.

21 PSP

Page 29: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

29

IS531 - Ch 8

Signature files (1)

Definition– Word-oriented index structure based on hashing.– Use liner search.– Suitable for not very large texts.

Structure– Based on a Hash function that maps words to bit masks.– The text is divided in blocks.

Bit mask of block is obtained by bitwise ORing the signatures of all the words in the text block.

Word not found, if no match between all 1 bits in the query mask and the block mask.

Page 30: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

30

IS531 - Ch 8

Signature files (2)

Example:

000101 110101 100100 101101

This is a text. A text has many words. Words are made from letters

block 1 block 2 block3 block 4

Text signature

h(text) = 000101

h(many) = 110000

h(words) = 100100

h(made) = 001100

h(letters) = 100001Signature function

Page 31: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

31

IS531 - Ch 8

Signature files (3)

False drop Problem– The corresponding bits are set even though the word is

not there!

– The design should insure that the probability of false drop is low. Also the Signature file should be as short as possible.

– Enhance the hashing function to minimize the error probability.

Page 32: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

32

IS531 - Ch 8

Signature files (4)

Searching1. If searching a single word, Hash word to a bit mask W.2. If searching phrases and reasonable proximity queries,

1) Hash words in query to a bit mask.2) Bitwise OR of all the query masks to a bit mask W.

3. Compare W to the bit masks Bi of all the text blocks. If all the bits set in W are also in Bi, then text block may contain the

word.4. For all candidate text blocks, an online traversal must be performed

to verify if the query is actually there. Construction

1. Cut the text in blocks.2. Generate an entry of the signature file for each block.

This entry is the bitwise OR of the signatures of all the words in the block.

Page 33: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

33

IS531 - Ch 8

8.4 Boolean queries

Its manipulations algorithms– Used to operate on sets of results.– Example: a OR (b AND c)

Search phase1. Determine which documents classify2. Determines the relevance of the classifying

documents so as to present them appropriately to the user.

3. Retrieves the exact positions of the matches to highlight them in those documents that the user actually wants to see.

Page 34: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

34

IS531 - Ch 8

8.5 Sequential searching

– Used for text searching when no data structure has been built on the text.

– The problem of exact string matching is : Given a short pattern P of length m and long T of

length n, find all the text position where the pattern occurs.

Ali

Page 35: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

35

IS531 - Ch 8

8.5 Sequential searching

Brute force Knuth-Morris-Pratt Boyer-Moore Family Shift-Or

Page 36: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

36

IS531 - Ch 8

Brute Force

Brute Force algorithm (BF)– The simplest possible one.

– It consists of merely trying all possible pattern positions in the text. For each such position, it verifies whether the pattern matches at that position.

– Does not need any pattern preprocessing.

– Many algorithms use a modification of this scheme.

– Left to right search.

Page 37: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

37

IS531 - Ch 8

Brute Force example

a b r a c a b r a c a d a b r a

a b r a c a d

a

a

a

a b

a b r a c a d a b r a

Page 38: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

38

IS531 - Ch 8

Knuth-Morris-Pratt(1)

Reuse information from previous checks– When the window has to be shifted, there is a prefix o

f the pattern that matched the text.

– The algorithm takes advantage of this information to avoid trying window positions which can be deduced not to match.

– left to right scan like the Brute Force algorithm.

Page 39: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

39

IS531 - Ch 8

Knuth-Morris-Pratt(2)

Next table– The next table at position j says which is the longest pro

per prefix of P1..j-1 which is also a suffix and the characters following prefix and suffix are different. j-next[j]+1 window positions can be safely skipped i

f the characters up to j-1 matched, and the j-th did not.

Page 40: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

40

IS531 - Ch 8

Knuth-Morris-Pratt(3) Next table for ‘abracadabra’

next 0 0 0 0 1 0 1 0 0 0 0 4

search pattern a b r a c a d a b r a

[next function]

Page 41: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

41

IS531 - Ch 8

a b r a c a b r a c a d a b r a

a b r a c a d

a b r a c a d a b r a

[search example]

Searching ‘abracadabra’

Knuth-Morris-Pratt(4)

Page 42: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

42

IS531 - Ch 8

Boyer-Moore Family(1)

BM algorithm– Based on the fact that the check inside the window can

proceed backwards. When a match or mismatch is determined, a suffix

of the pattern has been compared and found equal to the text in the window.

Page 43: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

43

IS531 - Ch 8

Boyer-Moore Family(2)

BM example– Searching ‘date’

p=’’date’’index[d]=0index[a]=1index[t]=2index[e]=3index[anything else] = -1

Page 44: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

44

IS531 - Ch 8

Boyer-Moore Family(3)

BM example– Searching ‘date’

T="some date"P="date” ** m<>t.. index[m]=-1 so move so -1th posn of P below mT="some date" "date" * a<>e.. index[a]=1 so move so char 1 of P below a.T="some date" “date” ****

Page 45: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

45

IS531 - Ch 8

Shift-Or(1)

– The basic idea of the Shift-Or (SO) algorithm, is to represent the state of the search as a number, and each search step costs a small number of arithmetic and logical operations.

– Efficient if the pattern length is no longer than the memory-word size of the machine w.( w is 32,64).

Page 46: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

46

IS531 - Ch 8

Shift-Or(2)

SO example– Searching ‘GCAGAGAG’.

– p has been found at position 12-8+1=5

Page 47: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

47

IS531 - Ch 8

Phrases and proximity

the best way to search a phrase– search for the element which is less frequent or can be

searched faster. for instance,

longer patterns are better than shorter ones. allowing fewer errors is better than allowing more errors.

the best way to search a proximity– is similar to the best way to search a phrase.

Page 48: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

48

IS531 - Ch 8

8.6 Pattern Matching

String matching allowing errors. Pattern matching Using indices.

Page 49: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

49

IS531 - Ch 8

String matching allowing errors(1)

– This problem called ‘approximate string matching’.

– Can be stated as follows: Given a short pattern P of length m, a long text T of

length n, and a maximum allowed number of errors k, find all the text position where the pattern occurs.

Page 50: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

50

IS531 - Ch 8

String matching allowing errors(1)

Dynamic programming– Classical solution to approximate string matching.

– A matrix C[0..m, 0..n] is filled column by column, where C[i,j] represents the minimum number of errors needed to match P1..i to a suffix of T1..j.

m: length of a short pattern P. n: length of a long text T.

Page 51: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

51

IS531 - Ch 8

String matching allowing errors(2)

Dynamic programming– This is computed as follows:

– A match is reported at text positions j such that

kjmC ],[

C[0,j]=0

C[i,0]=I

C[i, j]= if (Pi = Tj) then C[i-1, j-1]

else 1+min(C[i -1, j], C[i, j-1], C[i -1, j-1])

Page 52: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

52

IS531 - Ch 8

String matching allowing errors(3)

Dynamic programming– search ‘survey’ in the text ‘surgery’ with two errors

s u r g e r y

0 0 0 0 0 0 0 0

s 1 0 1 1 1 1 1 1

u 2 1 0 1 2 2 2 2

r 3 2 1 0 1 2 2 2

v 4 3 2 1 1 2 3 3

e 5 4 3 2 2 1 2 3

y 6 5 4 3 3 2 2 2

Page 53: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

53

IS531 - Ch 8

String matching allowing errors(4)

Dynamic programmingsurvey

sur_e_y

Page 54: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

54

IS531 - Ch 8

String matching allowing errors(5)

Bit-Parallelism– Has been used to parallelize the computation of the

dynamic programming matrix .

Filtering– Filter the text , reducing the area where dynamic

programming needs to be used.

Page 55: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

55

IS531 - Ch 8

Pattern matching Using indices

Inverted Files – Are word-oriented.

– Queries such as suffix or sub-string queries ,searching allowing errors and regular expressions are solved by a sequential search over the vocabulary.

– If block addressing is used , the search must be completed with a sequential search over the blocks.

– Not able to efficiently find approximate matches or regular expressions that span many words.

Page 56: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

56

IS531 - Ch 8

8.7 Structural Queries(1)

The algorithms to search on structured text– Some implementations build an ad hoc index to store

the structure. More efficient and independent of any consideration

about the text. Need extra development and maintenance effort.

Page 57: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

57

IS531 - Ch 8

8.7 Structural Queries(2)

The algorithms to search on structured text– Other techniques assume that the structure is marked in

the text using ‘tags’. ( case of HTML text). The techniques rely on the same index to query

content (such as inverted files), using it to index and search those tags as if they were words.

In many cases this is as efficient as an ad hoc index. Its integration into an existing text database is

simpler.

Page 58: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

58

IS531 - Ch 8

Compressed indices(1)

Inverted files– Are quite amenable to compression, because the lists of

occurrences are in increasing order of text position.

– An obvious choice is to represent the differences between the previous position and the current one.

– The text can be compressed independently of index.

Page 59: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

59

IS531 - Ch 8

Compressed indices(2)

Suffix trees and suffix arrays– Suffix arrays are very hard to compress further.

Because they represent an almost perfectly random permutation of the pointers to the text.

– Suffix arrays on compressed text The main advantage is that both index construction and

querying almost double their performance. Construction is faster because more compressed text fits in the

same memory space and therefore fewer text blocks are needed. Searching is faster because a large part of the search time is

spent in disk seek operations over the text area to compare suffixes.

Page 60: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

60

IS531 - Ch 8

8.9 Trends and Research Issues

The main trends in indexing and searching textual databases:– Text collections are becoming huge.

– Searching is becoming more complex.

– Compression is becoming a star in the field.

Page 61: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

61

IS531 - Ch 8

References

“Modern Information Retrieval”, Ricardo Baeza & Berthier Ribeiro, Addison Wesley 1999.

Readings in Information Retrieval,K.Sparck Jones and P. Willett

Many different Resources on the Internet:– http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix/

– http://www.cs.unt.edu/~rada/CSCE5200/

Page 62: 1 IS531 - Ch 8 Modern Information Retrieval Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi Supervised by: Dr. Mourad.

62

IS531 - Ch 8

.. That’s All..

?Thanks .. Any Questions