Web Data Management Compression and Search...R*, where R is a regular expression and signifies...
Transcript of Web Data Management Compression and Search...R*, where R is a regular expression and signifies...
![Page 1: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/1.jpg)
Web Data Management
Compression and Search
Lecture 3: Search and basic indexing
.
![Page 2: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/2.jpg)
What is Pattern Matching?
• Definition: – given a text string T and a pattern string P,
find the pattern inside the text • T: “the rain in spain stays mainly on the plain” • P: “n th”
![Page 3: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/3.jpg)
The Brute Force Algorithm
• Check each position in the text T to see if the pattern P starts in that position
T: a n d r e w
P: r e w
T: a n d r e w
P: r e w
P moves 1 char at a time through T ....
![Page 4: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/4.jpg)
Analysis
• Brute force pattern matching runs in time O(mn) in the worst case.
• But most searches of ordinary text take O(m+n), which is very quick.
continued
![Page 5: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/5.jpg)
• The brute force algorithm is fast when the alphabet of the text is large – e.g. A..Z, a..z, 1..9, etc.
• It is slower when the alphabet is small – e.g. 0, 1 (as in binary files, image files, etc.)
continued
![Page 6: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/6.jpg)
• Example of a worst case: – T: "aaaaaaaaaaaaaaaaaaaaaaaaaah" – P: "aaah"
• Example of a more average case: – T: "a string searching example is standard" – P: "store"
![Page 7: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/7.jpg)
The KMP Algorithm
• The Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-to- right order (like the brute force algorithm).
• But it shifts the pattern more intelligently than the brute force algorithm.
continued
![Page 8: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/8.jpg)
• If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?
Summary
![Page 9: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/9.jpg)
• If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?
• Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]
Summary
![Page 10: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/10.jpg)
k 0 1 23 4 F(k) 0 0 10 1
Example
T:
P:
![Page 11: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/11.jpg)
KMP Advantages
• KMP runs in optimal time: O(m+n) – very fast
• The algorithm never needs to move backwards in the input text, T – this makes the algorithm good for processing very large files that are read in from external devices or through a network stream
![Page 12: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/12.jpg)
KMP Disadvantages
• KMP doesn’t work so well as the size of the alphabet increases – more chance of a mismatch (more possible
mismatches) – mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later
![Page 13: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/13.jpg)
The Boyer-Moore Algorithm
![Page 14: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/14.jpg)
The Boyer-Moore Algorithm
• The Boyer-Moore pattern matching algorithm is based on two techniques.
• 1. The looking-glass technique – find P in T by moving backwards through P,
starting at its end
![Page 15: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/15.jpg)
• 2. The character-jump technique – when a mismatch occurs at T[i] == x – the character in pattern P[j] is not the same as T[i]
• There are 3 possible cases.
T
P
x a i
b a j
![Page 16: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/16.jpg)
Case 1
• If P contains x somewhere, then try to shift P right to align the last occurrence of x in P with T[i].
T x a i
P
j x c b a
T
P
jnew
x c b a
x a ? ? inew and
move i and j right, so j at end
![Page 17: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/17.jpg)
Case 2
• If P contains x somewhere, but a shift right to the last occurrence is not possible, then shift P right by 1 character to T[i+1].
T
P cw a x
j
T
P c w a x jnew
and move i and
x a x i
j right, so j at end
x is after j position
x a x ? inew
![Page 18: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/18.jpg)
Case 3
• If cases 1 and 2 do not apply, then shift P to align P[0] with T[i+1].
T x a i
j P d c b a
T
jnew
P d c b a
and move i and j right, so j at end
No x in P
x a ? ? ? inew
0
![Page 19: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/19.jpg)
Boyer-Moore Example (1)
T:
P:
![Page 20: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/20.jpg)
Last Occurrence Function
• Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet A to build a last occurrence function L() – L() maps all the letters in A to integers
• L(x) is defined as: // x is a letter in A – the largest index i such that P[i] == x, or – -1 if no such index exists
![Page 21: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/21.jpg)
x a b c d L(x) 4 5 3 -1
L() Example
• A = {a, b, c, d} • P: "abacab"
P a b a c a b
0 1 2 3 4 5
L() stores indexes into P[]
![Page 22: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/22.jpg)
x a b c d L(x) 4 5 3 −1
Boyer-Moore Example (2)
T:
P:
![Page 23: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/23.jpg)
Analysis • Boyer-Moore worst case running time is
O(nm + A)
• But, Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small. – e.g. good for English text, poor for binary
• Boyer-Moore is significantly faster than brute force for searching English text.
![Page 24: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/24.jpg)
Worst Case Example
• T: "aaaaa…a" • P: "baaaaa"
T:
P:
![Page 25: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/25.jpg)
Regular Expressions • Notation to specify a language
– Declarative – Sort of like a programming language.
• Fundamental in some languages like perl and applications like grep or lex
– Capable of describing the same thing as a NFA • The two are actually equivalent, so RE = NFA = DFA
– We can define an algebra for regular expressions
![Page 26: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/26.jpg)
•
Definition of a Regular Expression
R is a regular expression if it is:
1. 2. 3. 4. 5. 6.
a for some a in the alphabet ∑, standing for the language {a} ε, standing for the language {ε} Ø, standing for the empty language R1+R2, where R1 and R2 are regular expressions, and + signifies union (sometimes | is used) R1R2, where R1 and R2 are regular expressions and this signifies concatenation R*, where R is a regular expression and signifies closure
7. (R), where R is a regular expression, then a parenthesized R is also a regular expression
This definition may seem circular, but 1-3 form the basis Precedence: Parentheses have the highest precedence, followed by *, concatenation, and then union.
![Page 27: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/27.jpg)
Using Regular Expressions
• Regular expressions are a standard programmer's tool.
• Built in to Java, Perl, Unix, Python, . . . .
![Page 28: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/28.jpg)
• • • • • • • • • • •
RE Examples L(001) = {001} L(0+10*) = { 0, 1, 10, 100, 1000, 10000, … } L(0*10*) = {1, 01, 10, 010, 0010, …} i.e. {w | w has exactly a single 1} L(∑∑)* = {w | w is a string of even length} L((0(0+1))*) = { ε, 00, 01, 0000, 0001, 0100, 0101, …} L((0+ε)(1+ ε)) = {ε, 0, 1, 01} L(1Ø) = Ø ; concatenating the empty set to any set yields the empty set. Rε = R R+Ø = R Note that R+ε may or may not equal R (we are adding ε to the language) Note that RØ will only equal R if R itself is the empty set.
![Page 29: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/29.jpg)
Exercise 1
• Let ∑ be a finite set of symbols • ∑ = {10, 11}, ∑* = ?
![Page 30: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/30.jpg)
Answer
Answer: ∑* = {є, 10, 11, 1010, 1011, 1110, 1111, …}
![Page 31: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/31.jpg)
Exercises 2
• L1 = {10, 1}, L2 = {011, 11}, L1L2 = ?
![Page 32: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/32.jpg)
Answer
• L1L2 = {10011, 1011, 111}
![Page 33: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/33.jpg)
Exercises 3
• Write RE for – All strings of 0’s and 1’s – All strings of 0’s and 1’s with at least 2 consecutive 0’s – All strings of 0’s and 1’s beginning with 1 and
not having two consecutive 0’s
![Page 34: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/34.jpg)
Answer
• (0|1)* All strings of 0’s and 1’s • (0|1)*00(0|1)* All strings of 0’s and 1’s with at least 2 consecutive 0’s • (1+10)* All strings of 0’s and 1’s beginning with 1 and not having two consecutive 0’s
![Page 35: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/35.jpg)
More Exercises
• 1) (0|1)*011 • 2) 0*1*2* • 3) 00*11*22*
![Page 36: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/36.jpg)
More Exercises (Answers)
1) (0|1)*011 Answer: all strings of 0’s and 1’s ending in
011 2) 0*1*2* • Answer: any number of 0’s followed by any number of 1’s followed by any number of 2’s • 3) 00*11*22* Answer: strings in 0*1*2 with at least one of each symbol
![Page 37: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/37.jpg)
NFA
![Page 38: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/38.jpg)
• • • •
Deterministic Finite Automata (DFA)
Simple machine with N states. Begin in start state. Read first input symbol. Move to new state, depending on current state and input symbol.
• Repeat until last input symbol read. • Accept or reject string depending on label
of last state.
![Page 39: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/39.jpg)
DFA
![Page 40: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/40.jpg)
Theory of DFAs and REs
• RE. Concise way to describe a set of strings.
• DFA. Machine to recognize whether a given string is in a given set.
• Duality: for any DFA, there exists a regular expression to describe the same set of strings; for any regular expression, there exists a DFA that recognizes the same set.
![Page 41: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/41.jpg)
Duality Example
• DFA for multiple of 3 b’s:
• RE for multiple of 3 b’s:
![Page 42: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/42.jpg)
Fundamental Questions
• Which languages CANNOT be described by any RE?
• Set of all bit strings with equal number of 0s and 1s.
• Set of all decimal strings that represent prime numbers.
• Many more. . . .
![Page 43: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/43.jpg)
Problem 1
• Make a DFA that accepts the strings in the language denoted by regular expression ab*a
![Page 44: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/44.jpg)
Solution
• ab*a:
![Page 45: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/45.jpg)
Problem 2
• Write the RE for the following automata:
![Page 46: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/46.jpg)
Solution
• a(a|b)*a
![Page 47: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/47.jpg)
DFA to RE: State Elimination
• Eliminates states of the automaton and replaces the edges with regular expressions that includes the behavior of the eliminated states.
• Eventually we get down to the situation with just a start and final node, and this is easy to express as a RE
![Page 48: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/48.jpg)
• • •
State Elimination Consider the figure below, which shows a generic state s about to be eliminated. The labels on all edges are regular expressions. To remove s, we must make labels from each qi to p1 up to pm that include the paths we could have made through s.
![Page 49: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/49.jpg)
![Page 50: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/50.jpg)
DFA to RE via State Elimination (1)
• Starting with intermediate states and then moving to accepting states, apply the state elimination process to produce an equivalent automaton with regular expression labels on the edges.
• The result will be a one or two state automaton with a start state and accepting state.
![Page 51: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/51.jpg)
DFA to RE State Elimination (2)
• If the two states are different, we will have an automaton that looks like the following:
• We can describe this automaton as: (R | SU*T)*SU*
![Page 52: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/52.jpg)
DFA to RE State Elimination (3)
• If the start state is also an accepting state, then we must also perform a state elimination from the original automaton that gets rid of every state but the start state. This leaves the following:
• We can describe this automaton as simply R*
![Page 53: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/53.jpg)
DFA to RE State Elimination (4)
• If there are n accepting states, we must repeat the above steps for each accepting states to get n different regular expressions, R1, R2, … Rn.
• For each repeat we turn any other accepting state to non-accepting.
• The desired regular expression for the automaton is then the union of each of the n regular expressions: R1 U R2… U RN
![Page 54: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/54.jpg)
DFA->RE Example
• Convert the following to a RE:
• First convert the edges to RE’s:
![Page 55: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/55.jpg)
DFA -> RE Example (2)
• Eliminate State 1:
• Note edge from 3->3
• Answer: (0+10)*11(0+1)*
![Page 56: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/56.jpg)
Second Example
• Automata that accepts even number of 1’s
• Eliminate state 2:
![Page 57: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/57.jpg)
Second Example (2)
• Two accepting states, turn off state 3 first
• This is just 0*; can ignore going to state 3 since we would “die”
![Page 58: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/58.jpg)
Second Example (3)
• Turn off state 1 second:
• This is just 0*10*1(0|10*1)* • Combine from previous slide to get 0* | 0*10*1(0|10*1)*
![Page 59: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/59.jpg)
57
Text search
• Pattern matching directly – Brute force – BM – KMP
• Regular expressions • Indices for pattern matching
– Inverted files – Signature files – Suffix trees and Suffix arrays
![Page 60: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/60.jpg)
58
Inverted Index For each term t, we store a list of all documents that contain t.
dictionary 58
postings
![Page 61: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/61.jpg)
59
Create postings lists, determine document frequency
59
![Page 62: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/62.jpg)
60
Positional indexes
§Postings lists in a nonpositional index: each postingis just a docID§Postings lists in a positional index: each posting is adocID and a list of positions
![Page 63: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/63.jpg)
61
Positional indexes: Example
Query: “to1 be2 or3 not4 to5 be6” TO, 993427:
‹ 1: ‹7, 18, 33, 72, 86, 231›; 2: ‹1, 17, 74, 222, 255›; 4: ‹8, 16, 190, 429, 433›; 5: ‹363, 367›; 7: ‹13, 23, 191›; . . . ›BE, 178239: ‹ 1: ‹17, 25›; 4: ‹17, 191, 291, 430, 434›; 5: ‹14, 19, 101›; . . . ›
Document 4 is a match!
![Page 64: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/64.jpg)
62
Signature files • Definition
– Word-oriented index structure based on hashing. – Use liner search. – Suitable for not very large texts.
• Structure – Based on a Hash function that maps words to bit masks. – The text is divided in blocks. • Bit mask of block is obtained by bitwise ORing the signatures of all the words in the text block. • Word not found, if no match between all 1 bits in the query mask and the block mask.
![Page 65: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/65.jpg)
63
000101 110101 100100 101101
Signature files
• Example:
block 1 block 2 block3 block 4 This is a text. A text has many words. Words are made from letters
Text signature
h(text) h(many) h(words) h(made) h(letters)
= 000101 = 110000 = 100100 = 001100 = 100001
Signature function
![Page 66: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/66.jpg)
63
000101 110101 100100 101101
Signature files
• Example:
block 1 block 2 block3 block 4 This is a text. A text has many words. Words are made from letters
Text signature
h(text) h(many) h(words) h(made) h(letters)
= 000101 = 110000 = 100100 = 001100 = 100001
Signature function
![Page 67: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/67.jpg)
63
000101 110101 100100 101101
Signature files
• Example:
block 1 block 2 block3 block 4 This is a text. A text has many words. Words are made from letters
Text signature
h(text) h(many) h(words) h(made) h(letters)
= 000101 = 110000 = 100100 = 001100 = 100001
Signature function
![Page 68: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/68.jpg)
63
000101 110101 100100 101101
Signature files
• Example:
block 1 block 2 block3 block 4 This is a text. A text has many words. Words are made from letters
Text signature
h(text) h(many) h(words) h(made) h(letters)
= 000101 = 110000 = 100100 = 001100 = 100001
Signature function
![Page 69: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/69.jpg)
63
000101 110101 100100 101101
Signature files
• Example:
block 1 block 2 block3 block 4 This is a text. A text has many words. Words are made from letters
Text signature
h(text) h(many) h(words) h(made) h(letters)
= 000101 = 110000 = 100100 = 001100 = 100001
Signature function
![Page 70: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/70.jpg)
63
000101 110101 100100 101101
Signature files
• Example:
block 1 block 2 block3 block 4 This is a text. A text has many words. Words are made from letters
Text signature
h(text) h(many) h(words) h(made) h(letters)
= 000101 = 110000 = 100100 = 001100 = 100001
Signature function
![Page 71: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/71.jpg)
63
000101 110101 100100 101101
Signature files
• Example:
block 1 block 2 block3 block 4 This is a text. A text has many words. Words are made from letters
Text signature
h(text) h(many) h(words) h(made) h(letters)
= 000101 = 110000 = 100100 = 001100 = 100001
Signature function
![Page 72: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/72.jpg)
63
000101 110101 100100 101101
Signature files
• Example:
block 1 block 2 block3 block 4 This is a text. A text has many words. Words are made from letters
Text signature
h(text) h(many) h(words) h(made) h(letters)
= 000101 = 110000 = 100100 = 001100 = 100001
Signature function
![Page 73: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/73.jpg)
63
000101 110101 100100 101101
Signature files
• Example:
block 1 block 2 block3 block 4 This is a text. A text has many words. Words are made from letters
Text signature
h(text) h(many) h(words) h(made) h(letters)
= 000101 = 110000 = 100100 = 001100 = 100001
Signature function
![Page 74: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/74.jpg)
64
Signature files
• False drop Problem – The corresponding bits are set even though the word is not there! – The design should insure that the probability of false drop is low. • Also the Signature file should be as short as possible. – Enhance the hashing function to minimize the error probability.
![Page 75: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/75.jpg)
in the block.
•
•
Signature files
Searching 1. For a single word, Hash word to a bit mask W. 2. For phrases, 1) Hash words in query to a bit mask. 2) Bitwise OR of all the query masks to a bit mask W.
3. Compare W to the bit masks Bi of all the text blocks. • If all the bits set in W are also in Bi, then text block may contain the word. 4. For all candidate text blocks, an online traversal must be performed to verify if the actual matches are there. Construction 1. Cut the text in blocks. 2. Generate an entry of the signature file for each block.
• This entry is the bitwise OR of the signatures of all the words 65
![Page 76: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/76.jpg)
66
Suffix trees and suffix arrays
![Page 77: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/77.jpg)
67
Trie
• A tree representing a set of strings.
a
c
b
c
e
e
f
d b
f
e g
{ aeef ad bbfe bbfg c }
![Page 78: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/78.jpg)
68
Trie (Cont)
• Assume no string is a prefix of another
a
c
b
c
e
e
f
d b
f
e g
Each edge is labeled by a letter, no two edges outgoing from the same node are labeled the same.
Each string corresponds to a leaf.
![Page 79: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/79.jpg)
69
Compressed Trie
• Compress unary nodes, label edges by strings
a
c
b
c
e
e
f
d b
f
e g
a
c
bbf
c
eef d
e g
è
![Page 80: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/80.jpg)
70
Suffix tree
Given a string s a suffix tree of s is a compressed trie of all suffixes of s
To make these suffixes prefix-free we add a special character, say $, at the end of s
![Page 81: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/81.jpg)
71
Suffix tree (Example)
Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$
{ $ b$ ab$ bab$ abab$ }
a b
a b $
a b
b
$
$ $
$
![Page 82: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/82.jpg)
72
Trivial algorithm to build a Suffix tree
Put the largest suffix in Put the suffix bab$ in
a b a b $ a b a b $
b $
b a
![Page 83: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/83.jpg)
73
Put the suffix ab$ in
a b a b $
b $
b a
a b
a b $
b $
b a
$
![Page 84: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/84.jpg)
74
Put the suffix b$ in
a b
a b $
b $
b a
$
a b
a b $
a b
$
b
$
$
![Page 85: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/85.jpg)
75
Put the suffix $ in
a b
a b $
a b
b
$
$ $
a b
a b $
a b
b
$
$ $
$
![Page 86: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/86.jpg)
76
a b
a b $
a b
b
$
$ $
$
a b
a b $ 1
a b $ 2
b
$ 3
4
$
5
We will also label each leaf with the starting point of the corres. suffix.
$
![Page 87: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/87.jpg)
77
Analysis
Takes O(n2) time to build.
We will see how to do it in O(n) time
![Page 88: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/88.jpg)
78
What can we do with it ?
Exact string matching: Given a Text T, |T| = n, preprocess it such
that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T.
W e may also want to find all occurrences of P in T
![Page 89: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/89.jpg)
79
a b
a b $ 1
a b $ 2
b
$ 3
4
$
5
Exact string matching In preprocessing we just build a suffix tree in O(n) time
$
Given a pattern P = ab we traverse the tree according to the pattern.
![Page 90: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/90.jpg)
79
a b
a b $ 1
a b $ 2
b
$ 3
4
$
5
Exact string matching In preprocessing we just build a suffix tree in O(n) time
$
Given a pattern P = ab we traverse the tree according to the pattern.
![Page 91: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/91.jpg)
80
a b
a b $ 1
a b $ 2
b
$ 3
4
$
5
$
If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(n+k) time
![Page 92: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/92.jpg)
81
Generalized suffix tree Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s ∈ S To make these suffixes prefix-free we add a special char, say $, at the end of s
To associate each suffix with a unique string in S add a different special char to each s
![Page 93: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/93.jpg)
82
# b# ab# aab#
$ b$ ab$ bab$ abab$
{ }
1
2
a
b
a b $
a b $
$ 3
$
5
$
1
b #
a b
2
#
3
# 4
4
Generalized suffix tree (Example)
Let s1=abab and s2=aab here is a generalized suffix tree for s1and s2
#
![Page 94: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/94.jpg)
83
So what can we do with it ?
Matching a pattern against a database of strings
![Page 95: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/95.jpg)
84
Longest common substring (of two strings) Every node with a leaf descendant from
string s1 and a leaf descendant from string
1
2
a
b
a b $
a b $
$ 3
$
5
$
1
b #
a b
2
#
3
# 4
4
#
S2 represents a maximal common substring
and vice versa. Find such node with largest “string depth”
![Page 96: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/96.jpg)
84
Longest common substring (of two strings) Every node with a leaf descendant from
string s1 and a leaf descendant from string
1
2
a
b
a b $
a b $
$ 3
$
5
$
1
b #
a b
2
#
3
# 4
4
#
S2 represents a maximal common substring
and vice versa. Find such node with largest “string depth”
![Page 97: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/97.jpg)
85
Lowest common ancestor
A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
![Page 98: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/98.jpg)
85
Lowest common ancestor
A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
![Page 99: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/99.jpg)
85
Lowest common ancestor
A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
![Page 100: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/100.jpg)
86 1
2
a
b
a b $
a b $
b
$ 3
$
5
$
1
#
a b
2
#
3
# 4
4
Why?
The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
#
![Page 101: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/101.jpg)
86 1
2
a
b
a b $
a b $
b
$ 3
$
5
$
1
#
a b
2
#
3
# 4
4
Why?
The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
#
![Page 102: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/102.jpg)
86 1
2
a
b
a b $
a b $
b
$ 3
$
5
$
1
#
a b
2
#
3
# 4
4
Why?
The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
#
![Page 103: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/103.jpg)
87
Finding maximal palindromes
• A palindrome: caabaac, cbaabc • Want to find all maximal palindromes in a string s
Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position
i of s and the suffix at position m-i+1 of sr
![Page 104: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/104.jpg)
88
Maximal palindromes algorithm
Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc#
For every i find the LCA of suffix i of s and suffix m-i+1 of sr
![Page 105: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/105.jpg)
a b
c#
89
3
a
ba ab a$
b
3
$
7
$
b a
7
#
c
1
6 a b
c #
5
2 2
a $
c #
a
5
6
$
4
4
1
c #
a $
$
abc#
Let s = cbaaba$ then sr = abaabc#
![Page 106: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/106.jpg)
a b
c#
89
3
a
ba ab a$
b
3
$
7
$
b a
7
#
c
1
6 a b
c #
5
2 2
a $
c #
a
5
6
$
4
4
1
c #
a $
$
abc#
Let s = cbaaba$ then sr = abaabc#
![Page 107: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/107.jpg)
a b
c#
89
3
a
ba ab a$
b
3
$
7
$
b a
7
#
c
1
6 a b
c #
5
2 2
a $
c #
a
5
6
$
4
4
1
c #
a $
$
abc#
Let s = cbaaba$ then sr = abaabc#
![Page 108: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/108.jpg)
90
Analysis
O(n) time to identify all palindromes
![Page 109: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/109.jpg)
91
Drawbacks
• Suffix trees consume a lot of space
• It is O(n) but the constant is quite big
• Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node
![Page 110: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/110.jpg)
92
Suffix array
• We loose some of the functionality but we save space.
Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab
The suffix array gives the indices of the suffixes in sorted order
3 1 4 2
![Page 111: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/111.jpg)
93
How do we build it ?
• Build a suffix tree • Traverse the tree in DFS, lexicographically
picking edges outgoing from each node and fill the suffix array.
• O(n) time
![Page 112: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/112.jpg)
94
How do we search for a pattern ?
• If P occurs in T then all its occurrences are consecutive in the suffix array.
• Do a binary search on the suffix array
• Takes O(mlogn) time
![Page 113: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/113.jpg)
95
Example
Let S = mississippi i ippi issippi ississippi
5 2
11 8
1 10 9 7 4 6 3
mississippi pi ppi sippi sisippi ssippi ssissippi
L
Let P = issa
M R
![Page 114: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/114.jpg)
•
Supra index Structure – Suffix arrays are space efficient implementation of suffix trees. – Simply an array containing all the pointers to the text suffixes listed in lexicographical order. – Supra-indices: • If the suffix array is large, this binary search can perform poorly because of the number of random disk accesses. • Suffix arrays are designed to allow binary searches done by comparing the contents of each pointer. • To remedy this situation, the use of supra-indices over the suffix array has been proposed. 96
![Page 115: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/115.jpg)
97
Supra index
• Example 1 6 9 11 17 19 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters
60 50 28 19 11 40 33 SuffixArray
60 50 28 19 11 40 33
lett text word
SuffixArray
Supra-Index
![Page 116: Web Data Management Compression and Search...R*, where R is a regular expression and signifies closure 7. (R), where R is a regular expression, then a parenthesized R is also a regular](https://reader033.fdocuments.in/reader033/viewer/2022042015/5e73fcf6ac812f0b6a0f1a5f/html5/thumbnails/116.jpg)
97
Supra index
• Example
1 6 9 11 17 19 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters
60 50 28 19 11 40 33 SuffixArray
50 19 11 33
lett 60
text 28
word 40
Supra-Index SuffixArray
suffix tree 1
5 6
60 3
50 28 19 11 40 33