Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching...

24
Text redundancies (2) Maxime Crochemore King’s College London Universit´ e Paris-Est & M.C. CANT 2012 1/27

Transcript of Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching...

Page 1: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Text redundancies (2)

Maxime Crochemore

King’s College London Universite Paris-Est

&

M.C. CANT 2012 1/27

Page 2: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Algorithms and combinatorics on words

⋆ Links between combinatorial properties of words and algorithms on

strings

⋆ Examples:

– Text searching

– Text indexing and suffix arrays

– Text compression and permutations

– Locating repeats in strings

⋆ Combinatorial aspects in Applied Combinatorics on Words

[Lothaire, 2005], [Lothaire, 2002]

http://igm.univ-mlv.fr/∼berstel/Lothaire/index.html

⋆ Other examples in Algorithms on Strings

[C., Hancart, Lecroq, 2007], [C., Rytter, 1994]

http://www.dcs.kcl.ac.uk/staff/mac/

M.C. CANT 2012 2/27

Page 3: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Periods and borders of words

⋆ Non-empty string u, integer p, 0 < p ≤ |u|

⋆ p is a period of u if any of these conditions is satisfied:

– u[i] = u[i + p], for 1 ≤ i ≤ |u| − p

– u is a prefix/factor of some yk, k > 0, |y| = p

– u = yw = wz, for some strings y, z, w with |y| = |z| = p

b o r d e r l i n e b o r d e r

b o r d e r l i n e b o r d e r

-� p

-�

p

⋆ period(u) = smallest period of u (can be |u|)

border(u) = longest proper border of u (can be empty)

⋆ Periods and borders of abaabaa

3 abaa

6 a

7 empty string

M.C. CANT 2012 3/27

Page 4: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Periodicity Lemma

Lemma 1 (Periodicity Lemma [Fine, Wilf, 1965])

If p and q are periods of a word x and satisfy

p + q − GCD(p, q) ≤ |x| then GCD(p, q) is a period of x.

a b a a b a b a a b a b a a b · · ·

a b a a b a b a a b a

a b a a b a b a a b a a b a b · · ·

Lemma 2 (Weak Periodicity Lemma)

If p and q are periods of a word x and satisfy p+ q ≤ |x| then

GCD(p, q) is a period of x.

Used in the analysis of KMP algorithm and of many other

pattern matching algorithms.

M.C. CANT 2012 4/27

Page 5: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Proof of the weaker statement

⋆ p and q periods of x with p + q ≤ |x| and p > q

⋆ p− q period of x

a b c

-p

q-�

p− q

a bc�

q

-p

-�

p− q

⋆ the rest like Euclid’s induction

M.C. CANT 2012 5/27

Page 6: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

On-line String Matching (3)

u c

u a

u a-�

compatible shift

⋆ compatible with match

shift = period(u)

[Morris, Pratt, 1969]

⋆ id+not incomp. with c

[Knuth, M., P., 1977]

⋆ best shift = period(uc)

text y · · a b a a b a c · · · · · · ·

pattern x a b a a b a a

a b a a b a a

text y · · a b a a b a c · · · · · · ·

pattern x a b a a b a a

a b a a b a a

text y · · a b a a b a c · · · · · · ·

pattern x a b a a b a a

a b a a b a a

[Simon, 1989], [Hancart, 1993], [Breslauer, Colussi, Toniolo, 1993]M.C. CANT 2012 6/27

Page 7: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Delay

⋆ Delay: maximal number of comparisons on a text letter

⋆ MP algorithm: delay ≤ |x|

⋆ KMP algorithm: delay ≤ logΦ(|x| + 1)

proof by Periodicity Lemma

text y · · a b a a b a c · · · · · · ·

pattern x a b a a b a b

a b a a b a a

a b a a b a a

a b a a b a a

⋆ Simon-Hancart algorithm: delay ≤ min(1 + log2 |x|, cardA)

use of string-matching automaton

M.C. CANT 2012 7/27

Page 8: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Searching with an automaton

⋆ Uses the string-matching automaton SMA(x):

smallest determin. automaton accepting A∗x

⋆ Example x = abaa, A = {a, b}

0 1 2 3 4a b a a

b a

b

b

a

b

⋆ Search for abaa in:

b a b b a a b a a b a a b b a · · ·

state 0 0 1 2 0 1 1 2 3 4 2 3 4 2 0 1 · · ·

M.C. CANT 2012 8/27

Page 9: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Construction of SMA(x)

⋆ Unwinding arcs

⋆ From SMA(abaa) . . .

0 1 2 3 4a b a a

b a

b

b

a

b

⋆ . . . to SMA(abaab)

0 1 2 3 4 5a b a a b

b a

b

b

a

a

b

M.C. CANT 2012 9/27

Page 10: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Significant arcs

⋆ Complete SMA(ananas)

- -���� ���� ���� ���� ���� ���� ����0 1 2 3 4 5 6- - - - - -

a n a n a s

����n, s ����

a � ��a

� ��a

' $�a

� ��s � ��

n� ��

n, s& %�s& %�

n, s& %�

n, s

⋆ Forward arcs: spell the pattern

⋆ Backward arcs: arcs going backwards without

reaching the initial state

- -���� ���� ���� ���� ���� ���� ����0 1 2 3 4 5 6- - - - - -

a n a n a s

����a � ��

a� ��

a' $�

a

� ��n

M.C. CANT 2012 10/27

Page 11: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Complexity

⋆ Time and space optimisation: implementation of

significant arcs only

– Forward arcs: spell the pattern

– Backward arcs: arcs going backwards not to

initial state

Lemma 3 SMA(x) has at most |x| backward arcs.

⋆ implementation of SMA(x) in O(|x|) space

⋆ construction in O(|x|) time, independent of alphabet

size

⋆ Optimal searching strategy:

delay ≤ min(1 + log2 |x|, cardA)

[Hancart, 1993]

[Breslauer, Colussi, Toniolo, 1993]

M.C. CANT 2012 11/27

Page 12: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Local periods

⋆ Overlap

w overlap of (u, v) if w 6= ε and:

A∗u ∩ A∗w 6= ∅ and vA∗ ∩ wA∗ 6= ∅

|w| is a local period of uv at position |u|

⋆ Local Period

localperiod(u, v) = smallest local period of (u, v)

a b a b b a

b b

a b a b b a

a a

a b a b b a

b a b a

a b a b b a

b b a b a b b a b a

M.C. CANT 2012 12/27

Page 13: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Maximal Local Period

⋆ Word of period 5

a b a b b a

1 2 2 5 1 3 1

⋆ Note: localperiod(u, v) ≤ period(uv)

⋆ (u, v) is a critical factorization of uv if

localperiod(u, v) = period(uv)

localperiod(u, v) is maximal among all local periods

⋆ Computation of all local periods in linear time

[Duval, Kolpakov, Kucherov, Lecroq, Lefebvre, 2003]

M.C. CANT 2012 13/27

Page 14: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Critical Factorization

Theorem 1 (Critical Factorization Theorem)

Any non-empty word x can be factorized into u · v with both:

• |u| < p and

• localperiod(u, v) = period(x).

[Cesari, Vincent, Duval, 1983]

a b a b b a

b b a b a b b a b a

Leads to time-space optimal string-matching algorithm:

two-way algorithm

M.C. CANT 2012 14/27

Page 15: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Two-Way String Matching

text y

pattern x

w c

u w a

u

c w v

va w

v

shifts-�

|wc|-�

period(x)

⋆ Time-space optimality [C., Perrin, 1992]

– Search Time: linear time (≤ 2n comparisons)

with constant extra space

– Preprocessing Time: idem, based on next th.

⋆ Other solutions: [Galil, Seiferas, 1983]

[C., 1992], [C., Rytter, 1994], [Rytter, 2002]

⋆ Real-time version:

[Breslauer, Grossi, Mignosi, 2011]

M.C. CANT 2012 15/27

Page 16: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Example (4)

⋆ Critical factorization a b a b b a b a

u v

period = 5

⋆ Searching

window

a a a b b b b b a a b b a b a b b a b a b a . .

a b a b b a b a

a b a b b a b a

a b a b b a b a

a b a b b a b a

left-to-right and right-to-left scans

occurrence found

next shift length = period = 5

M.C. CANT 2012 19/27

Page 17: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Maximal suffixes

⋆ Orderings

≤ lexicographic ordering based on ≤ of alphabet

� lexicographic ordering based on ≤−1 of alphabet

Theorem 2 (C., Perrin, 1992) x 6= ǫ

Let x = uv with v = suffix of x that is maximal for ≤

Let x = u′v′ with v′ = suffix of x that is maximal for �

If |v| ≤ |v′| then (u, v) is a critical factorization of x

otherwise (u′, v′) is.

Moreover, |u| < period(x) and |u′| < period(x).

a b a a b a a

a b a a b a a

a b a b a a b b a b a b a

a b a b a a b b a b a b a

M.C. CANT 2012 20/27

Page 18: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Proof (4)

Four cases — x = uv, w shortest overlap of (u, v)

⋆ w suffix of u, v prefix of w

v ≤ w < wv, impossible

x u v

w w

⋆ w suffix of u, w prefix of v

z < v implies v = wz < wv

impossible

x u

w w

z

⋆ u suffix of w and v prefix of w

(u, v) is a critical factorization

x u v

w w

⋆ u suffix of w and w prefix of v

z < v; yz ≺ yv implies z ≺ v;

z prefix of v then border of v

w period of v and of x.

xv′

w

y

yz

M.C. CANT 2012 21/27

Page 19: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Computing maximal suffixes

⋆ Algorithm adapted from Lyndon factorisation

[Duval, 1983]

⋆ Runs in linear time according to string length

and constant extra space

M.C. CANT 2012 22/27

Page 20: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Maximal-suffix computation

⋆ v maximal suffix; |w| its period; w′ proper prefix of w

x u v

w w w′

a b a c b c b a c b c b a c b cu w w w′

⋆ Match: the periodicity continues

a b a c b c b a c b c b a c b cu w w new w′

? ?

b

⋆ Smaller letter: new w, border-free

a b a c b c b a c b c b a c b cu new w

? ?

a

⋆ Greater letter: new u and recomputation on the rest

a b a c b c b a c b c b a c b cnew u recomputation

? ?

c

M.C. CANT 2012 23/27

Page 21: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Perfect factorisation

Theorem 3 Any non-empty word x can be factorised

into u · v with both:

— |u| < 2 period(v) and

— v starts with at most one cube of a primitive word.

[Galil, Seiferas, 1983], [C., Rytter, 1994]

[Mignosi, Restivo, Salemi, 1995] see [Lothaire, 2002]

⋆ Example word with period = 10

a a a b a a b a a b a a a b a a b a a b a a a b a a · · ·

⋆ Leads to a time-space optimal string matching

M.C. CANT 2012 24/27

Page 22: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Square prefixes

w w

v v

u u

Lemma 4 (Three-square Lemma)

If u2 proper prefix of v2, v2 proper prefix of w2, and u

primitive then |u| + |v| ≤ |w|.

[C., Rytter, 1995], [Lothaire, 2001]

10 a a b a a b a a a b a a b a a b a a a b

7 a a b a a b a a a b a a b a

3 a a b a a b

M.C. CANT 2012 25/27

Page 23: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Primitively-rooted squares in a word

Lemma 5 ([Fraenkel, Simpson, 1998])

No more than 2n p.-r. squares in a word of length n.

y

w w

v v

u u?

rightmost positions on y? impossible!

Direct proofs [Hickerson, 2004], [Ilie, 2005]

Best bound: 2n− Θ(logn) [Ilie, 2005]

Computation in linear time [Gusfield, Stoye, 1998]

Lemma 6 ([C., 1981], [Gusfield, Stoye, 1999])

Maximal nb. of occurrences of p.-r. squares: cn logn.

Maximum reached with Fibonacci words.

M.C. CANT 2012 26/27

Page 24: Text redundancies (2) · Used in the analysis of KMP algorithm and of many other pattern matching algorithms. M.C. CANT 2012 4/27. Proof of the weaker statement ⋆ p and q periods

Runs

Run of x = maximal periodicity in x = non extensible

occurrence of a power in x

Theorem 4 ([Kolpakov, Kucherov, 1998])

Maximal number of runs in a word of length n: O(n).

M.C. CANT 2012 27/27