Compression of Palindromes and Regularity.kyodo/kokyuroku/contents/pdf/2051-24.pdf · Compression...

6
Compression of Palindromes and Regularity. Kayoko Shikishima-Tsuji Center for Liberal Arts Education and Research Tenri University 1 Introduction In [1], a property of clickstream data at a view of database is discussed and it is shown that page repetitions occur for the majority as a very specific structure, namely in the form of nested palindromes. A kind of function CFP (Compress First Palindrome) required for an algorithm which extracts these structures in linear time is introduced. In this paper, we define a rewriting system R which covers CFP and consider relation between R and CFP. We adopt the name “wrinkled word” instead of “nested palindrome” and it means a word which has a non-trivial palindrome as a factor. The set of all wrinkled words is regular, though the set of all palindromes is not regular. We also give automata on some alphabets which accept all wrinkled words. 2 Preliminaries We assume the reader to be familiar with basic concepts as alphabet, word, language, regular expression and automaton (for more details see [2]). Words together with the operation of concatenation form a free monoid, which is usually denoted by Σ for a finite alphabet Σ. The length of a finite word w is the number of not necessarily distinct symbols it consists of and is written by . The empty word is denoted by and . For a word for , a factor of is , where , and the reverse of is . A word is said to be palindrome if . If a palindrome is not in , the palindrome is non-trivial, otherwise is trivial. If a word has at least one non-trivial palindrome as a factor, is said to be wrinkled. w λ λ = 0 w = a 1 a 2 !a n a 1 , a 2 ,!, a n ∈Σ w a i !a j 1 i j n w R w a n !a 2 a 1 p ∈Σ p = p R p Σ∪ {λ } p p w ∈Σ w 1

Transcript of Compression of Palindromes and Regularity.kyodo/kokyuroku/contents/pdf/2051-24.pdf · Compression...

Page 1: Compression of Palindromes and Regularity.kyodo/kokyuroku/contents/pdf/2051-24.pdf · Compression of Palindromes and Regularity. ... language W = {the set of all ... Peter Linz, An

Compression of Palindromes and Regularity.

Kayoko Shikishima-Tsuji

Center for Liberal Arts Education and Research

Tenri University

1 Introduction

In [1], a property of clickstream data at a view of database is discussed and it is shown that page repetitions occur for the majority as a very specific structure, namely in the form of nested palindromes. A kind of function CFP (Compress First Palindrome) required for an algorithm which extracts these structures in linear time is introduced.

In this paper, we define a rewriting system R which covers CFP and consider relation between R and CFP. We adopt the name “wrinkled word” instead of “nested palindrome” and it means a word which has a non-trivial palindrome as a factor. The set of all wrinkled words is regular, though the set of all palindromes is not regular. We also give automata on some alphabets which accept all wrinkled words.

2 Preliminaries

We assume the reader to be familiar with basic concepts as alphabet, word, language, regular expression and automaton (for more details see [2]).

Words together with the operation of concatenation form a free monoid, which is usually denoted by Σ∗ for a finite alphabet Σ. The length of a finite word w is the number of not necessarily distinct symbols it consists of and is written by � .

The empty word is denoted by � and � . For a word � for

� , a factor of � is � , where � , and the reverse � of

� is � . A word � is said to be palindrome if � . If a palindrome

� is not in � , the palindrome � is non-trivial, otherwise � is trivial. If a

word � has at least one non-trivial palindrome as a factor, � is said to be wrinkled.

w

λ λ = 0 w = a1a2!ana1,a2 ,!,an ∈Σ w ai!aj 1≤ i ≤ j ≤ n wR

w an!a2a1 p∈Σ∗ p = pR

p Σ∪{λ} p p

w∈Σ∗ w

�1

Page 2: Compression of Palindromes and Regularity.kyodo/kokyuroku/contents/pdf/2051-24.pdf · Compression of Palindromes and Regularity. ... language W = {the set of all ... Peter Linz, An

A string rewriting system � on � is a subset of � . We define reduction relation on � that is induced by � is defined as follows: for every � , � if and only if there exists � such that for some � , �

and � . By � , we denote the reflexive transitive closure of � . If �

and there is no � such that � , then � is irreducible; otherwise, � is

reducible. The set of all irreducible words with respect to � is denoted by

� . For � , if � and � is irreducible, then � is a normal form for

� . If, for all � , � and � imply that there exists � such

that � and � , we say that � is confluent. If for all � , �

and � imply that there exists � such that � and � , we say that � is locally confluent. If there is no infinite sequence � where

� , then � is said to be noetherian.

Two string rewriting systems � and � on � are called equivalence if � implies � and � implies � and then we denote � .

Compress First Palindrome CFP : � is defined as follows (see [1]) : if � is a wrinkled word, for the left most � of � where � �

� , there are � such that � and we define � and if

� is a non-wrinkled word, we define � . Since a wrinkled word � has only finite non-trivial palindromes as factor and � , then we can

define � � where � is a large enough number such that

� ( � ). 3 Regularity of the set of all wrinkled words.

Proposition 1. Let � and � be string rewriting systems � �and . Then R and are

equivalent.

Proof) Since � , it is clear that if � for some � , then we have � .

On the other side, let � for some � , then there exist � ,� and a palindrome � such that � and � . If � ,

R Σ Σ∗ × Σ∗

Σ∗ R u,v∈Σ∗

u→R v (l,m)∈R x, y∈Σ∗ u = xly

v = xmy →R∗ →R x ∈Σ∗

y∈Σ∗ x→R y x x

→R

IRR(R) x, y∈Σ∗ x→R∗ y y y

x w, x, y∈Σ∗ w→R∗ x w→R

∗ y z ∈Σ∗

x→R∗ z y→R

∗ z R w, x, y∈Σ∗ w→R x

w→R y z ∈Σ∗ x→R∗ z y→R

∗ zR x1 →R x2 →R!

x1, x2 ,!∈Σ∗ R

R S Σw→R

∗ z w→S∗ z w→S

∗ z w→R∗ z R ≅ S

Σ∗ → Σ∗

w∈Σ∗ aba w a∈Σ, b∈Σ∪{λ},

a ≠ b u,v∈Σ∗ w = uabav CFP(w) = uv

w∈Σ∗ CFP(w) = w ww > CFP(w)

CFP∞ (w) = CFPn (w) n

CFPn (w) = CFPn+1(w) = CFP(CFPn (w))

R S R = {apa→R a | a∈Σ,p is a palindrome} S = aba→S a | a∈Σ, b∈Σ∪ λ{ }{ } S

R ⊇ S w→S z w, z ∈Σ∗

w→R zw→R z w, z ∈Σ∗ u,v∈Σ∗

a∈Σ p∈Σ∗ w = uapav z = uav p∈Σ∪{λ}

�2

Page 3: Compression of Palindromes and Regularity.kyodo/kokyuroku/contents/pdf/2051-24.pdf · Compression of Palindromes and Regularity. ... language W = {the set of all ... Peter Linz, An

then � . If � , then � is written � by � , � and � . Since � and so on, we have � .

q.e.d.

The following lemma is well-known (see [3]).

Lemma 1. If a string rewriting system � is noetharian and locally confluent, then � is confluent.

Proposition 2. Let � and � be string rewriting systems � �and . Then R and S are

confluent.

Proof) By Proposition 1 and Lemma 1, it is enough to prove that S is locally confluent.(a) If � where � and � for some

� , then � , � and � , � .

(b) If � and � where � , � . Since � , � , � , � , we have � where � when � and � when � .

q.e.d.

Proposition 3. Let � be one of string rewriting systems of Proposition 2. If, for � and a natural number n, � , then � . On the other hand, if � for � , then there exist a natural number n and � such that � and � .

Proof) If � , it is obvious that � . If � has no non-trivial palindrome as a factor, then we have � and � . We may assume that � is wrinkled. There exists a finite sequence: � , � such that � is wrinkled and � is not wrinkled. Then we have a sequence � . Since � is not wrinkled, � has no non-trivial palindrome as a factor and � . By Proposition 2, the rewriting system � is confluent and then we have � .

The following corollary is clear by Proposition 3.

w→S z p∉Σ∪{λ} p xbcbxR c∈Σ∪{λ} b∈Σx ∈Σ∗ w = uaxbcbxRav→S uaxbx

Ravw = uaxbcbxRav→S

∗ uav = z

RR

R S R = {apa→R a | a∈Σ,p is a palindrome} S = aba→S a | a∈Σ, b∈Σ∪ λ{ }{ }

w = u1v1u2v2u3 u1,v1,u2 ,v2 ,u3 ∈Σ∗ v1 →S a,v2 →S ba, b∈Σ w→S u1au2v2u3 w→S u1v1u2bu3 u1au2v2u3 →S u1au2bu3u1v1u2bu3 →S u1au2bu3w = u1vu3 v∈{aaa, abaa, aaba, abab} u1,u3 ∈Σ∗ a,b∈Σ

aaa→S∗ a abaa→S

∗ a aaba→S∗ a abab→S

∗ ab w→S∗ u1 ′v u3

′v = a v∈{aaa, abaa, aaba} ′v = ab v = aaba

R 'w, z ∈Σ∗ CFPn (w) = z w→R '

∗ zw→R '

∗ z w, z ∈Σ∗ z '∈Σ∗

CFPn (w) = z ' w→R '∗ z '

CFP(w) = z w→R ' zw w = z

CFP(w) = w ww = w0 CFP(w) = w1,!,CFP

n (w) = wn wn−1 wn

w→S w1 →S!→S wn wn

wn wn ∈IRR(R ')R ' w→R '

∗ wn

�3

Page 4: Compression of Palindromes and Regularity.kyodo/kokyuroku/contents/pdf/2051-24.pdf · Compression of Palindromes and Regularity. ... language W = {the set of all ... Peter Linz, An

Corollary 1. Let � be the string rewriting system � �� . Then we have � .

By Corollary 1, we have � = {the set of all non-wrinkled words}The following lemma is well-known (see [3]).

Lemma 2. A string rewriting system � is finite, then � is a regular set.

By Proposition 3 and Lemma 2, we have the following theorem.

Theorem 1. The language NW = {the set of all non-wrinkled words} and the language W = {the set of all wrinkled words} are both regular.

Proof) The language NW is regular and then W =(NW)c is regular.q.e.d.

The set of all palindromes are contest-free but not regular. W = {the set of all wrinkled words}= � is regular.

4 Examples.

For � , we give an automaton on � which accepts all wrinkled words on � .

Example 1. If � , then NW = {� | w is non-wrinkled word} is the empty set.

Example 2. If � , then NW is the set {ab, ba}

Example 3. If � , then NW is the set { � | � is a factor of (abc)* ∪

(acb)* such that | u | > 1}.

By the following automaton A = (Q, Σ, δ, i, F) (Figure 1), NW is accepted,

where � , � is the initial state and � is a set of final states.

R R = {apa→R a | a∈Σ,p is a palindrome} CFP∞ (Σ∗) = IRR(R)

CFP∞ (Σ∗)

R IRR(R)

= CFP∞ (Σ∗)

w |w has at least one non-trivial palindrome as a factor{ }

Σ ≤ 4 ΣΣ

Σ = {a} w∈Σ∗

Σ = {a,b}

Σ = {a,b,c} u ∈Σ∗ u

Q = {i, 1, 2,!, 9} i ∈ Q F =Q

�4

Page 5: Compression of Palindromes and Regularity.kyodo/kokyuroku/contents/pdf/2051-24.pdf · Compression of Palindromes and Regularity. ... language W = {the set of all ... Peter Linz, An

Example 4. If � , then NW is the language which is accepted by the following automaton A = (Q, Σ, δ, i, F) (Figure 2), where � , � is the initial state and � is a set of final states. In Figure 2, states � and the following transition functions which start from these states are omitted for simplicity: � , � , � , � , � , � , � , � , � , � , � , � , � , � , � , � (see Figure 3).

Σ = {a,b,c,d}Q = {i, 1, 2,!,16}

i ∈ Q F =Qi, 13,14,15,16

a : i→13 b : i→14 c : i→14 d : i→15 b :13→ 2c :13→ 8 d :13→10 a :14→11 c :14→ 7 d :14→1 a :15→ 5 b :15→ 6d :15→ 9 a :16→ 3 b :16→12 a :16→ 4

�5

2

75

10

86

11

1

9

3

4

12

c

d

d

d

d

d

d

c

c

c

c

c

b

a

aa

a

a

b

b

b

bb

a

Figure 2

Figure 1

2

3

i

1

c

a

a b

cb

5

6

7

4

c

c

ab

b

8 9

b

a c

a

Page 6: Compression of Palindromes and Regularity.kyodo/kokyuroku/contents/pdf/2051-24.pdf · Compression of Palindromes and Regularity. ... language W = {the set of all ... Peter Linz, An

References.

[1] Michel Speiser, Gianluca Antonini, Abderrahim Labbi, Juliana Sutanto: On nested palindromes in clickstream data. The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 12-16, 2012. ACM 2012: 1460-1468

[2] Peter Linz, An Introduction to Formal Language and Automata, Jones and Bartlett Publishers, Inc. , USA 2006, ISBN, 0763737984

[3] Ronald V. Book, Friedrich Otto: String-Rewriting Systems. Texts and Monographs in Computer Science, Springer 1993, ISBN 978-3-540-97965-4

Center for Liberal Arts Education and ResearchTenri University1050 Somanouchi, Tenri, Nara 632-8510, JapanE-mail address: tsuji sta.tenri-u.ac.jp

天理大学・総合教育研究センター 辻佳代子

�6

2

7 5

10

8

611

1 9

3 4

12

d

d d

d

c

c

c

c

a

a

a

b

b

b a

i

13

14 15

16

b

Figure 3