A Generalization of Ogden's Lemma

4
A Generalization of Ogden's Lemma CHRISTOPHER BADER Peace Corps, Kenya AND ARNALDO MOURA University of California, Berkeley, California ABSTRACT. "Excluded positions" are incorporated into a modified form of Ogden's lemma, and a language satisfying the latter, which the authors' generalized lemma can show fails to be context-free, is presented. It is also shown that in a fairly general sense there ts no function relating the number of distinguished and excluded posnions which would allow the authors' Iteration lemma to characterize the context-free languages. Categones and SubJect Descriptors' F 4.2 [Mathematical Logic and Formal Languages]. Grammars and Other Rewrtting Systems--grammar types; F.4.3 [Mathematical Logic and Formal Languages] Formal Languages--classes defined by grammars or automata General Terms: Languages, Theory Addltmnal Key Words and Phrases- Context-free languages, Ogden's lemma 1. Introduction Ogden's lemma [2] is one of the most useful results in the theory of context-free languages. It enables us, for example, to give a two-line proof that the language of strings of the form arab'% m is not context-free. Other applications are given in [1, pp. 192-211]. Ogden's lemma may be stated as follows. LEMMA. For any context-free language L, 3n ~ ~,I, the set of nonnegative integers, such that Vz E L, if d positions in z are "distinguished," with d > n, then 3u, v, w, x, y such that z = uvwxy and (1) vx contains at least one distinguished position; (2) if r is the number of distinguished positions in vwx, then r <_ n; (3) Vi E ~, uv'wx~y ~ L. PROOF. An immediate corollary of the generalized lemma given in the next section. [] In order to see that the language of strings of the form a'~b'~c m is not context-free, let m = n, and let every position in the string anb'~c n be distinguished. It is easy to see that we cannot choose u, v, w, x, y in such a way as to satisfy the lemma. The need for a stronger version of Ogden's lemma arose in connection with the theory of context-free languages with interpretations. Let L be a context-free language Authors" addresses: C. Bader, 7421 Saville Court, Alexandria, VA 22306; A Moura, Department of Computer Science, University of California, Berkeley, CA 94720. Permissmn to copy without fee all or part of this material ~sgranted provided that the copies are not made or distributed for direct commerctal advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is gwen that copying is by permission of the Assoeiatmn for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1982 ACM 0004-5411/82/0400-0404 $00.75 Journal of the Assocmtlon for Computmg Machinery, Vol 29, No 2, April 1982, pp 404-407

Transcript of A Generalization of Ogden's Lemma

A Generalization of Ogden's Lemma

CHRISTOPHER BADER

Peace Corps, Kenya

AND

ARNALDO MOURA

University of California, Berkeley, California

ABSTRACT. "Excluded positions" are incorporated into a modified form of Ogden's lemma, and a language satisfying the latter, which the authors' generalized lemma can show fails to be context-free, is presented. It is also shown that in a fairly general sense there ts no function relating the number of distinguished and excluded posnions which would allow the authors' Iteration lemma to characterize the context-free languages.

Categones and SubJect Descriptors' F 4.2 [Mathematical Logic and Formal Languages]. Grammars and Other Rewrtting Systems--grammar types; F.4.3 [Mathematical Logic and Formal Languages] Formal Languages--classes defined by grammars or automata

General Terms: Languages, Theory

Addltmnal Key Words and Phrases- Context-free languages, Ogden's lemma

1. Introduction

Ogden ' s l e m m a [2] is one o f the most useful results in the theo ry o f context - f ree languages. I t enables us, for example , to give a two- l ine p r o o f that the l anguage o f strings o f the fo rm arab'% m is not context-free. O the r app l i ca t ions are g iven in [1, pp. 192-211]. Ogden ' s l e m m a m a y be s ta ted as follows.

LEMMA. For any context-free language L, 3n ~ ~,I, the set o f nonnegative integers, such that Vz E L, i f d positions in z are "distinguished," with d > n, then 3u, v, w, x , y such that z = uvwxy and

(1) vx contains at least one distinguished position; (2) i f r is the number o f distinguished positions in vwx, then r <_ n; (3) Vi E ~ , uv'wx~y ~ L.

PROOF. A n immed ia t e coro l la ry o f the genera l ized l e m m a given in the next section. [ ]

In o rde r to see tha t the l anguage o f str ings o f the fo rm a'~b'~c m is not context-free , let m = n, and let every pos i t ion in the str ing anb'~c n be dis t inguished. I t is easy to see that we canno t choose u, v, w, x, y in such a way as to sat isfy the l emma.

The need for a s t ronger vers ion o f Ogden ' s l e m m a arose in connec t ion wi th the theory o f context - f ree l anguages wi th in terpre ta t ions . Le t L be a context - f ree l anguage

Authors" addresses: C. Bader, 7421 Saville Court, Alexandria, VA 22306; A Moura, Department of Computer Science, University of California, Berkeley, CA 94720. Permissmn to copy without fee all or part of this material ~s granted provided that the copies are not made or distributed for direct commerctal advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is gwen that copying is by permission of the Assoeiatmn for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1982 ACM 0004-5411/82/0400-0404 $00.75

Journal of the Assocmtlon for Computmg Machinery, Vol 29, No 2, April 1982, pp 404-407

Generalization of Ogden "s Lemma 405

partitioned into equivalence classes of sentences having the same interpretation, or meaning. A canonical form is a subset of L that contains one and only one member of each equivalence class. It can be shown, using the result we present in the next section, that the language of propositional calculus does not have a context-free canonical form. The proof of this, however, would take us too far afield.

2. The Generalized Lemma

THEOREM. For any context-free language L, 3n ~ IN such that Vz ~ L, i f d positions in z are "distinguished" and e positions are "excluded," with d > n (e+l), then 3u, v, w, x, y such that z = uvwxy and

(1) vx contains at least one distinguished position and no excluded positions; (2) if r is the number of distinguished positions and s is the number of excluded positions

in vwx, then r <_ n(s+x); (3) Vi ~ ~4, uv~wx'y E L.

PROOf. Since the hypothesis cannot apply to the null string ~,, we may assume that z ~ 2~. L - (~,} has a grammar G in Chomsky normal form. Let G have k nonterminals, and let n = 2 k+~. Let us consider one of the trees for z in G. If both children of a node in this tree have distinguished descendants, call that node a "branch point." Let P be a path with the greatest number of branch points. Since z has at least 2 tk+l)te+l) distinguished positions, P has at least (k + 1)(e + 1) branch points. We divide the lowermost part of P into e + 1 subpaths, each containing k + 1 branch points.

In each of the subpaths there must be two branch points with the same label, say, A. Thus there exist two strings of terminals, v' and x' (which we call a pair of "pumping substrings"), and two nonterminals, B and C, such that _4 ~ BC =** v'Ax'. Since the upper A is a branch point, both B and C have distinguished descendants. They cannot both dominate the lower .4, and so v'x' must contain at least one distinguished position.

Starting at the leaf in which P terminates, let us proceed through the subpaths until we fmd one whose pair of pumping substrings contains no excluded positions. We know that such a pair exists, since there are e + 1 distinct pairs but only e excluded positions. We call this pair v, x. Thus we have proved (1) and (3).

Suppose we have had to climb to the (g + 1)st subpath to fred v and x. For each subpath below this one, the pair of pumping substrings contains at least one excluded position. Hence if the number of excluded positions in vwx is s, then s _> g. The string vwx is dominated by a single node in the (g + 1)st subpath. By definition of P, no path down from this node contains more than (k + 1)(g + 1) branch points. Hence if the number of distinguished positions in vwx is r, then r _ _ 2 ( k + l ) ( g + l ) - - n (g+l)

___ n t8+1), proving (2). []

Application. Consider the language L1 = (z E {a, b}* IOq such that z = ab q) =* (q is prime)}. We shall show that this language satisfies Ogden's lemma, but that it does not satisfy the generalized lemma.

Any string in (a, b}* is in L~ if it is not in the form ab q. Regardless of how the distinguished positions are distributed, any string not in the form ab q can be pumped so that it is still not in this form. On the other hand, for strings in this form, regardless of how the distinguished positions are distributed, it is impossible to exclude the a from the pumping substrings. I f a substring containing the a is pumped, we get a string that is not in the form ab q and hence is in L~. Thus L~ satisfies Ogden's lemma.

406 C, BAI)ER AND A. MOURA

Let n be the constant of the generalized lemma, and let q be the smallest prime greater than n 2. Clearly, ab q ~ L1. Let the a be excluded, ani5 let all the b's be distinguished. Thus, in the terms of the lemma, d = q and e = t. Since d = q > n 2 = n ~e+l~, the lemma tells us that ab q contains a pair of pumping s~tbstrings containing only b's. Suppose the number of b's in this pair is p. I f we pump abqq times, we get ab q~p+l), which is clearly not in L~. Thus L~ does not satisfy the conditions o f the generalized lemma and hence is not context-free.

3. Is There a Necessary and Sufficient Version o f the Lemma?

As we shall see, satisfaction of the generalized lemma is ~ot sufficient to establish that a language is context-free. This may not be surprising, though, since the function f ix) = n ~x+l), for n > 1, is very "big." It is therefore of considerably more general interest to show that there is no function g that is small enough to be necessary and sufficient. To do this, we need two propositions.

Any function which is necessary must provide that there be some positions which are distinguished but not excluded, since otherwise condition (1) of the lemma cannot be met. We therefore consider only those functions such that g(x) > x and g(x) > 1 for all x. These conditions are clearly met by our original funct ionf(x) = n ~x+l~ for n > l .

PROPOSITION 1. Let L2 = {z E {a, b}*l(3q such that z == (ab) q) =~ (q is pO'ime)}. Note that L2 # L1. Let g be any function f rom ~I into ~q such that Vx, g(x) > x and g(x) > 1. Vz E L2, if d positions in z are distinguished and e posi- tions are excluded, with d > g(e), the conclusions o f the generalized lemma follow, using r <_ g(s) in (2).

PROOF. Let z E Lz. We consider two cases. First, suppose that if any single character is removed from z, the resulting string is still in/-,2.

Since d > g(e) > e, there are at least two positions in z which are distinguished but not excluded. Let one of these, which we suppose, without loss of' generality, to be an a, equal v, and let w = x = X. (1) clearly holds. (2) holds, since g(0) > 1. Vi > 1, uv'wx'y contains a substring of the form aa. Hence (3) holds for i > 1. For i = 0, uv'wx'y E L2 by the supposition made at the beginning of the proof. Thus (3) holds.

Otherwise, suppose that there is some character in z such that, i f it is removed, we obtain a string which is not in Lz. A g a i n without loss of generality, we can suppose this character to be an a. We can see from the definition of / .2 that the only strings not in L2 are of the form (ab) q, where q is not a prime. Thus z must be of the form (ab)Pa (ab) r, where p _ 0, r _ 0, and p + r = q.

Again, d > g(e) > e; so we have at least two positions in z which are distinguished but not excluded.

Suppose r = 0. Then z has a t'mal a. Since z contains at least two positions which are distinguished but not excluded, there is some character in z other than the £mal a which is distinguished but not excluded. Let v be this character, and let w = x = %. (1) and (2) hold as before. Vi _> 0, uv'wx'y has a f'mal a and hence is in Lz. Thus O) holds.

Otherwise, r > 0. Then z has a substring of the form aa. If there is some character outside this substring which is distinguished but not excluded, then the situation is similar to the case in which r = 0. Otherwise, both of the a 's in the substring must b¢ distinguished but not excluded. Let v equal the substring aa, and let w == x z %. (1) clearly holds. (2) holds, since g(0) is at least 2. For i = 0, uv'wx'y has a substring of the form bb or starts with a b. For i > 0, uv'wx'y has a substring of the form aa. Either way, uv'wx'y ~ Lz, fulfilling (3). []

Generalization of Ogden's Lemma 407

PgOPOSITION 2. L2 is not context free.

PROOF. Suppose it is. The set of context-free languages is closed under general- ized-sequential-machine (gsm) mappings and under mirror image [3]. Let M be the image of/.2 under a gsm mapping that places a # at the end of strings of the form (ab)n. Let N be the mirror image of M. Finally, let P be the image of N under the gsm that deletes al! characters until it encounters a #, and deletes only b's and #'s thereafter. L2 is context-free; so/~ is, too. But P = {a n ] n is a prime or zero}, which is well known not to be context-free. So L2 is not context-free either. []

L2 therefore satisfies the conditions of the lemma for every possible function g, with the restrictions that we have noted. But/-,2 is not context-free. Therefore the function that would permit our lemma to characterize the context-free languages cannot exist.

ACKNOWLEDGMENTS. A preliminary version of this paper was written while the first author was a senior at Princeton University under the supervision of Professor Jeffrey D. Ullman. Without his help and encouragement, it could not have been written.

The later versions were written in highly improbable, even exotic, surroundings, producing delays both avoidable and unavoidable. The editors have shown great understanding in this matter.

REFERENCES

(Note. Reference [4] is not cited m the text.) 1. Ano, A V., A~rD ULLMAlq, J.D The Theory of Parsing, Translation, and Compiling. 2 vols. Prentice-

HalL Englewood Chffs, N.J , 1972-73. 2. OGDEN, W. A helpful re.suit for proving inherent ambiguity. Math. Syst Theory 2 (1968), 191-194. 3. SALOMAA, A. Formal Languages. Academic Press, New York, 1973. 4. WISE, D.S. A strong pumping lemma for context-free languages. Theor. Comput. Scl 3 (1976),

359-369.

RECEIVED MARCH 1979; REVISED DECEMBER 1980; ACCEPTED JANUARY 1981

Journal of the Assoclauon for Computing Machinery, Vol 29, No 2, April 1982