· PDF filePreface We are delighted to present you with this volume containing the papers...

EACL 2009

Proceedings of theEACL 2009 Workshop onComputational Linguistic

Aspects ofGrammatical Inference

CLAGI 2009

30 March – 2009Megaron Athens International Conference Centre

Athens, Greece

Production and Manufacturing byTEHNOGRAFIA DIGITAL PRESS7 Ektoros Street152 35 VrilissiaAthens, Greece

c©2009 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ii

Preface

We are delighted to present you with this volume containing the papers accepted for presentation at the1st workshop CLAGI 2009, held in Athens, Greece, on March 30th 2009.

We want to acknowledge the help of the PASCAL 2 network of excellence.

Thanks also to Damir Cavar for giving an invited talk and to the programme committee for the reviewingand advising.

We are indebted to the general chair of EACL 2009, Alex Lascarides, to the publication chairs, KemalOflazer and David Schlangen, and to the Workshop chairs Stephen Clark and Miriam Butt, for all theirhelp. We are also grateful for Thierry Murgue’s assistance when putting together the proceedings.

Wishing you a very enjoyable time at CLAGI 2009!

Menno van Zaanen and Colin de la HigueraCLAGI 2009 Programme Chairs

iii

CLAGI 2009 Program Committee

Program Chairs:

Menno van Zaanen, Tilburg University (The Netherlands)Colin de la Higuera, University of Saint-Etienne, (France)

Program Committee Members:

Pieter Adriaans, University of Amsterdam (The Netherlands)Srinivas Bangalore, AT&T Labs-Research (USA)Leonor Becerra-Bonache, Yale University (USA)Rens Bod, University of Amsterdam (The Netherlands)Antal van den Bosch, Tilburg University (The Netherlands)Alexander Clark, Royal Halloway, University of London (UK)Walter Daelemans, University of Antwerp (Belgium)Shimon Edelman, Cornell University (USA)Jeroen Geertzen, University of Cambridge (UK)Jeffrey Heinz, University of Delaware (USA)Colin de la Higuera, University of Saint-Etienne (France)Alfons Juan, Polytechnic University of Valencia (Spain)Frantisek Mraz, Charles University (Czech Republic)Georgios Petasis, National Centre for Scientific Research (NCSR) ”Demokritos” (Greece)Khalil Sima’an, University of Amsterdam (The Netherlands)Richard Sproat, University of Illinois at Urbana-Champaign (USA)Menno van Zaanen, Tilburg University (The Netherlands)Willem Zuidema, University of Amsterdam (The Netherlands)

v

Table of Contents

Grammatical Inference and Computational LinguisticsMenno van Zaanen and Colin de la Higuera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

On Bootstrapping of Linguistic Features for Bootstrapping GrammarsDamir Cavar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Dialogue Act Prediction Using Stochastic Context-Free Grammar InductionJeroen Geertzen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Experiments Using OSTIA for a Language Production TaskDana Angluin and Leonor Becerra-Bonache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

GREAT: A Finite-State Machine Translation Toolkit Implementing a Grammatical Inference Approachfor Transducer Inference (GIATI)

Jorge Gonzalez and Francisco Casacuberta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

A Note on Contextual Binary Feature GrammarsAlexander Clark, Remi Eyraud and Amaury Habrard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Language Models for Contextual Error Detection and CorrectionHerman Stehouwer and Menno van Zaanen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

On Statistical Parsing of French with Supervised and Semi-Supervised StrategiesMarie Candito, Benoit Crabbe and Djame Seddah . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Upper Bounds for Unsupervised Parsing with Unambiguous Non-Terminally Separated GrammarsFranco M. Luque and Gabriel Infante-Lopez. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58

Comparing Learners for Boolean partitions: Implications for Morphological ParadigmsKatya Pertsova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

vii

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 1–4,Athens, Greece, 30 March 2009. c©2009 Association for Computational Linguistics

Grammatical Inference and Computational Linguistics

Menno van ZaanenTilburg Centre for Creative Computing

Tilburg UniversityTilburg, The [email protected]

Colin de la HigueraUniversity of Saint-Etienne

[email protected]

1 Grammatical inference and its links tonatural language processing

When dealing with language, (machine) learningcan take many different faces, of which the mostimportant are those concerned with learning lan-guages and grammars from data. Questions inthis context have been at the intersection of thefields of inductive inference and computationallinguistics for the past fifty years. To go backto the pioneering work, Chomsky (1955; 1957)and Solomonoff (1960; 1964) were interested, forvery different reasons, in systems or programs thatcould deduce a language when presented informa-tion about it.

Gold (1967; 1978) proposed a little later a uni-fying paradigm called identification in the limit,and the term of grammatical inference seems tohave appeared in Horning’s PhD thesis (1969).

Out of the scope of linguistics, researchers andengineers dealing with pattern recognition, underthe impulsion of Fu (1974; 1975), invented algo-rithms and studied subclasses of languages andgrammars from the point of view of what couldor could not be learned.

Researchers in machine learning tackled relatedproblems (the most famous being that of infer-ring a deterministic finite automaton, given ex-amples and counter-examples of strings). An-gluin (1978; 1980; 1981; 1982; 1987) introducedthe important setting of active learning, or learn-ing for queries, whereas Pitt and his colleagues(1988; 1989; 1993) gave several complexity in-spired results with which the hardness of the dif-ferent learning problems was exposed.

Researchers working in more applied areas,such as computational biology, also deal withstrings. A number of researchers from thatfield worked on learning grammars or automatafrom string data (Brazma and Cerans, 1994;Brazma, 1997; Brazma et al., 1998). Simi-

larly, stemming from computational linguistics,one can point out the work relating language learn-ing with more complex grammatical formalisms(Kanazawa, 1998), the more statistical approachesbased on building language models (Goodman,2001), or the different systems introduced to au-tomatically build grammars from sentences (vanZaanen, 2000; Adriaans and Vervoort, 2002). Sur-veys of related work in specific fields can alsobe found (Natarajan, 1991; Kearns and Vazirani,1994; Sakakibara, 1997; Adriaans and van Zaa-nen, 2004; de la Higuera, 2005; Wolf, 2006).

2 Meeting points between grammaticalinference and natural languageprocessing

Grammatical inference scientists belong to a num-ber of larger communities: machine learning (withspecial emphasis on inductive inference), com-putational linguistics, pattern recognition (withinthe structural and syntactic sub-group). There isa specific conference called ICGI (InternationalColloquium on Grammatical Inference) devotedto the subject. These conferences have been heldat Alicante (Carrasco and Oncina, 1994), Mont-pellier (Miclet and de la Higuera, 1996), Ames(Honavar and Slutski, 1998), Lisbon (de Oliveira,2000), Amsterdam (Adriaans et al., 2002), Athens(Paliouras and Sakakibara, 2004), Tokyo (Sakak-ibara et al., 2006) and Saint-Malo (Clark et al.,2008). In the proceedings of this event it is pos-sible to find a number of technical papers. Withinthis context, there has been a growing trend to-wards problems of language learning in the fieldof computational linguistics.

The formal objects in common between thetwo communities are the different types of au-tomata and grammars. Therefore, another meet-ing point between these communities has been thedifferent workshops, conferences and journals thatfocus on grammars and automata, for instance,

1

FSMNLP,GRAMMARS, CIAA, . . .

3 Goal for the workshop

There has been growing interest over the last fewyears in learning grammars from natural languagetext (and structured or semi-structured text). Thefamily of techniques enabling such learning is usu-ally called “grammatical inference” or “grammarinduction”.

The field of grammatical inference is often sub-divided into formal grammatical inference, whereresearchers aim to proof efficient learnability ofclasses of grammars, and empirical grammaticalinference, where the aim is to learn structure fromdata. In this case the existence of an underlyinggrammar is just regarded as a hypothesis and whatis sought is to better describe the language throughsome automatically learned rules.

Both formal and empirical grammatical infer-ence have been linked with (computational) lin-guistics. Formal learnability of grammars hasbeen used in discussions on how people learn lan-guage. Some people mention proofs of (non-)learnability of certain classes of grammars as ar-guments in the empiricist/nativist discussion. Onthe more practical side, empirical systems thatlearn grammars have been applied to natural lan-guage. Instead of proving whether classes ofgrammars can be learnt, the aim here is to pro-vide practical learning systems that automaticallyintroduce structure in language. Example fieldswhere initial research has been done are syntac-tic parsing, morphological analysis of words, andbilingual modelling (or machine translation).

This workshop organized at EACL 2009 aimedto explore the state-of-the-art in these topics. Inparticular, we aimed at bringing formal and empir-ical grammatical inference researchers closer to-gether with researchers in the field of computa-tional linguistics.

The topics put forward were to cover researchon all aspects of grammatical inference in rela-tion to natural language (such as, syntax, seman-tics, morphology, phonology, phonetics), includ-ing, but not limited to

• Automatic grammar engineering, including,for example,

– parser construction,

– parameter estimation,

– smoothing, . . .

• Unsupervised parsing

• Language modelling

• Transducers, for instance, for

– morphology,– text to speech,– automatic translation,– transliteration,– spelling correction, . . .

• Learning syntax with semantics,

• Unsupervised or semi-supervised learning oflinguistic knowledge,

• Learning (classes of) grammars (e.g. sub-classes of the Chomsky Hierarchy) from lin-guistic inputs,

• Comparing learning results in differentframeworks (e.g. membership vs. correctionqueries),

• Learning linguistic structures (e.g. phonolog-ical features, lexicon) from the acoustic sig-nal,

• Grammars and finite state machines in ma-chine translation,

• Learning setting of Chomskyan parameters,

• Cognitive aspects of grammar acquisition,covering, among others,

– developmental trajectories as studied bypsycholinguists working with children,

– characteristics of child-directed speechas they are manifested in corpora suchas CHILDES, . . .

• (Unsupervised) Computational language ac-quisition (experimental or observational),

4 The papers

The workshop was glad to have as invited speakerDamirCavar, who presented a talk titled:On boot-strapping of linguistic features for bootstrappinggrammars.

The papers submitted to the workshop and re-viewed by at least three reviewers each, covered avery wide range of problems and techniques. Ar-ranging them into patterns was not a simple task!

There were three papers focussing on transduc-ers:

2

• Jeroen Geertzen shows in his paperDialogueAct Prediction Using Stochastic Context-FreeGrammar Induction, how grammar inductioncan be used in dialogue act prediction.

• In their paper (Experiments Using OSTIA fora Language Production Task), Dana Angluinand Leonor Becerra-Bonache build on previ-ous work to see the transducer learning algo-rithm OSTIA as capable of translating syn-tax to semantics.

• In their paper titledGREAT: a finite-statemachine translation toolkit implementing aGrammatical Inference Approach for Trans-ducer Inference (GIATI), Jorge Gonzalez andFrancisco Casacuberta build on a long his-tory of GOATI learning and try to eliminatesome of the limitations of previous work.The learning concerns finite-state transducersfrom parallel corpora.

Context-free grammars of different types wereused for very different tasks:

• Alexander Clark, Remi Eyraud and AmauryHabrard (A note on contextual binary fea-ture grammars) propose a formal study ofa new formalism called “CBFG”, describethe relationship of CBFG to other standardformalisms and its appropriateness for mod-elling natural language.

• In their work titledLanguage models for con-textual error detection and correction, Her-man Stehouwer and Menno van Zaanen lookat spelling problems as a word predictionproblem. The prediction needs a languagemodel which is learnt.

• A formal study of French treebanks is madeby Marie-Helene Candito, Benoit Crabbe andDjame Seddah in their work:On statisticalparsing of French with supervised and semi-supervised strategies.

• Franco M. Luque and Gabriel Infante-Lopezstudy the learnability of NTS grammars withreference to the Penn treebank in their papertitled Upper Bounds for Unsupervised Pars-ing with Unambiguous Non-Terminally Sep-arated Grammars.

One paper concentrated on morphology :

• In A comparison of several learners forBoolean partitions: implications for morpho-logical paradigm, Katya Pertsova compares arote learner to three morphological paradigmlearners.

References

P. Adriaans and M. van Zaanen. 2004. Computationalgrammar induction for linguists.Grammars, 7:57–68.

P. Adriaans and M. Vervoort. 2002. The EMILE4.1 grammar induction toolbox. In Adriaans et al.(Adriaans et al., 2002), pages 293–295.

P. Adriaans, H. Fernau, and M. van Zaannen, editors.2002. Grammatical Inference: Algorithms and Ap-plications, Proceedings ofICGI ’02, volume 2484of LNAI , Berlin, Heidelberg. Springer-Verlag.

D. Angluin. 1978. On the complexity of minimuminference of regular sets.Information and Control,39:337–350.

D. Angluin. 1980. Inductive inference of formal lan-guages from positive data.Information and Control,45:117–135.

D. Angluin. 1981. A note on the number of queriesneeded to identify regular languages.Informationand Control, 51:76–87.

D. Angluin. 1982. Inference of reversible languages.Journal of the Association for Computing Machin-ery, 29(3):741–765.

D. Angluin. 1987. Queries and concept learning.Ma-chine Learning Journal, 2:319–342.

A. Brazma and K. Cerans. 1994. Efficient learningof regular expressions from good examples. In AII’94: Proceedings of the 4th International Workshopon Analogical and Inductive Inference, pages 76–90.Springer-Verlag.

A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. 1998.Pattern discovery in biosequences. In Honavar andSlutski (Honavar and Slutski, 1998), pages 257–270.

A. Brazma, 1997.Computational learning theory andnatural learning systems, volume 4, chapter Effi-cient learning of regular expressions from approxi-mate examples, pages 351–366. MIT Press.

R. C. Carrasco and J. Oncina, editors. 1994.Gram-matical Inference and Applications, Proceedings ofICGI ’94, number 862 in LNAI , Berlin, Heidelberg.Springer-Verlag.

N. Chomsky. 1955.The logical structure of linguis-tic theory. Ph.D. thesis, Massachusetts Institute ofTechnology.

3

N. Chomsky. 1957.Syntactic structure. Mouton.

A. Clark, F. Coste, and L. Miclet, editors. 2008.Grammatical Inference: Algorithms and Applica-tions, Proceedings ofICGI ’08, volume 5278 ofLNCS. Springer-Verlag.

C. de la Higuera. 2005. A bibliographical studyof grammatical inference. Pattern Recognition,38:1332–1348.

A. L. de Oliveira, editor. 2000.Grammatical Infer-ence: Algorithms and Applications, Proceedings ofICGI ’00, volume 1891 of LNAI , Berlin, Heidelberg.Springer-Verlag.

K. S. Fu and T. L. Booth. 1975. Grammatical infer-ence: Introduction and survey. Part I and II. IEEETransactions on Syst. Man. and Cybern., 5:59–72and 409–423.

K. S. Fu. 1974.Syntactic Methods in Pattern Recogni-tion. Academic Press, New-York.

E. M. Gold. 1967. Language identification in the limit.Information and Control, 10(5):447–474.

E. M. Gold. 1978. Complexity of automaton identi-fication from given data.Information and Control,37:302–320.

J. Goodman. 2001. A bit of progress in language mod-eling. Technical report, Microsoft Research.

V. Honavar and G. Slutski, editors. 1998.Gram-matical Inference, Proceedings ofICGI ’98, number1433 in LNAI , Berlin, Heidelberg. Springer-Verlag.

J. J. Horning. 1969.A study of Grammatical Inference.Ph.D. thesis, Stanford University.

M. Kanazawa. 1998.Learnable Classes of CategorialGrammars. CSLI Publications, Stanford, Ca.

M. J. Kearns and U. Vazirani. 1994.An Introductionto Computational Learning Theory. M IT press.

L. Miclet and C. de la Higuera, editors. 1996.Pro-ceedings ofICGI ’96, number 1147 in LNAI , Berlin,Heidelberg. Springer-Verlag.

B. L. Natarajan. 1991.Machine Learning: a Theoret-ical Approach. Morgan Kauffman Pub., San Mateo,CA.

G. Paliouras and Y. Sakakibara, editors. 2004.Gram-matical Inference: Algorithms and Applications,Proceedings ofICGI ’04, volume 3264 of LNAI ,Berlin, Heidelberg. Springer-Verlag.

L. Pitt and M. Warmuth. 1988. Reductions amongprediction problems: on the difficulty of predictingautomata. In3rd Conference on Structure in Com-plexity Theory, pages 60–69.

L. Pitt and M. Warmuth. 1993. The minimum consis-tent DFA problem cannot be approximated withinany polynomial. Journal of the Association forComputing Machinery, 40(1):95–142.

L. Pitt. 1989. Inductive inference, DFA’s, and com-putational complexity. InAnalogical and Induc-tive Inference, number 397 in LNAI , pages 18–44.Springer-Verlag, Berlin, Heidelberg.

Y. Sakakibara, S. Kobayashi, K. Sato, T. Nishino, andE. Tomita, editors. 2006. Grammatical Infer-ence: Algorithms and Applications, Proceedings ofICGI ’06, volume 4201 of LNAI , Berlin, Heidelberg.Springer-Verlag.

Y. Sakakibara. 1997. Recent advances of grammaticalinference. Theoretical Computer Science, 185:15–45.

R. Solomonoff. 1960. A preliminary report on a gen-eral theory of inductive inference. Technical ReportZTB-138, Zator Company, Cambridge, Mass.

R. Solomonoff. 1964. A formal theory of inductiveinference. Information and Control, 7(1):1–22 and224–254.

M. van Zaanen. 2000. ABL: Alignment-based learn-ing. In Proceedings ofCOLING 2000, pages 961–967. Morgan Kaufmann.

G. Wolf. 2006. Unifying computing and cognition.Cognition research.

4


On bootstrapping of linguistic features for bootstrapping grammars

Damir CavarUniversity of Zadar

Zadar, [email protected]

Abstract

We discuss a cue-based grammar induc-tion approach based on a parallel theory ofgrammar. Our model is based on the hy-potheses of interdependency between lin-guistic levels (of representation) and in-ductability of specific structural propertiesat a particular level, with consequencesfor the induction of structural properties atother linguistic levels. We present the re-sults of three different cue-learning exper-iments and settings, covering the induc-tion of phonological, morphological, andsyntactic properties, and discuss potentialconsequences for our general grammar in-duction model.1

1 Introduction

We assume that individual linguistic levels of nat-ural languages differ with respect to their for-mal complexity. In particular, the assumption isthat structural properties of linguistic levels likephonology or morphology can be characterizedfully by Regular grammars, and if not, at least alarge subset can. Structural properties of naturallanguage syntax on the other hand might be char-acterized by Mildly context-free grammars (Joshiet al., 1991), where at least a large subset could becharacterized by Regular and Context-free gram-mars.2

1This article is builds on joint work and articles with K.Elghamri, J. Herring, T. Ikuta, P. Rodrigues, G. Schrementiand colleagues at the Institute of Croatian Language and Lin-guistics and the University of Zadar. The research activitieswere partially funded by several grants over a couple of years,at Indiana University and from the Croatian Ministry of Sci-ence, Education and Sports of the Republic of Croatia.

2We are abstracting away from concrete linguistic modelsand theories, and their particular complexity, as discussed e.g.in (Ristad, 1990) or (Tesar and Smolensky, 2000).

Ignoring for the time being extra-linguistic con-ditions and cues for linguistic properties, and in-dependent of the complexity of specific linguis-tic levels for particular languages, we assumethat specific properties at one particular linguisticlevel correlate with properties at another level. Innatural languages certain phonological processesmight be triggered at morphological boundariesonly, e.g. (Chomsky and Halle, 1968), or prosodicproperties correlate with syntactic phrase bound-aries and semantic properties, e.g. (Inkelas andZec, 1990). Similarly, lexical properties, as forexample stress patterns and morphological struc-ture tend to be specific to certain word types (e.g.substantives, but not function words). i.e. corre-late with the lexical morpho-syntactic propertiesused in grammars of syntax. Other more informalcorrelations that are discussed in linguistics, thatrather lack a formal model or explanation, are forexample the relation between morphological rich-ness and the freedom of word order in syntax.

Thus, it seems that specific regularities andgrammatical properties at one linguistic levelmight provide cues for structural properties at an-other level. We expect such correlations to be lan-guage specific, given that languages qualitativelysignificantly differ at least at the phonetic, phono-logical and morphological level, and at least quan-titatively also at the syntactic level.

Thus in our model of grammar induction, wefavor the view expressed e.g. in (Frank, 2000)that complex grammars are bootstrapped (or grow)from less complex grammars. On the other hand,the intuition that structural or inherent proper-ties at different linguistic levels correlate, i.e. theyseem to be used as cues in processing and acquisi-tion, might require a parallel model of languagelearning or grammar induction, as for examplesuggested in (Jackendoff, 1996) or the Competi-tion Model (MacWhinney and Bates, 1989).

In general, we start with the observation that

5

natural languages are learnable. In principle, thestudy of how this might be modeled, and what theminimal assumptions about the grammar proper-ties and the induction algorithm could be, couldstart top-down, by assuming maximal knowledgeof the target grammar, and subsequently eliminat-ing elements that are obviously learnable in an un-supervised way, or fall out as side-effects. Alter-natively, a bottom-up approach could start with thequestion about how much supervision has to beadded to an unsupervised model in order to con-verge to a concise grammar.

Here we favor the bottom-up approach, and askhow simple properties of grammar can be learnedin an unsupervised way, and how cues could beidentified that allow for the induction of higherlevel properties of the target grammar, or other lin-guistic levels, by for example favoring some struc-tural hypotheses over others.

In this article we will discuss in detail sev-eral experiments of morphological cue inductionfor lexical classification (Cavar et al., 2004a) and(Cavar et al., 2004b) using Vector Space Modelsfor category induction and subsequent rule for-mation. Furthermore, we discuss structural cohe-sion measured via Entropy-based statistics on thebasis of distributional properties for unsupervisedsyntactic structure induction (Cavar et al., 2004c)from raw text, and compare the results with syn-tactic corpora like the Penn Treebank. We ex-pand these results with recent experiments in thedomain of unsupervised induction of phonotacticregularities and phonological structure (Cavar andCavar, 2009), providing cues for morphologicalstructure induction and syntactic phrasing.

ReferencesDamir Cavar and Małgorzata E. Cavar. 2009. On the

induction of linguistic categories and learning gram-mars. Paper presented at the 10th Szklarska PorebaWorkshop, March.

Damir Cavar, Joshua Herring, Toshikazu Ikuta, PaulRodrigues, and Giancarlo Schrementi. 2004a.Alignment based induction of morphology grammarand its role for bootstrapping. In Gerhard Jager,Paola Monachesi, Gerald Penn, and Shuly Wint-ner, editors, Proceedings of Formal Grammar 2004,pages 47–62, Nancy.

Damir Cavar, Joshua Herring, Toshikazu Ikuta, PaulRodrigues, and Giancarlo Schrementi. 2004b. Onstatistical bootstrapping. In William G. Sakas, ed-itor, Proceedings of the First Workshop on Psycho-

computational Models of Human Language Acqui-sition, pages 9–16.

Damir Cavar, Joshua Herring, Toshikazu Ikuta, PaulRodrigues, and Giancarlo Schrementi. 2004c. Syn-tactic parsing using mutual information and relativeentropy. Midwest Computational Linguistics Collo-quium (MCLC), June.

Noam Chomsky and Morris Halle. 1968. The SoundPattern of English. Harper & Row, New York.

Robert Frank. 2000. From regular to context free tomildly context sensitive tree rewriting systems: Thepath of child language acquisition. In A. Abeilleand O. Rambow, editors, Tree Adjoining Gram-mars: Formalisms, Linguistic Analysis and Process-ing, pages 101–120. CSLI Publications.

Sharon Inkelas and Draga Zec. 1990. The Phonology-Syntax Connection. University Of Chicago Press,Chicago.

Ray Jackendoff. 1996. The Architecture of the Lan-guage Faculty. Number 28 in Linguistic InquiryMonographs. MIT Press, Cambridge, MA.

Aravind Joshi, K. Vijay-Shanker, and David Weird.1991. The convergence of mildly context-sensitivegrammar formalisms. In Peter Sells, Stuart Shieber,and Thomas Wasow, editors, Foundational Issues inNatural Language Processing, pages 31–81. MITPress, Cambridge, MA.

Brian MacWhinney and Elizabeth Bates. 1989. TheCrosslinguistic Study of Sentence Processing. Cam-bridge University Press, New York.

Eric S. Ristad. 1990. Computational structure of gen-erative phonology and its relation to language com-prehension. In Proceedings of the 28th annual meet-ing on Association for Computational Linguistics,pages 235–242. Association for Computational Lin-guistics.

Bruce Tesar and Paul Smolensky. 2000. Learnabilityin Optimality Theory. MIT Press, Cambridge, MA.

6


Dialogue Act Prediction Using Stochastic Context-Free GrammarInduction

Jeroen GeertzenResearch Centre for English & Applied Linguistics

University of Cambridge, [email protected]

Abstract

This paper presents a model-based ap-proach to dialogue management that isguided by data-driven dialogue act predic-tion. The statistical prediction is based onstochastic context-free grammars that havebeen obtained by means of grammaticalinference. The prediction performance ofthe method compares favourably to that ofa heuristic baseline and to that ofn-gramlanguage models.

The act prediction is explored both fordialogue acts without realised semanticcontent (consisting only of communicativefunctions) and for dialogue acts with re-alised semantic content.

1 Introduction

Dialogue management is the activity of determin-ing how to behave as an interlocutor at a specificmoment of time in a conversation: whichactioncan or should be taken at whatstateof the dia-logue. The systematic way in which an interlocu-tor chooses among the options for continuing a di-alogue is often called adialogue strategy.

Coming up with suitable dialogue managementstrategies for dialogue systems is not an easy task.Traditional methods typically involve manuallycrafting and tuning frames or hand-crafted rules,requiring considerable implementation time andcost. More recently, statistical methods are be-ing used to semi-automatically obtain models thatcan be trained and optimised using dialogue data.1

These methods are usually based on two assump-tions. First, the training data is assumed to berepresentative of the communication that may beencountered in interaction. Second, it is assumedthat dialogue can be modelled as a Markov De-cision Process (MDP) (Levin et al., 1998), which

1See e.g. (Young, 2002) for an overview.

implies that dialogue is modelled as a sequentialdecision task in which each contribution (action)results in a transition from one state to another.

The latter assumption allows to assign arewardfor action-state pairs, and to determine the dia-logue management strategy that results in the max-imum expected reward by finding for each statethe optimal action by usingreinforcement learn-ing (cf. (Sutton and Barto, 1998)). Reinforce-ment learning approaches to dialogue manage-ment have proven to be successful in several taskdomains (see for example (Paek, 2006; Lemon etal., 2006)). In this process there is no supervision,but what is optimal depends usually on factors thatrequire human action, such as task completion oruser satisfaction.

The remainder of this paper describes and eval-uates a model-based approach to dialogue man-agement in which the decision process of takinga particular action given a dialogue state is guidedby data-driven dialogue act prediction. The ap-proach improves overn-gram language modelsand can be used in isolation or for user simula-tion, without yet providing a full alternative to re-inforcement learning.

2 Using structural properties oftask-oriented dialogue

One of the best known regularities that are ob-served in dialogue are the two-part structures,known asadjacency pairs(Schegloff, 1968), likeQUESTION-ANSWER or GREETING-GREETING.

A simple model of predicting a plausible nextdialogue act that deals with such regularities couldbe based on bigrams, and to include more contextalso higher-ordern-grams could be used. For in-stance, Stolcke et al. (2000) exploren-gram mod-els based on transcribed words and prosodic in-formation for SWBD-DAMSL dialogue acts in theSwitchboard corpus (Godfrey et al., 1992). Aftertraining back-offn-gram models (Katz, 1987) of

7

different order using frequency smoothing (Wittenand Bell, 1991), it was concluded that trigrams andhigher-ordern-grams offer a small gain in predi-cation performance with respect to bigrams.

Apart from adjacency pairs, there is a varietyof more complex re-occurring interaction patterns.For instance, the following utterances with cor-responding dialogue act types illustrate a clarifi-cation sub-dialogue within an information-requestdialogue:

1 A: How do I do a fax? QUESTION

2 B: Do you want to send QUESTIONor print one?

3 A: I want to print it ANSWER

4 B: Just press the grey button ANSWER

Such structures have received considerable at-tention and their models are often referred to asdiscourse/dialogue grammars (Polanyi and Scha,1984) or conversational/dialogue games (Levinand Moore, 1988).

As also remarked by Levin (1999), predict-ing and recognising dialogue games usingn-grammodels is not really successful. There are vari-ous causes for this. The flat horizontal structure ofn-grams does not allow (hierarchical) grouping ofsymbols. This may weaken the predictive powerand reduces the power of the representation sincenested structures such as exemplified above cannotbe represented in a straightforward way.

A better solution would be to express the struc-ture of dialogue games by a context-free grammar(CFG) representation in which the terminals aredialogue acts and the non-terminals denote con-versational games. Construction of a CFG wouldrequire explicit specification of a discourse gram-mar, which could be done by hand, but it would bea great advantage if CFGs could automatically beinduced from the data. An additional advantageof grammar induction is the possibility to assessthe frequency of typical patterns and a stochasticcontext-free grammar (SCFG) may be producedwhich can be used for parsing the dialogue data.

3 Sequencing dialogue acts

Both n-gram language models and SCFG basedmodels work on sequences of symbols. Usingmore complex symbols increases data sparsity:encoding more information increases the numberof unique symbols in the dataset and decreases

the number of reoccurring patterns which could beused in the prediction.

In compiling the symbols for the prediction ex-periments, three aspects are important: the identi-fication of interlocutors, the definition of dialogueacts, and multifunctionality in dialogue.

The dialogue act taxonomy that is used in theprediction experiments is that ofDIT (Bunt, 2000).A dialogue act is defined as a pair consisting of acommunicative function (CF) and a semantic con-tent (SC):a =< CF,SC >. The DIT taxonomydistinguishes 11 dimensions of communicativefunctions, addressing information about the taskdomain, feedback, turn management, and othergeneric aspects of dialogue (Bunt, 2006). Thereare also functions, calledthe general-purposefunctions, that may occur in any dimension. Inquite some cases, particularly when dialogue con-trol is addressed and dimension-specific functionsare realised, the SC is empty. General-purposefunctions, by contrast, are always used in combi-nation with a realised SC. For example:

dialogue act

utterance function semantic content

What to do next? SET-QUESTION next-step(X)

Press the button. SET-ANSWER press(Y) ∧

button(Y)

The SC —if realised— describes objects, prop-erties, and events in the domain of conversation.

In dialogue act prediction while taking multi-dimensionality into account, a dialogueD can berepresented as a sequence of events in which anevent is a set of one dialogue act or multiple di-alogue acts occurring simultaneously. The infor-mation concerning interlocutor and multifunction-ality is encoded in a single symbol and denoted bymeans of an-tuple. Assuming that at most threefunctions can occur simultaneously, a 4-tuple isneeded2: (interlocutor,da1,da2,da3). An ex-ample of a bigram of 4-tuples would then look asfollows:

(A,<SET-Q,"next-step(X)">, , ) ,(B,<SET-A,"press(Y) ∧ button(Y)">, , )

Two symbols are considered to be identical whenthe same speaker is involved and when the sym-bols both address the same functions. To make

2Ignoring the half percent of occurrences with four simul-taneous functions.

8

it easy to determine if two symbols are identical,the order of elements in a tuple is fixed: func-tions that occur simultaneously are first ordered onimportance of dimension, and subsequently on al-phabet. The task-related functions are consideredthe most important, followed by feedback-relatedfunctions, followed by any other remaining func-tions. This raises the question how recognitionperformance using multifunctional symbols com-pares against recognition performance using sym-bols that only encode the primary function

4 N-gram language models

There exists a significant body of work on the useof language models in relation to dialogue man-agement. Nagata and Morimoto (1994) describe astatistical model of discourse based on trigrams ofutterances classified by custom speech act types.They report39.7% prediction accuracy for the topcandidate and61.7% for the top three candidates.

In the context of the dialogue component of thespeech-to-speech translation system VERBMO-BIL, Reithinger and Maier (1995) usen-gram dia-logue act probabilities to suggest the most likelydialogue act. In later work, Alexandersson andReithinger (1997) describe an approach whichcomes close to the work reported in this paper: Us-ing grammar induction, plan operators are semi-automatically derived and combined with a statis-tical disambiguation component. This system isclaimed to have an accuracy score of around 70%on turn management classes.

Another study is that of Poesio and Mikheev(1998), in which prediction based on the previousdialogue act is compared with prediction based onthe context of dialogue games. Using the MapTask corpus annotated with ‘moves’ (dialogueacts) and ‘transactions’ (games) they showed thatby using higher dialogue structures it was possi-ble to perform significantly better than a bigrammodel approach. Using bigrams,38.6% accuracywas achieved. Additionally taking game structureinto account resulted in50.6%; adding informa-tion about speaker change resulted in an accuracyof 41.8% with bigrams, 54% with game structure.

All studies discussed so far are only concernedwith sequences of communicative functions, anddisregard the semantic content of dialogue acts.

5 Dialogue grammars

To automatically induce patterns from dialoguedata in an unsupervised way, grammatical infer-ence (GI) techniques can be used. GI is a branchof unsupervised machine learning that aims to findstructure in symbolic sequential data. In this case,the input of the GI algorithm will be sequences ofdialogue acts.

5.1 Dialogue Grammars Inducer

For the induction of structure, a GI algorithm hasbeen implemented that will be referred to as Dia-logue Grammars Inducer (DGI). This algorithm isbased on distributional clustering and alignment-based learning (van Zaanen and Adriaans, 2001;van Zaanen, 2002; Geertzen and van Zaanen,2004). Alignment-based learning (ABL) is a sym-bolic grammar inference framework that has suc-cessfully been applied to several unsupervised ma-chine learning tasks in natural language process-ing. The framework accepts sequences with sym-bols, aligns them with each other, and comparesthem to find interchangeable subsequences thatmark structure. As a result, the input sequencesare augmented with the induced structure.

The DGI algorithm takes as input time series ofdialogue acts, and gives as output a set of SCFGs.The algorithm has five phases:

1. SEGMENTATION: In the first phase of DGI,the time series are —if necessary— seg-mented in smaller sequences based on a spe-cific time interval in which no communica-tion takes place. This is a necessary step intask-oriented conversation in which there isample time to discuss (and carry out) severalrelated tasks, and an interaction often con-sists of a series of short dialogues.

2. ALIGNMENT LEARNING : In the secondphase a search space of possible structures,called hypotheses, is generated by compar-ing all input sequences with each other andby clustering sub-sequences that share simi-lar context. To illustrate the alignment learn-ing, consider the following input sequences:

A:SET-Q, B:PRO-Q, A:PRO-A, B:SET-A.A:SET-Q, B:PAUSE, B:RESUME, B:SET-A.A:SET-Q, B:SET-A.

The alignment learning compares all inputsequences with each other, and produces the

9

hypothesised structures depicted below. Theinduced structure is represented using brack-eting.

[i A:SET-Q, [j B:PRO-Q, A:PRO-A, ]j B:SET-A. ]i[i A:SET-Q, [j B:PAUSE, A:RESUME, ]j B:SET-A. ]i[i A:SET-Q, [j ]j B:SET-A. ]i

The hypothesisj is generated because of thesimilar context (which is underlined). Thehypothesisi, the full span, is introduced bydefault, as it might be possible that the se-quence is in itself a part of a longer sequence.

3. SELECTION LEARNING: The set of hypothe-ses that is generated during alignment learn-ing contains hypotheses that are unlikely tobe correct. These hypotheses are filtered out,overlapping hypotheses are eliminated to as-sure that it is possible to extract a context-free grammar, and the remaining hypothesesare selected and remain in the bracketed out-put. The decision of which hypotheses to se-lect and which to discard is based on a Viterbibeam search (Viterbi, 1967).

4. EXTRACTION: In the fourth phase, SCFGgrammars are extracted from the remaininghypotheses by means of recursive descentparsing. Ignoring the stochastic informa-tion, a CFG of the above-mentioned examplelooks in terms of grammar rules as depictedbelow:

S ⇒ A:SET-Q J B:SET-AJ ⇒ B:PRO-Q A:PRO-AJ ⇒ B:PAUSE A:RESUMEJ ⇒ ∅

5. FILTERING: In the last phase, the SCFGgrammars that have small coverage or involvemany non-terminals are filtered out, and theremaining SCFG grammars are presented asthe output of DGI.

Depending on the mode of working, the DGIalgorithm can generate a SCFG covering the com-plete input or can generate a set of SCFGs. In theformer mode, the grammar that is generated can beused for parsing sequences of dialogue acts and bydoing so suggests ways to continue the dialogue.In the latter mode, by parsing each grammar in theset of grammars that are expected to represent di-alogue games in parallel, specific dialogue games

may be recognised, which can in turn be used ben-eficially in dialogue management.

5.2 A worked example

In testing the algorithm, DGI has been used toinfer a set of SCFGs from a development set of250 utterances of the DIAMOND corpus (see alsoSection 6.1). Already for this small dataset, DGIproduced, using default parameters, 45 ‘dialoguegames’. One of the largest produced structureswas the following:

4 S ⇒ A:SET-Q , NTAX , NTBT , B:SET-A4 NTAX ⇒ B:PRO-Q , NTFJ3 NTFJ ⇒ A:PRO-A1 NTFJ ⇒ A:PRO-A , A:CLARIFY2 NTBT ⇒ B:PRO-Q , A:PRO-A2 NTBT ⇒ ∅

In this figure, each CFG rule has a number in-dicating how many times the rules has been used.One of the dialogue fragments that was used to in-duce this structure is the following excerpt:

utterance dialogue act

A1 how do I do a short code? SET-QB1 do you want to program one? PRO-QA2 no SET-AA3 I want to enter a kie* a short code CLARIFYB2 you want to use a short code? PRO-QA4 yes PRO-AB3 press the VK button SET-A

Unfortunately, many of the 45 induced struc-tures were very small or involved generalisationsalready based on only two input samples. To en-sure that the grammars produced by DGI gen-eralise better and are less fragmented, a post-processing step has been added which traversesthe grammars and eliminates generalisations basedon a low number of samples. In practice, thismeans that the post-processing requires the re-maining grammatical structure to be presentedN

times or more in the data.3. The algorithm withoutpost-processing will be referred to as DGI1; thealgorithm with post-processing as DGI2.

6 Act prediction experiments

To determine how to behave as an interlocutor ata specific moment of time in a conversation, theDGI algorithm can be used to infer a SCFG thatmodels the structure of the interaction. The SCFG

3N = 2 by default, but may increase with the size of the

training data.

10

can then be used to suggest a next dialogue actto continue the dialogue. In this section, the per-formance of the proposed SCFG based dialoguemodel is compared with the performance of thewell-knownn-gram language models, both trainedon intentional level, i.e. on sequences of sets of di-alogue acts.

6.1 Data

The task-oriented dialogues used in the dialogueact prediction tasks were drawn from the DIA-MOND corpus (Geertzen et al., 2004), which con-tains human-machine and human-human Dutchdialogues that have an assistance seeking na-ture. The dataset used in the experiments con-tains 1, 214 utterances representing1, 592 func-tional segments from the human-human part ofcorpus. In the domain of the DIAMOND data,i.e. operating a fax device, the predicates and argu-ments in the logical expressions of the SC of thedialogue acts refer to entities, properties, events,and tasks in the application domain. The appli-cation domain of the fax device is complex butsmall: the domain model consists of 70 entitieswith at most 10 properties, 72 higher-level actionsor tasks, and 45 different settings.

Representations of semantic content are oftenexpressed in some form of predicate logic typeformula. Examples are Quasi Logical Forms (Al-shawi, 1990), Dynamic Predicate Logic (Groe-nendijk and Stokhof, 1991), and UnderspecifiedDiscourse Representation Theory (Reyle, 1993).The SC in the dataset is in a simplified first orderlogic similar to quasi logical forms, and is suitableto support feasible reasoning, for which also theo-rem provers, model builders, and model checkerscan be used. The following utterances and theircorresponding SC characterise the dataset:

1 wat moet ik nu doen?(what do I have to do now?)λx . next-step(x)

2 druk op een toets(press a button)λx . press(x) ∧ button(x)

3 druk op de groene toets(press the green button)λx . press(x) ∧ button(x) ∧ color(x,’green’)

4 wat zit er boven de starttoets?(what is located above the starttoets?)λx . loc-above(x,’button041’)

Three types of predicate groups are distin-

guished: action predicates, element predicates,and property predicates. These types have a fixedorder. The action predicates appear before elementpredicates, which appear in turn before propertypredicates. This allows to simplify the semanticcontent for the purpose of reducing data sparsityin act prediction experiments, by stripping awaye.g. property predicates. For instance, if desiredthe SC of utterance 3 in the example could be sim-plified to that of utterance 2, making the semanticsless detailed but still meaningful.

6.2 Methodology and metrics

Evaluation of overall performance in communi-cation is problematic; there are no generally ac-cepted criteria as to what constitutes an objectiveand sound way of comparative evaluation. Anoften-used paradigm for dialogue system evalua-tion is PARADISE (Walker et al., 2000), in whichthe performance metric is derived as a weightedcombination of subjectively rated user satisfac-tion, task-success measures and dialogue cost.Evaluating if the predicted dialogue acts are con-sidered as positive contributions in such a waywould require the model to be embedded in a fullyworking dialogue system.

To assess whether the models that are learnedproduce human-like behaviour without resortingto costly user interaction experiments, it is neededto compare their output with real human responsesgiven in the same contexts. This will be done byderiving a model from one part of a dialogue cor-pus and applying the model on an ’unseen’ partof the corpus, comparing the suggested next dia-logue act with the observed next dialogue act. Tomeasure the performance,accuracyis used, whichis defined as the proportion of suggested dialogueacts that match the observed dialogue acts.

In addition to the accuracy, alsoperplexity isused as metric. Perplexity is widely used in re-lation to speech recognition and language models,and can in this context be understood as a metricthat measures the number of equiprobable possi-ble choices that a model faces at a given moment.Perplexity, being related to entropy is defined asfollows:

Entropy = −∑

i

p(wi|h) · log2 p(wi|h)

Perplexity = 2Entropy

11

whereh denotes the conditioned part, i.e.wi−1

in the case of bigrams andwi−2, wi−1 in the caseof trigrams, et cetera. In sum, accuracy could bedescribed as a measure of correctness of the hy-pothesis and perplexity could be described as howprobable the correct hypothesis is.

For all n-gram language modelling tasks re-ported, good-turing smoothing was used (Katz,1987). To reduce the effect of imbalances in thedialogue data, the results were obtained using 5-fold cross-validation.

To have an idea how the performance of boththen-gram language models and the SCFG mod-els relate to the performance of a simple heuris-tic, a baseline has been computed which suggestsa majority class label according to the interlocutorrole in the dialogue. The information seeker hasSET-Q and the information provider hasSET-A asmajority class label.

6.3 Results for communicative functions

The scores for communicative function predictionare presented in Table 1. For each of the threekinds of symbols, accuracy and perplexity are cal-culated: the first two columns are for the main CF,the second two columns are for the combinationof speaker identityandmain CF, and the third twocolumns are for the combination of speaker iden-tity and all CFs. The scores for the latter two cod-ings could not be calculated for the 5-gram model,as the data were too sparse.

As was expected, there is an improvement inboth accuracy (increasing) and perplexity (de-creasing) for increasingn-gram order. After the4-gram language model, the scores drop again.This could well be the result of insufficient train-ing data, as the more complex symbols could notbe predicted well.

Both language models and SCFG models per-form better than the baseline, for all three groups.The two SCFG models, DGI1 and DGI2, clearlyoutperform then-gram language models with asubstantial difference in accuracy. Also the per-plexity tends to be lower. Furthermore, modelDGI2 performs clearly better than model DGI1,which indicates that the ‘flattening’ of non-terminals which is described in Section 5 resultsin better inductions.

When comparing the three groups of sequences,it can be concluded that providing the speakeridentity combined with the main communicative

function results in better accuracy scores of5.9%on average, despite the increase in data sparsity. Asimilar effect has also been reported by Stolcke etal. (2000).

Only for the 5-gram language model, the databecome too sparse to learn reliably a languagemodel from. There is again an increase in per-formance when also the last two positions in the4-tuple are used and all available dialogue act as-signments are available. It should be noted, how-ever, that this increase has less impact than addingthe speaker identity. The best performingn-gramlanguage model achieved66.4% accuracy; thebest SCFG model achieved78.9% accuracy.

6.4 Results for dialogue acts

The scores for prediction of dialogue acts, includ-ing SC, are presented in the left part of Table 2.The presentation is similar to Table 1: for each ofthe three kinds of symbols, accuracy and perplex-ity were calculated. For dialogue acts that may in-clude semantic content, computing a useful base-line is not obvious. The same baseline as for com-municative functions was used, which results inlower scores.

The table shows that the attempts to learn topredict additionally the semantic content of utter-ances quickly run into data sparsity problems. Itturned out to be impossible to make predictionsfrom 4-grams and 5-grams, and for 3-grams thecombination of speaker and all dialogue acts couldnot be computed. Training the SCFGs, by con-trast, resulted in fewer problems with data sparsity,as the models abstract quickly.

As with predicting communicative functions,the SCFG models show better performance thanthe n-gram language models, for which the 2-grams show slightly better results than the 3-grams. Where there was a notable performancedifference between DGI1 and DGI2 for CF pre-diction, for dialogue act prediction there is only avery little difference, which is insignificant con-sidering the relatively high standard deviation.This small difference is explained by the fact thatDGI2 becomes less effective as the size of thetraining data decreases.

As with CF prediction, it can be concluded thatproviding the speaker identity with the main dia-logue act results in better scores, but the differenceis less big than observed with CF prediction due tothe increased data sparsity.

12

Table 1: Communicative function prediction scores forn-gram language models and SCFGs in accuracy(acc, in percent) and perplexity (pp). CFmain denotes the main communicative function, SPK speakeridentity, and CFall all occurring communicative functions.

CFmain SPK + CFmain SPK + CFall

acc pp acc pp acc pp

baseline 39.1±0.23 24.2±0.19 44.6±0.92 22.0±0.25 42.9±1.33 23.7±0.41

2-gram 53.1±0.88 17.9±0.35 58.3±1.84 16.8±0.31 61.1±1.65 16.3±0.593-gram 58.6±0.85 17.1±0.47 63.0±1.98 14.5±0.26 65.9±1.92 14.0±0.234-gram 60.9±1.12 16.7±0.15 65.4±1.62 15.2±1.07 66.4±2.03 14.2±0.445-gram 60.3±0.43 18.6±0.21 - - - -

DGI1 67.4±3.05 18.3±1.28 74.6±1.94 14.8±1.47 76.5±2.13 13.9±0.35DGI2 71.8±2.67 16.1±1.25 78.3±2.50 14.0±2.39 78.9±1.98 13.6±0.35

Table 2: Dialogue act prediction scores forn-gram language models and SCFGs. DAmain denotes thedialogue act with the main communicative function, and DAall all occurring dialogue acts.

DAmain SPK + DAmain SPK + DAall

full SC simplified SCacc pp acc pp acc pp acc pp

baseline 18.5±2.01 31.0±1.64 19.3±1.79 27.6±0.93 18.2±1.93 31.6±1.38 18.2±1.93 31.6±1.38

2-gram 31.2±1.42 28.5±1.03 34.6±1.51 24.7±0.62 35.1±1.30 26.9±0.47 37.5±1.34 26.2±2.373-gram 29.0±1.14 34.7±2.82 31.9±1.21 30.5±2.06 - - 29.1±1.28 28.0±2.594-gram - - - - - - - -5-gram - - - - - - - -

DGI1 38.8±3.27 25.1±0.94 42.5±0.96 25.0±1.14 42.9±2.44 27.3±1.98 46.6±2.01 24.6±2.24DGI2 39.2±2.45 25.0±1.28 42.7±1.03 25.3±0.99 42.4±2.19 28.0±1.57 46.4±1.94 24.7±2.55

The prediction scores of dialogue acts with fullsemantic content and simplified semantic contentare presented in the right part of Table 2. For bothcases multifunctionality is taken into account byincluding all occurring communicative functionsin each symbol. As can be seen from the table,the simplification of the semantic content leads toimprovements in the prediction performance forboth types of model. The bestn-gram languagemodel improved with2.4% accuracy from35.1%to 37.5%; the best SCFG-based model improvedwith 3.7% from42.9% to46.6%.

Moreover, the simplification of the semanticcontent reduced the problem of data-sparsity, mak-ing it also possible to predict based on 3-gramsalthough the accuracy is considerably lower thanthat of the 2-gram model.

7 Discussion

Both n-gram language models and SCFG basedmodels have their strengths and weaknesses.n-gram models have the advantage of being very ro-bust and they can be easily trained. The SCFG

based model can capture regularities that havegaps, and allow to model long(er) distance rela-tions. Both algorithms work on sequences andhence are easily susceptible to data-sparsity whenthe symbols in the sequences get more complex.The SCFG approach, though, has the advantagethat symbols can be clustered in the non-terminalsof the grammar, which allows more flexibility.

The multidimensional nature of theDIT++

functions can be adequately encoded in the sym-bols of the sequences. Keeping track of the inter-locutor and including not only the main commu-nicative function but also other functions that oc-cur simultaneously results in better performanceeven though it decreases the amount of data tolearn from.

The prediction experiments based on main com-municative functions assume that in case of multi-functionality, a main function can clearly be iden-tified. Moreover, it is assumed that task-relatedfunctions are more important than feedback func-tions or other functions. For most cases, these as-sumptions are justified, but in some cases they are

13

problematic. For instance, in a heated discussion,the turn management function could be consideredmore important for the dialogue than a simultane-ously occurring domain specific function. In othercases, it is impossible to clearly identify a mainfunction as all functions occurring simultaneouslyare equally important to the dialogue.

In general,n-grams of a higher order have ahigher predictability and therefore a lower per-plexity. However, using high ordern-grams isproblematic due to sparsity of training data, whichclearly is the case with 4-grams, and particularlywith 5-grams in combination with complex sym-bols as for CF prediction.

Considerably more difficult is the prediction ofdialogue acts with realised semantic content, asis evidenced in the differences in accuracy andperplexity for all models. Considering that thedata set, with about1, 600 functional segments,is rather small, the statistical prediction of logicalexpressions increases data sparsity to such a de-gree that from then-gram language models, only2-gram (and 3-grams to some extent) could betrained. The SCFG models can be trained for bothCF prediction and dialogue act prediction.

As noted in Section 6.2, objective evaluation ofdialogue strategies and behaviour is difficult. Theevaluation approach used here compares the sug-gested next dialogue act with the next dialogue actas observed. This is done for each dialogue act inthe test set. This evaluation approach has the ad-vantage that the evaluation metric can easily be un-derstood and computed. The approach, however,is also very strict: in a given dialogue context, con-tinuations with various types of dialogue acts mayall be equally appropriate. To also take other pos-sible contributions into account, a rich dataset isrequired in which interlocutors act differently insimilar dialogue context with a similar establishedcommon ground. Moreover, such a dataset shouldcontain for each of these cases with similar dia-logue context a representative set of samples.

8 Conclusions and future work

An approach to the prediction of communicativefunctions and dialogue acts has been presentedthat makes use of grammatical inference to auto-matically extract structure from corpus data. Thealgorithm, based on alignment-based learning, hasbeen tested against a baseline and severaln-gramlanguage models. From the results it can be con-

cluded that the algorithm outperforms then-grammodels: on the task of predicting the communica-tive functions, the best performingn-gram modelachieved 66.4% accuracy; the best SCFG modelachieved 78.9% accuracy. Predicting the seman-tic content in combination with the communica-tive functions is difficult, as evidenced by moder-ate scores. Obtaining lower degreen-gram lan-guage models is feasible, whereas higher degreemodels are not trainable. Prediction works betterwith the SCFG models, but does not result in con-vincing scores. As the corpus is small, it is ex-pected that with scaling up the available trainingdata, scores will improve for both tasks.

Future work in this direction can go in sev-eral directions. First, the grammar induction ap-proach shows potential of learning dialogue game-like structures unsupervised. The performance onthis task could be tested and measured by applyingthe algorithm on corpus data that have been anno-tated with dialogue games. Second, the modelscould also be extended to incorporate more infor-mation than dialogue acts alone. This could makecomparisons with the performance obtained withreinforcement learning or with Bayesian networksinteresting. Third, it would be interesting to learnand apply the same models on other kinds of con-versation, such as dialogue with more than two in-terlocutors. Fourth, datasets could be drawn froma large corpus that covers dialogues on a smallbut complex domain. This makes it possible toevaluate according to the possible continuationsas found in the data for situations with similar di-alogue context, rather than to evaluate accordingto a single possible continuation. Last, the ratherunexplored parameter space of the DGI algorithmmight be worth exploring in optimising the sys-tem’s performance.

References

Jan Alexandersson and Norbert Reithinger. 1997.Learning dialogue structures from a corpus. InProceedings of Eurospeech 1997, pages 2231–2234,Rhodes, Greece, September.

Hiyan Alshawi. 1990. Resolving quasi logical forms.Computational Linguistics, 16(3):133–144.

Harry Bunt. 2000. Dialogue pragmatics and contextspecification. In Harry Bunt and William Black, ed-itors, Abduction, Belief and Context in Dialogue;Studies in Computational Pragmatics, pages 81–150. John Benjamins, Amsterdam, The Netherlands.

14

Harry Bunt. 2006. Dimensions in dialogue annota-tion. InProceedings of the 5th International Confer-ence on Language Resources and Evaluation (LREC2006), pages 1444–1449, Genova, Italy, May.

Jeroen Geertzen and Menno M. van Zaanen. 2004.Grammatical inference using suffix trees. InProceedings of the 7th International Colloquiumon Grammatical Inference (ICGI), pages 163–174,Athens, Greece, October.

Jeroen Geertzen, Yann Girard, and Roser Morante.2004. The DIAMOND project. Poster at the 8thWorkshop on the Semantics and Pragmatics of Dia-logue (CATALOG 2004), Barcelona, Spain, July.

John Godfrey, Edward Holliman, and Jane McDaniel.1992. SWITCHBOARD: Telephone speech corpusfor research and development. InProceedings of theICASSP-92, pages 517–520, San Francisco, USA.

Jeroen Groenendijk and Martin Stokhof. 1991. Dy-namic predicate logic.Linguistics and Philosophy,14(1):39–100.

Slava M. Katz. 1987. Estimation of probabilities fromsparse data for the language model component of aspeech recognizer.IEEE Transactions on Acoustics,Speech, and Signal Processing, 35(3):400–401.

Oliver Lemon, Kallirroi Georgila, and James Hender-son. 2006. Evaluating effectiveness and portabil-ity of reinforcement learned dialogue strategies withreal users: The talk towninfo evaluation. InSpokenLanguage Technology Workshop, pages 178–181.

Joan A. Levin and Johanna A. Moore. 1988. Dialogue-games: metacommunication structures for naturallanguage interaction.Distributed Artificial Intelli-gence, pages 385–397.

Esther Levin, Roberto Pieraccini, and Wieland Eck-ert. 1998. Using markov decision process forlearning dialogue strategies. InProceedings of theICASSP’98, pages 201–204, Seattle, WA, USA.

Lori Levin, Klaus Ries, Ann Thyme-Gobbel, and AlonLavie. 1999. Tagging of speech acts and dialoguegames in spanish call home. InProceedings of ACL-99 Workshop on Discourse Tagging, College Park,MD, USA.

Masaaki Nagata and Tsuyoshi Morimoto. 1994. Firststeps towards statistical modeling of dialogue to pre-dict the speech act type of the next utterance.SpeechCommunication, 15(3-4):193–203.

Tim Paek. 2006. Reinforcement learning for spokendialogue systems: Comparing strenghts and weak-nesses for practical deployment. InInterspeechWorkshop on ”Dialogue on Dialogues”.

Massimo Poesio and Andrei Mikheev. 1998. Thepredictive power of game structure in dialogue

act recognition: Experimental results using maxi-mum entropy estimation. InProceedings Interna-tional Conference on Spoken Language Processing(ICSLP-98), Sydney, Australia, December.

Livia Polanyi and Remko Scha. 1984. A syntactic ap-proach to discourse semantics. InProceedings ofthe 10th international conference on Computationallinguistics, pages 413–419, Stanford, CA, USA.

Norbert Reithinger and Elisabeth Maier. 1995. Uti-lizing statistical dialogue act processing in VERB-MOBIL. In Proceedings of the 33rd annual meetingon the Association for Computational Linguistics(ACL), pages 116–121, Cambridge, Massachusetts.Association for Computational Linguistics (ACL).

Uwe Reyle. 1993. Dealing with ambiguities by under-specification: Construction, representation and de-duction.Journal of Semantics, 10(2):123–179.

Emanuel A. Schegloff. 1968. Sequencing in con-versational openings. American Anthropologist,70:1075–1095.

Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza-beth Shriberg, Rebecca Bates, Daniel Jurafsky, PaulTaylor, Rachel Martin, Carol Van Ess-Dykema, andMarie Meteer. 2000. Dialogue act modeling forautomatic tagging and recognition of conversationalspeech.Computational Linguistics, 26(3):339–373.

Richard S. Sutton and Andrew G. Barto. 1998.Re-inforcement Learning: An Introduction (AdaptiveComputation and Machine Learning). MIT Press,March.

Menno van Zaanen and Pieter W. Adriaans. 2001.Comparing two unsupervised grammar inductionsystems: Alignment-Based Learning vs. EMILE.Technical Report TR2001.05, University of Leeds,Leeds, UK, March.

Menno M. van Zaanen. 2002.Bootstrapping Structureinto Language: Alignment-Based Learning. Ph.D.thesis, University of Leeds, Leeds, UK, January.

Andrew J. Viterbi. 1967. Error bounds for convolu-tional codes and an asymptotically optimum decod-ing algorithm. IEEE Transactions on InformationTheory, 13(2):260–269, April.

Marilyn Walker, Candace Kamm, and Diane Litman.2000. Towards developing general models of usabil-ity with paradise. Natural Language Engineering,6(3-4):363–377.

Ian H. Witten and Timothy C. Bell. 1991. The zero-frequency problem: Estimating the probabilities ofnovel events in adaptive text compression.IEEETransactions on Information Theory, 37(4):1085–1094.

Steve Young. 2002. The statistical approach to thedesign of spoken dialogue systems. Technical Re-port CUED/F-INFENG/TR.433, Engineering De-partment, Cambridge University, UK, September.

15


Experiments Using OSTIA for a Language Production Task

Dana Angluin and Leonor Becerra-BonacheDepartment of Computer Science, Yale University

P.O.Box 208285, New Haven, CT, USA{dana.angluin, leonor.becerra-bonache}@yale.edu

Abstract

The phenomenon of meaning-preservingcorrections given by an adult to a childinvolves several aspects: (1) the childproduces an incorrect utterance, whichthe adult nevertheless understands, (2) theadult produces a correct utterance with thesame meaning and (3) the child recognizesthe adult utterance as having the samemeaning as its previous utterance, andtakes that as a signal that its previous ut-terance is not correct according to the adultgrammar. An adequate model of this phe-nomenon must incorporate utterances andmeanings, account for how the child andadult can understand each other’s mean-ings, and model how meaning-preservingcorrections interact with the child’s in-creasing mastery of language production.In this paper we are concerned with howa learner who has learned to comprehendutterances might go about learning to pro-duce them.

We consider a model of language com-prehension and production based on finitesequential and subsequential transducers.Utterances are modeled as finite sequencesof words and meanings as finite sequencesof predicates. Comprehension is inter-preted as a mapping of utterances to mean-ings and production as a mapping of mean-ings to utterances. Previous work (Castel-lanos et al., 1993; Pieraccini et al., 1993)has applied subsequential transducers andthe OSTIA algorithm to the problem oflearning to comprehend language; here weapply them to the problem of learning toproduce language. For ten natural lan-guages and a limited domain of geomet-ric shapes and their properties and rela-

tions we define sequential transducers toproduce pairs consisting of an utterancein that language and its meaning. Usingthis data we empirically explore the prop-erties of the OSTIA and DD-OSTIA al-gorithms for the tasks of learning compre-hension and production in this domain, toassess whether they may provide a basisfor a model of meaning-preserving correc-tions.

1 Introduction

The role of corrections in language learning hasrecently received substantial attention in Gram-matical Inference. The kinds of corrections con-sidered are mainly syntactic corrections based onproximity between strings. For example, a cor-rection of a string may be given by using editdistance (Becerra-Bonache et al., 2007; Kinber,2008) or based on the shortest extension of thequeried string (Becerra-Bonache et al., 2006),among others. In these approaches semantic in-formation is not used.

However, in natural situations, a child’s er-roneous utterances are corrected by her parentsbased on the meaning that the child intends to ex-press; typically, the adult’s corrections preservethe intended meaning of the child. Adults use cor-rections in part as a way of making sure they haveunderstood the child’s intentions, in order to keepthe conversation “on track”. Thus the child’s ut-terance and the adult’s correction have the samemeaning, but the form is different. As Chouinardand Clark point out (2003), because children at-tend to contrasts in form, any change in form thatdoes not mark a different meaning will signal tochildren that they may have produced somethingthat is not acceptable in the target language. Re-sults in (Chouinard and Clark, 2003) show thatadults reformulate erroneous child utterances of-ten enough to help learning. Moreover, these re-

16

sults show that children can not only detect differ-ences between their own utterance and the adultreformulation, but that they do make use of thatinformation.

Thus in some natural situations, correctionshave a semantic component that has not been takeninto account in previous Grammatical Inferencestudies. Some interesting questions arise: Whatare the effects of corrections on learning syntax?Can corrections facilitate the language learningprocess? One of our long-term goals is to find aformal model that gives an account of this kindof correction and in which we can address thesequestions. Moreover, such a model might allow usto show that semantic information can simplify theproblem of learning formal languages.

A simple computational model of semantics andcontext for language learning incorporating se-mantics was proposed in (Angluin and Becerra-Bonache, 2008). This model accommodates twodifferent tasks: comprehension and production.That paper focused only on the comprehensiontask and formulated the learning problem as fol-lows. The teacher provides to the learner severalexample pairs consisting of a situation and an ut-terance denoting something in the situation; thegoal of the learner is to learn the meaning func-tion, allowing the learner to comprehend novel ut-terances. The results in that paper show that undercertain assumptions, a simple algorithm can learnto comprehend an adult’s utterance in the sense ofproducing the same sequence of predicates, evenwithout mastering the adult’s grammar. For exam-ple, receiving the utterance the blue square abovethe circle, the learner would be able to produce thesequence of predicates (bl, sq, ab, ci).

In this paper we focus on the production task,using sequential and subsequential transducers tomodel both comprehension and production. Adultproduction can be modeled as converting a se-quence of predicates into an utterance, which canbe done with access to the meaning transducer forthe adult’s language.

However, we do not assume that the child ini-tially has access to the meaning transducer forthe adult’s language; instead we assume that thechild’s production progresses through differentstages. Initially, child production is modeled asconsisting of two different tasks: finding a correctsequence of predicates, and inverting the meaningfunction to produce a kind of “telegraphic speech”.

For example, from (gr, tr, le, sq) the child mayproduce green triangle left square. Our goal is tomodel how the learner might move from this tele-graphic speech to speech that is grammatical in theadult’s sense. Moreover, we would like to find aformal framework in which corrections (in form ofexpansions, for example, the green triangle to theleft of the square) can be given to the child dur-ing the intermediate stages (before the learner isable to produce grammatically correct utterances)to study their effect on language learning.

We thus propose to model the problem ofchild language production as a machine trans-lation problem, that is, as the task of translat-ing a sequence of predicate symbols (representingthe meaning of an utterance) into a correspond-ing utterance in a natural language. In this pa-per we explore the possibility of applying existingautomata-theoretic approaches to machine transla-tion to model language production. In Section 2,we describe the use of subsequential transducersfor machine translation tasks and review the OS-TIA algorithm to learn them (Oncina, 1991). InSection 3, we present our model of how the learnercan move from telegraphic to adult speech. In Sec-tion 4, we present the results of experiments in themodel made using OSTIA. Discussion of these re-sults is presented in Section 5 and ideas for futurework are in Section 6.

2 Learning Subsequential Transducers

Subsequential transducers (SSTs) are a formalmodel of translation widely studied in the liter-ature. SSTs are deterministic finite state mod-els that allow input-output mappings between lan-guages. Each edge of an SST has an associatedinput symbol and output string. When an in-put string is accepted, an SST produces an out-put string that consists of concatenating the out-put substrings associated with sequence of edgestraversed, together with the substring associatedwith the last state reached by the input string. Sev-eral phenomena in natural languages can be eas-ily represented by means of SSTs, for example,the different orders of noun and adjective in Span-ish and English (e.g., un cuadrado rojo - a redsquare). Formal and detailed definitions can befound in (Berstel, 1979).

For any SST it is always possible to find anequivalent SST that has the output strings assignedto the edges and states so that they are as close to

17

the initial state as they can be. This is called anOnward Subsequential Transducer (OST).

It has been proved that SSTs are learnable inthe limit from a positive presentation of sentencepairs by an efficient algorithm called OSTIA (On-ward Subsequential Transducer Inference Algo-rithm) (Oncina, 1991). OSTIA takes as input a fi-nite training set of input-output pairs of sentences,and produces as output an OST that generalizesthe training pairs. The algorithm proceeds as fol-lows (this description is based on (Oncina, 1998)):

• A prefix tree representation of all the inputsentences of the training set is built. Emptystrings are assigned as output strings to boththe internal nodes and the edges of this tree,and every output sentence of the training setis assigned to the corresponding leaf of thetree. The result is called a tree subsequentialtransducer.

• An onward tree subsequential transducerequivalent to the tree subsequential trans-ducer is constructed by moving the longestcommon prefixes of the output strings, levelby level, from the leaves of the tree towardsthe root.

• Starting from the root, all pairs of states ofthe onward tree subsequential transducer areconsidered in order, level by level, and aremerged if possible (i.e., if the resulting trans-ducer is subsequential and does not contra-dict any pair in the training set).

SSTs and OSTIA have been successfully ap-plied to different translation tasks: Roman numer-als to their decimal representations, numbers writ-ten in English to their Spanish spelling (Oncina,1991) and Spanish sentences describing simplevisual scenes to corresponding English and Ger-man sentences (Castellanos et al., 1994). Theyhave also been applied to language understandingtasks (Castellanos et al., 1993; Pieraccini et al.,1993).

Moreover, several extensions of OSTIA havebeen introduced. For example, OSTIA-DR incor-porates domain (input) and range (output) mod-els in the learning process, allowing the algorithmto learn SSTs that accept only sentences compat-ible with the input model and produce only sen-tences compatible with the output model (Oncina

and Varo, 1996). Experiments with a language un-derstanding task gave better results with OSTIA-DR than with OSTIA (Castellanos et al., 1993).Another extension is DD-OSTIA (Oncina, 1998),which instead of considering a lexicographic orderto merge states, uses a heuristic order based on ameasure of the equivalence of the states. Experi-ments in (Oncina, 1998) show that better resultscan be obtained by using DD-OSTIA in certaintranslation tasks from Spanish to English.

3 From telegraphic to adult speech

To model how the learner can move from tele-graphic speech to adult speech, we reduce thisproblem to a translation problem, in which thelearner has to learn a mapping from sequences ofpredicates to utterances. As we have seen in theprevious section, SSTs are an interesting approachto machine translation. Therefore, we explore thepossibility of modeling language production usingSSTs and OSTIA, to see whether they may pro-vide a good framework to model corrections.

As described in (Angluin and Becerra-Bonache,2008), after learning the meaning function thelearner is able to assign correct meanings to ut-terances, and therefore, given a situation and anutterance that denotes something in the situation,the learner is able to point correctly to the objectdenoted by the utterance. To simplify the taskwe consider, we make two assumptions about thelearner at the start of the production phase: (1)the learner’s lexicon represents a correct meaningfunction and (2) the learner can generate correctsequences of predicates.

Therefore, in the initial stage of the productionphase, the learner is able to produce a kind of“telegraphic speech” by inverting the lexicon con-structed during the comprehension stage. For ex-ample, if the sequence of predicates is (bl, sq, ler,ci), and in the lexicon blue is mapped to bl, squareto sq, right to ler and circle to ci, then by invert-ing this mapping, the learner would produce bluesquare right circle.

In order to explore the capability of SSTs andOSTIA to model the next stage of language pro-duction (from telegraphic to adult speech), we takethe training set to be input-output pairs each ofwhich contains as input a sequence of predicates(e.g., (bl, sq, ler, ci)) and as output the correspond-ing utterance in a natural language (e.g., the bluesquare to the right of the circle). In this example,

18

the learner must learn to include appropriate func-tion words. In other languages, the learner mayhave to learn a correct choice of words determinedby gender, case or other factors. (Note that we arenot yet in a position to consider corrections.)

4 Experiments

Our experiments were made for a limited domainof geometric shapes and their properties and re-lations. This domain is a simplification of theMiniature Language Acquisition task proposed byFeldman et al. (Feldman et al., 1990). Previousapplications of OSTIA to language understandingand machine translation have also used adapta-tions and extensions of the Feldman task.

In our experiments, we have predicates for threedifferent shapes (circle (ci), square (sq) and tri-angle (tr)), three different colors (blue (bl), green(gr) and red (re)) and three different relations (tothe left of (le), to the right of (ler), and above (ab)).We consider ten different natural languages: Ara-bic, English, Greek, Hebrew, Hindi, Hungarian,Mandarin, Russian, Spanish and Turkish.

We created a data sequence of input-outputpairs, each consisting of a predicate sequence anda natural language utterance. For example, onepair for Spanish is ((ci, re, ler, tr), el circulo rojoa la derecha del triangulo). We ran OSTIA on ini-tial segments of the sequence of pairs, of lengths10, 20, 30, . . ., to produce a sequence of subse-quential transducers. The whole data sequencewas used to test the correctness of the transducersgenerated during the process. An error is countedwhenever given a data pair (x, y), the subsequen-tial transducer translates x to y′, and y′ 6= y. Wesay that OSTIA has converged to a correct trans-ducer if all the transducers produced afterwardshave the same number of states and edges, and 0errors on the whole data sequence.

To generate the sequences of input-output pairs,for each language we constructed a meaning trans-ducer capable of producing the 444 different pos-sible meanings involving one or two objects. Werandomly generated 400 unique (non-repeated)input-output pairs for each language. This processwas repeated 10 times. In addition, to investigatethe effect of the order of presentation of the input-output pairs, we repeated the data generation pro-cess for each language, sorting the pairs accordingto a length-lex ordering of the utterances.

We give some examples to illustrate the trans-

ducers produced. Figure 1 shows an example ofa transducer produced by OSTIA after just tenpairs of input-output examples for Spanish. Thistransducer correctly translates the ten predicate se-quences used to construct it, but the data is notsufficient for OSTIA to generalize correctly in allcases, and many other correct meanings are stillincorrectly translated. For example, the sequence(ci, bl) is translated as el circulo a la izquierda delcirculo verde azul instead of el circulo azul.

The transducers produced after convergence byOSTIA and DD-OSTIA correctly translate all 444possible correct meanings. Examples for Spanishare shown in Figure 2 (OSTIA) and Figure 3 (DD-OSTIA). Note that although they correctly trans-late all 444 correct meanings, the behavior of thesetwo transducers on other (incorrect) predicate se-quences is different, for example on (tr, tr).

1

bl/ azul sq/ el cuadrado ci/el circulo a la

izquierda del circulo verde

2

tr/ el triangulo le/ ler/

re/ rojo a la derecha del cuadrado

sq/ ci/

bl/ azul gr/

ler/ a la derecha del cuadrado

3ab/

tr/ encima del triangulo ci/ verde encima del

circulo azul bl/

re/ rojo

Figure 1: Production task, OSTIA. A transducerproduced using 10 random unique input-outputpairs (predicate sequence, utterance) for Spanish.

1

bl/ azul sq/ el cuadrado

2

le/ a la izquierda del ler/ a la derecha del

ab/ encima del re/ rojo

gr/ verde ci/ el circulo

tr/ el triangulo


ab/ encima del bl/ azul re/ rojo

gr/ verde sq/ cuadrado

ci/ circulo tr/ triangulo

Figure 2: Production task, OSTIA. A transducerproduced (after convergence) by using randomunique input-output pairs (predicate sequence, ut-terance) for Spanish.

Different languages required very differentnumbers of data pairs to converge. Statistics onthe number of pairs needed until convergence forOSTIA for all ten languages for both randomunique and random unique sorted data sequencesare shown in Table 1. Because better results werereported using DD-OSTIA in machine translation

19

1

bl/ azul re/ rojo

gr/ verde sq/ el cuadrado

ci/ el circulo tr/ el triangulo

2


ab/ encima del

sq/ cuadrado ci/ circulo

tr/ triangulo

Figure 3: Production task, DD-OSTIA. A trans-ducer produced (after convergence) using randomunique input-output pairs (predicate-sequence, ut-terance) for Spanish.

Language # Pairs # Sorted PairsArabic 150 200English 200 235Greek 375 400Hebrew 195 30Hindi 380 350Hungarian 365 395Mandarin 45 150Russian 270 210Spanish 190 150Turkish 185 80

Table 1: Production task, OSTIA. The entries givethe median number of input-output pairs until con-vergence in 10 runs. For Greek, Hindi and Hun-garian, the median for the unsorted case is calcu-lated using all 444 random unique pairs, instead of400.

tasks (Oncina, 1998), we also tried using DD-OSTIA for learning to translate a sequence ofpredicates to an utterance. We used the same se-quences of input-output pairs as in the previousexperiment. The results obtained are shown in Ta-ble 2.

We also report the sizes of the transducerslearned by OSTIA and DD-OSTIA. Table 3 andTable 4 show the numbers of states and edgesof the transducers after convergence for each lan-guage. In case of disagreements, the number re-ported is the mode.

To answer the question of whether productionis harder than comprehension in this setting, wealso considered the comprehension task, that is,to translate an utterance in a natural languageinto the corresponding sequence of predicates.

Language # Pairs # Sorted PairsArabic 80 140English 85 180Greek 350 400Hebrew 65 80Hindi 175 120Hungarian 245 140Mandarin 40 150Russian 185 210Spanish 80 150Turkish 50 40

Table 2: Production task, DD-OSTIA. The entriesgive the median number of input-output pairs un-til convergence in 10 runs. For Greek, Hindi andHungarian, the median for the unsorted case is cal-culated using all 444 random unique pairs, insteadof 400.

Languages #states #edgesArabic 2 20English 2 20Greek 9 65Hebrew 2 20Hindi 7 58Hungarian 3 20Mandarin 1 10Russian 3 30Spanish 2 20Turkish 4 31

Table 3: Production task, OSTIA. Sizes of trans-ducers at convergence.

The comprehension task was studied by Oncinaet al. (Castellanos et al., 1993). They used En-glish sentences, with a more complex version ofthe Feldman task domain and more complex se-mantic representations than we use. Our resultsare presented in Table 5. The number of statesand edges of the transducers after convergence isshown in Table 6.

5 Discussion

It should be noted that because the transducersoutput by OSTIA and DD-OSTIA correctly repro-duce all the pairs used to construct them, once ei-ther algorithm has seen all 444 possible data pairsin either the production or the comprehension task,the resulting transducers will correctly translate allcorrect inputs. However, state-merging in the al-

20


Table 4: Production task, DD-OSTIA. Sizes oftransducers at convergence.

Languages OSTIA DD-OSTIAArabic 65 65English 60 20Greek 325 60Hebrew 90 45Hindi 60 35Hungarian 40 45Mandarin 60 40Russian 280 55Spanish 45 30Turkish 60 35

Table 5: Comprehension task, OSTIA and DD-OSTIA. Median number (in 10 runs) of input-output pairs until convergence using a sequence of400 random unique pairs of (utterance, predicatesequence).

gorithms induces compression and generalization,and the interesting questions are how much datais required to achieve correct generalization, andhow that quantity scales with the complexity ofthe task. This are very difficult questions to ap-proach analytically, but empirical results can offervaluable insights.

Considering the comprehension task (Tables 5and 6), we see that OSTIA generalizes correctlyfrom at most 15% of all 444 possible pairs exceptin the cases of Greek, Hebrew and Russian. DD-OSTIA improves the OSTIA results, in some casesdramatically, for all languages except Hungarian.DD-OSTIA achieves correct generalization fromat most 15% of all possible pairs for all ten lan-guages. Because the meaning function for all tenlanguage transducers is independent of the state,in each case there is a 1-state sequential trans-


Table 6: Comprehension task, OSTIA and DD-OSTIA. Sizes of transducers at convergence using400 random unique input-output pairs (utterance,predicate sequence). In cases of disagreement, thenumber reported is the mode.

ducer that achieves correct translation of correctutterances into predicate sequences. OSTIA andDD-OSTIA converged to 1-state transducers forall languages except Greek, for which they con-verged to 2-state transducers. Examining one suchtransducer for Greek, we found that the require-ment that the transducer be “onward” necessitatedtwo states. These results are broadly compatiblewith the results obtained by Oncina et al. (Castel-lanos et al., 1993) on language understanding; themore complex tasks they consider also give evi-dence that this approach may scale well for thecomprehension task.

Turning to the production task (Tables 1, 2, 3and 4), we see that providing the random sampleswith a length-lex ordering of utterances has incon-sistent effects for both OSTIA and DD-OSTIA,sometimes dramatically increasing or decreasingthe number of samples required. We do not fur-ther consider the sorted samples.

Comparing the production task with the com-prehension task for OSTIA, the production taskgenerally requires substantially more randomunique samples than the comprehension task forthe same language. The exceptions are Mandarin(production: 45 and comprehension: 60) and Rus-sian (production: 270 and comprehension: 280).For DD-OSTIA the results are similar, with thesole exception of Mandarin (production: 40 andcomprehension: 40). For the production task DD-OSTIA requires fewer (sometimes dramaticallyfewer) samples to converge than OSTIA. How-ever, even with DD-OSTIA the number of sam-

21

ples is in several cases (Greek, Hindi, Hungarianand Russian) a rather large fraction (40% or more)of all 444 possible pairs. Further experimentationand analysis is required to determine how these re-sults will scale.

Looking at the sizes of the transducers learnedby OSTIA and DD-OSTIA in the production task,we see that the numbers of states agree for all lan-guages except Turkish. (Recall from our discus-sion in Section 4 that there may be differences inthe behavior of the transducers learned by OSTIAand DD-OSTIA at convergence.) For the produc-tion task, Mandarin gives the smallest transducer;for this fragment of the language, the translationof correct predicate sequences into utterances canbe achieved with a 1-state transducer. In contrast,English and Spanish both require 2 states to handlearticles correctly. For example, in the transducerin Figure 3, the predicate for a circle (ci) is trans-lated as el circulo if it occurs as the first object (instate 1) and as circulo if it occurs as second ob-ject (in state 2) because del has been supplied bythe translation of the intervening binary relation(le, ler, or ab.) Greek gives the largest transducerfor the production task, with 9 states, and requiresthe largest number of samples for DD-OSTIA toachieve convergence, and one of the largest num-bers of samples for OSTIA. Despite the evidenceof the extremes of Mandarin and Greek, the rela-tion between the size of the transducer for a lan-guage and the number of samples required to con-verge to it is not monotonic.

In our model, one reason that learning the pro-duction task may in general be more difficult thanlearning the comprehension task is that while themapping of a word to a predicate does not dependon context, the mapping of a predicate to a wordor words does (except in the case of our Mandarintransducer.) As an example, in the comprehensiontask the Russian words triugolnik, triugolnika andtriugonikom are each mapped to the predicate tr,but the reverse mapping must be sensitive to thecontext of the occurrence of tr.

These results suggest that OSTIA or DD-OSTIA may be an effective method to learn totranslate sequences of predicates into natural lan-guage utterances in our domain. However, some ofour objectives seem incompatible with the proper-ties of OSTIA. In particular, it is not clear howto incorporate the learner’s initial knowledge ofthe lexicon and ability to produce “telegraphic

speech” by inverting the lexicon. Also, the in-termediate results of the learning process do notseem to have the properties we expect of a learnerwho is progressing towards mastery of produc-tion. That is, the intermediate transducers per-fectly translate the predicate sequences used toconstruct them, but typically produce other trans-lations that the learner (using the lexicon) wouldknow to be incorrect. For example, the intermedi-ate transducer from Figure 1 translates the predi-cate sequence (ci) as el circulo a la izquierda delcirculo verde, which the learner’s lexicon indicatesshould be translated as (ci, le, ci, gr).

6 Future work

Further experiments and analysis are required tounderstand how these results will scale with largerdomains and languages. In this connection, it maybe interesting to try the experiments of (Castel-lanos et al., 1993) in the reverse (production) di-rection. Finding a way to incorporate the learner’sinitial lexicon seems important. Perhaps by incor-porating the learner’s knowledge of the input do-main (the legal sequences of predicates) and usingthe domain-aware version, OSTIA-D, the interme-diate results in the learning process would be morecompatible with our modeling objectives. Copingwith errors will be necessary; perhaps an explic-itly statistical framework for machine translationshould be considered.

If we can find an appropriate model of howthe learner’s language production process mightevolve, then we will be in a position to modelmeaning-preserving corrections. That is, thelearner chooses a sequence of predicates and mapsit to a (flawed) utterance. Despite its flaws, thelearner’s utterance is understood by the teacher(i.e., the teacher is able to map it to the sequenceof predicates chosen by the learner) and respondswith a correction, that is, a correct utterance forthat meaning. The learner, recognizing that theteacher’s utterance has the same meaning but adifferent form, then uses the correct utterance (aswell as the meaning and the incorrect utterance) toimprove the mapping of sequences of predicates toutterances.

It is clear that in this model, corrections are notnecessary to the process of learning comprehen-sion and production; once the learner has a correctlexicon, the utterances of the teacher can be trans-lated into sequences of predicates, and the pairs

22

of (predicate sequence, utterance) can be used tolearn (via an appropriate variant of OSTIA) a per-fect production mapping. However, it seems verylikely that corrections can make the process oflearning a production mapping easier or faster, andfinding a model in which such phenomena can bestudied remains an important goal of this work.

7 Acknowledgments

The authors sincerely thank Prof. Jose Oncinafor the use of his programs for OSTIA and DD-OSTIA, as well as his helpful and generous ad-vice. The research of Leonor Becerra-Bonachewas supported by a Marie Curie InternationalFellowship within the 6th European CommunityFramework Programme.

ReferencesDana Angluin and Leonor Becerra-Bonache. 2008.

Learning Meaning Before Syntax. ICGI, 281–292.

Leonor Becerra-Bonache, Colin de la Higuera, J.C.Janodet, and Frederic Tantini. 2007. Learning Ballsof Strings with Correction Queries. ECML, 18–29.

Leonor Becerra-Bonache, Adrian H. Dediu, andCristina Tirnauca. 2006. Learning DFA from Cor-rection and Equivalence Queries. ICGI, 281–292.

Jean Berstel. 1979. Transductions and Context-FreeLanguages. PhD Thesis, Teubner, Stuttgart, 1979.

Antonio Castellanos, Enrique Vidal, and Jose Oncina.1993. Language Understanding and SubsequentialTransducers. ICGI, 11/1–11/10.

Antonio Castellanos, Ismael Galiano, and Enrique Vi-dal. 1994. Applications of OSTIA to machine trans-lation tasks. ICGI, 93–105.

Michelle M. Chouinard and Eve V. Clark. 2003. AdultReformulations of Child Errors as Negative Evi-dence. Journal of Child Language, 30:637–669.

Jerome A. Feldman, George Lakoff, Andreas Stolcke,and Susan Hollback Weber. 1990. Miniature Lan-guage Acquisition: A touchstone for cognitive sci-ence. Technical Report, TR-90-009. InternationalComputer Science Institute, Berkeley, California.April, 1990.

Efim Kinber. 2008. On Learning Regular Expres-sions and Patterns Via Membership and CorrectionQueries. ICGI, 125–138.

Jose Oncina. 1991. Aprendizaje de lenguajes regu-lares y transducciones subsecuenciales. PhD Thesis,Universitat Politecnica de Valencia, Valencia, Spain,1998.

Jose Oncina. 1998. The data driven approach appliedto the OSTIA algorithm. ICGI, 50–56.

Jose Oncina and Miguel Angel Varo. 1996. Using do-main information during the learing of a subsequen-tial transducer. ICGI, 301–312.

Roberto Pieraccini, Esther Levin, and Enrique Vidal.1993. Learning how to understand language. Eu-roSpeech’93, 448–458.

23


GREAT: a finite-state machine translation toolkit implementing aGrammatical Inference Approach for Transducer Inference (GIATI)

Jorge Gonzálezand Francisco CasacubertaDepartamento de Sistemas Informáticos y Computación

Instituto Tecnológico de InformáticaUniversidad Politécnica de Valencia

{jgonzalez,fcn}@dsic.upv.es

Abstract

GREAT is a finite-state toolkit which isdevoted to Machine Translation and thatlearns structured models from bilingualdata. The training procedure is based ongrammatical inference techniques to ob-tain stochastic transducers that model boththe structure of the languages and the re-lationship between them. The inferenceof grammars from natural language causesthe models to become larger when a lessrestrictive task is involved; even more ifa bilingual modelling is being considered.GREAT has been successful to implementthe GIATI learning methodology, usingdifferent scalability issues to be able todeal with corpora of high volume of data.This is reported with experiments on theEuroParl corpus, which is a state-of-the-art task in Statistical Machine Translation.

1 Introduction

Over the last years, grammatical inference tech-niques have not been widely employed in the ma-chine translation area. Nevertheless, it is not un-known that researchers are trying to include somestructured information into their models in order tocapture the grammatical regularities that there arein languages together with their own relationship.

GIATI (Casacuberta, 2000; Casacuberta et al.,2005) is a grammatical inference methodology toinfer stochastic transducers in a bilingual mod-elling approach for statistical machine translation.

From a statistical point of view, the translationproblem can be stated as follows: given a sourcesentences = s1 . . . sJ , the goal is to find a targetsentencet = t1 . . . t

I, among all possible target

stringst, that maximises the posterior probability:

t = argmaxt

Pr(t|s) (1)

The conditional probabilityPr(t|s) can be re-placed by a joint probability distributionPr(s, t)which is modelled by a stochastic transducer beinginferred through the GIATI methodology (Casacu-berta et al., 2004; Casacuberta and Vidal, 2004):

t = argmaxt

Pr(s, t) (2)

This paper describes GREAT, a software pack-age for bilingual modelling from parallel corpus.

GREAT is a finite-state toolkit which was bornto overcome the computational problems that pre-vious implementations of GIATI (Picó, 2005) hadin practice when huge amounts of data were used.Even more, GREAT is the result of a very metic-ulous study of GIATI models, which improvesthe treatment of smoothing transitions in decod-ing time, and that also reduces the required time totranslate an input sentence by means of an analysisthat will depend on the granularity of the symbols.

Experiments for a state-of-the-art, voluminoustranslation task, such as the EuroParl, are re-ported. In (González and Casacuberta, 2007),the so called phrase-based finite-state transducerswere concluded to be a better modelling option forthis task than the ones that derive from a word-based approach. That is why the experiments hereare exclusively related to this particular kind ofGIATI-based transducers.

The structure of this work is as follows: first,section 2 is devoted to describe the training proce-dure, which is in turn divided into several lines, forinstance, the finite-state GIATI-based models aredefined and their corresponding grammatical in-ference methods are described, including the tech-niques to deal with tasks of high volume of data;then, section 3 is related to the decodification pro-cess, which includes an improved smoothing be-haviour and an analysis algorithm that performsaccording to the granularity of the bilingual sym-bols in the models; to continue, section 4 deals

24

with an exhaustive report on experiments; and fi-nally, the conclusions are stated in the last section.

2 Finite state models

A stochastic finite-state automatonA is a tuple(Γ, Q, i, f, P ), whereΓ is an alphabet of symbols,Q is a finite set of states, functionsi : Q → [0, 1]andf : Q → [0, 1] refer to the probability of eachstate to be, respectively, initial and final, and par-cial functionP : Q × {Γ ∪ ε} × Q → [0, 1] de-fines a set of transitions between pairs of states insuch a way that each transition is labelled with asymbol fromΓ (or the empty stringε), and is as-signed a probability. Moreover, functionsi, f, andP have to respect theconsistencyproperty in or-der to define a distribution of probabilities on thefree monoid. Consistent probability distributionscan be obtained by requiring a series of local con-straints which are similar to the ones for stochasticregular grammars (Vidal et al., 2005):

•∑

q∈Q

i(q) = 1

• ∀q ∈ Q :∑

γ∈{Γ∪ε},q′∈Q

P (q, γ, q′)+f(q) = 1

A stochastic finite-state transducer is definedsimilarly to a stochastic finite-state automaton,with the difference that transitions between statesare labelled with pairs of symbols that belong totwo different (input and output) alphabets, thatis, (Σ ∪ ε) × (∆ ∪ ε). Then, given some in-put and output strings,s andt, a stochastic finite-state transducer is able to associate them a jointprobability Pr(s, t). An example of a stochasticfinite-state transducer can be observed in Figure 1.

1

0.2

0.2

0.2

0.3

a : CA

b : ε 0.3

c : C 0.2

c : C 0.2

c : BC 0.5

c : ABC 0.8

ε : ε 0.3 ε : ε 0.3

ε : ε 0.7

ε : ε 0.8

Q0

Q1

Q2

Q3

Q4

Figure 1: A stochastic finite-state transducer

2.1 Inference of stochastic transducers

The GIATI methodology (Casacuberta et al.,2005) has been revealed as an interesting approach

to infer stochastic finite-state transducers throughthe modelling of languages. Rather than learn-ing translations, GIATI first converts every pairof parallel sentences in the training corpus into acorresponding extended-symbol string in order to,straight afterwards, infer a language model from.

More concretely, given a parallel corpus con-sisting of a finite sampleC of string pairs: first,each training pair(x, y) ∈ Σ⋆×∆⋆ is transformedinto a stringz ∈ Γ⋆ from an extended alphabet,yielding a string corpusS; then, a stochastic finite-state automatonA is inferred fromS; finally, tran-sition labels inA are turned back into pairs ofstrings of source/target symbols inΣ⋆ × ∆⋆, thusconverting the automatonA into a transducerT .

The first transformation is modelled by some la-belling functionL : Σ⋆ ×∆⋆ → Γ⋆, while the lasttransformation is defined by an inverse labellingfunction Λ(·), such thatΛ(L(C)) = C. Build-ing a corpus of extended symbols from the originalbilingual corpus allows for the use of many usefulalgorithms for learning stochastic finite-state au-tomata (or equivalent models) that have been pro-posed in the literature on grammatical inference.

2.2 Phrase-basedn-gram transducers

Phrase-basedn-gram transducers represent an in-teresting application of the GIATI methodology,where the extended symbols are actually bilingualphrase pairs, andn-gram models are employed aslanguage models (González et al., 2008). Figure 2shows a general scheme for the representation ofn-grams through stochastic finite state automata.

NULL

TRANSITIONSBACKOFF

TRANSITIONSBACKOFF

TRANSITIONSBIGRAM

TRANSITIONSBACKOFF

TRANSITIONSBACKOFF

TRANSITIONSUNIGRAM

HISTORY LEVEL 0

<s> </s> <unk>

TRANSITIONSTRIGRAM

TRANSITIONS(N−1)−GRAM

HISTORY LEVEL N−1

TRANSITIONSN−GRAM

. . .

. . .

. . .

...

HISTORY LEVEL 2

HISTORYLEVEL 1

u1 u2 u3 uK1

b1 b2 b3 bK2

n1 n2 n3 nKN

Figure 2: A finite-staten-gram model

25

The states in the model refer to all then-gramhistories that have been seen in the string corpusSin training time. Consuming transitions jump fromstates in a determined layer to the one immediatelyabove, increasing the history level. Once the toplevel has been reached,n-gram transitions allowfor movements inside this layer, from state to state,updating the history to the lastn − 1 seen events.

Given that an n-gram eventΓn−1Γn−2 . . .Γ2Γ1Γ0 is statistically statedas Pr(Γ0|Γn−1Γn−2 . . .Γ2Γ1), then it is appro-priately represented as a finite state transitionbetween their corresponding up-to-date histories,which are associated to some states (see Figure 3).

Γn−1Γn−2 . . .Γ2Γ1Γ0

Figure 3: Finite-state representation ofn-grams

Γn−1 . . .Γ3Γ2Γ1 Γn−2 . . .Γ2Γ1Γ0

Γ0

Therefore, transitions are labelled with a sym-bol from Γ and every extended symbol inΓ is atranslation pair coming from a phrase-based dic-tionary which is inferred from the parallel corpus.

Nevertheless, backoff transitions to lower his-tory levels are taken for smoothing purposes. Ifthe lowest level is reached and no transition hasbeen found for next wordsj , then a transition tothe<unk> state is fired, thus consideringsj as anon-starting word for any bilingual phrase in themodel. There is only 1 initial state, which is deno-ted as<s>, and it is placed at the 1st history level.

The inverse labelling function is applied overthe automaton transitions as in Figure 4, obtaininga single transducer (Casacuberta and Vidal, 2004).

Q’Q

Q Q’

On demande une activité

Action is required

Pr= p

Pr= p

On/ε demande/ε une/ε

activité/

Action is required

Pr= 1Pr= 1 Pr= 1

Figure 4: Phrase-based inverse labelling function

Intermediate states are artificially created since

they do not belong to the original automatonmodel. As a result, they are non-final states, withonly one incoming and one outcoming edge each.

2.3 Transducer pruning via n-gram events

GREAT implements this pruning technique, whichis inspired by some other statistical machine trans-lation decoders that usually filter their phrase-based translation dictionaries by means of thewords in the test set sentences (Koehn et al., 2007).

As already seen in Figure 3, anyn-gram eventis represented as a transition between their cor-responding historical states. In order to be ableto navigate through this transition, the analy-sis must have reached theΓn−1 . . .Γ3Γ2Γ1 stateand the remaining input must fit the source ele-ments ofΓ0. In other words, the full source se-quence from then-gram eventΓn−1 . . .Γ3Γ2Γ1Γ0

has to be present in the test set. Otherwise,its corresponding transition will not be able tobe employed during the analysis of the test setsentences. As a result,n-gram events that arenot in the test set can skip their transition gener-ation, since they will not be affected during de-coding time, thus reducing the size of the model.If there is also a backoff probability that is asso-ciated to the samen-gram event, its correspond-ing transition generation can be skipped too, sinceits source state will never be reached, as it is thestate which represents then-gram event. Nev-ertheless, since trained extended-symboln-gramevents would typically include more thann sourcewords, the verification of their presence or theirabsence inside the test set would imply hashing allthe test-set word sequences, which is rather im-practical. Instead, a window size is used to hashthe words in the test set, then the trainedn-gramevents are scanned on their source sequence usingthis window size to check if they might be skippedor not. It should be clear that the bigger the win-dow size is, the moren-gram rejections there willbe, therefore the transducer will be smaller. How-ever, the translation results will not be affected asthese disappearing transitions are unreachable us-ing that test set. As the window size increases, theresulting filtered transducer is closer to the mini-mum transducer that reflects the test set sentences.

3 Finite state decoding

Equation 2 expresses the MT problem in terms ofa finite state model that is able to compute the ex-

26

pressionPr(s, t). Given that only the input sen-tence is known, the model has to be parsed, takinginto account all possiblet that are compatible withs. The best output hypothesist would be that onewhich corresponds to a path through the transduc-tion model that, with the highest probability, ac-cepts the input sequence as part of the input lan-guage of the transducer.

Although the navigation through the model isconstrained by the input sentence, the search spacecan be extremely large. As a consequence, onlythe most scored partial hypotheses are being con-sidered as possible candidates to become the solu-tion. This search process is very efficiently carriedout by a beam-search approach of the well knownViterbi algorithm (Jelinek, 1998), whose temporalasymptotic cost isΘ(J · |Q| ·M), whereM is theaverage number of incoming transitions per state.

3.1 Parsing strategies: from words to phrases

The trellis structure that is commonly employedfor the analysis of an input sentence through astochastic finite state transducer has a variable sizethat depends on the beam factor in a dynamicbeam-search strategy. That way, only those nodesscoring at a predefined threshold from the best onein every stage will be considered for the next stage.

A word-based parsing strategy would start withthe initial state<s>, looking for the best transi-tions that are compatible with the first words1.The corresponding target states are then placedinto the output structure, which will be used for theanalysis of the second words2. Iteratively, everystate in the structure is scanned in order to get theinput labels that match the current analysis wordsi, and then to build an output structure with thebest scored partial paths. Finally, the states thatresult from the last analysis step are then rescoredby their corresponding final probabilities.

This is the standard algorithm for parsinga source sentence through an non-deterministicstochastic finite state transducer. Nevertheless, itmay not be the most appropriate one when dealingwith this type of phrase-basedn-gram transducers.

As it must be observed in Figure 4, a set ofconsecutive transitions represent only one phrasetranslation probability after a given history. Infact, the path from Q to Q’ should only be fol-lowed if the remaining input sentence, which hasnot been analysed yet, begins with the full inputsequenceOn demande une activité. Otherwise, it

should not be taken into account. However, as faras the words in the test sentence are compatiblewith the corresponding transitions, and accordingto the phrase score, this (word) synchronous pars-ing algorithm may store these intermediate statesinto the trellis structure, even if the full path willnot be accomplished in the end. As a consequence,these entries will be using a valuable position in-side the trellis structure to an idle result. This willbe not only a waste of time, but also a distortionon the best score per stage, reducing the effectivepower of the beam parameter during the decoding.Some other better analysis options may be rejectedbecause of their a-priori lower score. Therefore,this decoding algorithm can lead the system to aworse translation result. Alternatively, the beamfactor can be increased in order to be large enoughto store the successful paths, thus more time willbe required for the decoding of any input sentence.

On the other hand, a phrase-based analysis stra-tegy would never include intermediate states in-side a trellis structure. Instead, these artificialstates are tried to be parsed through until an ori-ginal state is being reached, i.e. Q’ in Figure 4.Word-based and phrase-based analysis are con-ceptually compared in Figure 5, by means of theirrespective edge generation on the trellis structure.

WORD−BASED EDGES

PHRASE−BASED EDGES

Q Q’


Figure 5: Word-based and phrase-based analysis

However, in order to be able to use a scrolledtwo-stage implementation of a Viterbi phrase-based analysis, the target states, which may bepositioned at several stages of distance from thecurrent one, are directly advanced to the next one.Therefore, the nodes in the trellis must be storedtogether with their corresponding last input posi-tion that was parsed. In the same manner, statesin the structure are only scanned if their posi-tion indicator is lower than the current analysisword. Otherwise, they have already taken it intoaccount so they are directly transfered to the nextstage. The algorithm remains synchronous withthe words in the input sentence, however, on this

27

particular occasion, states in thei-th step of anal-ysis are guaranteed to have parsedat least untilthe i-th word, but maybe they have gone further.Figure 6 is a graphical diagram about this concept.

Qi

Q′j Q′

j Q′j

Q′j

i j


Figure 6: A phrase-based analysis implementation

Moreover, all the states that are being stored inthe successive stages, that is, the ones from the ori-ginal topology of the finite-state representation ofthe n-gram model, are also guaranteed to lead toa final state in the model, because if they are notfinal states themselves, then there will always be asuccessful path towards a final state.

GREAT incorporates an analysis strategy thatdepends on the granularity of the bilingual sym-bols themselves so that a phrase-based decodingis applied when a phrase-based transducer is used.

3.2 Backoff smoothing

Two smoothing criteria have been explored in or-der to parse the input through the GIATI model.First, a standard backoff behaviour, where back-off transitions are taken as failure transitions, wasimplemented. There, backoff transitions are onlyfollowed if there is not any other successful paththat has been compatible with the remaining input.

However, GREAT also includes another morerefined smoothing behaviour, to be applied overthe same bilingualn-gram transducers, wheresmoothing edges are interpreted in a different way.

GREAT suggests to apply the backoff crite-rion according to its definition in the grammati-cal inference method which incorporated it intothe language model being learnt and that will berepresented as a stochastic finite-state automaton.In other words, from then-gram point of view,backoff weights (or finite-state transitions) shouldonly be employed if no transitions are found in then-gram automaton for a currentbilingual symbol.Nevertheless, input words in translation applica-tions do not belong to those bilingual languages.Instead, input sequences have to be analysed in

such a way as if they could be internally repre-senting any possible bilingual symbol from the ex-tended vocabulary that matches their source sides.That way, bilingual symbols are considered to be asort of input, so the backoff smoothing criterion isthen applied to each compatible, bilingual symbol.

For phrase-based transducers, it means that for asuccessful transition(x, y), there is no need to gobackoff and find other paths consuming that bilin-gual symbol, but we must try backoff transitionsto look for any other successful transition(x′, y′),which is also compatible with the current position.

Conceptually, this procedure would be as if theinput sentence, rather than a source string, was ac-tually composed of a left-to-right bilingual graph,being built from the expansion of every input wordinto their compatible, bilingual symbols as in acategory-based approach. Phrase-based bilingualsymbols would be graphically represented as a sortof skip transitions inside this bilingual input graph.

This new interpretation about the backoffsmoothing weights on bilingualn-gram models,which is not a priori a trivial feature to be included,is easily implemented for stochastic transducersby considering backoff transitions asε/ε transi-tions and keeping track of a dynamic list of forbid-den states every time a backoff transition is taken.

An outline about the management of state ac-tiveness, which is integrated into the parsing algo-rithm, is shown below:

ALGORITHM

for Q in {states to explore}for Q-Q’ in {transitions} (a)

if Q’ is active[...]set Q’ to inactive

if Q is not NULLif Q not in the top levelfor Q’ in {inactive states}

set Q’ to activeQ’’ := backoff(Q’)set Q’’ to inactive

Q := backoff(Q)GoTo (a)

else[...]for Q’ in {inactive states}set Q’ to active

[...]

END ALGORITHM

28

The algorithm will try to translate several con-secutive input words as a whole phrase, always al-lowing a backoff transition in order to cover allthe compatible phrases in the model, not only theones which have been seen after a given history,but after all its suffixes as well. A dynamic listof forbidden states will take care to accomplish anexploration constraint that has to be included intothe parsing algorithm: a path between two statesQ and Q’ has necessarily to be traced through theminimum number of backoff transitions; any otherQ-Q’ or Q-Q” paths, where Q” is the destinationof a Q’-Q” backoff path, should be ignored. Thisconstraint will cause that only one transition perbilingual symbol will be followed, and that it willbe the highest in the hierarchy of history levels.Figure 7 shows a parsing example over a finite-state representation of a smoothed bigram model.

Q

<backoff>p

1

p1

p1

p2

p2

p2

p3

p3

ε

Figure 7:Compatible edges for a bigram model.Given a reaching state Q, let us assume that thetransitions that correspond to certain bilingualphrase pairsp

1, p

2andp

3are all compatible with

the remaining input. However, the bigram (Q,p

3) did not occur throughout the training corpus,

therefore there is no a direct transition from Q top

3. A backoff transition enables the access top

3

because the bigram (Q,p3) turns into a unigram

event that is actually inside the model. Unigramtransitions top

1andp

2must be ignored because

their corresponding bigram events were success-fully found one level above.

4 Experiments

GREAT has been successfully employed to workwith the French-English EuroParl corpus, that is,the benchmark corpus of the NAACL 2006 sharedtask of the Workshop on Machine Translation

of the Association for Computational Linguistics.The corpus characteristics can be seen in Table 1.

Table 1:Characteristics of the Fr-En EuroParl.

French EnglishSentences 688031

Training Run. words 15.6 M 13.8 MVocabulary 80348 61626Sentences 2000

Dev-Test Run. words 66200 57951

The EuroParl corpus is built on the proceedingsof the European Parliament, which are publishedon its web and are freely available. Because ofits nature, this corpus has a large variability andcomplexity, since the translations into the differ-ent official languages are performed by groups ofhuman translators. The fact that not all transla-tors agree in their translation criteria implies that agiven source sentence can be translated in variousdifferent ways throughout the corpus.

Since the proceedings are not available in everylanguage as a whole, a different subset of the cor-pus is extracted for every different language pair,thus evolving into somewhat a different corpus foreach pair of languages.

4.1 System evaluation

We evaluated the performance of our methods byusing the following evaluation measures:

BLEU (Bilingual Evaluation Understudy) score:This indicator computes the precision of uni-grams, bigrams, trigrams, and tetragramswith respect to a set of reference translations,with a penalty for too short sentences (Pap-ineni et al., 2001). BLEU measures accuracy,not error rate.

WER (Word Error Rate): The WER criterion calcu-lates the minimum number of editions (subs-titutions, insertions or deletions) that areneeded to convert the system hypothesis intothe sentence considered ground truth. Be-cause of its nature, this measure is very pes-simistic.

Time. It refers to the average time (in milliseconds)to translate one word from the test corpus,without considering loading times.

29

4.2 Results

A set of experimental results were obtained in or-der to assess the impact of the proposed techniquesin the work with phrase-basedn-gram transducers.

By assuming an unconstrained parsing, that is,the successive trellis structure is large enough tostore all the states that are compatible within theanalysis of a source sentence, the results are notvery sensitive to then-gram degree, just showingthat bigrams are powerful enough for this corpus.However, apart from this, Table 2 is also show-ing a significant better performance for the second,more refined behaviour for the backoff transitions.

Table 2:Results for the two smoothing criteria.

nBackoff 1 2 3 4 5baseline

BLEU 26.8 26.3 25.8 25.7 25.7WER 62.3 63.9 64.5 64.5 64.5

GREATBLEU 26.8 28.0 27.9 27.9 27.9WER 62.3 61.9 62.0 62.0 62.0

From now on, the algorithms will be tested onthe phrase-basedbigram transducer, being builtaccording to the GIATI method, where backoff isemployed asε/ε transitions with forbidden states.

In these conditions, the results, following aword-based and a phrase-based decoding strategy,which are in function of the dynamic beam factor,can be analysed in Tables 3 and 4.

Table 3:Results for a word-based analysis.

beam Time (ms) BLEU WER1.00 0.1 0.4 94.61.02 0.3 12.8 81.91.05 5.2 20.0 74.01.10 30.0 24.9 68.21.25 99.0 27.1 64.61.50 147.0 27.5 62.92.00 173.6 27.8 62.13.50 252.3 28.0 61.9

From the comparison of Tables 3 and 4, it canbe deduced that a word-based analysis is itera-tively taking into account a quite high percentageof useless states, thus needing to increase the beamparameter to include the successful paths into theanalysis. The price for considering such a long list

Table 4:Results for a phrase-based analysis.

beam Time (ms) BLEU WER1.00 0.2 19.8 71.81.02 0.4 22.1 68.61.05 0.7 24.3 66.01.10 2.4 26.1 64.21.25 7.0 27.1 62.81.50 9.7 27.5 62.32.00 11.4 27.8 62.03.50 12.3 28.0 61.9

of states in every iteration of the algorithm is interms of temporal requirements.

However, a phrase-based approach only storesthose states that have been successfully reached bya full phrase compatibility with the input sentence.Therefore, it takes more time to process an indi-vidual state, but since the list of states is shorter,the search method performs at a better speed rate.Another important element to point out betweenTables 3 and 4, is about the differences on qualityresults for a same beam parameter in both tables.Word-based decoding strategies suffer the effec-tive reduction on the beam factor that was men-tioned on section 3.1 because their best scores onevery analysis stage, which determine the explo-ration boundaries, may refer to a no way out path.Logically, these differences are progressively re-duced as the beam parameter increases, since thesearch space is explored in a more exhaustive way.

Table 5:Number of trained and survivedn-grams.

n-gramsWindow size unigrams bigrams

No filter 1,593,677 4,477,3822 299,002 512,9433 153,153 141,8834 130,666 90,2655 126,056 78,8246 125,516 77,341

On the other hand, a phrase-based extended-symbol bigram model, being learnt by means ofthe full training data, computes an overall set ofapproximately 6 million events. The applicationof the n-gram pruning technique, using a grow-ing window parameter, can effectively reduce thatnumber to only 200,000. Thesen-grams, whenrepresented as transducer transitions, suppose a re-duction from 20 million transitions to only those

30

500,000 that are affected by the test set sentences.As a result, the size of the model to be parseddecreases, therefore, the decoding time also de-creases. Tables 5 and 6 show the effect of thispruning method on the size of the transducers, thenon the decoding time with a phrase-based analysis,which is the best strategy for phrase-based models.

Table 6:Decoding time for several windows sizes.

Window size Edges Time (ms)No filter 19,333,520 362.4

2 2,752,882 41.33 911,054 17.34 612,006 12.85 541,059 11.96 531,333 11.8

Needless to say that BLEU and WER keep ontheir best numbers for all the transducer sizes asthe test set is not present in the pruned transitions.

5 Conclusions

GIATI is a grammatical inference technique tolearn stochastic transducers from bilingual datafor their usage in statistical machine translation.Finite-state transducers are able to model both thestructure of both languages and their relationship.GREAT is a finite-state toolkit which was born toovercome the computational problems that previ-ous implementations of GIATI present in practicewhen huge amounts of parallel data are employed.

Moreover, GREAT is the result of a very metic-ulous study of GIATI models, which improvesthe treatment of smoothing transitions in decod-ing time, and that also reduces the required time totranslate an input sentence by means of an analysisthat will depend on the granularity of the symbols.

A pruning technique has been designed forn-gram approaches, which reduces the transducersize to integrate only those transitions that are re-ally required for the translation of the test set. Thathas allowed us to perform some experiments con-cerning a state-of-the-art, voluminous translationtask, such as the EuroParl, whose results have beenreported in depth. A better performance has beenfound when a phrase-based decoding strategy isselected in order to deal with those GIATI phrase-based transducers. Besides, this permits us to ap-ply a more refined interpretation of backoff transi-tions for a better smoothing translation behaviour.

Acknowledgments

This work was supported by the “Vicerrectoradode Innovación y Desarrollo de la UniversidadPolitécnica de Valencia”, under grant 20080033.

References

Francisco Casacuberta and Enrique Vidal. 2004.Ma-chine translation with inferred stochastic finite-statetransducers.Comput. Linguistics, 30(2):205–225.

Francisco Casacuberta, Hermann Ney, Franz JosefOch, Enrique Vidal, Juan Miguel Vilar, Sergio Bar-rachina, Ismael García-Varea, David Llorens, CésarMartínez, and Sirko Molau. 2004. Some ap-proaches to statistical and finite-state speech-to-speech translation.Computer Speech & Language,18(1):25–47.

Francisco Casacuberta, Enrique Vidal, and David Picó.2005. Inference of finite-state transducers from reg-ular languages.Pattern Recognition, 38(9):1431–1443.

F. Casacuberta. 2000. Inference of finite-state trans-ducers by using regular grammars and morphisms.In A.L. Oliveira, editor,Grammatical Inference: Al-gorithms and Applications, volume 1891 ofLectureNotes in Computer Science, pages 1–14. Springer-Verlag. 5th International Colloquium GrammaticalInference -ICGI2000-. Lisboa. Portugal.

J. González and F. Casacuberta. 2007. Phrase-basedfinite state models. InProceedings of the 6th In-ternational Workshop on Finite State Methods andNatural Language Processing (FSMNLP), Potsdam(Germany), September 14-16.

J. González, G. Sanchis, and F. Casacuberta. 2008.Learning finite state transducers using bilingualphrases. In9th International Conference on Intel-ligent Text Processing and Computational Linguis-tics. Lecture Notes in Computer Science, Haifa, Is-rael, February 17 to 23.

Frederick Jelinek. 1998. Statistical Methods forSpeech Recognition. The MIT Press, January.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. InACL. The Association for Computer Linguistics.

K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001.Bleu: a method for automatic evaluation of machinetranslation.

31

David Picó. 2005.Combining Statistical and Finite-State Methods for Machine Translation. Ph.D. the-sis, Departamento de Sistemas Informáticos y Com-putación. Universidad Politécnica de Valencia, Va-lencia (Spain), September. Advisor: Dr. F. Casacu-berta.

E. Vidal, F. Thollard, F. Casacuberta C. de la Higuera,and R. Carrasco. 2005. Probabilistic finite-state ma-chines - part I.IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 27(7):1013–1025.

32


A note on contextual binary feature grammars

Alexander ClarkDepartment of Computer Science

Royal Holloway, University of [email protected]

Remi Eyraud and Amaury HabrardLaboratoire d’Informatique Fondamentale

de Marseille, CNRS,Aix-Marseille Universite, France

remi.eyraud,[email protected]

Abstract

Contextual Binary Feature Grammarswere recently proposed by (Clark et al.,2008) as a learnable representation forrichly structured context-free and con-text sensitive languages. In this pa-per we examine the representationalpower of the formalism, its relationshipto other standard formalisms and lan-guage classes, and its appropriatenessfor modelling natural language.

1 Introduction

An important issue that concerns both natu-ral language processing and machine learningis the ability to learn suitable structures of alanguage from a finite sample. There are twomajor points that have to be taken into ac-count in order to define a learning method use-ful for the two fields: first the method shouldrely on intrinsic properties of the language it-self, rather than syntactic properties of therepresentation. Secondly, it must be possibleto associate some semantics to the structuralelements in a natural way.

Grammatical inference is clearly an impor-tant technology for NLP as it will provide afoundation for theoretically well-founded un-supervised learning of syntax, and thus avoidthe annotation bottleneck and the limitationsof working with small hand-labelled treebanks.

Recent advances in context-free grammati-cal inference have established that there arelarge learnable classes of context-free lan-guages. In this paper, we focus on the ba-sic representation used by the recent approachproposed in (Clark et al., 2008). The authorsconsider a formalism called Contextual BinaryFeature Grammars (CBFG) which defines aclass of grammars using contexts as features

instead of classical non terminals. The use offeatures is interesting from an NLP point ofview because we can associate some semanticsto them, and because we can represent com-plex, structured syntactic categories. The no-tion of contexts is relevant from a grammaticalinference standpoint since they are easily ob-servable from a finite sample. In this paperwe establish some basic language theoretic re-sults about the class of exact Contextual Bi-nary Feature Grammars (defined in Section 3),in particular their relationship to the Chomskyhierarchy: exact CBFGs are those where thecontextual features are associated to all thepossible strings that can appear in the corre-sponding contexts of the language defined bythe grammar.

The main results of this paper are proofsthat the class of exact CBFGs:• properly includes the regular languages

(Section 5),• does not include some context-free lan-

guages (Section 6),• and does include some non context-free

languages (Section 7).Thus, this class of exact CBFGs is orthog-

onal to the classic Chomsky hierarchy butcan represent a very large class of languages.Moreover, it has been shown that this classis efficiently learnable. This class is thereforean interesting candidate for modeling naturallanguage and deserves further investigation.

2 Basic Notation

We consider a finite alphabet Σ, and Σ∗ thefree monoid generated by Σ. λ is the emptystring, and a language is a subset of Σ∗. Wewill write the concatenation of u and v as uv,and similarly for sets of strings. u ∈ Σ∗ is asubstring of v ∈ Σ∗ if there are strings l, r ∈ Σ∗

such that v = lur.

33

A context is an element of Σ∗ × Σ∗. For astring u and a context f = (l, r) we write f �u = lur; the insertion or wrapping operation.We extend this to sets of strings and contextsin the natural way. A context is also known instructuralist linguistics as an environment.

The set of contexts, or distribution, of astring u of a language L is, CL(u) = {(l, r) ∈Σ∗ × Σ∗|lur ∈ L}. We will often drop thesubscript where there is no ambiguity. Wedefine the syntactic congruence as u ≡L v iffCL(u) = CL(v). The equivalence classes un-der this relation are the congruence classes ofthe language. In general we will assume thatλ is not a member of any language.

3 Contextual Binary FeatureGrammars

Most definitions and lemmas of this sectionwere first introduced in (Clark et al., 2008).

3.1 Definition

Before the presentation of the formalism, wegive some results about contexts to help togive an intuition of the representation. Thebasic insight behind CBFGs is that there is arelation between the contexts of a string w andthe contexts of its substrings. This is given bythe following trivial lemma:

Lemma 1. For any language L and for anystrings u, u′, v, v′ if C(u) = C(u′) and C(v) =C(v′), then C(uv) = C(u′v′).

We can also consider a slightly stronger result:

Lemma 2. For any language L and for anystrings u, u′, v, v′ if C(u) ⊆ C(u′) and C(v) ⊆C(v′), then C(uv) ⊆ C(u′v′).

C(u) ⊆ C(u′) means that we can replaceany occurrence of u in a sentence, with a u′,without affecting the grammaticality, but notnecessarily vice versa. Note that none of thesestrings need to correspond to non-terminals:this is valid for any fragment of a sentence.

We will give a simplified example from En-glish syntax: the pronoun it can occur every-where that the pronoun him can, but not viceversa1. Thus given a sentence “I gave himaway”, we can substitute it for him, to get the

1This example does not account for a number of syn-tactic and semantic phenomena, particularly the distri-bution of reflexive anaphors.

grammatical sentence I gave it away, but wecannot reverse the process. For example, giventhe sentence it is raining, we cannot substi-tute him for it, as we will get the ungrammat-ical sentence him is raining. Thus we observeC(him) ( C(it).

Looking at Lemma 2 we can also say that,if we have some finite set of strings K, wherewe know the contexts, then:

Corollary 1.

C(w) ⊇⋃

u′,v′:u′v′=w

⋃u∈K:

C(u)⊆C(u′)

⋃v∈K:

C(v)⊆C(v′)

C(uv)

This is the basis of the representation: aword w is characterised by its set of contexts.We can compute the representation of w, fromthe representation of its parts u′, v′, by lookingat all of the other matching strings u and vwhere we understand how they combine (withsubset inclusion). In order to illustrate thisconcept, we give here a simple example.

Consider the language {anbn|n > 0} andthe set K = {aabb, ab, abb, aab, a, b}. Supposewe want to compute the set of contexts ofaaabbb, Since C(abb) ⊆ C(aabbb), and vacu-ously C(a) ⊆ C(a), we know that C(aabb) ⊆C(aaabbb). More generally, the contexts of abcan represent anbn, those of aab the stringsan+1bn and the ones of abb the strings anbn+1.

The key relationships are given by contextset inclusion. Contextual binary feature gram-mars allow a proper definition of the combina-tion of context inclusion:

Definition 1. A Contextual Binary FeatureGrammar (CBFG) G is a tuple 〈F, P, PL,Σ〉.F is a finite set of contexts, called features,where we write C = 2F for the power set of Fdefining the categories of the grammar, P ⊆C × C × C is a finite set of productions thatwe write x → yz where x, y, z ∈ C and PL ⊆C × Σ is a set of lexical rules, written x → a.

Normally PL contains exactly one productionfor each letter in the alphabet (the lexicon).

A CBFG G defines recursively a map fG

34

from Σ∗ → C as follows:

fG(λ) = ∅ (1)

fG(w) =⋃

(c→w)∈PL

c iff |w| = 1

(2)

fG(w) =⋃

u,v:uv=w

⋃x→yz∈P :y⊆fG(u)∧z⊆fG(v)

x iff |w| > 1.

(3)

We give here more explanation about themap fG. It defines in fact the analysis of astring by a CBFG. A rule z → xy is appliedto analyse a string w if there is a cut uv = ws.t. x ⊆ fG(u) and y ⊆ fG(v), recall that xand y are sets of contexts. Intuitively, the re-lation given by the production rule is linkedwith Lemma 2: z is included in the set of fea-tures of w = uv. From this relationship, forany (l, r) ∈ z we have lwr ∈ L(G).

The complete computation of fG is then jus-tified by Corollary 1: fG(w) defines all thepossible features associated by G to w with allthe possible cuts uv = w (i.e. all the possiblederivations).

Finally, the natural way to define the mem-bership of a string w in L(G) is to have thecontext (λ, λ) ∈ fG(w) which implies thatλuλ = u ∈ L(G).

Definition 2. The language defined by aCBFG G is the set of all strings that are as-signed the empty context: L(G) = {u|(λ, λ) ∈fG(u)}.

As we saw before, we are interested in caseswhere there is a correspondence between thelanguage theoretic interpretation of a context,and the occurrence of that context as a featurein the grammar. From the basic definition ofa CBFG, we do not require any specific con-dition on the features of the grammar, exceptthat a feature is associated to a string if thestring appears in the context defined by thefeature. However, we can also require that fG

defines exactly all the possible features thatcan be associated to a given string accordingto the underlying language.

Definition 3. Given a finite set of contextsF = {(l1, r1), . . . , (ln, rn)} and a language Lwe can define the context feature map FL :

Σ∗ → 2F which is just the map u 7→ {(l, r) ∈F |lur ∈ L} = CL(u) ∩ F .

Using this definition, we now need a cor-respondence between the language theoreticcontext feature map FL and the representa-tion in the CBFG fG.

Definition 4. A CBFG G is exact if for allu ∈ Σ∗, fG(u) = FL(G)(u).

Exact CBFGs are a more limited formalismthan CBFGs themselves; without any limitson the interpretation of the features, we candefine a class of formalisms that is equal tothe class of Conjunctive Grammars (see Sec-tion 4). However, exactness is an importantnotion because it allows to associate intrinsiccomponents of a language to strings. Contextsare easily observable from a sample and more-over it is only when the features correspond tothe contexts that distributional learning algo-rithms can infer the structure of the language.A basic example of such a learning algorithmis given in (Clark et al., 2008).

3.2 A Parsing Example

To clarify the relationship with CFGparsing, we will give a simple workedexample. Consider the CBFG G =〈{(λ, λ), (aab, λ), (λ, b), (λ, abb), (a, λ)(aab, λ)},P, PL, {a, b}〉 with PL ={{(λ, b), (λ, abb)} → a, {(a, λ), (aab, λ)} → b}and P ={{(λ, λ)} → {(λ, b)}{(aab, λ)},{(λ, λ)} → {(λ, abb)}{(a, λ)},{(λ, b)} → {(λ, abb)}{(λ, λ)},{(a, λ)} → {(λ, λ)}{(aab, λ)}}.

If we want to parse the string w = aabb theusual way is to have a bottom-up approach.This means that we recursively compute thefG map on the substrings of w in order tocheck whether (λ, λ) belongs to fG(w).

The Figure 1 graphically gives the mainsteps of the computation of fG(aabb). Ba-sically there are two ways to split aabb thatallow the derivation of the empty context:aab|b and a|abb. The first one correspondto the top part of the figure while the sec-ond one is drawn at the bottom. We cansee for instance that the empty context be-longs to fG(ab) thanks to the rule {(λ, λ)} →{(λ, abb)}{(a, λ)}: {(λ, abb)} ⊆ fG(a) and{(a, λ)} ⊆ fG(b). But for symmetrical reasons

35

the result can also be obtained using the rule{(λ, λ)} → {(λ, b)}{(aab, λ)}.

As we trivially have fG(aa) = fG(bb) = ∅,since no right-hand side contains the concate-nation of the same two features, an inductionproof can be written to show that (λ, λ) ∈fG(w) ⇔ w ∈ {anbn : n > 0}.

a a b b

fG

{(λ,b),(λ,abb)} {(λ,b),(λ,abb)} {(a,λ),(aab,λ)} {(a,λ),(aab,λ)}

fG fG fG

Rule: (λ,λ) → (λ,b) (aab,λ)

fG(ab) ⊇ {(λ,λ)}

Rule: (a,λ) → (λ,λ) (aab,λ)

fG(abb) ⊇ {(a,λ)}

Rule: (λ,λ) → (λ,abb) (a,λ)

fG(aabb) ⊇ {(λ,λ)}

f G

{(λ,b),(λ,abb)} {(λ,b),(λ,abb)} {(a,λ),(aab,λ)} {(a,λ),(aab,λ)}

f G f G f GRule: (λ,λ) → (λ,abb) (a,λ)

fG(ab) ⊇ {(λ,λ)}

Rule: (λ,b) → (λ,abb) (λ,λ)

fG(aab) ⊇ {(λ,b)}

Rule: (λ,λ) → (λ,b) (aab,λ)

fG(aabb) ⊇ {(λ,λ)}

Figure 1: The two derivations to obtain (λ, λ)in fG(aabb) in the grammar G.

This is a simple example that illustratesthe parsing of a string given a CBFG. Thisexample does not characterize the power ofCBFG since no right handside part is com-posed of more than one context. A more inter-esting, example with a context-sensitive lan-guage, will be presented in Section 7.

4 Non exact CBFGs

The aim here is to study the expressive powerof CBFG compare to other formalism recentlyintroduced. Though the inference can be doneonly for exact CBFG, where features are di-rectly linked with observable contexts, it isstill worth having a look at the more generalcharacteristics of CBFG. For instance, it is in-

teresting to note that several formalisms in-troduced with the aim of representing naturallanguages share strong links with CBFG.

Range Concatenation Grammars

Range Concatenation Grammars are a verypowerful formalism (Boullier, 2000), that is acurrent area of research in NLP.

Lemma 3. For every CBFG G, there isa non-erasing positive range concatenationgrammar of arity one, in 2-var form that de-fines the same language.

Proof. Suppose G = 〈F, P, PL,Σ〉. Definea RCG with a set of predicates equal to Fand the following clauses, and the two vari-ables U, V . For each production x → yz inP , for each f ∈ x, where y = {g1, . . . gi},z = {h1, . . . hj} add clausesf(UV ) → g1(U), . . . gi(U), h1(V ), . . . hj(V ).For each lexical production {f1 . . . fk} → aadd clauses fi(a) → ε. It is straightforwardto verify that f(w) ` ε iff f ∈ fG(w).

Conjunctive Grammar

A more exact correspondence is to the class ofConjunctive Grammars (Okhotin, 2001), in-vented independently of RCGs. For every ev-ery language L generated by a conjunctivegrammar there is a CBFG representing L#(where the special character # is not includedin the original alphabet).

Suppose we have a conjunctive grammarG = 〈Σ, N, P, S〉 in binary normal form (asdefined in (Okhotin, 2003)). We construct theequivalent CBFG G′ = 〈F, P ′, PL,Σ〉 as fol-lowed:

• For every letter a we add a context (la, ra)to F such that laara ∈ L;

• For every rules X → a in P , we create arule {(la, ra)} → a in PL.

• For every non terminal X ∈ N , for everyrule X → P1Q1& . . .&PnQn we add dis-tinct contexts {(lPiQi , rPiQi)} to F, suchthat for all i it exists ui, lPiQiuirPiQi ∈ L

and PiQi∗⇒G ui;

• Let FX,j = {(lPiQi , rPiQi) : ∀i} theset of contexts corresponding to thejth rule applicable to X. For all

36

(lPiQi , rPiQi) ∈ FX,j , we add to P ′ therules (lPiQi , rPiQi) → FPi,kFQi,l (∀k, l).

• We add a new context (w, λ) to F suchthat S

∗⇒G w and (w, λ) → # to PL;

• For all j, we add to P ′ the rule (λ, λ) →FS,j{(w, λ)}.

It can be shown that this construction givesan equivalent CBFG.

5 Regular Languages

Any regular language can be defined by an ex-act CBFG. In order to show this we will pro-pose an approach defining a canonical form forrepresenting any regular language.

Suppose we have a regular language L, weconsider the left and right residual languages:

u−1L = {w|uw ∈ L} (4)

Lu−1 = {w|wu ∈ L} (5)

They define two congruencies: if l, l′ ∈ u−1L(resp. r, r′ ∈ Lu−1) then for all w ∈ Σ∗, lw ∈L iff l′w ∈ L (resp. wr ∈ L iff wr′ ∈ L).

For any u ∈ Σ∗, let lmin(u) be the lexico-graphically shortest element such that l−1

minL =u−1L. The number of such lmin is finite bythe Myhil-Nerode theorem, we denote by Lmin

this set, i.e. {lmin(u)|u ∈ Σ∗}. We de-fine symmetrically Rmin for the right residuals(Lr−1

min = Lu−1).We define the set of contexts as:

F (L) = Lmin ×Rmin. (6)

F (L) is clearly finite by construction.If we consider the regular language de-

fined by the deterministic finite automataof Figure 2, we obtain Lmin = {λ, a, b}and Rmin = {λ, b, ab} and thus F (L) ={(λ, λ), (a, λ), (b, λ), (λ, b), (a, b), (b, b), (λ, ab),(a, ab), (b, ab)}.

By considering this set of features, wecan prove (using arguments about congruenceclasses) that for any strings u, v such thatFL(u) ⊃ FL(v), then CL(u) ⊃ CL(v). Thismeans the set of feature F is sufficient to rep-resent context inclusion, we call this propertythe fiduciality.

Note that the number of congruence classesof a regular language is finite. Each congru-ence class is represented by a set of contexts

Figure 2: Example of a DFA. The left residualsare defined by λ−1L, a−1L, b−1L and the rightones by Lλ−1, Lb−1, Lab−1 (note here thatLa−1 = Lλ−1).

FL(u). Let KL be finite set of strings formedby taking the lexicographically shortest stringfrom each congruence class. The final gram-mar can be obtained by combining elementsof KL. For every pair of strings u, v ∈ KL, wedefine a rule

FL(uv) → FL(u), FL(v) (7)

and we add lexical productions of the formFL(a) → a, a ∈ Σ.

Lemma 4. For all w ∈ Σ∗, fG(w) = FL(w).

Proof. (Sketch) Proof in two steps: ∀w ∈Σ∗, FL(w) ⊆ fG(w) and fG(w) ⊆ FL(w). Eachstep is made by induction on the length of wand uses the rules created to build the gram-mar, the derivation process of a CBFG andthe fiduciality for the second step. The keypoint rely on the fact that when a string w isparsed by a CBFG G, there exists a cut of win uv = w (u, v ∈ Σ∗) and a rule z → xy in Gsuch that x ⊆ fG(u) and y ⊆ fG(v). The rulez → xy is also obtained from a substring fromthe set used to build the grammar using theFL map. By inductive hypothesis you obtaininclusion between fG and FL on u and v.

For the language of Figure 2, the followingset is sufficient to build an exact CBGF:{a, b, aa, ab, ba, aab, bb, bba} (this correspondsto all the substrings of aab and bba). We have:FL(a) = F (L)\{(λ, λ), (a, λ)} → a

FL(b) = F (L) → b

FL(aa) = FL(a) → FL(a), FL(a)

FL(ab) = F (L) → FL(a), FL(b) = FL(a), F (L)

FL(ba) = F (L) → FL(b), FL(a) = F (L), FL(a)

FL(bb) = F (L) → FL(b), FL(b) = F (L), F (L)

37

FL(aab) = FL(bba) = FL(ab) = FL(ba)

The approach presented here gives a canon-ical form for representing a regular languageby an exact CBFG. Moreover, this is is com-plete in the sense that every context of everysubstring will be represented by some elementof F (L): this CBFG will completely model therelation between contexts and substrings.

6 Context-Free Languages

We now consider the relationship betweenCFGs and CBFGs.

Definition 5. A context-free grammar (CFG)is a quadruple G = (Σ, V, P, S). Σ is a fi-nite alphabet, V is a set of non terminals(Σ ∩ V = ∅), P ⊆ V × (V ∪ Σ)+ is a finiteset of productions, S ∈ V is the start symbol.

In the following, we will suppose that a CFGis represented in Chomsky Normal Form, i.e.every production is in the form N → UW withN,U, W ∈ V or N → a with a ∈ Σ.We will write uNv ⇒G uαv if there is a pro-duction N → α ∈ P . ∗⇒G is the reflexive tran-sitive closure of ⇒G. The language defined bya CFG G is L(G) = {w ∈ Σ∗|S ∗⇒G w}.

6.1 A Simple Characterization

A simple approach to try to represent a CFGby a CBFG is to define a bijection between theset of non terminals and the set of context fea-tures. Informally we define each non terminalby a single context and rewrite the productionsof the grammar in the CBFG form.

To build the set of contexts F , it is sufficientto choose |V | contexts such that a bijection bC

can be defined between V and F with bC(N) =(l, r) implies that S

∗⇒ lNr. Note that we fixbT (S) = (λ, λ).

Then, we can define a CBFG〈F, P ′, P ′

L,Σ〉, where P ′ = {bT (N) →bT (U)bT (W )|N → UW ∈ P} andP ′

L = {bT (N) → a|N → a ∈ P, a ∈ Σ}.A similar proof showing that this constructionproduces an equivalent CBFG can be foundin (Clark et al., 2008).

If this approach allows a simple syntacticalconvertion of a CFG into a CBFG, it is notrelevant from an NLP point of view. Thoughwe associate a non-terminal to a context, this

may not correspond to the intrinsic propertyof the underlying language. A context couldbe associated with many non-terminals and wechoose only one. For example, the context(He is, λ) allows both noun phrases and ad-jective phrases. In formal terms, the resultingCBFG is not exact. Then, with the bijectionwe introduced before, we are not able to char-acterize the non-terminals by the contexts inwhich they could appear. This is clearly whatwe don’t want here and we are more interestedin the relationship with exact CBFG.

6.2 Not all CFLs have an exact CBFG

We will show here that the class of context-free grammars is not strictly included in theclass of exact CBFGs. First, the grammardefined in Section 3.2 is an exact CBFG forthe context-free and non regular language{anbn|n > 0}, showing the class of exactCBFG has some elements in the class of CFGs.

We give now a context-free language L thatcan not be defined by an exact CBFG:

L = {anb|n > 0} ∪ {amcn|n > m > 0}.

Suppose that there exists an exact CBFG thatrecognizes it and let N be the length of thebiggest feature (i.e. the longuest left part ofthe feature). For any sufficiently large k >N , the sequences ck and ck+1 share the samefeatures: FL(ck) = FL(ck+1). Since the CBFGis exact we have FL(b) ⊆ FL(ck). Thus anyderivation of ak+1b could be a derivation ofak+1ck which does not belong to the language.

However, this restriction does not mean thatthe class of exact CBFG is too restrictive formodelling natural languages. Indeed, the ex-ample we have given is highly unnatural andsuch phenomena appear not to occur in at-tested natural languages.

7 Context-Sensitive Languages

We now show that there are some exactCBFGs that are not context-free. In particu-lar, we define a language closely related to theMIX language (consisting of strings with anequal number of a’s, b’s and c’s in any order)which is known to be non context-free, andindeed is conjectured to be outside the classof indexed grammars (Boullier, 2003).

38

Let M = {(a, b, c)∗}, we consider the languageL = Labc∪Lab∪Lac∪{a′a, b′b, c′c, dd′, ee′, ff ′}:Lab = {wd|w ∈ M, |w|a = |w|b},Lac = {we|w ∈ M, |w|a = |w|c},Labc = {wf |w ∈ M, |w|a = |w|b = |w|c}.In order to define a CBFG recognizing L, wehave to select features (contexts) that can rep-resent exactly the intrinsic components of thelanguages composing L. We propose to use thefollowing set of features for each sublanguages:

• For Lab: (λ, d) and (λ, ad), (λ, bd).

• For Lac: (λ, e) and (λ, ae), (λ, ce).

• For Labc: (λ, f).

• For the letters a′, b′, c′, a, b, c we add:(λ, a), (λ, b), (λ, c), (a′, λ), (b′, λ), (c′, λ).

• For the letters d, e, f, d′, e′, f ′ we add;(λ, d′), (λ, e′), (λ, f ′), (d, λ), (e, λ), (f, λ).

Here, Lab will be represented by (λ, d), but wewill use (λ, ad), (λ, bd) to define the internalderivations of elements of Lab. The same ideaholds for Lac with (λ, e) and (λ, ae), (λ, ce).

For the lexical rules and in order to have anexact CBFG, note the special case for a, b, c:{(λ, bd), (λ, ce), (a′, λ)} → a{(λ, ad), (b′, λ)} → b{(λ, ad), (λ, ae), (c′, λ)} → cFor the nine other letters, each one is definedwith only one context like {(λ, d′)} → d.

For the production rules, the most impor-tant one is: (λ, λ) → {(λ, d), (λ, e)}, {(λ, f ′)}.

Indeed, this rule, with the presence of twocontexts in one of categories, means that anelement of the language has to be derivedso that it has a prefix u such that fG(u) ⊇{(λ, d), (λ, e)}. This means u is both an ele-ment of Lab and Lac. This rule represents thelanguage Labc since {(λ, f ′)} can only repre-sent the letter f .

The other parts of the language will bedefined by the following rules:(λ, λ) → {(λ, d)}, {(λ, d′)},(λ, λ) → {(λ, e)}, {(λ, e′)},(λ, λ) → {(λ, a)}, {(λ, bd), (λ, ce), (a′, λ)},(λ, λ) → {(λ, b)}, {(λ, ad), (b′, λ)},(λ, λ) → {(λ, c)}, {(λ, ad), (λ, ae), (c′, λ)},(λ, λ) → {(λ, d′)}, {(d, λ)},(λ, λ) → {(λ, e′)}, {(e, λ)},

(λ, λ) → {(λ, f ′)}, {(f, λ)}.

This set of rules is incomplete, since for rep-resenting Lab, the grammar must contain therules ensuring to have the same number of a’sand b’s, and similarly for Lac. To lighten thepresentation here, the complete grammar ispresented in Annex.

We claim this is an exact CBFG for acontext-sensitive language. L is not context-free since if we intersect L with the regularlanguage {Σ∗d}, we get an instance of thenon context-free MIX language (with d ap-pended). The exactness comes from the factthat we chose the contexts in order to ensurethat strings belonging to a sublanguage cannot belong to another one and that the deriva-tion of a substring will provide all the possiblecorrect features with the help of the union ofall the possible derivations.

Note that the Mix language on its own isprobably not definable by an exact CBFG: itis only when other parts of the language candistributionally define the appropriate partialstructures that we can get context sensitivelanguages. Far from being a limitation of thisformalism (a bug), we argue this is a feature:it is only in rather exceptional circumstancesthat we will get properly context sensitive lan-guages. This formalism thus potentially ac-counts not just for the existence of non contextfree natural language but also for their rarity.

8 Conclusion

The chart in Figure 3 summarises the differentrelationship shown in this paper. The substi-tutable languages (Clark and Eyraud, 2007)and the very simple ones (Yokomori, 2003)form two different learnable class of languages.There is an interesting relationship with Mar-cus External Contextual Grammars (Mitrana,2005): if we defined the language of a CBFGto be the set {fG(u) � u : u ∈ Σ∗} we wouldbe taking some steps towards contextual gram-mars.

In this paper we have discussed the weakgenerative power of Exact Contextual BinaryFeature Grammars; we conjecture that theclass of natural language stringsets lie in thisclass. ECBFGs are efficiently learnable (see(Clark et al., 2008) for details) which is a com-

39

Context-free

Regular

Context sensitive

very simple

substi-tutable

Range Concatenation

Conjunctive = CBFG

Exact CBFG

Figure 3: The relationship between CBFG andother classes of languages.

pelling technical advantage of this formalismover other more traditional formalisms such asCFGs or TAGs.

References

Pierre Boullier. 2000. A Cubic Time Extensionof Context-Free Grammars. Grammars, 3:111–131.

Pierre Boullier. 2003. Counting with range con-catenation grammars. Theoretical ComputerScience, 293(2):391–416.

Alexander Clark and Remi Eyraud. 2007. Polyno-mial identification in the limit of substitutablecontext-free languages. Journal of MachineLearning Research, 8:1725–1745, Aug.

Alexander Clark, Remi Eyraud, and AmauryHabrard. 2008. A polynomial algorithm for theinference of context free languages. In Proceed-ings of International Colloquium on Grammati-cal Inference, pages 29–42. Springer, September.

V. Mitrana. 2005. Marcus external contextualgrammars: From one to many dimensions. Fun-damenta Informaticae, 54:307–316.

Alexander Okhotin. 2001. Conjunctive grammars.J. Autom. Lang. Comb., 6(4):519–535.

Alexander Okhotin. 2003. An overview of con-junctive grammars. Formal Language TheoryColumn, bulletin of the EATCS, 79:145–163.

Takashi Yokomori. 2003. Polynomial-time iden-tification of very simple grammars from pos-itive data. Theoretical Computer Science,298(1):179–206.

Annex

(λ, λ) → {(λ, d), (λ, e)}, {(λ, f ′)}(λ, λ) → {(λ, d)}, {(λ, d′)}(λ, λ) → {(λ, e)}, {(λ, e′)}(λ, λ) → {(λ, a)}, {(λ, bd), (λ, ce), (a′, λ)}(λ, λ) → {(λ, b)}, {(λ, ad), (b′, λ)}(λ, λ) → {(λ, c)}, {(λ, ad), (λ, ae), (c′, λ)}(λ, λ) → {(λ, d′)}, {(d, λ)}(λ, λ) → {(λ, e′)}, {(e, λ)}(λ, λ) → {(λ, f ′)}, {(f, λ)}

(λ, d) → {(λ, d)}, {(λ, d)}(λ, d) → {(λ, ad)}, {(λ, bd)}(λ, d) → {(λ, bd)}, {(λ, ad)}(λ, d) → {(λ, d)}, {(λ, ad), (λ, ae), (c′, λ)}(λ, d) → {(λ, ad), (λ, ae), (c′, λ)}, {(λ, d)}

(λ, ad) → {(λ, ad), (λ, ae), (c′, λ)}, {(λ, ad)}(λ, ad) → {(λ, ad)}, {(λ, ad), (λ, ae), (c′, λ)}(λ, ad) → {(λ, ad), (b′, λ)}, {(λ, d)}(λ, ad) → {(λ, d)}, {(λ, ad), (b′, λ)}

(λ, bd) → {(λ, ad), (λ, ae), (c′, λ)}, {(λ, bd)}(λ, bd) → {(λ, bd)}, {(λ, ad), (λ, ae), (c′, λ)}(λ, bd) → {(λ, bd), (λ, ce), (a′, λ)}, {(λ, d)}(λ, bd) → {(λ, d)}, {(λ, bd), (λ, ce), (a′, λ)}(λ, e) → {(λ, e)}, {(λ, e)}(λ, e) → {(λ, ae)}, {(λ, ce)}(λ, e) → {(λ, ce)}, {(λ, ae)}(λ, e) → {(λ, e)}, {(λ, ad), (b′, λ)}(λ, e) → {(λ, ad), (b′, λ)}, {(λ, e)}(λ, ae) → {(λ, ad), (b′, λ)}, {(λ, ae)}(λ, ae) → {(λ, ae)}, {(λ, ad), (b′, λ)}(λ, ae) → {(λ, ad), (λ, ae), (c′, λ)}, {(λ, e)}(λ, ae) → {(λ, e)}, {(λ, ad), (λ, ae), (c′, λ)}(λ, ce) → {(λ, ad), (b′, λ)}, {(λ, ce)}(λ, ce) → {(λ, ce)}, {(λ, ad), (b′, λ)}(λ, ce) → {(λ, bd), (λ, ce), (a′, λ)}, {(λ, e)}(λ, ce) → {(λ, e)}, {(λ, bd), (λ, ce), (a′, λ)}{(λ, bd), (λ, ce), (a′, λ)} → a{(λ, ad), (b′, λ)} → b{(λ, ad), (λ, ae), (c′, λ)} → c{(λ, d′)} → d{(λ, e′)} → e{(λ, f ′)} → f{(λ, a)} → a′

{(λ, b)} → b′

{(λ, c)} → c′

{(d, λ)} → d′

{(e, λ)} → e′

{(f, λ)} → f ′

40


Language models for contextual error detection and correction

Herman StehouwerTilburg Centre for Creative Computing

Tilburg UniversityTilburg, The Netherlands

[email protected]

Menno van ZaanenTilburg Centre for Creative Computing

Tilburg UniversityTilburg, The [email protected]

Abstract

The problem of identifying and correctingconfusibles, i.e. context-sensitive spellingerrors, in text is typically tackled usingspecifically trained machine learning clas-sifiers. For each different set of con-fusibles, a specific classifier is trained andtuned.

In this research, we investigate a moregeneric approach to context-sensitive con-fusible correction. Instead of using spe-cific classifiers, we use one generic clas-sifier based on a language model. Thismeasures the likelihood of sentences withdifferent possible solutions of a confusiblein place. The advantage of this approachis that all confusible sets are handled bya single model. Preliminary results showthat the performance of the generic clas-sifier approach is only slightly worse thatthat of the specific classifier approach.

1 Introduction

When writing texts, people often use spellingcheckers to reduce the number of spelling mis-takes in their texts. Many spelling checkers con-centrate on non-word errors. These errors can beeasily identified in texts because they consist ofcharacter sequences that are not part of the lan-guage. For example, in Englishwoord is is notpart of the language, hence a non-word error. Apossible correction would beword.

Even when a text does not contain any non-word errors, there is no guarantee that the text iserror-free. There are several types of spelling er-rors where the words themselves are part of thelanguage, but are used incorrectly in their context.Note that these kinds of errors are much harderto recognize, as information from the context in

which they occur is required to recognize and cor-rect these errors. In contrast, non-word errors canbe recognized without context.

One class of such errors, calledconfusibles,consists of words that belong to the language, butare used incorrectly with respect to their local,sentential context. For example,She owns to carscontains the confusibleto. Note that this word isa valid token and part of the language, but usedincorrectly in the context. Considering the con-text, a correct and very likely alternative would bethe wordtwo. Confusibles are grouped together inconfusible sets. Confusible sets are sets of wordsthat are similar and often used incorrectly in con-text. Too is the third alternative in this particularconfusible set.

The research presented here is part of alarger project, which focusses on context-sensitivespelling mistakes in general. Within this projectall classes of context-sensitive spelling errors aretackled. For example, in addition to confusibles,a class of pragmatically incorrect words (wherewords are incorrectly used within the document-wide context) is considered as well. In this arti-cle we concentrate on the problem of confusibles,where the context is only as large as a sentence.

2 Approach

A typical approach to the problem of confusiblesis to train a machine learning classifier to a specificconfusible set. Most of the work in this area hasconcentrated on confusibles due to homophony(to, too, two) or similar spelling (desert, dessert).However, some research has also touched upon in-flectional or derivational confusibles such asI ver-susme (Golding and Roth, 1999). For instance,when word forms are homophonic, they tend toget confused often in writing (cf. the situation withto, too, andtwo, affect andeffect, or there, their,and they’re in English) (Sandra et al., 2001; Vanden Bosch and Daelemans, 2007).

41

Most work on confusible disambiguation usingmachine learning concentrates on hand-selectedsets of notorious confusibles. The confusible setsare typically very small (two or three elements)and the machine learner will only see trainingexamples of the members of the confusible set.This approach is similar to approaches used in ac-cent restoration (Yarowsky, 1994; Golding, 1995;Mangu and Brill, 1997; Wu et al., 1999; Even-Zohar and Roth, 2000; Banko and Brill, 2001;Huang and Powers, 2001; Van den Bosch, 2006).

The task of the machine learner is to decide, us-ing features describing information from the con-text, which word taken from the confusible set re-ally belongs in the position of the confusible. Us-ing the example above, the classifier has to decidewhich word belongs on the position of theX inShe ownsX cars, where the possible answers forX are to, too, or two. We call X, the confusiblethat is under consideration, thefocus word.

Another way of looking at the problem of con-fusible disambiguation is to see it as a very spe-cialized case of word prediction. The problem isthen to predict which word belongs at a specificposition. Using similarities between these cases,we can use techniques from the field of languagemodeling to solve the problem of selecting the bestalternative from confusible sets. We will investi-gate this approach in this article.

Language models assign probabilities to se-quences of words. Using this information, itis possible to predict the most likely word ina certain context. If a language model givesus the probability for a sequence ofn wordsPLM (w1, . . . , wn), we can use this to predict themost likely wordw following a sequence ofn− 1words arg maxw PLM (w1, . . . , wn−1, w). Obvi-ously, a similar approach can be taken withw inthe middle of the sequence.

Here, we will use a language model as a classi-fier to predict the correct word in a context. Sincea language model models the entire language, it isdifferent from a regular machine learning classifiertrained on a specific set of confusibles. The advan-tage of this approach to confusible disambiguationis that the language model can handle all potentialconfusibles without any further training and tun-ing. With the language model it is possible to takethe words from any confusible set and compute theprobabilities of those words in the context. Theelement from the confusible set that has the high-

est probability according to the language model isthen selected. Since the language model assignsprobabilities to all sequences of words, it is pos-sible to define new confusible sets on the fly andlet the language model disambiguate them with-out any further training. Obviously, this is notpossible for a specialized machine learning clas-sifier approach, where a classifier is fine-tuned tothe features and classes of a specific confusible set.

The expected disadvantage of the generic (lan-guage model) classifier approach is that the accu-racy is expected to be less than that of the specific(specialized machine learning classifier) approach.Since the specific classifiers are tuned to each spe-cific confusible set, the weights for each of thefeatures may be different for each set. For in-stance, there may be confusibles for which the cor-rect word is easily identified by words in a specificposition. If a determiner, likethe, occurs in the po-sition directly before the confusible,to or too arevery probably not the correct answers. The spe-cific approach can take this into account by assign-ing specific weights to part-of-speech and positioncombinations, whereas the generic approach can-not do this explicitly for specific cases; the weightsfollow automatically from the training corpus.

In this article, we will investigate whether it ispossible to build a confusible disambiguation sys-tem that is generic for all sets of confusibles usinglanguage models as generic classifiers and investi-gate in how far this approach is useful for solvingthe confusible problem. We will compare thesegeneric classifiers against specific classifiers thatare trained for each confusible set independently.

3 Results

To measure the effectiveness of the generic clas-sifier approach to confusible disambiguation, andto compare it against a specific classifier approachwe have implemented several classification sys-tems. First of these is a majority class baseline sys-tem, which selects the word from the confusibleset that occurs most often in the training data.1

We have also implemented several generic classi-fiers based on different language models. We com-pare these against two machine learning classi-fiers. The machine learning classifiers are trainedseparately for each different experiment, whereas

1This baseline system corresponds to the simplest lan-guage model classifier. In this case, it only usesn-grams withn = 1.

42

the parameters and the training material of the lan-guage model are kept fixed throughout all the ex-periments.

3.1 System description

There are many different approaches that can betaken to develop language models. A well-knownapproach is to usen-grams, or Markov models.These models take into account the probabilitythat a word occurs in the context of the previousn − 1 words. The probabilities can be extractedfrom the occurrences of words in a corpus. Proba-bilities are computed by taking the relative occur-rence count of then words in sequence.

In the experiments described below, we will usea tri-gram-based language model and where re-quired this model will be extended with bi-gramand uni-gram language models. The probabilityof a sequence is computed as the combination ofthe probabilities of the tri-grams that are found inthe sequence.

Especially whenn-grams with largen are used,data sparseness becomes an issue. The trainingdata may not contain any occurrences of the par-ticular sequence ofn symbols, even though thesequence is correct. In that case, the probabilityextracted from the training data will be zero, eventhough the correct probability should be non-zero(albeit small). To reduce this problem we can ei-ther use back-off or smoothing when the probabil-ity of an n-gram is zero. In the case of back-off,the probabilities of lower ordern-grams are takeninto account when needed. Alternatively, smooth-ing techniques (Chen and Goodman, 1996) redis-tribute the probabilities, taking into account previ-ously unseen word sequences.

Even though the language models provide uswith probabilities of entire sequences, we areonly interested in then-grams directly around theconfusible when using the language models inthe context of confusible disambiguation. Theprobabilities of the rest of the sequence will re-main the same whichever alternative confusibleis inserted in the focus word position. Fig-ure 1 illustrates that the probability of for exampleP (analysts had expected) is irrelevant for the de-cision betweenthenandthanbecause it occurs inboth sequences.

The different language models we will considerhere are essentially the same. The differences liein how they handle sequences that have zero prob-

ability. Since the probabilities of then-grams aremultiplied, having an-gram probability of zero re-sults in a zero probability for the entire sequence.There may be two reasons for ann-gram to haveprobability zero: there is not enough training data,so this sequence has not been seen yet, or this se-quence is not valid in the language.

When it is known that a sequence is not validin the language, this information can be used todecide which word from the confusible set shouldbe selected. However, when the sequence simplyhas not been seen in the training data yet, we can-not rely on this information. To resolve the se-quences with zero probability, we can use smooth-ing. However, this assumes that the sequence isvalid, but has not been seen during training. Theother solution, back-off, tries not to make this as-sumption. It checks whether subsequences of thesequence are valid, i.e. have non-zero probabili-ties. Because of this, we will not use smoothing toreach non-zero probabilities in the current exper-iments, although this may be investigated furtherin the future.

The first language model that we will investi-gate here is a linear combination of the differ-ent n-grams. The probability of a sequence iscomputed by a linear combination of weightedn-gram probabilities. We will report on two differentweight settings, one system using uniform weight-ing, called uniform linear, and one where uni-grams receive weight 1, bi-grams weight 138, andtri-grams weight 437.2 These weights are normal-ized to yield a final probability for the sequence,resulting in the second system calledweighted lin-ear.

The third system uses the probabilities of thedifferent n-grams separately, instead of using theprobabilities of alln-grams at the same time as isdone in the linear systems. Thecontinuous back-off method uses only one of the probabilities ateach position, preferring the higher-level probabil-ities. This model provides a step-wise back-off.The probability of a sequence is that of the tri-grams contained in that sequence. However, if theprobability of a trigram is zero, a back-off to theprobabilities of the two bi-grams of the sequenceis used. If that is still zero, the uni-gram probabil-ity at that position is used. Note that this uni-gramprobability is exactly what the baseline system

2These weights are selected by computing the accuracy ofall combinations of weights on a held out set.

43

. . . much stronger most analysts had expected .

than thenP (much stronger than) P (much stronger then)×P (stronger than most) ×P (stronger then most)×P (than most analysts) ×P (then most analysts)

Figure 1: Computation of probabilities using the language model.

uses. With this approach it may be the case thatthe probability for one word in the confusible setis computed based on tri-grams, whereas the prob-ability of another word in the set of confusibles isbased on bi-grams or even the uni-gram probabil-ity. Effectively, this means that different kinds ofprobabilities are compared. The same weights asin the weighted linear systems are used.

To resolve the problem of unbalanced probabil-ities, a fourth language model, calledsynchronousback-off, is proposed. Whereas in the case of thecontinuous back-off model, two words from theconfusible set may be computed using probabil-ities of different leveln-grams, the synchronousback-off model uses probabilities of the same levelof n-grams for all words in the confusible set, withn being the highest value for which at least one ofthe words has a non-zero probability. For instance,when worda has a tri-gram probability of zero andwordb has a non-zero tri-gram probability,b is se-lected. When both have a zero tri-gram probabil-ity, a back-off to bi-grams is performed for bothwords. This is in line with the idea that if a proba-bility is zero, the training data is sufficient, hencethe sequence is not in the language.

To implement the specific classifiers, we usedthe TiMBL implementation of ak-NN classifier(Daelemans et al., 2007). This implementation ofthek-NN algorithm is calledIB1. We have tunedthe different parameter settings for thek-NN clas-sifier using Paramsearch (Van den Bosch, 2004),which resulted in ak of 35.3 To describe the in-stances, we try to model the data as similar as pos-sible to the data used by the generic classifier ap-proach. Since the language model approaches usen-grams withn = 3 as the largestn, the featuresfor the specific classifier approach use words oneand two positions left and right of the focus word.

3We note thatk is handled slightly differently in TiMBLthan usual,k denotes the number of closest distances consid-ered. So if there are multiple instances that have the same(closest) distance they are all considered.

The focus word becomes the class that needs tobe predicted. We show an example of both train-ing and testing in figure 2. Note that the featuresfor the machine learning classifiers could be ex-panded with, for instance, part-of-speech tags, butin the current experiments only the word forms areused as features.

In addition to thek-NN classifier, we also runthe experiments using the IGTree classifier, whichis denotedIGTreein the rest of the article, which isalso contained in the TiMBL distribution. IGTreeis a fast, trie based, approximation ofk-nearestneighbor classification (Knuth, 1973; Daelemanset al., 1997). IGTree allows for fast training andtesting even with millions of examples. IGTreecompresses a set of labeled examples into a deci-sion tree structure similar to the classic C4.5 algo-rithm (Quinlan, 1993), except that throughout onelevel in the IGTree decision tree, the same featureis tested. Classification in IGTree is a simple pro-cedure in which the decision tree is traversed fromthe root node down, and one path is followed thatmatches the actual values of the new example tobe classified. If a leaf is found, the outcome storedat the leaf of the IGTree is returned as the clas-sification. If the last node is not a leaf node, butthere are no outgoing arcs that match a feature-value combination of the instance, the most likelyoutcome stored at that node is produced as the re-sulting classification. This outcome is computedby collating the outcomes of all leaf nodes that canbe reached from the node.

IGTree is typically able to compress a largeexample set into a lean decision tree with highcompression factors. This is done in reasonablyshort time, comparable to other compression al-gorithms. More importantly, IGTree’s classifica-tion time depends only on the number of features(O(f)). Indeed, in our experiments we observehigh compression rates. One of the unique char-acteristics of IGTree compared to basick-NN isits resemblance to smoothing of a basic language

44

Training . . . much stronger thanmost analysts had expected .

〈much, stronger, most, analysts〉 ⇒than

Testing . . . much stronger most analysts had expected .

〈much, stronger, most, analysts〉 ⇒?

Figure 2: During training, a classified instance (in this case for the confusible pair{then, than}) aregenerated from a sentence. During testing, a similar instance is generated. The classifier decides whatthe corresponding class, and hence, which word should be thefocus word.

model (Zavrel and Daelemans, 1997), while stillbeing a generic classifier that supports any numberand type of features. For these reasons, IGTree isalso included in the experiments.

3.2 Experimental settings

The probabilities used in the language models ofthe generic classifiers are computed by looking atoccurrences ofn-grams. These occurrences areextracted from a corpus. The training instancesused in the specific machine learning classifiersare also extracted from the same data set. Fortraining purposes, we used the Reuters news cor-pus RCV1 (Lewis et al., 2004). The Reuters cor-pus contains about 810,000 categorized newswirestories as published by Reuters in 1996 and 1997.This corpus contains around 130 million tokens.

For testing purposes, we used the Wall StreetJournal part of the Penn Treebank corpus (Marcuset al., 1993). This well-known corpus contains ar-ticles from the Wall Street Journal in 1987 to 1989.We extract our test-instances from this corpus inthe same way as we extract our training data fromthe Reuters corpus. There are minor tokenizationdifferences between the corpora. The data is cor-rected for these differences.

Both corpora are in the domain of English lan-guage news texts, so we expect them to have simi-lar properties. However, they are different corporaand hence are slightly different. This means thatthere are also differences between the training andtesting set. We have selected this division to cre-ate a more realistic setting. This should allow for amore to real-world use comparison than when bothtraining and testing instances are extracted fromthe same corpus.

For the specific experiments, we selected anumber of well-known confusible sets to testthe different approaches. In particular, welook at {then, than}, {its, it’s}, {your, you’re},

{their, there, they’re}. To compare the difficultyof these problems, we also selected two words atrandom and used them as a confusible set.

The random category consists of two words thatwhere randomly selected from all words in theReuters corpus that occurred more than a thousandtimes. The words that where chosen, and used forall experiments here arerefugeesandeffect. Theyoccur around 27 thousand times in the Reuters cor-pus.

3.3 Empirical results

Table 1 sums up the results we obtained with thedifferent systems. The baseline scores are gen-erally very high, which tells us that the distribu-tion of classes in a single confusible set is severelyskewed, up to a ten to one ratio. This also makesthe task hard. There are many examples for oneword in the set, but only very few training in-stances for the other(s). However, it is especiallyimportant to recognize the important aspects of theminority class.

The results clearly show that the specific clas-sifier approaches outperform the other systems.For instance, on the first task ({then, than}) theclassifier achieves an accuracy slightly over 98%,whereas the language model systems only yieldaround 96%. This is as expected. The classifieris trained on just one confusible task and is there-fore able to specialize on that task.

Comparing the two specific classifiers, we seethat the accuracy achieved by IB1 and IGTree isquite similar. In general, IGTree performs a bitworse than IB1 on all confusible sets, which isas expected. However, in general it is possiblefor IGTree to outperform IB1 on certain tasks. Inour experience this mainly happens on tasks wherethe usage of IGTree, allowing for more compactinternal representations, allows one to use muchmore training data. IGTree also leads to improved

45

{then, than} {its, it’s} {your, you’re} {their, there, they’re} randomBaseline 82.63 92.42 78.55 68.36 93.16IB1 98.01 98.67 96.36 97.12 97.89IGTree 97.07 96.75 96.00 93.02 95.79Uniform linear 68.27 50.70 31.64 32.72 38.95Weighted linear 94.43 92.88 93.09 93.25 88.42Continuous back-off 81.49 83.22 74.18 86.01 63.68Synchronous back-off 96.42 94.10 92.36 93.06 87.37Number of cases 2,458 4,830 275 3,053 190

Table 1: This table shows the performance achieved by the different systems, shown in accuracy (%).TheNumber of casesdenotes the number of instances in the testset.

performance in cases where the features have astrong, absolute ordering of importance with re-spect to the classification problem at hand.

The generic language model approaches per-form reasonably well. However, there are cleardifferences between the approaches. For instancethe weighted linear and synchronous back-off ap-proaches work well, but uniform linear and con-tinuous back-off perform much worse. Especiallythe synchronous back-off approach achieves de-cent results, regardless of the confusible problem.

It is not very surprising to see that the contin-uous back-off method performs worse than thesynchronous back-off method. Remember thatthe continuous back-off method always uses lowerlevel n-grams when zero probabilities are found.This is done independently of the probabilities ofthe other words in the confusible set. The contin-uous back-off method prefersn-grams with largern, however it does not penalize backing off to ann-gram with smallern. Combine this with the factthatn-gram probabilities with largen are compar-atively lower than those forn-grams with smallern and it becomes likely that a bi-gram contributesmore to the erroneous option than the correct tri-gram does to the correct option. Tri-grams aremore sparse than bi-grams, given the same data.

The weighted linear approach outperforms theuniform linear approach by a large margin on allconfusible sets. It is likely that the contributionfrom then-grams with largen overrules the prob-abilities of then-grams with smallern in the uni-form linear method. This causes a bias towards themore frequent words, compounded by the fact thatbi-grams, and uni-grams even more so, are lesssparse and therefore contribute more to the totalprobability.

We see that the both generic and specific clas-

sifier approaches perform consistently across thedifferent confusible sets. The synchronous back-off approach is the best performing generic clas-sifier approach we tested. It consistently outper-forms the baseline, and overall performs betterthan the weighted linear approach.

The experiments show that generic classifiersbased on language model can be used in the con-text of confusible disambiguation. However, then in the differentn-grams is of major importance.Exactly whichn grams should be used to com-pute the probability of a sequence requires moreresearch. The experiments also show that ap-proaches that concentrate onn-grams with largern yield more encouraging results.

4 Conclusion and future work

Confusibles are spelling errors that can only be de-tected within their sentential context. This kindof errors requires a completely different approachcompared to non-word errors (errors that can beidentified out of context, i.e. sequences of char-acters that do not belong to the language). Inpractice, most confusible disambiguation systemsare based on machine learning classification tech-niques, where for each type of confusible, a newclassifier is trained and tuned.

In this article, we investigate the use of languagemodels in the context of confusible disambigua-tion. This approach works by selecting the wordin the set of confusibles that has the highest prob-ability in the sentential context according to thelanguage model. Any kind of language model canbe used in this approach.

The main advantage of using language modelsas generic classifiers is that it is easy to add newsets of confusibles without retraining or adding ad-ditional classifiers. The entire language is mod-

46

eled, which means that all the information onwords in their context is inherently present.

The experiments show that using generic clas-sifiers based on simplen-gram language modelsyield slightly worse results compared to the spe-cific classifier approach, where each classifier isspecifically trained on one confusible set. How-ever, the advantage of the generic classifier ap-proach is that only one system has to be trained,compared to different systems for each confusiblein the specific classifier case. Also, the exact com-putation of the probabilities using then-grams, inparticular the means of backing-off, has a largeimpact on the results.

As future work, we would like to investigate theaccuracy of more complex language models usedas classifiers. Then-gram language models de-scribed here are relatively simple, but more com-plex language models could improve performance.In particular, instead of back-off, smoothing tech-niques could be investigated to reduce the impactof zero probability problems (Chen and Goodman,1996). This assumes that the training data we arecurrently working with is not enough to properlydescribe the language.

Additionally, language models that concentrateon more structural descriptions of the language,for instance, using grammatical inference tech-niques (de la Higuera, 2005), or models that ex-plicitly take long distance dependencies into ac-count (Griffiths et al., 2005) can be investigated.This leads to much richer language models thatcould, for example, check whether there is alreadya verb in the sentence (which helps in cases suchas{its, it’s}).

A different route which we would also like to in-vestigate is the usage of a specific classifier, suchas TiMBL’s IGTree, as a language model. If aclassifier is trained to predict the next word in thesentence or to predict the word at a given positionwith both left and right context as features, it canbe used to estimate the probability of the words ina confusible set, just like the language models wehave looked at so far. Another type of classifiermight estimate the perplexity at a position, or pro-vide some other measure of “surprisedness”. Ef-fectively, these approaches all take a model of theentire language (as described in the training data)into account.

References

Banko, M. and Brill, E. (2001). Scaling to very verylarge corpora for natural language disambiguation.In Proceedings of the 39th Annual Meeting of the As-sociation for Computational Linguistics, pages 26–33. Association for Computational Linguistics.

Chen, S. and Goodman, J. (1996). An empirical studyof smoothing techniques for language modelling. InProceedings of the 34th Annual Meeting of the ACL,pages 310–318. ACL.

Daelemans, W., Van den Bosch, A., and Weijters, A.(1997). IGTree: using trees for compression andclassification in lazy learning algorithms.ArtificialIntelligence Review, 11:407–423.

Daelemans, W., Zavrel, J., Van der Sloot, K., and Vanden Bosch, A. (2007). TiMBL: Tilburg MemoryBased Learner, version 6.1, reference guide. Techni-cal Report ILK 07-07, ILK Research Group, TilburgUniversity.

de la Higuera, C. (2005). A bibliographical studyof grammatical inference. Pattern Recognition,38(9):1332 – 1348. Grammatical Inference.

Even-Zohar, Y. and Roth, D. (2000). A classificationapproach to word prediction. InProceedings of theFirst North-American Conference on ComputationalLinguistics, pages 124–131, New Brunswick, NJ.ACL.

Golding, A. and Roth, D. (1999). A Winnow-BasedApproach to Context-Sensitive Spelling Correction.Machine Learning, 34(1–3):107–130.

Golding, A. R. (1995). A Bayesian hybrid method forcontext-sensitive spelling correction. InProceed-ings of the 3rd workshop on very large corpora,ACL-95.

Griffiths, T. L., Steyvers, M., Blei, D. M., and Tenen-baum, J. B. (2005). Integrating topics and syntax. InIn Advances in Neural Information Processing Sys-tems 17, pages 537–544. MIT Press.

Huang, J. H. and Powers, D. W. (2001). Large scale ex-periments on correction of confused words. InAus-tralasian Computer Science Conference Proceed-ings, pages 77–82, Queensland AU. Bond Univer-sity.

Knuth, D. E. (1973). The art of computer program-ming, volume 3: Sorting and searching. Addison-Wesley, Reading, MA.

Lewis, D. D., Yang, Y., Rose, T. G., Dietterich, G., Li,F., and Li, F. (2004). Rcv1: A new benchmark col-lection for text categorization research.Journal ofMachine Learning Research, 5:361–397.

Mangu, L. and Brill, E. (1997). Automatic rule ac-quisition for spelling correction. InProceedings ofthe International Conference on Machine Learning,pages 187–194.

47

Marcus, M., Santorini, S., and Marcinkiewicz, M.(1993). Building a Large Annotated Corpus of En-glish: the Penn Treebank.Computational Linguis-tics, 19(2):313–330.

Quinlan, J. (1993). C4.5: Programs for MachineLearning. Morgan Kaufmann, San Mateo, CA.

Sandra, D., Daems, F., and Frisson, S. (2001). Zohelder en toch zoveel fouten! wat leren we uit psy-cholinguıstisch onderzoek naar werkwoordfoutenbij ervaren spellers?Tijdschrift van de Verenigingvoor het Onderwijs in het Nederlands, 30(3):3–20.

Van den Bosch, A. (2004). Wrapped progressivesampling search for optimizing learning algorithmparameters. In Verbrugge, R., Taatgen, N., andSchomaker, L., editors,Proceedings of the SixteenthBelgian-Dutch Conference on Artificial Intelligence,pages 219–226, Groningen, The Netherlands.

Van den Bosch, A. (2006). Scalable classification-based word prediction and confusible correction.Traitement Automatique des Langues, 46(2):39–63.

Van den Bosch, A. and Daelemans, W. (2007).TussenTaal, Spelling en Onderwijs, chapter Dat gebeurdmei niet: Computationele modellen voor verwarbarehomofonen, pages 199–210. Academia Press.

Wu, D., Sui, Z., and Zhao, J. (1999). An information-based method for selecting feature types for wordprediction. InProceedings of the Sixth EuropeanConference on Speech Communication and Technol-ogy, EUROSPEECH’99, Budapest.

Yarowsky, D. (1994). Decision lists for lexical ambi-guity resolution: application to accent restoration inSpanish and French. InProceedings of the AnnualMeeting of the ACL, pages 88–95.

Zavrel, J. and Daelemans, W. (1997). Memory-basedlearning: Using similarity for smoothing. InPro-ceedings of the 35th Annual Meeting of the Associa-tion for Computational Linguistics, pages 436–443.

48


On statistical parsing of French with supervised and semi-supervisedstrategies

Marie Candito*, Benoît Crabbé* and Djamé Seddah⋄

* Université Paris 7UFRL et INRIA (Alpage)

30 rue du Château des RentiersF-75013 Paris — France

⋄ Université Paris 4LALIC et INRIA (Alpage)

28 rue SerpenteF-75006 Paris — France

Abstract

This paper reports results on grammati-cal induction for French. We investigatehow to best train a parser on the FrenchTreebank (Abeillé et al., 2003), viewingthe task as a trade-off between generaliz-ability and interpretability. We compare,for French, a supervised lexicalized pars-ing algorithm with a semi-supervised un-lexicalized algorithm (Petrov et al., 2006)along the lines of (Crabbé and Candito,2008). We report the best results knownto us on French statistical parsing, that weobtained with the semi-supervised learn-ing algorithm. The reported experimentscan give insights for the task of grammat-ical learning for a morphologically-richlanguage, with a relatively limited amountof training data, annotated with a ratherflat structure.

1 Natural language parsing

Despite the availability of annotated data, therehave been relatively few works on French statis-tical parsing. Together with a treebank, the avail-ability of several supervised or semi-supervisedgrammatical learning algorithms, primarily set upon English data, allows us to figure out how theybehave on French.

Before that, it is important to describe the char-acteristics of the parsing task. In the case of sta-tistical parsing, two different aspects of syntacticstructures are to be considered : their capacity tocapture regularities and their interpretability forfurther processing.Generalizability Learning for statistical parsing

requires structures that capture best the underlying

regularities of the language, in order to apply thesepatterns to unseen data.

Since capturing underlying linguistic rules isalso an objective for linguists, it makes senseto use supervised learning from linguistically-defined generalizations. One generalization istypically the use of phrases, and phrase-structurerules that govern the way words are grouped to-gether. It has to be stressed that these syntacticrules exist at least in part independently of seman-tic interpretation.Interpretability But the main reason to use su-

pervised learning for parsing, is that we wantstructures that are asinterpretableas possible, inorder to extract some knowledge from the anal-ysis (such as deriving a semantic analysis froma parse). Typically, we need a syntactic analysisto reflect how wordsrelate to each other. Thisis our main motivation to use supervised learn-ing : the learnt parser will output structures asdefined by linguists-annotators, and thus inter-pretable within the linguistic theory underlying theannotation scheme of the treebank. It is importantto stress that this is more than capturing syntacticregularities : it has to do with themeaningof thewords.It is not certain though that both requirements(generalizability / interpretability) are best met inthe same structures. In the case of supervisedlearning, this leads to investigate different instan-tiations of the training trees, to help the learning,while keeping the maximum interpretability of thetrees. As we will see with some of our experi-ments, it may be necessary to find a trade-off be-tween generalizability and interpretability.

Further, it is not guaranteed that syntactic rulesinfered from a manually annotated treebank pro-duce the best language model. This leads to

49

methods that use semi-supervised techniques ona treebank-infered grammar backbone, such as(Matsuzaki et al., 2005; Petrov et al., 2006).

The plan of the paper is as follows : in thenext section, we describe the available treebankfor French, and how its structures can be inter-preted. In section 3, we describe the typical prob-lems encountered when parsing using a plain prob-abilistic context-free grammar, and existing algo-rithmic solutions that try to circumvent these prob-lems. Next we describe experiments and resultswhen training parsers on the French data. Finally,we discuss related work and conclude.

2 Interpreting the French trees

The French Treebank (Abeillé et al., 2003) is apublicly available sample from the newspaperLeMonde, syntactically annotated and manually cor-rected for French.

<SENT><NP fct="SUJ">

<w cat="D" lemma="le" mph="ms" subcat="def">le</w><w cat="N" lemma="bilan" mph="ms" subcat="C">bilan</w>

</NP><VN>

<w cat="ADV" lemma="ne" subcat="neg">n’</w><w cat="V" lemma="être" mph="P3s" subcat="">est</w>

</VN><AdP fct="MOD">

<w compound="yes" cat="ADV" lemma="peut-être"><w catint="V">peut</w><w catint="PONCT">-</w><w catint="V">être</w>

</w><w cat="ADV" lemma="pas" subcat="neg">pas</w>

</AdP><AP fct="ATS">

<w cat="ADV" lemma="aussi">aussi</w><w cat="A" lemma="sombre" mph="ms" subcat="qual">sombre </w>

</AP><w cat="PONCT" lemma="." subcat="S">.</w>

</SENT>

Figure 1: Simplified example of the FTB

To encode syntactic information, it uses a com-bination of labeled constituents, morphologicalannotations and functional annotation for verbaldependents as illustrated in Figure 1. This con-stituent and functional annotation was performedin two successive steps : though the original re-lease (Abeillé et al., 2000) consists of 20,648 sen-tences (hereafter FTB-V0), the functional annota-tion was performed later on a subset of 12351 sen-tences (hereafter FTB). This subset has also beenrevised, and is known to be more consistently an-notated. This is the release we use in our experi-ments. Its key properties, compared with the PennTreebank, (hereafter PTB) are the following :Size: The FTB is made of 385 458 tokens and

12351 sentences, that is the third of the PTB. Theaverage length of a sentence is 31 tokens in the

FTB, versus 24 tokens in the PTB.Inflection : French morphology is richer than En-glish and leads to increased data sparseness forstatistical parsing. There are 24098 types in theFTB, entailing an average of 16 tokens occurringfor each type (versus 12 for the PTB).Flat structure : The annotation scheme is flatterin the FTB than in the PTB. For instance, thereare no VPs for finite verbs, and only one sententiallevel for sentences whether introduced by comple-mentizer or not. We can measure the corpus flat-ness using the ratio between tokens and non ter-minal symbols, excluding preterminals. We obtain0.69 NT symbol per token for FTB and 1.01 for thePTB.Compounds: Compounds are explicitly annotated(see the compoundpeut-êtrein Figure 1 ) and veryfrequent : 14,52% of tokens are part of a com-pound. They include digital numbers (written withspaces in French10 000), very frozen compoundspomme de terre (potato)but also named entitiesor sequences whose meaning is compositional butwhere insertion is rare or difficult (garde d’enfant(child care)).Now let us focus on what is expressed in theFrench annotation scheme, and why syntactic in-formation is split between constituency and func-tional annotation.Syntactic categories and constituentscapture dis-tributional generalizations. A syntactic categorygroups forms that share distributional properties.Nonterminal symbols that label the constituentsare a further generalizations over sequences of cat-egories or constituents. For instance about any-where it is grammatical to have a given NP, it isimplicitly assumed that it will also be grammati-cal - though maybe nonsensical - to have insteadany other NPs. Of course this is known to be falsein many cases : for instance NPs with or with-out determiners have very different distributions inFrench (that may justify a different label) but theyalso share a lot. Moreover, if words are taken intoaccount, and not just sequences of categories, thenconstituent labels are a very coarse generalization.Constituents also encode dependencies : for in-stance the different PP-attachment for the sen-tencesI ate a cake with cream / with a forkre-flects thatwith creamdepends oncake, whereaswith a fork depends onate. More precisely, asyntagmatic tree can be interpreted as a depen-dency structure using the following conventions :

50

for each constituent, given the dominating symboland the internal sequence of symbols, (i) a headsymbol can be isolated and (ii) the siblings of thathead can be interpreted as containing dependentsof that head. Given these constraints, the syntag-matic structure may exhibit various degree of flat-ness for internal structures.

Functional annotation Dependencies are en-coded in constituents. While X-bar inspired con-stituents are supposed to contain all the syntac-tic information, in the FTB the shape of the con-stituents does not necessarily express unambigu-ously thetypeof dependency existing between ahead and a dependent appearing in the same con-stituent. Yet this is crucial for example to ex-tract the underlying predicate-argument structures.This has led to a “flat” annotation scheme, com-pleted with functional annotations that inform onthe type of dependency existing between a verband its dependents. This was chosen for Frenchto reflect, for instance, the possibility to mix post-verbal modifiers and complements (Figure 2), orto mix post-verbal subject and post-verbal indi-rect complements : a post verbal NP in the FTB

can correspond to a temporal modifier, (most of-ten) a direct object, or an inverted subject, and inthe three cases other subcategorized complementsmay appear.

SENT

NP-SUJ

D

une

N

lettre

VN

V

avait

V

été

V

envoyée

NP-MOD

D

la

N

semaine

A

dernière

PP-AOBJ

P

aux

NP

N

salariésSENT

NP-SUJ

D

Le

N

Conseil

VN

V

a

V

notifié

NP-OBJ

D

sa

N

décision

PP-AOBJ

P

à

NP

D

la

N

banque

Figure 2: Two examples of post-verbal NPs : adirect object and a temporal modifier

3 Algorithms for probabilistic grammarlearning

We propose here to investigate how to apply statis-tical parsing techniques mainly tested on English,to another language – French –. In this section webriefly introduce the algorithms investigated.

Though Probabilistic Context Free Grammars(PCFG) is a baseline formalism for probabilisticparsing, it suffers a fundamental problem for the

purpose of natural language parsing : the inde-pendence assumptions made by the model are toostrong. In other words all decisions are local to agrammar rule.

However as clearly pointed out by (Johnson,1998) decisions have to take into account non lo-cal grammatical properties: for instance a nounphrase realized in subject position is more likely tobe realized by a pronoun than a noun phrase real-ized in object position. Solving this first method-ological issue, has led to solutions dubbed here-after asunlexicalized statistical parsing(Johnson,1998; Klein and Manning, 2003a; Matsuzaki etal., 2005; Petrov et al., 2006).

A second class of non local decisions to betaken into account while parsing natural languagesare related to handling lexical constraints. Asshown above the subcategorization properties ofa predicative word may have an impact on the de-cisions concerning the tree structures to be asso-ciated to a given sentence. Solving this secondmethodological issue has led to solutions dubbedhereafter aslexicalized parsing(Charniak, 2000;Collins, 1999).

In a supervised setting, a third and practicalproblem turns out to be critical: that ofdatasparsenesssince available treebanks are generallytoo small to get reasonable probability estimates.Three class of solutions are possible to reduce datasparseness: (1) enlarging the data manually or au-tomatically (e.g. (McClosky et al., 2006) uses self-training to perform this step) (2) smoothing, usu-ally this is performed using a markovization pro-cedure (Collins, 1999; Klein and Manning, 2003a)and (3) make the data more coarse (i.e. clustering).

3.1 Lexicalized algorithm

The first algorithm we use is the lexicalized parserof (Collins, 1999). It is called lexicalized, as itannotates non terminal nodes with an additionallatent symbol: the head word of the subtree. Thisadditional information attached to the categoriesaims at capturing bilexical dependencies in orderto perform informed attachment choices.

The addition of these numerous latent sym-bols to non terminals naturally entails an over-specialization of the resulting models. To en-sure generalization, it therefore requires to addadditional simplifying assumptions formulated asa variant of usual naïve Bayesian-style simplify-ing assumptions: the probability of emitting a non

51

head node is assumed to depend on the head andthe mother node only, and not on other siblingnodes1.

Since Collins demonstrated his models to sig-nificantly improve parsing accuracy over barePCFG, lexicalization has been thought as a ma-jor feature for probabilistic parsing. However twoproblems are worth stressing here: (1) the reasonwhy these models improve over bare PCFGs is notguaranteed to be tied to the fact that they capturebilexical dependencies and (2) there is no guar-antee that capturing non local lexical constraintsyields an optimal language model.

Concerning (1) (Gildea, 2001) showed that fulllexicalization has indeed small impact on results :he reimplemented an emulation of Collins’ Model1 and found that removing all references to bilex-ical dependencies in the statistical model2, re-sulted in a very small parsing performance de-crease (PARSEVAL recall on WSJ decreased from86.1 to 85.6). Further studies conducted by (Bikel,2004a) proved indeed that bilexical informationwere used by the most probable parses. The ideais that most bilexical parameters are very similarto their back-off distribution and have therefore aminor impact. In the case of French, this fact canonly be more true, with one third of training datacompared to English, and with a much richer in-flection that worsens lexical data sparseness.

Concerning (2) the addition of head word an-notations is tied to the use of manually definedheuristics highly dependent on the annotationscheme of the PTB. For instance, Collins’ mod-els integrate a treatment of coordination that is notadequate for the FTB-like coordination annotation.

3.2 Unlexicalized algorithms

Another class of algorithms arising from (John-son, 1998; Klein and Manning, 2003a) attemptsto attach additional latent symbols to treebank cat-egories without focusing exclusively on lexicalhead words. For instance the additional annota-tions will try to capture non local preferences like

1This short description cannot do justice to (Collins,1999) proposal which indeed includes more fine grained in-formations and a backoff model. We only keep here the keyaspects of his work relevant for the current discussion.

2Let us consider a dependent constituent C with headword Chw and head tag Cht, and let C be governed by a con-stituent H, with head word Hhw and head tag Hht. Gildeacompares Collins model, where the emission of Chw is con-ditioned on Hhw, and a “mono-lexical” model, where theemission of Chw is not conditioned on Hhw.

the fact that an NP in subject position is morelikely realized as a pronoun.

The first unlexicalized algorithms set up in thistrend (Johnson, 1998; Klein and Manning, 2003a)also use language dependent and manually de-fined heuristics to add the latent annotations. Thespecialization induced by this additional annota-tion is counterbalanced by simplifying assump-tions, dubbed markovization (Klein and Manning,2003a).

Using hand-defined heuristics remains prob-lematic since we have no guarantee that the latentannotations added in this way will allow to extractan optimal language model.

A further development has been first introducedby (Matsuzaki et al., 2005) who recasts the prob-lem of adding latent annotations as an unsuper-vised learning problem: given an observed PCFG

induced from the treebank, the latent grammar isgenerated by combining every non terminal of theobserved grammar to a predefined setH of latentsymbols. The parameters of the latent grammarare estimated from theobserved treesusing a spe-cific instantiation ofEM.

This first procedure however entails a combi-natorial explosion in the size of the latent gram-mar as|H| increases. (Petrov et al., 2006) (here-after BKY ) overcomes this problem by using thefollowing algorithm: given a PCFG G0 inducedfrom the treebank, iteratively createn grammarsG1 . . . Gn (with n = 5 in practice), where eachiterative step is as follows :

• SPLIT Create a new grammarGi from Gi−1

by splitting every non terminal ofGi intwo new symbols. EstimateGi’s parameterson the observed treebank using a variant ofinside-outside. This step adds the latent an-notation to the grammar.

• MERGE For each pair of symbols obtainedby a previous split, try to merge them back.If the likelihood of the treebank does notget significantly lower (fixed threshold) thenkeep the symbol merged, otherwise keep thesplit.

• SMOOTH This step consists in smoothing theprobabilities of the grammar rules sharing thesame left hand side.

This algorithm yields state-of-the-art results on

52

English3. Its key interest is that it directly aimsat finding an optimal language model without (1)making additional assumptions on the annotationscheme and (2) without relying on hand-definedheuristics. This may be viewed as a case of semi-supervised learning algorithm since the initial su-pervised learning step is augmented with a secondstep of unsupervised learning dedicated to assignthe latent symbols.

4 Experiments and Results

We investigate how some treebank features impactlearning. We describe first the experimental pro-tocol, next we compare results of lexicalized andunlexicalized parsers trained on various “instan-tiations” of the xml source files of the FTB, andthe impact of training set size for both algorithms.Then we focus on studying how words impact theresults of the BKYalgorithm.

4.1 Protocol

Treebank settingFor all experiments, the tree-bank is divided into 3 sections : training (80%),development (10%) and test (10%), made ofrespectively 9881, 1235 and 1235 sentences.We systematically report the results with thecompounds merged. Namely, we preprocess thetreebank in order to turn each compound into asingle token both for training and test.

Software and adaptation to FrenchFor theCollins algorithm, we use Bikel’s implementation(Bikel, 2004b) (hereafter BIKEL ), and we reportresults using Collins model 1 and model 2, withinternal tagging. Adapting model 1 to Frenchrequires to design French specific head propaga-tion rules. To this end, we adapted those de-scribed by (Dybro-Johansen, 2004) for extractinga Stochastic Tree Adjoining Grammar parser onFrench. And to adapt model 2, we have furtherdesigned French specific argument/adjunct identi-fication rules.

For the BKY approach, we use the Berkeleyimplementation, with an horizontal markovizationh=0, and 5 split/merge cycles. All the requiredknowledge is contained in the treebank used fortraining, except for the treatment of unknown orrare words. It clusters unknown words using ty-pographical and morphological information. We

3(Petrov et al., 2006) obtain an F-score=90.1 for sentencesof less than 40 words.

adapted these clues to French, following (Arunand Keller, 2005).

Finally we use as a baseline a standard PCFGalgorithm, coupled with a trigram tagger (we referto this setup as TNT/LNCKY algorithm4).

MetricsFor evaluation, we use the standard PAR-SEVAL metric of labeled precision/recall, alongwith unlabeled dependency evaluation, which isknown as a more annotation-neutral metric. Unla-beled dependencies are computed using the (Lin,1995) algorithm, and the Dybro-Johansen’s headpropagation rules cited above5. The unlabeleddependency F-score gives the percentage of in-put words (excluding punctuation) that receive thecorrect head.As usual for probabilistic parsing results, the re-sults are given for sentences of the test set of lessthan 40 words (which is true for 992 sentences ofthe test set), and punctuation is ignored for F-scorecomputation with both metrics.

4.2 Comparison using minimal tagsets

We first derive from the FTB a minimally-informed treebank, TREEBANKM IN, instantiatedfrom the xml source by using only the major syn-tactic categories and no other feature. In each ex-periment (Table 1) we observe that the BKY al-gorithm significantly outperforms Collins models,for both metrics.

parser BKY BIKEL BIKEL TNT/metric M1 M2 LNCKY

PARSEVAL LP 85.25 78.86 80.68 68.74PARSEVAL LR 84.46 78.84 80.58 67.93PARSEVAL F1 84.85 78.85 80.63 68.33Unlab. dep. Prec. 90.23 85.74 87.60 79.50Unlab. dep. Rec. 89.95 85.72 86.90 79.37Unlab. dep. F1 90.09 85.73 87.25 79.44

Table 1: Results for parsers trained on FTB withminimal tagset

4The tagger is TNT (Brants, 2000), and the parseris LNCKY , that is distributed by Mark Johnson(http://www.cog.brown.edu/ ∼mj/Software.htm ).Formally because of the tagger, this is not a strict PCFGsetup. Rather, it gives a practical trade-off, in which thetagger includes the lexical smoothing for unknown and rarewords.

5For this evaluation, the gold constituent trees are con-verted into pseudo-gold dependency trees (that may con-tain errors). Then parsed constituent trees are convertedinto parsed dependency trees, that are matched against thepseudo-gold trees.

53

4.3 Impact of training data size

How do the unlexicalized and lexicalized ap-proaches perform with respect to size? We com-pare in figure 3 the parsing performance BKY andCOLLINSM1, on increasingly large subsets of theFTB, in perfect tagging mode6 and using a moredetailed tagset (CC tagset, described in the nextexperiment). The same 1235-sentences test setis used for all subsets, and the development set’ssize varies along with the training set’s size. BKY

outperforms the lexicalized model even with smallamount of data (around 3000 training sentences).Further, the parsing improvement that would re-sult from more training data seems higher for BKY

than for Bikel.

2000 4000 6000 8000 10000

7678

8082

8486

88

Number of training sentences

F−

scor

e

BikelBerkeley

Figure 3: Parsing Learning curve on FTB with CC-tagset, in perfect-tagging

This potential increase for BKY results if wehad more French annotated data is somehow con-firmed by the higher results reported for BKY

training on the Penn Treebank (Petrov et al., 2006): F1=90.2. We can show though that the 4 pointsincrease when training on English data is not onlydue to size : we extracted from the Penn Treebanka subset comparable to the FTB, with respect tonumber of tokens and average length of sentences.We obtain F1=88.61 with BKY training.

4.4 Symbol refinements

It is well-known that certain treebank transfor-mations involving symbol refinements improve

6For BKY , we simulate perfect tagging by changingwords into word+tag in training, dev and test sets. We ob-tain around 99.8 tagging accuracy, errors are due to unknownwords.

PCFGs (see for instance parent-transformation of(Johnson, 1998), or various symbol refinements in(Klein and Manning., 2003b)). Lexicalization it-self can be seen as symbol refinements (with back-off though). For BKY , though the key point is toautomatize symbol splits, it is interesting to studywhether manual splits still help.We have thus experimented BKY training withvarious tagsets. The FTB contains rich mor-phological information, that can be used to splitpreterminal symbols : main coarse category (thereare 13), subcategory (subcat feature refining themain cat), and inflectional information (mph fea-ture).We report in Table 2 results for the four tagsets,where terminals are made of :MIN : main cat,SUBCAT: main cat + subcat feature,MAX : cat +subcat + all inflectional information,CC: cat + ver-bal mood + wh feature.

Tagset Nb of tags Parseval Unlab. dep TaggingF1 F1 Acc

MIN 13 84.85 90.09 97.35SUBCAT 34 85.74 – 96.63MAX 250 84.13 – 92.20CC 28 86.41 90.99 96.83

Table 2: Tagset impact on learning with BKY (owntagging)

The corpus instantiation withCC tagset is ourbest trade-off between tagset informativeness andobtained parsing performance7. It is also the bestresult obtained for French probabilistic parsing.This demonstrates though that the BKY learningis not optimal since manual a priori symbol refine-ments significantly impact the results.We also tried to learn structures with functionalannotation attached to the labels : we obtain PAR-SEVAL F1=78.73 with tags from the CC tagset +grammatical function. This degradation, due todata sparseness and/or non local constraints badlycaptured by the model, currently constrains us touse a language model without functional informa-tions. As stressed in the introduction, this limitsthe interpretability of the parses and it is a trade-off between generalization and interpretability.

4.5 Lexicon and Inflection impact

French has a rich morphology that allows somedegree of word order variation, with respect to

7The differences are statistically significant : using a stan-dard t-test, we obtain p-value=0.015 betweenMIN andSUB-CAT, and p-value=0.002 betweenCC andSUBCAT.

54

English. For probabilistic parsing, this can havecontradictory effects : (i) on the one hand, thisinduces more data sparseness : the occurrencesof a French regular verb are potentially split intomore than 60 forms, versus 5 for an Englishverb; (ii) on the other hand, inflection encodesagreements, that can serve as clues for syntacticattachments.

Experiment In order to measure the impactof inflection, we have tested to cluster wordforms on a morphological basis, namely to partlycancel inflection. Using lemmas as word formclasses seems too coarse : it would not allow todistinguish for instance between a finite verb anda participle, though they exhibit different distri-butional properties. Instead we use as word formclasses, the couple lemma + syntactic category.For example for verbs, given the CC tagset, thisamounts to keeping 6 different forms (for the 6moods).To test this grouping, we derive a treebank wherewords are replaced by the concatenation of lemma+ category for training and testing the parser.Since it entails a perfect tagging, it has to becompared to results in perfect tagging mode :more precisely, we simulate perfect taggingby replacing word forms by the concatenationform+tag.Moreover, it is tempting to study the impact ofa more drastic clustering of word forms : that ofusing the sole syntactic category to group wordforms (we replace each word by its tag). Thisamounts to test a pure unlexicalized learning.

Discussion Results are shown in Figure 4.We make three observations : First, comparingthe terminal=tag curves with the other two, itappears that the parser does take advantage oflexical information to rank parses, even for this“unlexicalized” algorithm. Yet the relatively smallincrease clearly shows that lexical informationremains underused, probably because of lexicaldata sparseness.Further, comparing terminal=lemma+tag and ter-minal=form+tag curves, we observe that groupingwords into lemmas helps reducing this sparseness.And third, the lexicon impact evolution (i.e.the increment between terminal=tag and termi-nal=form+tag curves) is stable, once the training

size is superior to approx. 3000 sentences8.This suggests that only very frequent wordsmatter, otherwise words’ impact should be moreand more important as training material augments.

0 2000 4000 6000 8000 1000076

7880

8284

8688

Number of training sentences

Par

seva

l F−

scor

e

Bky terminal=form+tagBky terminal=lemma+tagBky terminal=tag

Figure 4: Impact of clustering word forms (train-ing on FTB with CC-tagset, in perfect-tagging)

5 Related Work

Previous works on French probabilistic parsing arethose of (Arun and Keller, 2005), (Schluter andvan Genabith, 2007), (Schluter and van Genabith,2008). One major difficulty for comparison is thatall three works use a different version of the train-ing corpus. Arun reports results on probabilisticparsing, using an older version of the FTB and us-ing lexicalized models (Collins M1 and M2 mod-els, and the bigram model). It is difficult to com-pare our results with Arun’s work, since the tree-bank he has used is obsolete (FTB-V0). He obtainsfor Model 1 : LR=80.35 / LP=79.99, and for thebigram model : LR=81.15 / LP=80.84, with min-imal tagset and internal tagging. The results withFTB (revised subset of FTB-V0) with minimal

8 This is true for all points in the curves, except forthe last step, i.e. when full training set is used. We per-formed a 10-fold cross validation to limit sample effects. Forthe BKY training with CC tagset, and own tagging, we ob-tain an average F-score of 85.44 (with a rather high stan-dard deviationσ=1.14). For the clustering word forms ex-periment, using the full training set, we obtain : 86.64 forterminal=form+tag (σ=1.15), 87.33 for terminal=lemma+tag(σ=0.43), and 85.72 for terminal=tag (σ=0.43). Hence ourconclusions (words help even with unlexicalized algorithm,and further grouping words into lemmas helps) hold indepen-dently of sampling.

55

tagset (Table 1) are comparable for COLLINSM1,and nearly 5 points higher for BKY .

It is also interesting to review (Arun and Keller,2005) conclusion, built on a comparison with theGerman situation : at that time lexicalization wasthought (Dubey and Keller, 2003) to have no siz-able improvement on German parsing, trained onthe Negra treebank, that uses a flat structures. So(Arun and Keller, 2005) conclude that since lex-icalization helps much more for parsing French,with a flat annotation, then word-order flexibilityis the key-factor that makes lexicalization useful(if word order is fixed, cf. French and English)and useless (if word order is flexible, cf. German).This conclusion does not hold today. First, it canbe noted that as far as word order flexibility is con-cerned, French stands in between English and Ger-man. Second, it has been proven that lexicalizationhelps German probabilistic parsing (Kübler et al.,2006). Finally, these authors show that markoviza-tion of the unlexicalized Stanford parser gives al-most the same increase in performance than lex-icalization, both for the Negra treebank and theTüba-D/Z treebank. This conclusion is reinforcedby the results we have obtained : the unlexicalized,markovized, PCFG-LA algorithm outperforms theCollins’ lexicalized model.

(Schluter and van Genabith, 2007) aim at learn-ing LFG structures for French. To do so, and inorder to learn first a Collins parser, N. Schlutercreated a modified treebank, the MFT, in order (i)to fit her underlying theoretical requirements, (ii)to increase the treebank coherence by error min-ing and (iii) to improve the performance of thelearnt parser. The MFT contains 4739 sentencestaken from the FTB, with semi-automatic trans-formations. These include increased rule stratifi-cation, symbol refinements (for information prop-agation), coordination raising with some manualre-annotation, and the addition of functional tags.MFT has also undergone a phase of error min-ing, using the (Dickinson and Meurers, 2005) soft-ware, and following manual correction. She re-ports a 79.95% F-score on a 400 sentence testset, which compares almost equally with Arun’sresults on the original 20000 sentence treebank.So she attributes her results to the increased co-herence of her smaller treebank. Indeed, we ranthe BKY training on the MFT, and we get F-score=84.31. While this is less in absolute thanthe BKY results obtained with FTB (cf. results in

table 2), it is indeed very high if training data sizeis taken into account (cf. the BKY learning curvein figure 3). This good result raises the open ques-tion of identifying which modifications in the MFT

(error mining and correction, tree transformation,symbol refinements) have the major impact.

6 Conclusion

This paper reports results in statistical parsingfor French with both unlexicalized (Petrov et al.,2006) and lexicalized parsers. To our knowledge,both results are state of the art on French for eachparadigm.

Both algorithms try to overcome PCFG’s sim-plifying assumptions by some specialization of thegrammatical labels. For the lexicalized approach,the annotation of symbols with lexical head isknown to be rarely fully used in practice (Gildea,2001), what is really used being the category ofthe lexical head.

We observe that the second approach (BKY )constantly outperforms the lexicalist strategyà la(Collins, 1999). We observe however that (Petrovet al., 2006)’s semi-supervised learning procedureis not fully optimal since a manual refinement ofthe treebank labelling turns out to improve theparsing results.

Finally we observe that the semi-supervisedBKY algorithm does take advantage of lexical in-formation : removing words degrades results. Thepreterminal symbol splits percolates lexical dis-tinctions. Further, grouping words into lemmashelps for a morphologically rich language such asFrench. So, an intermediate clustering standingbetween syntactic category and lemma is thoughtto yield better results in the future.

7 Acknowledgments

We thank N. Schluter and J. van Genabith forkindly letting us run BKY on the MFT, and A.Arun for answering our questions. We also thankthe reviewers for valuable comments and refer-ences. The work of the second author was partlyfunded by the “Prix Diderot Innovation 2007”,from University Paris 7.

56

References

Anne Abeillé, Lionel Clément, and Alexandra Kinyon.2000. Building a treebank for french. InProceed-ings of the 2nd International Conference LanguageResources and Evaluation (LREC’00).

Anne Abeillé, Lionel Clément, and François Toussenel,2003.Building a treebank for French. Kluwer, Dor-drecht.

Abhishek Arun and Frank Keller. 2005. Lexicalizationin crosslinguistic probabilistic parsing: The case offrench. InProceedings of the 43rd Annual Meet-ing of the Association for Computational Linguis-tics, pages 306–313, Ann Arbor, MI.

Daniel M. Bikel. 2004a. A distributional analysisof a lexicalized statistical parsing model. InProc.of Empirical Methods in Natural Language Pro-cessing (EMNLP 2004), volume 4, pages 182–189,Barcelona, Spain.

Daniel M. Bikel. 2004b. Intricacies of Collins’ ParsingModel. Computational Linguistics, 30(4):479–511.

Thorsten Brants. 2000. Tnt – a statistical part-of-speech tagger. InProceedings of the 6th AppliedNLP Conference (ANLP), Seattle-WA.

Eugene Charniak. 2000. A maximum-entropy-inspired parser. InProceedings of the Annual Meet-ing of the North American Association for Com-putational Linguistics (NAACL-00), pages 132–139,Seattle, Washington.

Michael Collins. 1999.Head driven statistical modelsfor natural language parsing. Ph.D. thesis, Univer-sity of Pennsylvania, Philadelphia.

Benoit Crabbé and Marie Candito. 2008. Expériencesd’analyse syntaxique statistique du français. InActes de la 15ème Conférence sur le Traitement Au-tomatique des Langues Naturelles (TALN’08), pages45–54, Avignon.

Markus Dickinson and W. Detmar Meurers. 2005.Prune diseased branches to get healthy trees! howto find erroneous local trees in treebank and whyit matters. InProceedings of the 4th Workshopon Treebanks and Linguistic Theories (TLT 2005),Barcelona, Spain.

Amit Dubey and Frank Keller. 2003. Probabilis-tic parsing for german using sister-head dependen-cies. In In Proceedings of the 41st Annual Meet-ing of the Association for Computational Linguis-tics, pages 96–103.

Ane Dybro-Johansen. 2004. Extraction automatiquede grammaires á partir d’un corpus français. Mas-ter’s thesis, Université Paris 7.

Daniel Gildea. 2001. Corpus variation and parser per-formance. InProceedings of the First Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 167–202.

Mark Johnson. 1998. PCFG models of linguis-tic tree representations.Computational Linguistics,24(4):613–632.

Dan Klein and Christopher D. Manning. 2003a. Ac-curate unlexicalized parsing. InProceedings of the41st Annual Meeting on Association for Computa-tional Linguistics-Volume 1, pages 423–430. Asso-ciation for Computational Linguistics Morristown,NJ, USA.

Dan Klein and Christopher D. Manning. 2003b. Ac-curate unlexicalized parsing. InProceedings of the41st Meeting of the Association for ComputationalLinguistics.

Sandra Kübler, Erhard W. Hinrichs, and WolfgangMaier. 2006. Is it really that difficult to parse ger-man? InProceedings of the 2006 Conference onEmpirical Methods in Natural Language Process-ing, pages 111–119, Sydney, Australia, July. Asso-ciation for Computational Linguistics.

Dekang Lin. 1995. A dependency-based method forevaluating broad-coverage parsers. InInternationalJoint Conference on Artificial Intelligence, pages1420–1425, Montreal.

Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii.2005. Probabilistic cfg with latent annotations. InProceedings of the 43rd Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages75–82.

David McClosky, Eugene Charniak, and Mark John-son. 2006. Effective self-training for parsing. InProceedings of the Human Language TechnologyConference of the NAACL, Main Conference, pages152–159, New York City, USA, June. Associationfor Computational Linguistics.

Slav Petrov, Leon Barrett, Romain Thibaux, and DanKlein. 2006. Learning accurate, compact, andinterpretable tree annotation. InProceedings ofthe 21st International Conference on ComputationalLinguistics and 44th Annual Meeting of the Associ-ation for Computational Linguistics, Sydney, Aus-tralia, July. Association for Computational Linguis-tics.

Natalie Schluter and Josef van Genabith. 2007.Preparing, restructuring, and augmenting a frenchtreebank: Lexicalised parsers or coherent treebanks?In Proceedings of PACLING 07.

Natalie Schluter and Josef van Genabith. 2008.Treebank-based acquisition of lfg parsing resourcesfor french. In European Language Resources As-sociation (ELRA), editor,Proceedings of the SixthInternational Language Resources and Evaluation(LREC’08), Marrakech, Morocco, may.

57


Upper Bounds for Unsupervised Parsing with UnambiguousNon-Terminally Separated Grammars

Franco M. Luque and Gabriel Infante-LopezGrupo de Procesamiento de Lenguaje Natural

Universidad Nacional de Córdoba & CONICETArgentina

{francolq|gabriel}@famaf.unc.edu.ar

Abstract

Unambiguous Non-Terminally Separated(UNTS) grammars have properties thatmake them attractive for grammatical in-ference. However, these properties do notstate the maximal performance they canachieve when they are evaluated against agold treebank that is not produced by anUNTS grammar. In this paper we inves-tigate such an upper bound. We developa method to find an upper bound for theunlabeledF1 performance that any UNTSgrammar can achieve over a given tree-bank. Our strategy is to characterize allpossible versions of the gold treebank thatUNTS grammars can produce and to findthe one that optimizes a metric we define.We show a way to translate this score intoan upper bound for theF1. In particular,we show that theF1 parsing score of anyUNTS grammar can not be beyond82.2%when the gold treebank is the WSJ10 cor-pus.

1 Introduction

Unsupervised learning of natural language has re-ceived a lot of attention in the last years, e.g., Kleinand Manning (2004), Bod (2006a) and Seginer(2007). Most of them use sentences from a tree-bank for training and trees from the same treebankfor evaluation. As such, the best model for un-supervised parsing is the one that reports the bestperformance.

Unambiguous Non-Terminally Separated(UNTS) grammars have properties that makethem attractive for grammatical inference. Thesegrammars have been shown to be PAC-learnablein polynomial time (Clark, 2006), meaning thatunder certain circumstances, the underlyinggrammar can be learned from a sample of the

underlying language. Moreover, UNTS grammarshave been successfully used to induce grammarsfrom unannotated corpora in competitions oflearnability of formal languages (Clark, 2007).

UNTS grammars can be used for modeling nat-ural language. They can be induced using anytraining material, the induced models can be eval-uated using trees from a treebank, and their per-formance can be compared against state-of-the-art unsupervised models. Different learning al-gorithms might produce different grammars and,consequently, different scores. The fact that theclass of UNTS grammars is PAC learnable doesnot convey any information on the possible scoresthat different UNTS grammars might produce.From a performance oriented perspective it mightbe possible to have an upper bound over the setof possible scores of UNTS grammars. Knowingan upper bound is complementary to knowing thatthe class of UNTS grammars is PAC learnable.

Such upper bound has to be defined specificallyfor UNTS grammars and has to take into accountthe treebank used as test set. The key questionis how to compute it. Suppose that we want toevaluate the performance of a given UNTS gram-mar using a treebank. The candidate grammar pro-duces a tree for each sentence and those trees arecompared to the original treebank. We can thinkthat the candidate grammar has produced a newversion of the treebank, and that the score of thegrammar is a measure of the closeness of the newtreebank to the original treebank. Finding the bestupper bound is equivalent to finding the closestUNTS version of the treebank to the original one.

Such bounds are difficult to find for most classesof languages because the search space is theset of all possible versions of the treebank thatmight have been produced by any grammar in theclass under study. In order to make the problemtractable, we need the formalism to have an easyway to characterize all the versions of a treebank

58

it might produce. UNTS grammars have a specialcharacterization that makes the search space easyto define but whose exploration is NP-hard.

In this paper we present a way to characterizeUNTS grammars and a metric function to mea-sure the closeness between two different versionof a treebank. We show that the problem of find-ing the closest UNTS version of the treebank canbe described as Maximum Weight Independent Set(MWIS) problem, a well known NP-hard problem(Karp, 1972). The exploration algorithm returnsa version of the treebank that is the closest to thegold standard in terms of our own metric.

We show that theF1-measure is related to ourmeasure and that it is possible to find and upperbound of theF1-performance for all UNTS gram-mars. Moreover, we compute this upper bound forthe WSJ10, a subset of the Penn Treebank (Mar-cus et al., 1994) using POS tags as the alphabet.The upper bound we found is82.2% for the F1measure. Our result suggest that UNTS grammarsare a formalism that has the potential to achievestate-of-the-art unsupervised parsing performancebut does not guarantee that there exists a grammarthat can actually achieve the82.2%.

To the best of our knowledge, there is no pre-vious research on finding upper bounds for perfor-mance over a concrete class of grammars. In Kleinand Manning (2004), the authors compute an up-per bound for parsing with binary trees a gold tree-bank that is not binary. This upper bound, that is88.1% for the WSJ10, is for any parser that returnsbinary trees, including the concrete models devel-oped in the same work. But their upper bound doesnot use any specific information of the concretemodels that may help them to find better ones.

The rest of the paper is organized as follows.Section 2 presents our characterization of UNTSgrammars. Section 3 introduces the metric we op-timized and explains how the closest version of thetreebank is found. Section 4 explains how the up-per bound for our metric is translated to an up-per bound of theF1 score. Section 5 presents ourbound for UNTS grammars using the WSJ10 andfinally Section 6 concludes the paper.

2 UNTS Grammars and Languages

Formally, a context free grammarG =(Σ, N, S, P ) is said to be Non-Terminally Sepa-rated (NTS) if, for allX, Y ∈ N andα, β, γ ∈(Σ ∪ N)∗ such thatX

∗⇒ αβγ andY

∗⇒ β, we

have thatX∗⇒ αY γ (Clark, 2007). Unambiguous

NTS (UNTS) grammars are those NTS grammarsthat parses unambiguously every instance of thelanguage.

Given any grammarG, a substrings of r ∈L(G) is called aconstituent of r if and only if thereis anX in N such thatS

∗⇒ uXv

∗⇒ usv = r.

In contrast, a strings is called a non-constituent ordistituent of r ∈ L(G) if s is not a constituent ofr.We say thats is a constituent of a languageL(G)if for every r that containss, s is a constituent ofr. In contrast,s is a distituent ofL(G) if for everyr wheres occurs,s is a distituent ofr.

An interesting characterization of finite UNTSgrammars is that every substring that appear insome string of the language is always a constituentor always a distituent. In other words, if there is astringr in L(G) for which s is a constituent, thens is a constituent ofL(G). By means of this prop-erty, if we ignore the non-terminal labels, a finiteUNTS language is fully determined by its set ofconstituentsC. We can show this property for fi-nite UNTS languages. We believe that it can alsobe shown for non-finite cases, but for our purposesthe finite cases suffices, because we use grammarsto parse finite sets of sentences, specifically, thesentences of test treebanks. We know that for ev-ery finite subset of an infinite language producedby a UNTS grammarG, there is a UNTS gram-mar G′ whose language is finite and that parsesthe finite subset asG. If we look for the upperbound among the grammars that produce a finitelanguage, this upper bound is also an upper boundfor the class of infinite UNTS grammars.

The UNTS characterization plays a very im-portant role in the way we look for the upperbound. Our method focuses on how to determinewhich of the constituents that appear in the goldare actually the constituents that produce the up-per bound. Suppose that a given gold treebankcontains two stringsα andβ such that theyoccuroverlapped. That is, there exist non-empty stringsα′, γ, β′ such thatα = α′γ and β = γβ′ andα′γβ′ occurs in the treebank. IfC is the set ofconstituents of a UNTS grammar it can not havebothα andβ. It might have one or the other, butif both belong toC the resulting language can notbe UNTS. In order to find the closest UNTS gram-mar we design a procedure that looks for the sub-set of all substrings that occur in the sentences ofthe gold treebank that can be the constituent setC

59

of a grammar. We do not explicitly build a UNTSgrammar, but find the setC that produces the bestscore.

We say that two stringsα andβ arecompatiblein a languageL if they do not occur overlappedin L, and hence they both can be members ofC.If we think of L as a subset of an infinite lan-guage, it is not possible to check that two overlap-ping strings do not appear overlapped in the “real”language and hence that they are actually com-patible. Nevertheless, we can guarantee compat-ibility between two stringsα, β by requiring thatthey do not overlap at all, this is, that there areno non-empty stringsα′, γ, β′ such thatα = α′γ

andβ = γβ′. We call this type of compatibilitystrong compatibility. Strong compatibility ensuresthat two strings can belong toC regardless ofL.In our experiments we focus on finding the best setC of compatible strings.

Any set of compatible stringsC extracted fromthe gold treebank can be used to produce a newversion of the treebank. For example, Figure 1shows two trees from the WSJ Penn Treebank.The string “in the dark” occurs as a constituent in(a) and as a distituent in (b). IfC contains “in thedark”, it can not contain “the dark clouds” giventhat they overlap in the yield of (b). As a con-sequence, the new treebank correctly contains thesubtree in (a) but not the one in (b). Instead, theyield of (b) is described as in (c) in the new tree-bank.

C defines a new version of the treebank that sat-isfies the UNTS property. Our goal is to obtain atreebankT ′ such that (a)T ′ andT are treebanksover the same set of sentences, (b)T ′ is UNTS,and (c)T ′ is the closest treebank toT in terms ofperformance. The three of them imply that anyother UNTS grammar is not as similar as the onewe found.

3 Finding the Best UNTS Grammar

As our goal is to find the closest grammar in termsof performance, we need to define first a weightfor each possible grammar and second, an algo-rithm that searches for the grammar with the bestweight. Ideally, the weight of a candidate gram-mar should be in terms ofF1, but we can showthat optimization of this particular metric is com-putationally hard. Instead of definingF1 as theirscore, we introduce a new metric that is easier tooptimize, we find the best grammar for this met-

ric, and we show that the possible values ofF1can be bounded by a function that takes this scoreas argument. In this section we present our metricand the technique we use to find a grammar thatreports the best value for our metric.

If the original treebankT is not produced by anyUNTS grammar, then there are strings inT thatare constituents in some sentences and that are dis-tituents in some other sentences. For each one ofthem we need a procedure to decide whether theyare members ofC or not. If a stringα appears asignificant number of times more as a constituentthan as a distituent the procedure may choose toinclude it inC at the price of being wrong a fewtimes. That is, the new version ofT has all occur-rences ofα either as constituents or as distituents.The treebank that has all of its occurrences as con-stituents differs from the original in that there aresome occurrences ofα that were originally dis-tituents and are marked as constituents. Similarly,if α is marked as distituent in the new treebank, ithas occurrences ofα that were constituents inT .

The decision procedure becomes harder whenall the substrings that appear in the treebank areconsidered. The increase in complexity is a con-sequence of the number of decisions the procedureneeds to take and the way these decisions interfereone with another. We show that the problem ofdetermining the setC is naturally embedded in agraph NP-hard problem. We define a way to lookfor the optimal grammars by translating our prob-lem to a well known graph problem. LetL be thethe set of sentences in a treebank, and letS(L) beall the possible non-empty proper substrings ofL.We build a weighted undirected graphG in termsof the treebank as follows. Nodes inG correspondto strings inS(L). The weight of a node is a func-tion w(s) that models our interest of havings se-lected as a constituent;w(s) is defined in terms ofsome information derived from the gold treebankT and we discuss it later in this section. Finally,two nodesa andb are connected by an edge if theirtwo corresponding strings conflict in a sentence ofT (i.e., they are not compatible inL).

Not all elements ofL are inS(L). We did notincludeL in S(L) for two practical reasons. Thefirst one is that to requireL in S(L) is too re-strictive. It states that all strings inL are in factconstituents. If two stringab and bc of L oc-cur overlapped in a third stringabc then there isno UNTS grammar capable of having the three of

60

(a)

PRP

we

VBP

’reIN

inDT

the

JJ

dark

(b)IN

in DT

the

JJ

dark

NNS

clouds

(c)

IN

in

DT

the

JJ

dark

NNS

clouds

Figure 1: (a) and (b) are two subtrees that show “in the dark” as a constituent and as a distituent respec-tively. (c) shows the result of choosing “in the dark” as a constituent.

them as constituents. The second one is that in-cluding them produces graphs that are too sparse.If they are included in the graph, we know thatany solution should contain them, consequently,all their neighbors do not belong to any solutionand they can be removed from the graph. Our ex-periments show that the graph that results from re-moving nodes related to nodes representing stringsin L are too small to produce any interesting result.

By means of representing the treebank as agraph, selecting a set of constituentsC ⊆ S(L)is equivalent to selecting an independent set ofnodes in the graph. Anindependent set is a sub-set of the set of nodes that do not have any pairof nodes connected by an edge. Clearly, there areexponentially many possible ways to select an in-dependent set, and each of these sets represents aset of constituents. But, since we are interested inthe best set of constituents, we associate to eachindependent setC the weightW (C) defined as∑

s∈C w(s). Our aim is then to find a setCmax

that maximizes this weight. This problem is a wellknown problem of graph theory known in the lit-erature as the Maximum Weight Independent Set(MWIS) problem. This problem is also known tobe NP-hard (Karp, 1972).

We still have to choose a definition forw(s).We want to find the grammar that maximizesF1.Unfortunately,F1 can not be expressed in terms ofa sum of weights. Maximization ofF1 is beyondthe expressiveness of our model, but our strategyis to define a measure that correlates withF1 andthat can be expressed as a sum of weights.

In order to introduce our measure, we first de-fine c(s) andd(s) as the number of times a strings appears in the gold treebankT as a constituentand as a distituent respectively. Observe that ifwe choose to includes as a constituent ofC, theresulting treebankT ′ contains all thec(s) + d(s)occurrences ofs as a constituent.c(s) of thes oc-currences inT ′ are constituents as they are inT

andd(s) of the occurrences are constituents inT ′

but are in fact distituents inT . We want to max-

imize c(s) and minimized(s) at the same time.This can be done by defining the contribution of astrings to the overall score as

w(s) = c(s) − d(s).

With this definition ofw, the weightW (C) =∑

s∈C w(s) becomes the number of constituentsof T ′ that are inT minus the number of con-stituents that do not. If we define the number ofhits to beH(C) =

∑

s∈C c(s) and the number ofmisses to beM(C) =

∑

s∈C d(s) we have that

W (C) = H(C) − M(C). (1)

As we confirm in Section 5, graphs tend to bevery big. In order to reduce the size of the graphs,if a string s hasw(s) ≤ 0, we do not include itscorresponding node in the graph. An independentset that does not includes has an equal or higherW than the same set includings.

For example, letT be the treebank in Fig-ure 2 (a). The sets of substrings such thatw(c) ≥ 0 is {da, cd, bc, cda, ab, bch}. Thegraph that corresponds to this set of strings isgiven in Figure 3. Nodes corresponding tostrings {dabch, bcda, abe, abf, abg, bci, daj} arenot shown in the figure because the strings donot belong toS(L). The figure also shows theweights associated to the substrings according totheir counts in Figure 2 (a). The shadowed nodescorrespond to the independent set that maximizesW . The trees in the Figure 2 (b) are the sentencesof the treebank parsed according the optimal inde-pendent set.

4 An Upper Bound for F1

Even though finding the independent set that max-imizes W is an NP-Hard problem, there are in-stances where it can be effectively computed, aswe show in the next section. The setCmax max-imizes W for the WSJ10 and we know that allothersC produces a lower value ofW . In otherwords, the setCmax produce a treebankTmax that

61

(a)

d ab c h

(da)((bc)h)

bc d a

b((cd)a)

a b e

(ab)ea b f

(ab)fa b g

(ab)gb c i

(bc)id a j

(da)j

(b)

d a b c h

d(ab)ch

bc d a

b((cd)a)

a b e

(ab)ea b f

(ab)fa b g

(ab)gb c ibci

d a jdaj

Figure 2: (a) A gold treebank. (b) The treebank generated by the grammar C = L ∪ {cd, ab, cda}.

Figure 3: Graph for the treebank of Figure 2.

is the closest UNTS version to the WSJ10 in termsof W . We can compute the precision, recall andF1 for Cmax but there is no warranty that theF1score is the best for all the UNTS grammars. Thisis the case becauseF1 andW do not define thesame ordering over the family of candidate con-stituent setsC: there are gold treebanksT (usedfor computing the metrics), and setsC1, C2 suchthat F1(C1) < F1(C2) andW (C1) > W (C2).For example, consider the gold treebankT in Fig-ure 4 (a). The table in Figure 4 (b) displays twosetsC1 andC2, the treebanks they produce, andtheir values ofF1 andW . Note thatC2 is the re-sult of adding the stringef to C1, also note thatc(ef) = 1 andd(ef) = 2. This improves theF1score but produces a lowerW .

The F1 measure we work with is the one de-fined in the recent literature of unsupervised pars-ing (Klein and Manning, 2004).F1 is defined interms of Precision and Recall as usual, and the lasttwo measures are micro-averaged measures thatinclude full-span brackets, and that ignore bothunary branches and brackets of span one. For sim-plicity, the previous example does not count thefull-span brackets.

As the example shows, the upper bound forW

might not be an upper bound ofF1, but it is pos-sible to find a way to define an upper bound ofF1 using the upper bound ofW . In this sectionwe define a functionf with the following prop-erty. LetX andY be the sets ofW -weights and

F1-weights for all possible UNTS grammars re-spectively. Then, ifw is an upper bound ofX,thenf(w) is an upper bound ofY . The functionfis defined as follows:

f(w) = F1

(

1

2 − wK

, 1

)

(2)

whereF1(p, r) = 2prp+r

, andK =∑

s∈STc(s) is

the total number of constituents in the gold tree-bank T . From it, we can also derive values forprecision and recall: precision1

2− w

K

and recall1.

A recall of 1 is clearly an upper bound for all thepossible values of recall, but the value given forprecision is not necessarily an upper bound for allthe possible values of precision. It might exist agrammar having a higher value of precision butwhoseF1 has to be below our upper bound.

The rest of section shows thatf(W ) is an up-per bound forF1, the reader not interested in thetechnicalities can skip it.

The key insight for the proof is that both metricsF1 andW can be written in terms of precision andrecall. LetT be the treebank that is used to com-pute all the metrics. And letT ′ be the treebankproduced by a given constituent setC. If a strings belongs toC, then itsc(s) + d(s) occurrencesin T ′ are marked as constituents. Moreover,s iscorrectly tagged ac(s) number of times while itis incorrectly tagged ad(s) number of times. Us-ing this,P , R andF1 can be computed forC asfollows:

P (C) =P

s∈Cc(s)

P

s∈Cc(s)+d(s)

= H(C)H(C)+M(C) (3)

R(C) =P

s∈Cc(s)

K

= H(C)K

(4)

F1(C) = 2P (C)R(C)P (C)+R(C)

= 2H(C)K+H(C)+M(C)

62

(a)

a b c

(ab)c

a b da(bd)

e f g

(ef)g

e f hefh

e f iefi

(b)C T ′

C P R F1 W

C1 = {abc, abd, efg, efh, efi, ab} {(ab)c, (ab)d, efg, efh, efi} 50% 33% 40% 1 − 1 = 0C2 = {abc, abd, efg, efh, efi, ab, ef} {(ab)c, (ab)d, (ef)g, (ef)h, (ef)i} 40% 67% 50% 2 − 3 = −1

Figure 4: (a) A gold treebank. (b) Two grammars, the treebanks they generate, and their scores.

W can also be written in terms ofP andR as

W (C) = (2 −1

P (C))R(C)K (5)

This formula is proved to be equivalent to Equa-tion (1) by replacingP (C) andR(C) with equa-tions (3) and (4) respectively. Using the last twoequations, we can rewriteF1 andW takingp andr, representing values of precision and recall, asparameters:

F1(p, r) =2pr

p + r

W (p, r) = (2 −1

p)rK (6)

Using these equations, we can prove thatf

correctly translates upper bounds ofW to upperbounds ofF1 using calculus. In contrast toF1,W not necessarily take values between0 and1. In-stead, it takes values betweenK and−∞. More-over, it is negative whenp < 1

2 , and goes to−∞whenp goes to0. Let C be an arbitrary UNTSgrammar, and letpC , rC andwC be its precision,recall andW -weight respectively. Letw be ourupper bound, so thatwC ≤ w. If f1C is definedasF1(pC , rC) we need to show thatf1C ≤ f(w).We boundf1C in two steps. First, we show that

f1C ≤ f(wC)

and second, we show that

f(wC) ≤ f(w).

The first inequality is proved by observing thatf1C andf(wC) are the values of the function

f1(r) = F1

(

1

2 − wC

Kr

, r

)

at the pointsr = rC and r = 1 respectively.This function corresponds to the line defined bythe F1 values of all possible models that have a

fixed weightW = wC . The function is monoton-ically increasing inr, so we can apply it to bothsides of the following inequalityrC ≤ 1, which istrivially true. As result, we getf1C ≤ f(wC) asrequired. The second inequality is proved by ob-serving thatf(w) is monotonically increasing inw, and by applying it to both sides of the hypothe-siswc ≤ w.

5 UNTS Bounds for the WSJ10 Treebank

In this section we focus on trying to find real upperbounds building the graph for a particular treebankT . We find the best independent set, we build theUNTS versionTmax of T and we compute the up-per bound forF1. The treebank we use for exper-iments is the WSJ10, which consists of the sen-tences of the WSJ Penn Treebank whose lengthis at most 10 words after removing punctuationmarks (Klein and Manning, 2004). We also re-moved lexical entries transforming POS tags intoour terminal symbols as it is usually done (Kleinand Manning, 2004; Bod, 2006a).

We start by finding the best independent set. Tosolve the problem in the practice, we convert itinto an Integer Linear Programming (ILP) prob-lem. ILP is also NP-hard (Karp, 1972), but thereis software that implements efficient strategies forsolving some of its instances (Achterberg, 2004).

ILP problems are defined by three parameters.First, there is a set of variables that can take val-ues from a finite set. Second, there is an objectivefunction that has to be maximized, and third, thereis a set of constraints that must be satisfied. In ourcase, we define a binary variablexs ∈ {0, 1} forevery nodes in the graph. Its value is 1 or 0, thatrespectively determines the presence or absence ofs in the setCmax. The objective function is

∑

s∈S(L)

xsw(s)

The constraints are defined using the edges of the

63

graph. For every edge(s1, s2) in the graph, weadd the following constraint to the problem:

xs1+ xs2

≤ 1

The 7422 trees of the WSJ10 treebank have atotal of 181476 substrings of length≥ 2, thatform the setS(L) of 68803 different substrings.The number of substrings inS(L) does not growtoo much with respect to the number of strings inL because substrings are sequences of POS tags,meaning that each substring is very frequent in thecorpus. If substrings were made out of words in-stead of POS tags, the number of substrings wouldgrow much faster, making the problem harder tosolve. Moreover, removing the stringss such thatw(s) ≤ 0 gives a total of only7029 substrings.Since there is a node for each substring, the result-ing graph contains7029 nodes. Recall that thereis an edge between two strings if they occur over-lapped. Our graph contains1204 edges. The ILPversion has7029 variables,1204 constraints andthe objective function sums over7029 variables.These numbers are summarized in Table 1.

The solution of the ILP problem is a set of6583 variables that are set to one. This set corre-sponds to a setCmax of nodes in our graph of thesame number of elements. UsingCmax we builda new versionTmax of the WSJ10, and computeits weightW , precision, recall andF1. Their val-ues are displayed in Table 2. Since the elementsof L were not introduced inS(L), elements ofLare not necessarily inCmax, but in order to com-pute precision and recall, we add them by hand.Strictly speaking, the set of constituents that weuse for buildingTmax is Cmax plus the full spanbrackets.

We can, using equation (2), compute the up-per bound ofF1 for all the possible scores of allUNTS grammars that use POS tags as alphabet:

f(wmax) = F1

(

1

2 − wmax

K

, 1

)

= 82.2%

The precision for this upper bound is

P (wmax) =1

2 − wmax

K

= 69.8%

while its recall isR = 100%. Note from the pre-vious section thatP (wmax) is not an upper boundfor precision but just the precision associated tothe upper boundf(wmax).

Gold constituents K 35302Strings |S(L)| 68803Nodes 7029Edges 1204

Table 1: Figures for the WSJ10 and its graph.

Hits H 22169Misses M 2127Weight W 20042Precision P 91.2%Recall R 62.8%F1 F1 74.4%

Table 2: Summary of the scores forCmax.

Table 3 shows results that allow us to com-pare the upper bounds with state-of-the-art pars-ing scores. BestW corresponds to the scores ofTmax and UBoundF1 is the result of our transla-tion functionf . From the table we can see thatan unsupervised parser based on UNTS grammarsmay reach a sate-of-the-art performance over theWSJ10. RBranch is a WSJ10 version where alltrees are binary and right branching. DMV, CCMand DMV+CCM are the results reported in Kleinand Manning (2004). U-DOP and UML-DOPare the results reported in Bod (2006b) and Bod(2006a) respectively. Incremental refers to the re-sults reported in Seginer (2007).

We believe that our upper bound is a generousone and that it might be difficult to achieve it fortwo reasons. First, since the WSJ10 corpus isa rather flat treebank, from the68803 substringsonly 10% of them are such thatc(s) > d(s). Ourprocedure has to decide among this10% whichof the strings are constituents. An unsupervisedmethod has to choose the set of constituents fromthe set of all68803 possible substrings. Second,we are supposing a recall of100% which is clearlytoo optimistic. We believe that we can find atighter upper bound by finding an upper bound forrecall, and by rewritingf in equation (2) in termsof the upper bound for recall.

It must be clear the scope of the upper boundwe found. First, note that it has been computedover the WSJ10 treebank using the POS tags asthe alphabet. Any other alphabet we use, like forexample words, or pairs of words and POS tags,changes the relation of compatibility among thesubstrings, making a completely different universe

64

Model UP UR F1

RBranch 55.1 70.0 61.7DMV 46.6 59.2 52.1CCM 64.2 81.6 71.9DMV+CCM 69.3 88.0 77.6U-DOP 70.8 88.2 78.5UML-DOP 82.9Incremental 75.6 76.2 75.9

BestW(UNTS) 91.2 62.8 74.4UBoundF1(UNTS) 69.8 100.0 82.2

Table 3: Performance on the WSJ10 of the mostrecent unsupervised parsers, and our upper boundson UNTS.

of UNTS grammars. Second, our computation ofthe upper bound was not made for supersets of theWSJ10. Supersets such as the entire Penn Tree-bank produce bigger graphs because they containlonger sentences and various different sequencesof substrings. As the maximization ofW is anNP-hard problem, the computational cost of solv-ing bigger instances grows exponentially. A thirdlimitation that must be clear is about the modelsaffected by the bound. The upper bound, and ingeneral the method, is only applicable to the classof formal UNTS grammars, with only some veryslight variants mentioned in the previous sections.Just moving to probabilistic or weighted UNTSgrammars invalidates all the presented results.

6 Conclusions

We present a method for assessing the potential ofUNTS grammars as a formalism for unsupervisedparsing of natural language. We assess their po-tential by finding an upper bound of their perfor-mance when they are evaluated using the WSJ10treebank. We show that any UNTS grammars canachieve at most82.2% of F1 measure, a valuecomparable to most state-of-the-art models. In or-der to compute this upper bound we introduceda measure that does not define the same orderingamong UNTS grammars as theF1, but that hasthe advantage of being computationally easier tooptimize. Our measure can be used, by means ofa translation function, to find an upper bound forF1. We also showed that the optimization proce-dure for our metric maps into an NP-Hard prob-lem, but despite this fact we present experimen-tal results that compute the upper bound for theWSJ10 when POS tags are treated as the grammar

alphabet.From a more abstract perspective, we intro-

duced a different approach to assess the usefulnessof a grammatical formalism. Usually, formalismare proved to have interesting learnability proper-ties such as PAC-learnability or convergence of aprobabilistic distribution. We present an approachthat even though it does not provide an effectiveway of computing the best grammar in an unsu-pervised fashion, it states the upper bound of per-formance for all the class of UNTS grammars.

Acknowledgments

This work was supported in part by grant PICT2006-00969, ANPCyT, Argentina. We would liketo thank Pablo Rey (UDP, Chile) for his helpwith ILP, and Demetrio Martín Vilela (UNC, Ar-gentina) for his detailed review.

References

Tobias Achterberg. 2004. SCIP - a framework to in-tegrate Constraint and Mixed Integer Programming.Technical report.

Rens Bod. 2006a. An all-subtrees approach to unsu-pervised parsing. InProceedings of COLING-ACL2006.

Rens Bod. 2006b. Unsupervised parsing with U-DOP.In Proceedings of CoNLL-X.

Alexander Clark. 2006. PAC-learning unambiguousNTS languages. InProceedings of ICGI-2006.

Alexander Clark. 2007. Learning deterministic con-text free grammars: The Omphalos competition.Machine Learning, 66(1):93–110.

Richard M. Karp. 1972. Reducibility among com-binatorial problems. In R. E. Miller and J. W.Thatcher, editors,Complexity of Computer Compu-tations, pages 85–103. Plenum Press.

Dan Klein and Christopher D. Manning. 2004.Corpus-based induction of syntactic structure: Mod-els of dependency and constituency. InProceedingsof ACL 42.

Mitchell P. Marcus, Beatrice Santorini, and Mary A.Marcinkiewicz. 1994. Building a large annotatedcorpus of english: The Penn treebank.Computa-tional Linguistics, 19(2):313–330.

Yoav Seginer. 2007. Fast unsupervised incrementalparsing. InProceedings of ACL 45.

65


Comparing learners for Boolean partitions:implications for morphological paradigms ∗

Katya PertsovaUniversity of North Carolina,

Chapel [email protected]

Abstract

In this paper, I show that a problem oflearning a morphological paradigm is sim-ilar to a problem of learning a partitionof the space of Boolean functions. I de-scribe several learners that solve this prob-lem in different ways, and compare theirbasic properties.

1 Introduction

Lately, there has been a lot of work on acquir-ing paradigms as part of the word-segmentationproblem (Zeman, 2007; Goldsmith, 2001; Snoveret al., 2002). However, the problem of learningthe distribution of affixes within paradigms as afunction of their semantic (or syntactic) features ismuch less explored to my knowledge. This prob-lem can be described as follows: suppose that thesegmentation has already been established. Canwe now predict what affixes should appear inwhat contexts, where by a ‘context’ I mean some-thing quite general: some specification of seman-tic (and/or syntactic) features of the utterance. Forexample, one might say that the nominal suffix -z in English (as in apple-z) occurs in contexts thatinvolve plural or possesive nouns whose stems endin a voiced segment.

In this paper, I show that the problem of learn-ing the distribution of morphemes in contextsspecified over some finite number of featuresis roughly equivalent to the problem of learn-ing Boolean partitions of DNF formulas. Giventhis insight, one can easily extend standard DNF-learners to morphological paradigm learners. Ishow how this can be done on an example ofthe classical k-DNF learner (Valiant, 1984). Thisinsight also allows us to bridge the paradigm-learning problem with other similar problems in

∗This paper ows a great deal to the input from Ed Stabler.As usual, all the errors and shortcomings are entirely mine.

the domain of cognitive science for which DNF’shave been used, e.g., concept learning. I also de-scribe two other learners proposed specifically forlearning morphological paradigms. The first ofthese learners, proposed by me, was designed tocapture certain empirical facts about syncretismand free variation in typological data (Pertsova,2007). The second learner, proposed by DavidAdger, was designed as a possible explanation ofanother empirical fact - uneven frequencies of freevariants in paradigms (Adger, 2006).

In the last section, I compare the learners onsome simple examples and comment on their mer-its and the key differences among the algorithms.I also draw connections to other work, and discussdirections for further empirical tests of these pro-posals.

2 The problem

Consider a problem of learning the distributionof inflectional morphemes as a function of someset of features. Using featural representations, wecan represent morpheme distributions in terms ofa formula. The DNF formulas are commonly usedfor such algebraic representation. For instance,given the nominal suffix -z mentioned in the in-troduction, we can assign to it the following rep-resentation: [(noun; +voiced]stem; +plural) ∨(noun; +voiced]stem; +possesive)]. Presum-ably, features like [plural] or [+voiced] or ]stem(end of the stem) are accessible to the learners’cognitive system, and can be exploited duringthe learning process for the purpose of “ground-ing” the distribution of morphemes.1 This wayof looking at things is similar to how some re-searchers conceive of concept-learning or word-

1Assuming an a priori given universal feature set, theproblem of feature discovery is a subproblem of learningmorpheme distributions. This is because learning what fea-ture condition the distribution is the same as learning whatfeatures (from the universal set) are relevant and should bepaid attention to.

66

learning (Siskind, 1996; Feldman, 2000; Nosofskyet al., 1994).

However, one prominent distinction that setsinflectional morphemes apart from words is thatthey occur in paradigms, semantic spaces defin-ing a relatively small set of possible distinctions.In the absence of free variation, one can say thatthe affixes define a partition of this semantic spaceinto disjoint blocks, in which each block is asso-ciated with a unique form. Consider for instancea present tense paradigm of the verb “to be” instandard English represented below as a partitionof the set of environments over the following fea-tures: class (with values masc, fem, both (masc& fem),inanim,), number (with values +sg and−sg), and person (with values 1st, 2nd, 3rd).2

am 1st. person; fem; +sg.1st. person; masc; +sg.

are 2nd. person; fem; +sg.2nd. person; masc; +sg.2nd. person; fem; −sg.2nd. person; masc; −sg.2nd. person; both; −sg.1st. person; fem; −sg.1st. person; masc; −sg.1st. person; both; −sg.3rd. person; masc; −sg3rd. person; fem; −sg3rd. person; both; −sg3rd. person; inanim; −sg

is 3rd person; masc; +sg3rd person; fem; +sg3rd person; inanim; +sg

Each block in the above partition can be rep-resented as a mapping between the phonologicalform of the morpheme (a morph) and a DNF for-mula. A single morph will be typically mapped toa DNF containing a single conjunction of features(called a monomial). When a morph is mappedto a disjunction of monomials (as the morph [-z]discussed above), we think of such a morph asa homonym (having more than one “meaning”).Thus, one way of defining the learning problem isin terms of learning a partition of a set of DNF’s.

2These particular features and their values are chosen justfor illustration. There might be a much better way to repre-sent the distinctions encoded by the pronouns. Also noticethat the feature values are not fully independent: some com-binations are logically ruled out (e.g. speakers and listenersare usually animate entities).

Alternatively, we could say that the learner hasto learn a partition of Boolean functions associatedwith each morph (a Boolean function for a morphm maps the contexts in which m occurs to true,and all other contexts to false).

However, when paradigms contain free varia-tion, the divisions created by the morphs no longerdefine a partition since a single context may be as-sociated with more than one morph. (Free vari-ation is attested in world’s languages, althoughit is rather marginal (Kroch, 1994).) In case aparadigm contains free variation, it is still possibleto represent it as a partition by doing the follow-ing:

(1) Take a singleton partition of morph-meaning pairs (m, r) and merge any cellsthat have the same meaning r. Then mergethose blocks that are associated with thesame set of morphs.

Below is an example of how we can use this trickto partition a paradigm with free-variation. Thedata comes from the past tense forms of “to be” inBuckie English.

was 1st. person; fem; +sg.1st. person; masc; +sg.3rd person; masc; +sg3rd person; fem; +sg3rd person; inanim; +sg

was/were 2nd. person; fem; +sg.2nd. person; masc; +sg.2nd. person; fem; −sg.2nd. person; masc; −sg.2nd. person; both; −sg.1st. person; fem; −sg.1st. person; masc; −sg.1st. person; both; −sg.

were 3rd. person; masc; −sg3rd. person; fem; −sg3rd. person; both; −sg3rd. person; inanim; −sg

In general, then, the problem of learning thedistribution of morphs within a single inflectionalparadigm is equivalent to learning a Boolean par-tition.

In what follows, I consider and compare severallearners for learning Boolean partitions. Some ofthese learners are extensions of learners proposedin the literature for learning DNFs. Other learners

67

were explicitly proposed for learning morphologi-cal paradigms.

We should keep in mind that all these learnersare idealizations and are not realistic if only be-cause they are batch-learners. However, becausethey are relatively simple to state and to under-stand, they allow a deeper understanding of whatproperties of the data drive generalization.

2.1 Some definitions

Assume a finite set of morphs, Σ, and a finite setof features F . It would be convenient to think ofmorphs as chunks of phonological material cor-responding to the pronounced morphemes.3 Ev-ery feature f ∈ F is associated with some setof values Vf that includes a value [∗], unspec-ified. Let S be the space of all possible com-plete assignments over F (an assignment is a set{fi → Vf |∀fi ∈ F}). We will call those assign-ments that do not include any unspecified featuresenvironments. Let the set S′ ⊆ S correspond tothe set of environments.

It should be easy to see that the set S forms aBoolean lattice with the following relation amongthe assignments, ≤R: for any two assignments a1

and a2, a1 ≤R a2 iff the value of every feature fi

in a1 is identical to the value of fi in a2, unless fi

is unspecified in a2. The top element of the latticeis an assignment in which all features are unspec-ified, and the bottom is the contradiction. Everyelement of the lattice is a monomial correspondingto the conjunction of the specified feature values.An example lattice for two binary features is givenin Figure 1.

Figure 1: A lattice for 2 binary features

A language L consists of pairs from Σ × S′.That is, the learner is exposed to morphs in differ-ent environments.

3However, we could also conceive of morphs as functionsspecifying what transformations apply to the stem withoutmuch change to the formalism.

One way of stating the learning problem is tosay that the learner has to learn a grammar for thetarget language L (we would then have to spec-ify what this grammar should look like). Anotherway is to say that the learner has to learn the lan-guage mapping itself. We can do the latter by us-ing Boolean functions to represent the mapping ofeach morph to a set of environments. Dependingon how we state the learning problem, we mightget different results. For instance, it’s known thatsome subsets of DNF’s are not learnable, whilethe Boolean functions corresponding to them arelearnable (Valiant, 1984). Since I will use Booleanfunctions for some of the learners below, I intro-duce the following notation. Let B be the set ofBoolean functions mapping elements of S′ to trueor false. For convenience, we say that bm corre-sponds to a Boolean function that maps a set of en-vironments to true when they are associated withm in L, and to false otherwise.

3 Learning Algorithms

3.1 Learner 1: an extension of the Valiantk-DNF learner

An observation that a morphological paradigm canbe represented as a partition of environments inwhich each block corresponds to a mapping be-tween a morph and a DNF, allows us to easily con-vert standard DNF learning algorithms that relyon positive and negative examples into paradigm-learning algorithms that rely on positive examplesonly. We can do that by iteratively applying anyDNF learning algorithm treating instances of in-put pairs like (m, e) as positive examples for mand as negative examples for all other morphs.

Below, I show how this can be done by ex-tending a k-DNF4 learner of (Valiant, 1984) to aparadigm-learner. To handle cases of free varia-tion we need to keep track of what morphs occurin exactly the same environments. We can do thisby defining the partition Π on the input followingthe recipe in (1) (substituting environments for thevariable r).

The original learner learns from negative exam-ples alone. It initializes the hypothesis to the dis-junction of all possible conjunctions of length atmost k, and subtracts from this hypothesis mono-mials that are consistent with the negative ex-amples. We will do the same thing for each

4k-DNF formula is a formula with at most k feature val-ues in each conjunct.

68

morph using positive examples only (as describedabove), and forgoing subtraction in a cases of free-variation. The modified learner is given below.The following additional notation is used: Lex isthe lexicon or a hypothesis. The formula D is adisjunction of all possible conjunctions of lengthat most k. We say that two assignments are con-sistent with each other if they agree on all specifiedfeatures. Following standard notation, we assumethat the learner is exposed to some text T that con-sists of an infinite sequence of (possibly) repeatingelements from L. tj is a finite subsequence of thefirst j elements from T . L(tj) is the set of ele-ments in tj .

Learner 1 (input: tj)

1. set Lex := {〈m,D〉| ∃〈m, e〉 ∈L(tj)}

2. For each 〈m, e〉 ∈ L(tj), for eachm′ s.t. ¬∃ block bl ∈ Π of L(tj),〈m, e〉 ∈ bl and 〈m′, e〉 ∈ bl:replace 〈m′, f〉 in Lex by 〈m′, f ′〉where f ′ is the result of removingevery monomial consistent with e.

This learner initially assumes that every morphcan be used everywhere. Then, when it hears onemorph in a given environment, it assumes that noother morph can be heard in exactly that environ-ment unless it already knows that this environmentpermits free variation (this is established in thepartition Π).

4 Learner 2:

The next learner is an elaboration on the previouslearner. It differs from it in only one respect: in-stead of initializing lexical representations of ev-ery morph to be a disjunction of all possible mono-mials of length at most k, we initialize it to be thedisjunction of all and only those monomials thatare consistent with some environment paired withthe morph in the language. This learner is simi-lar to the DNF learners that do something on bothpositive and negative examples (see (Kushilevitzand Roth, 1996; Blum, 1992)).

So, for every morph m used in the language, wedefine a disjunction of monomials Dm that can bederived as follows. (i) Let Em be the enumerationof all environments in which m occurs in L (ii)let Mi correspond to a set of all subsets of feature

values in ei, ei ∈ E (iii) let Dm be∨

M , where aset s ∈M iff s ∈Mi, for some i.

Learner 2 can now be stated as a learner thatis identical to Learner 1 except for the initial set-ting of Lex. Now, Lex will be set to Lex :={〈m,Dm〉| ∃〈m, e〉 ∈ L(ti)}.

Because this learner does not require enumer-ation of all possible monomials, but just thosethat are consistent with the positive data, it canhandle “polynomially explainable” subclass ofDNF’s (for more on this see (Kushilevitz and Roth,1996)).

5 Learner 3: a learner biased towardsmonomial and elsewhere distributions

Next, I present a batch version of a learner I pro-posed based on certain typological observationsand linguists’ insights about blocking. The typo-logical observations come from a sample of verbalagreement paradigms (Pertsova, 2007) and per-sonal pronoun paradigms (Cysouw, 2003) show-ing that majority of paradigms have either “mono-mial” or “elsewhere” distribution (defined below).

Roughly speaking, a morph has a monomial dis-tribution if it can be described with a single mono-mial. A morph has an elsewhere distribution ifthis distribution can be viewed as a complementof distributions of other monomial or elsewhere-morphs. To define these terms more precisely Ineed to introduce some additional notation. Let⋂

ex be the intersection of all environments inwhich morph x occurs (i.e., these are the invariantfeatures of x). This set corresponds to a least up-per bound of the environments associated with x inthe lattice 〈S,≤R〉, call it lubx. Then, let the min-imal monomial function for a morph x, denotedmmx, be a Boolean function that maps an envi-ronment to true if it is consistent with lubx andto false otherwise. As usual, an extension of aBoolean function, ext(b) is the set of all assign-ments that b maps to true.

(2) Monomial distributionA morph x has a monomial distribution iffbx ≡ mmx.

The above definition states that a morph has amonomial distribution if its invariant features pickout just those environments that are associatedwith this morph in the language. More concretely,if a monomial morph always co-occurs with thefeature +singular, it will appear in all singular en-

69

vironments in the language.

(3) Elsewhere distributionA morph x has an elsewhere distribution

iff bx ≡ mmx − (mmx1 ∨mmx2 ∨ . . . ∨(mmxn)) for all xi 6= x in Σ.

The definition above amounts to saying that amorph has an elsewhere distribution if the envi-ronments in which it occurs are in the extensionof its minimal monomial function minus the min-imal monomial functions of all other morphs. Anexample of a lexical item with an elsewhere distri-bution is the present tense form are of the verb “tobe”, shown below.

Table 1: The present tense of “to be” in English

sg. pl1p. am are2p. are are3p. is are

Elsewhere morphemes are often described inlinguistic accounts by appealing to the notion ofblocking. For instance, the lexical representationof are is said to be unspecified for both personand number, and is said to be “blocked” by twoother forms: am and is. My hypothesis is thatthe reason why such non-monotonic analyses ap-pear so natural to linguists is the same reason forwhy monomial and elsewhere distributions are ty-pologically common: namely, the learners (and,apparently, the analysts) are prone to generalizethe distribution of morphs to minimal monomi-als first, and later correct any overgeneralizationsthat might arise by using default reasoning, i.e. bypositing exceptions that override the general rule.Of course, the above strategy alone is not sufficientto capture distributions that are neither monomial,nor elsewhere (I call such distributions “overlap-ping”, cf. the suffixes -en and -t in the Germanparadigm in Table 2), which might also explainwhy such paradigms are typologically rare.

Table 2: Present tense of some regular verbs inGerman

sg. pl1p. -e -en2p. -st -t3p. -t -en

The original learner I proposed is an incre-mental learner that calculates grammars similarto those proposed by linguists, namely grammarsconsisting of a lexicon and a filtering “blocking”component. The version presented here is a sim-pler batch learner that learns a partition of Booleanfunctions instead.5 Nevertheless, the main proper-ties of the original learner are preserved: specifi-cally, a bias towards monomial and elsewhere dis-tributions.

To determine what kind of distribution a morphhas, I define a relation C. A morph m stands in arelation C to another morph m′ if ∃〈m, e〉 ∈ L,such that lubm′ is consistent with e. In otherwords, mCm′ if m occurs in any environmentconsistent with the invariant features of m′. LetC+ be a transitive closure of C.

Learner 3 (input: tj)

1. Let S(tj) be the set of pairs in tj containingmonomial- or elsewhere-distribution morphs.That is, 〈m, e〉 ∈ S(tj) iff ¬∃m′ such thatmC+m′ and m′C+m.

2. Let O(tj) = tj − S(tj) (the set of all otherpairs).

3. A pair 〈m, e〉 ∈ S is a least element of Siff ¬∃〈m′, e′〉 ∈ (S − {〈m, e〉}) such thatm′C+m.

4. Given a hypothesis Lex, and for any expres-sion 〈m, e〉 ∈ Lex: let rem((m, e), Lex) =(m, (mmm − {b|〈m′, b〉 ∈ Lex}))6

1. set S := S(tj) and Lex := ∅

2. While S 6= ∅: remove a least xfrom S and set Lex := Lex ∪rem(x, Lex)

3. Set Lex := Lex ∪O(tj).

This learner initially assumes that the lexicon isempty. Then it proceeds adding Boolean functionscorresponding to minimal monomials for morphsthat are in the set S(tj) (i.e., morphs that have ei-ther monomial or elsewhere distributions). This

5I thank Ed Stabler for relating this batch learner to me(p.c.).

6For any two Boolean functions b, b′: b−b′ is the functionthat maps e to 1 iff e ∈ ext(b) and e 6∈ ext(b′). Similarly,b + b′ is the function that maps e to 1 iff e ∈ ext(b) ande ∈ ext(b′).

70

is done in a particular order, namely in the or-der in which the morphs can be said to blockeach other. The remaining text is learned by rote-memorization. Although this learner is more com-plex than the previous two learners, it generalizesfast when applied to paradigms with monomialand elsewhere distributions.

5.1 Learner 4: a learner biased towardsshorter formulas

Next, I discuss a learner for morphologicalparadigms, proposed by another linguist, DavidAdger. Adger describes his learner informallyshowing how it would work on a few examples.Below, I formalize his proposal in terms of learn-ing Boolean partitions. The general strategy ofthis learner is to consider simplest monomials first(those with the fewer number of specified features)and see how much data they can unambiguouslyand non-redundantly account for. If a monomialis consistent with several morphs in the text - it isdiscarded unless the morphs in question are in freevariation. This simple strategy is reiterated for thenext set of most simple monomials, etc.

Learner 4 (input tj)

1. Let Mi be the set of all monomials over Fwith i specified features.

2. Let Bi be the set of Boolean functions fromenvironments to truth values correspondingto Mi in the following way: for each mono-mial mn ∈ Mi the corresponding Booleanfunction b is such that b(e) = 1 if e is anenvironment consistent with mn; otherwiseb(e) = 0.

3. Uniqueness check:For a Boolean function b, morph m, and texttj let unique(b, m, tj) = 1 iff ext(bm) ⊆ext(b) and ¬∃〈m′, e〉 ∈ L(tj), s.t. e ∈ext(b) and e 6∈ ext(bm).

1. set Lex := Σ× ∅ and i := 0;

2. while Lex does not correspond toL(tj) AND i ≤ |F | do:for each b ∈ Bi, for each m, s.t.∃〈m, e〉 ∈ L(tj):

• if unique(b, m, tj) = 1 thenreplace 〈m, f〉 with 〈m, f + b〉in Lex

i← i + 1

This learner considers all monomials in the or-der of their simplicity (determined by the num-ber of specified features), and if the monomial inquestion is consistent with environments associ-ated with a unique morph then these environmentsare added to the extension of the Boolean functionfor that morph. As a result, this learner will con-verge faster on paradigms in which morphs can bedescribed with disjunctions of shorter monomialssince such monomials are considered first.

6 Comparison

6.1 Basic properties

First, consider some of the basic properties of thelearners presented here. For this purpose, we willassume that we can apply these learners in an iter-ative fashion to larger and larger batches of data.We say that a learner is consistent if and only if,given a text tj , it always converges on the gram-mar generating all the data seen in tj (Oshersonet al., 1986). A learner is monotonic if and onlyif for every text t and every point j < k, the hy-pothesis the learner converges on at tj is a subsetof the hypothesis at tk (or for learners that learnby elimination: the hypothesis at tj is a supersetof the hypothesis at tk). And, finally, a learner isgeneralizing if and only if for some tj it convergeson a hypothesis that makes a prediction beyond theelements of tj .

The table below classifies the four learners ac-cording to the above properties.

Learner consist. monoton. generalizingLearner 1 yes yes yesLearner 2 yes yes yesLearner 3 yes no yesLearner 4 yes yes yes

All learners considered here are generalizingand consistent, but they differ with respect tomonotonicity. Learner 3 is non-monotonic whilethe remaining learners are monotonic. Whilemonotonicity is a nice computational property,some aspects of human language acquisition aresuggestive of a non-monotonic learning strategy,e.g. the presence of overgeneralization errors andtheir subsequent corrections by children(Marcus etal., 1992). Thus, the fact that Learner 3 is non-monotonic might speak in its favor.

71

6.2 Illustration

To demonstrate how the learners work, considerthis simple example. Suppose we are learning thefollowing distribution of morphs A and B over 2binary features.

(4) Example 1

+f1 −f1+f2 A B−f2 B B

Suppose further that the text t3 is:A +f1;+f2B −f1;+f2B +f1;−f2

Learner 1 generalizes right away by assumingthat every morph can appear in every environmentwhich leads to massive overgeneralizations. Theseovergeneralizations are eventually eliminated asmore data is discovered. For instance, after pro-cessing the first pair in the text above, the learner“learns” that B does not occur in any environ-ment consistent with (+f1;+f2) since it has justseen A in that environment. After processing t3,Learner 1 has the following hypothesis:

A (+f1;+f2) ∨ (−f1;−f2)B (−f1) ∨ (−f2)

That is, after seeing t3, Learner 2 correctly pre-dicts the distribution of morphs in environmentsthat it has seen, but it still predicts that both Aand B should occur in the not-yet-observed en-vironment, (−f1;−f2). This learner can some-times converge before seeing all data-points, es-pecially if the input includes a lot of free varia-tion. If fact, if in the above example A and B werein free variation in all environments, Learner 1would have converged right away on its initial set-ting of the lexicon. However, in paradigms with nofree variation convergence is typically slow sincethe learner follows a very conservative strategy oflearning by elimination.

Unlike Learner 1, Learner 2 will converge afterseeing t3. This is because this learner’s initial hy-pothesis is more restricted. Namely, the initial hy-pothesis for A includes disjunction of only thosemonomials that are consistent with (+f1;+f2).Hence, A is never overgeneralized to (−f1;−f2).Like Learner 1, Learner 2 also learns by elimina-

tion, however, on top of that it also restricts its ini-tial hypothesis which leads to faster convergence.

Let’s now consider the behavior of learner 3 onexample 1. Recall that this learner first computesminimal monomials of all morphs, and checksin they have monomial or elsewhere distributions(this is done via the relation C+). In this case, Ahas a monomial distribution, and B has an else-where distribution. Therefore, the learner firstcomputes the Boolean function for A whose exten-sion is simply (+f1;+f2); and then the Booleanfunction for B, whose extension includes environ-ments consistent with (*;*) minus those consistentwith (+f1;+f2), which yields the following hy-pothesis:

ext(bA) [+f1;+f2]ext(bB) [−f1;+f2][+f1;−f2][−f1;−f2]

That is, Learner 3 generalizes and converges onthe right language after seeing text t3.

Learner 4 also converges at this point. Thislearner first considers how much data can be un-ambiguously accounted for with the most minimalmonomial (*;*). Since both A and B occur in en-vironments consistent with this monomial, noth-ing is added to the lexicon. On the next round,it considers all monomials with one specified fea-ture. 2 such monomials, (−f1) and (−f2), areconsistent only with B, and so we predict B to ap-pear in the not-yet-seen environment (−f1;−f2).Thus, the hypothesis that Learner 4 arrives at is thesame as the hypothesis Learners 3 arrives at afterseeing t3.

6.3 Differences

While the last three learners perform similarly onthe simple example above, there are significantdifferences between them. These differences be-come apparent when we consider larger paradigmswith homonymy and free variation.

First, let’s look at an example that involves amore elaborate homonymy than example 1. Con-sider, for instance, the following text.

(5) Example 2A [+f1;+f2;+f3]A [+f1;−f2;−f3]A [+f1;+f2;−f3]A [−f1;+f2;+f3]B [−f1;−f2;−f3]

72

Given this text, all three learners will differ intheir predictions with respect to the environ-ment (−f1;+f2;−f3). Learner 2 will pre-dict both A and B to occur in this environmentsince not enough monomials will be removedfrom representations of A or B to rule out ei-ther morph from occurring in (−f1;+f2;−f3).Learner 3 will predict A to appear in all envi-ronments that haven’t been seen yet, including(−f1;+f2;−f3). This is because in the cur-rent text the minimal monomial for A is (∗; ∗; ∗)and A has an elsewhere distribution. On theother hand, Learner 4 predicts B to occur in(−f1;+f2;−f3). This is because the exten-sion of the Boolean function for B includesany environments consistent with (−f1;−f3) or(−f1;−f2) since these are the simplest monomi-als that uniquely pick out B.

Thus, the three learners follow very differentgeneralization routes. Overall, Learner 2 is morecautious and slower to generalize. It predicts freevariation in all environments for which not enoughdata has been seen to converge on a single morph.Learner 3 is unique in preferring monomial andelsewhere distributions. For instance, in the aboveexample it treats A as a ‘default’ morph. Learner4 is unique in its preference for morphs describ-able with disjunction of simpler monomials. Be-cause of this preference, it will sometimes gener-alize even after seeing just one instance of a morph(since several simple monomials can be consistentwith this instance alone).

One way to test what the human learners doin a situation like the one above is to use artifi-cial grammar learning experiments. Such experi-ments have been used for learning individual con-cepts over features like shape, color, texture, etc.Some work on concept learning suggests that it issubjectively easier to learn concepts describablewith shorter formulas (Feldman, 2000; Feldman,2004). Other recent work challenges this idea (La-fond et al., 2007), showing that people don’t al-ways converge on the most minimal representa-tion, but instead go for the more simple and gen-eral representation and learn exceptions to it (thisapproach is more in line with Learner 3).

Some initial results from my pilot experimentson learning partitions of concept spaces (using ab-stract shapes, rather than language stimuli) alsosuggest that people find paradigms with else-where distributions easier to learn than the ones

with overlapping distributions (like the Germanparadigms in 2). However, I also found a bias to-ward paradigms with the fewer number of relevantfeatures. This bias is consistent with Learner 4since this learner tries to assume the smallest num-ber of relevant features possible. Thus, both learn-ers have their merits.

Another area in which the considered learn-ers make somewhat different predictions has todo with free variation. While I can’t discussthis at length due to space constraints, let mecomment that any batch learner can easily de-tect free-variation before generalizing, which isexactly what most of the above learners do (ex-cept Learner 3, but it can also be changed to dothe same thing). However, since free variationis rather marginal in morphological paradigms,it is possible that it would be rather problem-atic. In fact, free variation is more problematic ifwe switch from the batch learners to incrementallearners.

7 Directions for further research

There are of course many other learners one couldconsider for learning paradigms, including ap-proaches quite different in spirit from the onesconsidered here. In particular, some recently pop-ular approaches conceive of learning as matchingprobabilities of the observed data (e.g., Bayesianlearning). Comparing such approaches with thealgorithmic ones is difficult since the criteria forsuccess are defined so differently, but it wouldstill be interesting to see whether the kinds ofprior assumptions needed for a Bayesian modelto match human performance would have some-thing in common with properties that the learn-ers considered here relied on. These propertiesinclude the disjoint nature of paradigm cells, theprevalence of monomial and elsewhere morphs,and the economy considerations. Other empiricalwork that might help to differentiate Boolean par-tition learners (besides typological and experimen-tal work already mentioned) includes finding rele-vant language acquisition data, and examining (ormodeling) language change (assuming that learn-ing biases influence language change).

References

David Adger. 2006. Combinatorial variation. Journalof Linguistics, 42:503–530.

73

Avrim Blum. 1992. Learning Boolean functions in aninfinite attribute space. Machine Learning, 9:373–386.

Michael Cysouw. 2003. The Paradigmatic Structureof Person Marking. Oxford University Press, NY.

Jacob Feldman. 2000. Minimization of complexity inhuman concept learning. Nature, 407:630–633.

Jacob Feldman. 2004. How surprising is a simple pat-tern? Quantifying ‘Eureka!’. Cognition, 93:199–224.

John Goldsmith. 2001. Unsupervised learning of amorphology of a natural language. ComputationalLinguistics, 27:153–198.

Anthony Kroch. 1994. Morphosyntactic variation. InKatharine Beals et al., editor, Papers from the 30thregional meeting of the Chicago Linguistics Soci-ety: Parasession on variation and linguistic theory.Chicago Linguistics Society, Chicago.

Eyal Kushilevitz and Dan Roth. 1996. On learning vi-sual concepts and DNF formulae. Machine Learn-ing, 24:65–85.

Daniel Lafond, Yves Lacouture, and Guy Mineau.2007. Complexity minimization in rule-based cat-egory learning: revising the catalog of boolean con-cepts and evidence for non-minimal rules. Journalof Mathematical Psychology, 51:57–74.

Gary Marcus, Steven Pinker, Michael Ullman,Michelle Hollander, T. John Rosen, and Fei Xu.1992. Overregularization in language acquisition.Monographs of the Society for Research in ChildDevelopment, 57(4). Includes commentary byHarold Clahsen.

Robert M. Nosofsky, Thomas J. Palmeri, and S.C.McKinley. 1994. Rule-plus-exception modelof classification learning. Psychological Review,101:53–79.

Daniel Osherson, Scott Weinstein, and Michael Stob.1986. Systems that Learn. MIT Press, Cambridge,Massachusetts.

Katya Pertsova. 2007. Learning Form-Meaning Map-pings in the Presence of Homonymy. Ph.D. thesis,University of California, Los Angeles.

Jeffrey Mark Siskind. 1996. A computational studyof cross-situational techniques for learning word-to-meaning mappings. Cognition, 61(1-2):1–38, Oct-Nov.

Matthew G. Snover, Gaja E. Jarosz, and Michael R.Brent. 2002. Unsupervised learning of morphologyusing a novel directed search algorithm: taking thefirst step. In Proceedings of the ACL-02 workshopon Morphological and phonological learning, pages11–20, Morristown, NJ, USA. Association for Com-putational Linguistics.

Leslie G. Valiant. 1984. A theory of the learnable.CACM, 17(11):1134–1142.

Daniel Zeman. 2007. Unsupervised acquiring of mor-phological paradigms from tokenized text. In Work-ing Notes for the Cross Language Evaluation Forum,Budapest. Madarsko. Workshop.

74

Author Index

Angluin, Dana, 16

Becerra-Bonache, Leonor, 16

Candito, Marie, 49Casacuberta, Francisco, 24Cavar, Damir, 5Clark, Alexander, 33Crabbe, Benoit, 49

de la Higuera, Colin, 1

Eyraud, Remi, 33

Geertzen, Jeroen, 7Gonzalez, Jorge, 24

Habrard, Amaury, 33

Infante-Lopez, Gabriel, 58

Luque, Franco M., 58

Pertsova, Katya, 66

Seddah, Djame, 49Stehouwer, Herman, 41

van Zaanen, Menno, 1, 41

75

· PDF filePreface We are delighted to present you with this volume containing the papers...

Documents

Transcript of · PDF filePreface We are delighted to present you with this volume containing the papers...