On the Design of a Tag Set for Dravidian Languages

7/24/2019 On the Design of a Tag Set for Dravidian Languages

http://slidepdf.com/reader/full/on-the-design-of-a-tag-set-for-dravidian-languages 1/22

On the Design of a Tag Set for Dravidian Languages

Kavi Narayana Murthy and Badugu Srinivasu

Department of Computer and Information Sciences

University of Hyderabad

[email protected], [email protected]

Abstract

Tagging is the process of assigning shortlabels to words in a text for the purposeof indicating lexical, morphological,syntactic, semantic or other pieces of information associated with those words.When the focus is mainly on syntactic

categories (and sub-categories), this isalso known as part-of-speech or POStagging. It may be noted that the termtagging is broader than the term POStagging. Lexical, morphological andsyntactic levels are well recognized inlinguistics and linguistic theories nor-mally do not posit tagging or chunkinglevels at all. In computational imple-mentations within the field of NaturalLanguage Processing (NLP), however,it has generally been found that wordsare highly ambiguous and ambiguitiesmultiply at an exponential rate, makingsyntactic parsing so much more chal-lenging. Tagging is a level that hasbeen introduced before syntactic parsingwith the main intention of reducingthese ambiguities. It may be noted thatthe lexicon and morphological analyzertypically look at words in isolation andconsider all possible meanings, struc-tures, or tags. A tagger, on the other

hand, looks at the sentential contextand using this knowledge, attempts toreduce the possible tags for a givenword in context in which it appears.It is not absolutely necessary that all

ambiguities should be removed before wego to syntactic analysis, it is importantonly to reduce the degree of ambiguity.Syntactic parsers are anyway capable of dealing with ambiguities. If the degreeof tag ambiguities is very less, taggingmay not even be necessary. Taggingis useful to the extent it simplifies

syntactic parsing. There does not seemto be any evidence that the humanmind carries out tagging or chunkingas separate processes before it embarksupon syntactic analysis.

Critical issues connected with taggingare the design of the tag set and theapproach to tagging and the exactmethodology used. These issues areno doubt inseparably connected withthe overall purpose and the design of the lexicon, morphology and syntacticmodules. The beaten path is to developa manually tagged database of sentencesand then use this for training a machinelearning algorithm. The machine learn-ing algorithm is expected to generalizefrom these training examples so that itcan then tag any new sentence. Manualtagging is difficult, time consuming andprone to human errors. Consistency is

difficult to achieve especially if the tagset is elaborate and fine grained. Also,given the limited amount of trainingdata that is practically possible todevelop, a large and detailed tag set



will lead to sparsity of training dataand machine learning algorithms willfail to learn effectively. From theseconsiderations, NLP researchers tend torestrict themselves to small, shallow orflat tag sets which are least confusing tohuman annotators and easy for machinelearning algorithms to model. Whenthis idea is taken to the extreme, useful

sub-categorizations can be lost. In thispaper we propose an alternative viewand a novel approach to tagging. Wefocus on Dravidian languages here anddemonstrate our system for Teluguand Kannada languages. We believethe ideas presented in this paper areapplicable to all languages.

Before we get into designing a taggerwe must ask from where we get the

information required to selecting aparticular tag out of all the possible tagsfor a given word in a given sentence.The general assumption in most taggingwork has been that this informationcomes from other words in the sentence.This is not always true. In the case of Dravidian, for example, we find thatin most cases, information requiredfor disambiguating tags comes fromword internal structure, not from the

other words in the sentential context.Morphology therefore does the majorpart of tagging and there is no needfor Markov models and things of thatkind. In fact we believe that the sameapproach can be effectively applied toall languages provided we change ourview of what constitutes a word.

The lexicon deals with words and theirmeanings, morphology is all about the

internal structure of words and Tagsare assigned to words. Given all this,what exactly is a word is a fundamen-tal question. We believe that a lotof avoidable confusion arises both in

NLP and linguistics because of a heavyemphasis on the written form of a word.A sequence of characters separatedby spaces is considered to be a word.We instead define words as meaningfulsequences of phonemes and we try notto get influenced by the written form.Whether and where spaces appear areirrelevant to us. When viewed from this

stand point, we find that the degree of lexical ambiguity is far less than nor-mally seen in other approaches. A vastmajority of words are not ambiguousat all. Most of the remaining casesof ambiguity at root word level getresolved automatically once we considerthe morphology of the inflected formsin which these words usually occur inrunning texts. Therefore, we believethat the main task of tagging is not

one of tag assignment but only of tagdisambiguation. Tags can be assignedby the dictionary and morphologycomponents quite effectively. This way,we can develop large scale annotatedcorpora with high quality of tagging,without any need for manual taggingor any machine learning algorithm. Wedemonstrate our approach for Teluguand Kannada and argue for the meritsof our approach compared to othercompeting approaches.

Here we do not need to do any taggingmanually. There is no need for anytraining data and there is no need forany machine learning algorithm. A major part of tagging work is done by thelexicon and morphology, automatically.Only a small part of the data will remainambiguous after morphology. Most of these cases are easily resolved by syntax,

as they are all fully rule governed. Itis thus possible to develop large scale,high quality tagged data automaticallyonce we have a proper morphologycomponent.



Since the work is done automaticallyand there is no need to worry aboutthe effectiveness of machine learning,we can afford to have a fairly large,elaborate, fine-grained, hierarchical tagset, which captures as much of lexical,morphological, syntactic and semanticinformation as necessary or useful. Here

we shall present such a fine-grainedhierarchical tag set designed especiallyfor Dravidian languages but believed tobe more universal than that.

One aspect that is often not given suffi-cient importance is the precise definitionof each tag. When things are left tointuition and subjective interpretation,there will naturally be confusions andinconsistencies even in the manuallyprepared or manually checked taggeddata. Our aim shall be to define thetags as precisely as possible so that suchconfusions can be minimised.

We shall conclude the paper with taggingexperiments and quantitative results.

1 Background and Introduction

1.1 Language, Grammar and

Computation

We human beings are capable of producinga number of different kinds of sounds. Weare capable of making interesting patterns bystringing together these basic sound units,called phonemes, into larger structures. And,most importantly, we are capable of systemati-cally associating meanings with these patternsof sounds. Further, we are capable of learningsuch associations, we are capable of effectivelycommunicating these association rules to oth-

ers. We are also capable of communicatingour ideas, thoughts, feelings and emotionsto others by expressing them in patterns of sounds according to these mutually agreed uponmapping rules. This faculty of the human mind

is called language. Language is the capacity tomap sounds to meanings and use this for speechand thought. Language is, by far, a unique giftof nature to mankind.

Computers are capable of storing and ma-nipulating symbol structures. They are notcapable of understanding meanings. Computersdo not understand the meaning of any single

word in any human language. How then canwe put computers to good use in meaningfulprocessing of language? The answer to thisquestion comes from observing the fact thatthere is structure in language and there is asystematic relationship between structure andmeaning. This relationship between structureand meaning can be observed, learned, taughtand used. Therefore, if only we take careto store only meaningful structures and if only we take care to allow only meaningful

manipulations of such structures, we can en-sure that everything remains meaningful (tous) throughout, although the machine itself does not understand a word. This systematicrelationship between structure and meaningis what we shall call grammar. Grammar isthus the key to language processing in machines.

Grammar is the key to language processingeven in humans. We do not simply store allpossible linguistic units and their correspondingmeanings. The number of possible sentences, forexample, is infinite and we are capable of under-standing the meaning of sentences we have neverheard before in our life. This is possible only be-cause we have a grammar in our head and we usethis grammar to construct new sentences or tounderstand sentences spoken by others. This istrue not only of sentences but also of all levelsof linguistic analysis.

To summarize, we must develop appropriaterepresentations of basic linguistic units, appro-priate representations of their structures and

appropriate formalisms for manipulating thesestructures at all levels of linguistic analysis.What is appropriate and what is not is dictatedmainly by meaning. The written form has norole at all in this. This is a brief summary of



the theory we have been working on. See [1] formore details.

1.2 Speech and Text

Language is a powerful means of communi-cation. Of course we can also communicatecertain ideas and feelings through body lan-guage, gestures etc. Simply getting up orwalking out can also convey some message to

others. We can even communicate at timesthrough silence. Nonetheless, by and large themost effective and most widely used meansof communication among humans is throughspeech. We have therefore defined language asa mental faculty of human beings that enablesus to systematically map sound patterns tomeanings. Language is speech, it has nothing todo with writing. We believe that an enormousamount of confusion has been created bothwithin NLP and in Linguistics by giving too

much of importance to the written form. Weall learned our first language only by listeningand speaking. Reading and writing are learnedlater, that too only upon being taught. Literacyis not as important as people think today, onecan be a highly knowledgeable scholar withoutbeing literate. Many languages of the worlddo not have a script of their own, the need forwriting was never felt all through the historyof many cultures. Writing is a technology, thathas been invented by us as an after-thought,while language is a natural gift of nature tomankind. Do not confuse language for writingor script. Language can exist without a scriptbut not vice versa. It is unfortunate, therefore,that we have started defining everything basedon the written form. Words are not sequencesof letters or characters separated by spaces.It does not matter if there are zero, one ormore spaces within or between words. In factinserting spaces is also a newly developed idea- stone inscriptions do not have spaces betweenwords, for example. There are no spaces

between words in speech. A word cannot bedefined in terms of written symbols separatedby white space. This is just not right.

A dictionary stores words and their meanings.

Morphology deals with the internal structureof words. A tagger attaches tags to words.Sentences are built up from words. Wordsform very important and fundamental units of language. What exactly is a word then?

Look at the sentence: ’Arjuna went ondancing’. Here ’Arjuna’ is the subject, theperson about whom we are saying something.

The predicate, that is what we are saying aboutthe subject, is that he went on dancing. Thepredicate here indicates an action. A word thatindicates action is called a verb. How manyactions is Arjuna performing? Only one. Thereis only one action, therefore there is only oneverb here. We classify words into word classeslike nouns and verbs. Here there is one verband hence only one word that is indicating theaction. ’went on dancing’ is thus one word,not three. This is the crux of our approach.

We need to think in terms of meanings, not interms of the written form.

This view is not new, linguists have alwaysknown very well that the written form is notto be taken too seriously. Despite this, wefind that today even within linguistics, thewritten form has somehow come to have anunreasonably strong influence on our thinkingand working.

This meaning based definition of a wordwill have far reaching implications. The socalled auxiliary verbs do not indicate auxiliaryactivities. They do not stand for sub-actions orconstituents of an action or supporting actionsor incidental or related actions. Walking mayimply lifting the legs one by one, movingforward, placing them back on the ground,changing the balance of weight on the two legsand so on but this is not what we mean whenwe talk of axillary verbs. We would thereforebe compelled to reject the very idea of axillary

verbs.

Taking ’has been coming’ (English), or ’jaarahaa tha’ (Hindi) as single words has manymerits. Features such as tense and aspect



apply to verbs, they cannot stand alone inisolation. What is the tense of ’jaa’, what isthe tense of ’rahaa’ and what is the tense of ’tha’? Can there be several tenses for one verb?Items that indicate grammatical features arenot words, they are morphemes, they formpart of one word. We need to take instancessuch as ’has been coming’ and ’jaa rahaa tha’as individual words, deal with their internal

structure through morphology, assign tags tothem as atomic units, look at sentences assequences of such words. A drastic change inthe way we think and work is called for. Oncethis happens, the differences we see betweenlanguages will melt away to a large extent andwe will start seeing the underlying universalsacross human languages.

How far should one go along this line? ’A bigtree’ stands for a single object. A word that

indicates an object, a thing, is a noun. Shouldwe say ’a big tree’ is therefore a single word? Of course we can. A word that is used in place of anoun is generally termed a pronoun. A pronounstands not in place of a noun but in place of a whole noun phrase, in loose language. Thatis, if we wish to use the pronoun ’it’ in thisexample, this ’it’ would stand for ’a big tree’,not just for ’tree’. Otherwise, we should be ableto say ’a big it’, right? Since pronouns standin place of nouns, and since pronouns standin place of items like ’a big tree’, such itemsshould be considered single words. Are wetrying to say what we used to call as a phraseis now renamed as a word? What about theso called function words? Since their primaryrole is to indicate grammatical function ratherthan lexical content, should we say they arenot words at all? What about items like ’in’ or’on’ ? Do they have any meaning at all? Don’tthey have some meaning? Where should wedraw the dividing line for defining words andother linguistic units? How should we deal with

the non-word items?

Semantics is a bit nebulous by nature and wemust therefore be very careful. Since the broadgoal is to discover the universal grammar that

underlies all human languages, we must go bywhat is common across human languages. Thenotion of word classes leads us to good answersto all these questions we have raised here.

1.3 Words, Word Classes and Tagging

Words are minimal sequences of phonemes withmeaning. Words that indicate things are callednouns. Words that indicate action or state of

existence are called verbs. Nouns and verbsare examples of word classes. Nouns and verbsare found in almost all human languages, theyare universal. Word classes provide broad,semantically motivated universal classificationof words, and lead to further categorization ata grammatical level.

What about adjectives? Words that describethings are called adjectives. In ’a red shirt’,shirt is a thing, it is a noun, the item ’red’ is

describing this thing, it is giving its colour,hence ’red’ is an adjective. There is one viewwhich says we can never mental visualize red-ness without some thing that is red. Attributeshave no independent existence, they alwaysneed a substratum, some object on which theycan be superimposed. As such, adjectives haveno independent status at all. In contrast to thisview, it is possible to argue that although prop-erties of some object are unthinkable withoutthat object, at a conceptual level, we can takeadjectives as a legitimate word class in its ownright. A word that indicates action is calleda verb but the very word ’action’ is a noun.Likewise, ’redness’ is a noun, it is a conceptualobject, but with the help of this noun, we canposit the adjective ’red’. Without getting toodeep into these alternative views, let us observethat adjectives are widely recognized in humanlanguages and it would not be wrong to consideradjectives as a class by themselves. Likewise,manner adverbs - items that indicate how anaction is performed, can also be considered a

universal word class. Further, pronouns whichare used to avoid the boredom of repeatingnouns, can also be considered a universal wordclass. By and large, noun, pronoun, adjective,adverb and verb are the only universal word



classes. They are semantically motivated andcan be applied to all human languages.

Items that correspond to one of these wordclasses shall be called words, not the others.Conjunctions, articles, prepositions etc. arenot universal word classes, they do not have asufficiently clear meaning of their own. The socalled function words are not words at all.

A sentence is not merely a sequence of words,a sentence may contain words, feature bundlesand connectives. Items indicating feature bun-dles can be and should be connected up withthe relevant words or other larger linguisticstructures. Thus ’the’ is not a word in English,it is an item which indicates a discourse levelfeature called definiteness of the noun phrase towhich it is attached. It should be clear by nowthat our theory has far reaching implications

at all levels of linguistics and NLP. Interestedreaders may see [1] for more details.

Some may argue that defining words in termsof meanings is not practicable since computersdo not understand meaning. If the right thingis difficult to do that is not a valid excuse fordoing the wrong thing. Do pre-processing, dopost-processing, have manual intervention, dowhatever you wish but do not stray away fromwhat is right. We believe that it is practicallypossible to work with meaning-defined wordsin all languages of the world. It may provedifficult, but it should not be impossible.

Word classes such as noun, verb and adjectiveare also called ’Parts of Speech’ (POS) by tra-dition. For the sake of convenience, we may useshort labels, called tags, for these. For example,nouns may be indicated by N and verbs by V.POS Tagging is the process of attaching suchshort labels to indicate the Parts of Speech forwords.

How should we deal with non-words in a sen-tence? Connectives and other such non-wordsare part of the grammar. They should ideallyhave no place in the lexicon or morphology

but they have a definite role in syntax. Wemay assign tags to these non-words also if thatsimplifies the design of the syntactic modulebut we must never confuse non-words for words.Syntax is all about identifying the relationsbetween words in a sentence, the non-wordsonly facilitate this process.

Tags need not indicate purely syntacticcategories. There is need for sub-categorizationin syntax and a tagging scheme may includenot only the major grammatical categories butalso sub-categories. For example, one maytalk of common nouns and proper nouns, of intransitive verbs and transitive verbs. One canactually go beyond purely syntactic propertiesand include lexical, morphological or evensemantic information in the tags. It all dependsupon what we need and what we can. In this

paper we use the terms Tag and Tagging in thisbroader sense, not restricting ourselves to POStags or POS tagging.

Tagging is only for convenience. Tagging isusually intended to reduce, if not eliminate, am-biguities at word level. It is well known thatsyntactic parsing is at least cubic in computa-tional complexity and having to consider severalalternative interpretations for each word can ex-ponentially increase parsing complexity. Tag-ging has been invented in NLP as an indepen-dent layer of analysis, sitting between morphol-ogy and syntax, mainly to help the syntacticparser to do better in terms of speed and accu-racy. However, if we take the definition of wordwe have given here, we will find that sentencesare not as long as they appear to be (in termsof number of words) and words are not as am-biguous as they appear to be either. Therefore,syntactic parsing is actually orders of magnitude

simpler than what we usually think it is. To thisextent, the importance of tagging is reduced. Itis worth reiterating that linguistic theories neverposit tagging or chunking as separate layers of analysis sitting between morphology and syntax.



1.4 Grammar

Words are finite, the mapping from words(that is phoneme sequences) to meanings canbe stored in our brain. This is the mentallexicon. But there are infinitely many possiblesentences and we can understand all of them.Our mental capacity is finite and so we mustnecessarily be using a finite device to handlethe infinitely many sentences. This mentaldevice we have that enables us to construct andanalyze infinitely many valid sentences usingthe finite vocabulary we have is called grammar.

Consider the sentences ’Rama saw the run-ning deer’ and ’Rama saw the deer running’.Sentences having the same set of words can thusvary in meaning and the difference can only beaccounted for by the structure. Grammar isthe finite device that maps an infinite varietyof structures to their corresponding meanings.Grammar is the central core of the humanlanguage faculty - without grammar, languageis impossible.

The lexicon maps phoneme sequences tomeanings. Sequence is also one kind of struc-ture, although a simple one. Therefore, alexicon is also a grammar. The lexicon isnot an adjunct to grammar or an independentmodule, it is itself very much a part of grammar.

Words of a language are finite and hencelist-able. If all words of a language can besimply listed in the lexicon, there is no need formorphology. However, there is structure insidewords too, there are systematic relationshipsbetween these structures and meanings andthese systematic relations can be observed,learned, taught and used by the human mind.The human mind has a natural tendency toobserve systematic relationships and makegeneralizations. Thus, morphology, which deals

with the internal structure of words in relationto their meanings, also become a component of grammar. The lexicon, the morphology and thesyntax constitute the three main componentsof grammar up to the sentence level. Grammar

is not arbitrary string manipulation. Grammarmust help us understand meanings in terms of structures.

1.5 Computational Grammar

A complete grammar of any given languagemust include the complete lexicon, the completemorphology and the complete syntax. Samples

will not do, we need to be comprehensive andexhaustive. Only a complete grammar candefine the language fully. The lexicon, morphol-ogy and syntax should be mutually exclusiveand complementary. For example, what ishandled by morphology need not be, in fact,should not be listed in the lexicon. Becauseof such interactions, it is not possible to havea dictionary without morphology and syntax,nor can we have a syntactic grammar withouta dictionary. The dictionary, morphology and

syntactic grammar always go together and mustalways be viewed as a whole.

We cannot open up the human brain andsee what kind of grammar is sitting there buta good grammar needs to be psychologicallyplausible as also simple, neat and elegant. Itmust capture generalizations adequately andsatisfactorily. It must have predictive andexplanatory power. It must be universal. Achild born in any language community any-where in the world picks up its mother tonguewith equal ease and in more or less the sameamount of time. Hence there must be universalprinciples underlying and governing all humanlanguages. The main goal of modern linguisticsis to discover such a universal grammar.

The main goal of computational linguisticsis also to discover such a comprehensive yetsimple, neat, elegant, and universal grammar.The computer is only a powerful tool in ourhand in this grand project. A computational

grammar is not a different kind of grammar, itis only a comprehensive, universal, yet simpleand elegant grammar, the only difference is thatit has been implemented, tested and validatedon real data by actually building computational



models.

A complete grammar must be capable of han-dling each and every valid structure and map itto appropriate meanings. How can we be sure?The only way to test and ascertain this is to teston large scale real life data. A computationalgrammar is simply a grammar that has actuallybeen implemented as a computer program and

subjected to extensive testing and validationon real life linguistic data. Even to build suchan exhaustive grammar, we invariably needlarge scale data. A computational grammaris designed, developed, tested, validated onlarge scale real data. In order to do this, thegrammar itself needs to be defined at a veryminute level and in a very precise way. Awhole lot of definitions and treatments you willfind in grammar books are very superficial,cursory, exemplary and merely illustrative, not

sufficiently detailed and precise. These ideashave perhaps never been tested and validated.

There is no evidence to show that the humanmind uses different grammars for analysingutterances heard and for generating utterances.That would be wasteful and very unintelligent.Therefore, grammars must be designed in sucha way that they can used both for analysis andgeneration with equal ease.

There is also another common confusionabout generation. Generation does not meangenerating each and every possible form. Thatwould be a completely unnatural process.Human beings never generate all possibleforms of a word or all possible sentences inall their lives. They never even try. Asking acomputer to generate all forms is therefore notright, except purely as a means of testing andvalidating a computational grammar. Whatlinguists really mean is that a grammar shouldbe capable of generating all valid forms, this is

only an abstract specification, not a practicalrequirement. Even here, we must realize thathuman beings are also capable of generatinginvalid forms, if asked to. Therefore, if you ask acomputer to generate an invalid or meaningless

form from a given grammar and it does, thereis no fault. Computational grammars must bedesigned to correctly generate and analyze allvalid forms but it is not essential to ensurethat such a system will never generate in-valid forms even if asked to. Over-generation isperfectly fine, both theoretically and practically.

We will either need to generate particularforms based purely on given grammaticalproperties or we will need to generate linguisticutterances to convey an intended meaning.Computers do not understand meanings andso the second case never really arises. How-ever, we may be working with some symbolicrepresentation of meanings, in which case acomputer program may be expected to gen-erate valid sentences to convey the intendedmeanings. Generating sentences as part of automatic translation is a much more common

requirement.

Linguists often fail to understand the impor-tance of the size of data used for building andtesting grammatical systems. They think thereare only a few types of structures and there isno point in looking at a thousand examples of each kind. There is more to it than meets theeye. There are a large number of significantlinguistic phenomena and not all of them occur

equally frequently. Those who are used tolooking at only a few important phenomena willnot understand the importance of large scaledata. Rare phenomena are more likely to occurin a large corpus than in a small corpus. There-fore, when the aim to develop wide coverage,comprehensive, if not exhaustive grammars, theimportance of a large and representative corpuscannot be undermined.

We believe that we can go much closer to the

dream of a universal grammar if we take thedefinition of word we have given here seriously.Meaning is central to language and linguisticsand any process that causes loss or distortion of meaning is simply not acceptable.



2 A Novel Approach to Tagging

There is only one critical question that we needto ask when it comes to tagging - where canwe find the crucial bits of information requiredto assign the correct tag to a given word in agiven sentence? Statistical approaches assumethat the necessary information comes from theother words in the sentence. In many cases, only

the words that come before the current word aretaken into direct consideration. We believe, insharp contrast, that the crucial information re-quired for assigning the correct tag comes fromwithin the word. It is the internal structure of a word that determines its grammatical cate-gory as also sub-categorization and other fea-tures. True, there will be instances where theinternal structure alone is not sufficient. Firstlywe find that such cases are not as frequent as youmay be thinking. A vast majority of the words

can be tagged correctly by looking at the inter-nal structure of the word. The crux of tagginglies in morphology. This is clearly true in thecase of so called morphologically rich languagesbut this is actually true of all human languages if only we define words in terms of meanings ratherthan in terms of the written form and spacingconsiderations. Secondly, in those cases wheremorphology assigns more than one possible tag,information required for disambiguation comesmainly from syntax. Syntax implies complexinter-relationships between words and this can-not be reduced to a mere statistics of sequences.In Kannada, for example, the plural/honorificimperative form of a verb and a past conjunc-tive verbal participle form are the same. Hencemorphology cannot resolve this ambiguity. Thisambiguity can only be resolved by looking at thesentence structure. If this word is functioning asa finite verb, it must be the imperative. If it isfollowed by another finite verb later in the sen-tence, this could a conjunctive participle. Statis-tical techniques are perhaps not the best means

to capture and utilize such complex functionaldependencies. Instead, syntactic parsing will au-tomatically remove most of the tag ambiguities.Given this observation, we use a simple pipe-line architecture as depicted in the figure below.

We keep going forward and we do not need tocome back again and again to preceding mod-ules. We carry with us all the necessary/usefulinformation in the form of tags, each moduleadding or refining the information as we moveon. The lexicon assigns tags to words that ap-pear without any overt morphological inflection.Morphology handles all the derived and inflectedwords, including many forms of sandhi. The

bridge module combines the tags given by thedictionary and the additional information givenby the morph, making suitable changes to re-flect the correct structure and meaning. Thechunker takes these tag sequences to producechunks. A chunker may or may not be required,it is included in the architecture for general-ity. In fact for our work on Dravidian, we donot need a chunker at all. The parser analyzesthese chunk sequences and produces a depen-dency structure. The overall tag structure re-

mains the same throughout, making it so muchsimpler and easier to build, test and use.

Pre−Processor

Dictionary−Morphology

Bridge

Chunker

Parser

Input Text

Structural Description

FIG 1 The System Architecture

In this paper we shall focus on the design of atag set given the above theory and background.Details of the lexicon, morphology, syntax etc.are being published elsewhere. We shall includehere transcripts from the actual implementedsystem to give a feel for the readers.

2.1 Designing a Tag Set

Tagging approaches based on machine learningrequire manually tagged training data. Manualtagging becomes difficult and error prone as



the tag set becomes large and elaborate andso there is a strong tendency to go for small,flat tag sets. Such tag sets may not capture allthe required and/or useful bits of informationfor carrying out various tasks in NLP. Flattag sets are also rigid and resist changes.Hierarchical tag sets are more flexible. In ourcase, we do not depend upon statistical ormachine learning techniques and we do not

need any training data. No manual taggingwork is involved and so we can afford to havelarge, fine-grained, elaborate, hierarchical tagset that carries as much of lexical, morpho-logical, syntactic and semantic fields as we wish.

One of the biggest difficulties that researchersface while designing tag sets, while performingmanual tagging and while building and evaluat-ing tagging systems is the very definition of tags.Tags involve lexical, morphological, syntactic

and semantic considerations and often thereare conflicts. One cannot go purely by intuitivedefinitions such as ’nouns are things, pronounsstand in place of nouns and adjectives modifynouns’. We will need to give precise definitionsand criteria to decide which tag label shouldbe given to which word. We shall illustrate thekind of reasoning we need to go through with just a few examples below. More will come later.

Let us take a clear case. Since both nounsand pronouns can take nominal inflectionsand both can function as subjects and objectsof sentences, why should we even make adistinction between them? After all, there isno real difference in morphology either. Howdo we answer such a question? Upon carefulexamination, we find that pronouns are neithermodified by adjectives nor do they function likeadjectives, modifying other nouns. Nouns can.A noun can be modified by a demonstrativeadjective, a pronoun cannot be. Nouns can bemodified by quantifying adjectives, pronouns

cannot be. As such, there are significant differ-ences at the level of syntax and that is why wemust distinguish between these two categories.Similarly, common nouns differ from propernouns in significant ways. Demonstrative

adjectives, quantifying adjectives, ordinals, anddescriptive adjectives need to be treated differ-ently since they have different roles in chunking.Not making such distinctions will lead to unnec-essary explosion of possible parses, making theparsing process slow and less accurate. Evenwithin proper names, names of persons, placesetc. vary in their grammatical properties. Placenames appear more in nominative, dative and

locative cases, and not so much in accusativecase. A place name can modify a person namebut not vice versa. Place names can modifycommon nouns, person names usually will not.Grammar includes hard constraints as alsosofter restrictions. We need to design a tag setkeeping all the relevant linguistic phenomenain mind. It is not possible to have a generalpurpose tag set suitable for all applications.Here our goal is syntactic parsing and this goalguides us at every step.

A tag set is a classification system. the mo-ment we introduce sub-categories, it becomesa hierarchical classification system. There arecertain general principles that must be followedin the design of any hierarchical classificationsystem. If set X is divided into sets A, B and C,a) all the properties of X must be valid for A, Band C b) A, B and C must be mutually exclusivec) X must not include anything other than A,B and C. Further, we believe that the majorcategories must correspond one-to-one with theuniversal word classes. We find that such basicrequirements are often violated in the tag setsbeing proposed by other groups and we shalldemonstrate this after we define our own tag set.

Keeping these scientific considerations inmind, we have developed a fairly elaborate hi-erarchical tag set starting from Kannada andTelugu data. Tagging is an intermediate leveland appropriateness of tag set and tag assign-ments can only be verified by placing it in be-

tween morphology and syntax and building andtesting actual systems. We are in the process of developing an end-to-end system and this givesus strong basis to argue why our design is supe-rior and why exactly other possible alternatives



are not good enough.

2.2 The Tag Set

We have seen that noun, pronoun, verb, adjec-tive, adverb are the major and universal wordclasses or categories. We shall denote these withthe widely used and highly readable mnemonicsN, PRO, V, ADJ and ADV respectively. Weshall try to define these major categories and

their sub-categories. Sub-categorization willnaturally be a bit more specific to Dravidianlanguages. In each case, we shall start with abroad semantic description and then look at thespecific morphological and syntactic properties.

2.2.1 Noun

Words that are used to name a person,animal, place or a thing, including abstractconcepts or ideas, are called nouns. This is

the layman’s definition found in elementarygrammar books, it is not precise enough. Weneed to look at the morphological and syntacticproperties of nouns.

Morphologically, nouns can take numberand case. Most commonly, number can besingular or plural. In Sanskrit, there is alsoa dual number. Usually, the singular formis the default. Not all nouns show numberdistinction. Only countable nouns take pluralforms. Uncountable nouns, including massnouns (water, for example) and abstract nouns(beauty, for example) usually do not takeplural. Some nouns may not appear in pluralform simply because situations where we canput several of them will be rare. We can talkof eyes or ears but rarely do we need to talk of noses. Some nouns are actually plural but oftenconstrued as singular and vice versa. Someappear only in plural form. Pronouns also takenumber but other categories do not. Givenall this, if a word takes plural form, we can

say it is a noun or a pronoun but we cannotsay a word is not a noun just because it doesnot show number feature. The number featureis often helpful in clearing confusion betweennouns and adjectives. In many languages,

verbs also show singular-plural distinction.This is not number, a plural marker on a verbnever indicates plurality of actions. Verbsmay be marked for number purely for showingagreement, say, with the subject of the sentence.

At a broad level, we may look at case assimply direct or oblique. At another level,we may consider cases as indicators of the

thematic roles played by various constituentsin a sentence in relation to the verb. Theseindicators may manifest in various forms. Insome languages, they appear as overt affixescalled case markers. In other languages, thesecase indicators may take the form of preposi-tions or post-positions. In some languages, casedistinctions may be expressed indirectly by theposition of the word in the sentence. In anycase, it is important to realize that prepositionsand postpositions are not words, they are

only features of other words. Sometimes itis argued that prepositions also show relationbetween two nouns. For example, in the phrase’the book on the table’, the preposition ’on’is said to relate the tow nouns ’book’ and’table’. This analysis is not entirely right. Thephrase ’the book on the table’ actually means’the book which is/was on the table’ and thepreposition ’on’ indicates the place where thebook exists, existence itself being indicatedby the verb ’is/was’. Upon sufficiently deepand detailed analysis, we can assure ourselvesthat prepositions and post-positions are verymuch like case markers, with the same primarypurpose. If a word shows case distinctions, itcan only be a noun or a pronoun, not any othercategory. From a morphological point of view,words which are usually considered adverbs of place or time may have to considered as nounsif they take number and/or case distinctions.Words like ’here’ and ’there’ in English maybe treated as adverbs but their counterparts inDravidian (’illi’, ’alli’ for example in Kannada)

will show case distinctions. Only a few casedistinctions are used simply because theseare the only meaningful possibilities. Thus,’illi’ (here) is already a place indicator and wecannot think of adding a locative case marker



to this word. We can think of ’from here’, or’to here’, but not ’ in here’. Given this, onealternative is to consider them as nouns anddepend upon the standard noun morphology totake care of all the case distinctions that occur,without worrying about over-generation. Analternative view is to consider each of these fewforms as separate words in their own right, callthem adverbs and take them out of the realm

of morphology. Generalizations are lost in thislatter view and so we take the first view here.

In terms of syntax, nouns can take up majorthematic roles including subject and objectroles in a sentence. Pronouns can also do butadjectives and adverbs cannot. At a more locallevel, nouns can be modified by various kindsof nominal modifiers including demonstratives,quantifiers and other kinds of adjectives, in-cluding participles. Nouns can also function

as nominal modifiers and modify other nouns.This is a general feature of human languageswhether the resulting noun-noun structure isa compound or a phrase, permitting nounsto be used as adjectives in syntax eliminatesthe need to distinguish between the nominaland adjectival use. There will no longer beany need to distinguish between lexical andsyntactic categories as far as this particular caseis concerned, tagging will become so much moreeasier and less confusing and less ambiguous.

It must be noted that possessive formsof nouns and pronouns always function asadjectives. From a purely morphological pointof view, we usually treat them under nounor pronoun categories. Once this point isclearly understood, there should not be anymore confusions. Whenever we talk of nounsor pronouns at a syntactic level, we excludepossessive forms. A possessive noun or apronoun can modify other nouns but can nevertake thematic roles such as subject or object.

In fact there are many situations wherethe morphological and syntactic view pointsdiffer. We need to have a specified policy todeal with all such situations. In our work, we

always favour the morphological view point indeciding tags. The flow of information in ourarchitecture is from lexical and morphologicallevels to syntactic level and tag assignmentsare first done at the morphological level. If andwhere needed, suitable refinements can be doneat later levels of analysis. On the other hand,if we get stuck up right in the initial levels, wecannot progress at all.

At a discourse level, nouns can be referredto and they can also refer. Pronouns canrefer to nouns or noun phrases. We can say’the big tree’ and then refer to it using thepronoun ’it’. Definite noun phrases can alsorefer to other nouns or noun phrases. ’Thecap’ can refer to a part of ’the pen’. Here byreference we mean non-syntactic relations atinter and intra-sentential levels, not exactlywhat linguists do in binding theory. Here the

important questions are what refers to whatand what is the nature of the semantic relationbetween the two. reference sets nouns andpronouns apart and sets all other categoriesaside too.

Subcategories of Noun

Nouns are subclassified into common nouns,proper nouns, locative nouns and cardinals,denoted by the tags NCOM, NPRP, NLOCand NCARD. These distinctions are requiredmainly because these subcategories occur indiffering positions inside noun phrases andrestrict the occurrence of other constituentswithin noun phrases. Proper nouns, locativenouns and cardinals have special properties andall other nouns are grouped under the defaultsubcategory called common noun.

Proper nouns usually do not take pluralforms, have a somewhat different statisticaldistribution of case marker occurrences, are

not modified by various kinds of modifiers(including demonstratives, quantifying andqualifying adjectives, possessive nouns andpronouns, relative participles etc.), and theyact as modifiers under restricted situations.



Proper nouns are also not found in the lexiconunless the words have some other common nounmeanings also. Proper nouns can be furthersubclassified into person, Location, organiza-tion and others, denoted by PER, LOC, ORGand OTH respectively. There are significantdifferences amongst these. Agreement featuresvary. For example, N-PRP-PER words haveimplicit masculine or feminine gender, which

agrees with the verb. Locations can be partof other proper names including person namesand names of organizations. Person namescan be part of an organization name but theyrarely modify location names. Location namesoccur more frequently in nominative, dative,genitive and locative cases and relatively lessin, say, accusative case. Proper nouns cannotbe relativized, common nouns can be - ’theboy I know’ is OK but ’the Rama I know’ isodd. Proper nouns usually signify specific,

single objects and so annotators tend to getconfused whenever they come across fairlyspecific objects which may not occur in pluralform frequently. Criteria such as the aboveshould help to resolve the common noun -proper noun confusions.

Nouns indicating space or time, also calledspatio-temporal nouns, are subcategorizedunder nouns because they show nominal mor-phology, although they are usually adverbialin function. They can also function like nounsand become subjects or objects although rarely.Here again we find restrictions on case, modifi-cation, etc.

Cardinals behave like nouns in terms of morphology and are therefore grouped undernouns. Their morphology is, however, a bitirregular. They usually act as nominal modifiersbut can also act as nouns and take on subjectand object roles. This happens when we aretalking about these numbers themselves, as

in mathematics. Singularity and plurality areimplicit but overt case marking is seen.

Proto-Nouns

There are certain words whose classificationrequires a more careful look. Consider theKannada word ’obba’ meaning one person. Bydefault the gender is masculine. One may arguethat ’obba’ is a pronoun because it stands forsome one person. Somebody else may arguethat it should be treated as an adjective sinceit indicates number, (apart from indicatingthat the following noun should be human) as

in ’obba huDuga’ (cf. ’oMdu mara’). A thirdperson may counter this by saying it cannot bean adjective since it can take nominal inflectionssuch as number and case. The word ’obba’takes nominal inflections and can be subjectof a sentence and so it can only be a nounor a pronoun. But pronouns cannot modifynouns, they can only stand in place of a noun,and so this word can only be taken as a noun.But ’obba’ indicates some man in general andnot any specific person. How do we resolve this?

We need to look at related word forms suchas ’obbanu’ (one man), ’obbaLu’ (one woman)and obbaru (one man/woman, plural indicatinghonorificity). Effecting a change in number isvery much a standard aspect of morphologybut change of gender cannot be considereda grammatical process. How can grammarchange gender, that too in languages wherethe grammatical and biological genders areclosely related? Therefore, although ’obbaru’can be obtained from ’obbanu’ or ’obbaLu’,’obbanu’ cannot be obtained from ’obbaLu’ orvice versa. We will have to consider ’obbanu’and ’obbaLu’ as separate words, related inmeaning though they are. More importantly,we need to realize that ’obbanu’, ’obbaLu’ and’obbaru’ are clearly pronouns. They have allthe properties of pronouns, they can be subjectsof sentences, for example. In contrast, ’obba’is adjectival in nature, it is not a noun or apronoun. The source of the confusion is thefact that nominative suffixes can be optionally

dropped in Kannada and so we confuse ’obba’to be the same as ’obbanu’. ’obba baMda’is the same as ’obbanu baMdanu’ (one mancame) but in ’obba huDuga’ we cannot replace’obba’ with ’obbanu’. Therefore, ’obba’ should



be considered as an adjective indicating themodified noun to be one person, whereas theother forms are clearly pronouns.

In fact a large number of adjectives canbecome pronominals by a similar change inmorphological structure. Thus the adjective’praamaaNika’ can become ’praamaaNikanu’,’praamaaNikaLu’, ’praamaaNikaru’ and then

take other cases thereafter. Here the additionof suitable nominative suffixes is derivingpronouns from adjectives by systematic andhighly productive processes. Instead of storingall of ’praamaaNika’, ’praamaaNikanu’ and’praamaaNikaLu’ directly in the lexicon, wecan store only ’praamaaNika’ and mark it asan adjective that can become a pronoun by theaddition of suitable nominative case suffixes.Thus words like ’praamaaNika’ behave likeproto-nouns, they are actually adjectives but

ready to become nouns/pronouns by the mereaddition of gender information. We have chosento call these words proto-nouns and indicatedthis with the PTN label.

Words such as ’kelavaru’, ’mattobbaru’, in-nobbaru’ are best treated as pronouns, theyhave all the properties of pronouns. They in-dicate an indefinite number of persons. Thisis better than treating these words as cardinalnouns, subclassified as human, treating othernon-human cardinals as a separate subclass of cardinal nouns. They can replace noun phrases,they indicate generic groups and are thus pro-nouns, not nouns.

2.2.2 Pronoun

Pronouns are abbreviated forms of nounsor noun phrases, abbreviated in the senseof information they carry. The proper noun’Ram’ stands for a particular person and thepronoun ’he’ stands for any one masculineperson, other than you and I. It has most of the

information that the word ’Ram’ carries butnot quite every bit. We use pronouns to avoidrepeating the same nouns time and again andthe partial information that the pronouns carryare sufficient for us to understand the discourse

properly.

Pronouns can act as subjects and objects.They also take number and case like nouns.However, pronoun morphology is somewhatirregular in many languages. Also, pronounsshow overt marking for gender and person,which are usually implicit in nouns. Pronounsmay also show other finer distinctions such as

proximate and distal, inclusive or exclusive,and so on. Pronouns are usually not modifiednor do they act as modifiers. Pronouns areusually not relativized although rare usagessuch as ’he who would climb the laddermust begin at the bottom’ are found. Ref-erence is a very important property of pronouns.

Pronouns are subcategorized into personalpronouns, interrogative pronouns, reflexive pro-nouns and indefinite pronouns, labelled PER,

INTG, REF and INDF respectively. Pronounsother than personal pronouns are usually inthird person only. These subcategories are mo-tivated not by morphology but by the variouskinds of syntactic constructions in which thesepronouns participate.

2.2.3 Verb

Verbs are words that indicate actions orevents. Existence or state of existence is consid-ered to be a very special kind of event. Existenceis the most fundamental event, without which

we cannot even talk of other kinds of eventsor actions. Therefore, words that indicate ex-istence or state of existence are also called verbs.

Morphologically, verbs may show tense,aspect and mood. they may also show a varietyof agreement markers. Verb morphology canbe extremely rich and complex. Indeed it is inDravidian. Some verbs are defective, they showlittle, no or morphology, and are depicted assuch in the tag set.

Syntactically, a verb has expectations aboutthematic roles such as subject and object whichare to be filled by noun phrases or clauses as ap-propriate. A verb, along with its complements



and optional adjuncts, constitutes a clause.there can be one or more clauses in a sentenceand if there are two or more clauses, the verbsin various clauses are inter-related. Syntax isall about finding the inter-relationships betweenwords in a sentence and the verb plays a centralrole in this whole process.

Verbs have been classified along various di-

mensions but from the point of view of syntax,the sub-categorization frames form the mostimportant characteristic. Sub-categorizationframes specify what roles are essential, optionaland prohibited and what syntactic kinds of con-stituents can fill up these roles. As such theseare extensions of the traditional notion of transi-tivity. Here we classify verbs into transitive, in-transitive and bitransitive (denoted TR, IN andBI) and mark further restrictions by using nu-merical indices as in TR1 or TR12 or TR13.

2.2.4 Adjective

Adjectives are words that modify nounsby specifying their attributes or properties.Adjectives can be modified only by intensifiers.As a general rules adjectives do not show anyinflectional morphology. Inflections if any areonly agreement markers. Thus, if any wordtakes a plural form or case markers, it is not anadjective. Adjectives used attributively are partof noun phrases occurring in modifier positions.When used predicatively, we may consider

adjectives as heads of adjectival phrases. Pos-sessive nouns and pronouns, relative participles,cardinals etc. are all adjectival in functionalthough they may be classified elsewhere fromthe point of view of morphology.

Adjectives (other than those that have al-ready been considered elsewhere) are subcat-egorized into demonstratives, ordinals, quanti-fiers, question words, and all other by defaultas absolute, labelled DEM, ORD, QNTF, QW

and ABS. Demonstrative adjectives occur zeroor one times. Question words occur zero orone times and question the noun being modi-fied. Ordinals and Quantifiers are mutually ex-clusive. Absolute adjectives may occur zero, one

or more times. There are also positional restric-tions. In head final languages, demonstrativesor question words occur first, then come ordinalsor quantifiers, after which appear absolute adjectives. Subcategorization of adjectives is jus-tified by these observations.

2.2.5 Adverb

Words that modify a verb are called adverbs.However, we find a number of other kinds of words, which modify or add more informa-tion to adjectives, other adverbs, or a wholesentence and all these words are traditionallycalled adverbs. They usually answer questionssuch as where, when, how or how much. Suchwords may provide more information in termsof time, frequency, manner etc. Adverbs appearin various positions in a sentence and add quitea bit of complexity in syntactic and semantic

analysis. Adverbs usually have no morphology.Adverbs form the default category, if a worddoes not fit anywhere else, there is a tendencyto push into the adverb basket.

In keeping with the tradition, we includeall these various types of words under theadverb category and subcategorize adverbsinto adverbs of manner (MAN), place (PLA),time (TIM), question words (QW), intensifiers(INTF), negation (NEG), conjunctions (CONJ),post-nominal modifiers (POSN), and by default,all the others into absolute (ABS). Intensifiersmodify adjectives. Adverbs of negation add anegative aspect to verbs. Post-nominal modi-fiers add clitic-like information to nouns. Cer-tain conjunctions occur in the sentence initialposition, indicating conjunction with precedingsentences at discourse level, and have no rolewithin the sentence as far as syntax is concerned.Although they are conjunctions, they do not joinany two items within the sentence. We have cho-

sen to treat them under adverbs keeping syntaxin mind. Note that time and place indicatorsare classified as locative nouns if they show anynominal inflections. When no inflections occur,they have been grouped under adverbs.



2.2.6 Other Tags

Tag sets proposed by various groups includetags for interjections, post-positions, particles,punctuation marks, symbols, foreign words,echo formations, reduplication, etc. and per-haps one unknown tag to take care of situationswhen no tag seems to fit. Since these do not cor-respond to universal word classes and since thesetags do not map on to words which are defined

based on a specific theory of meaning, we shallavoid getting into all these here. How aboutfunction words like conjunctions? Such items dohave a very definite role in syntax. They cannotbe ignored or wished away. At the same time,we should not group them with words of the lan-guage. These items have less of a lexical role andmore of a grammatical role and so they shouldbe excluded from the lexicon, morphology andtagging and included within the grammar at thesyntactic level. Having said this much, it may

still be felt practically convenient to assign tagsto them. This way, syntax would see a sequenceof tags as input, not a mixture of tags and othermeta symbols. Sometimes such items may evenbe listed within the lexicon but we must under-stand that this is purely for the sake of practicalconvenience.

2.3 Our Tag Set

Firstly, we believe the same overall structurecan be and should be retained at all levels of processing, starting from the lexicon, throughmorphology, tagging and syntactic parsing.Each module may add or refine the relevantparts but the overall tag structure should notchange.

Secondly, although every word must ideallybe passed through morphology, we can avoidsome work and save time by not passing wordforms that are directly found in the lexiconthrough the morphological analyzer. In thatcase, the lexicon should give the same tag that

the morphology would have given. Defaultssuch as singular number and nominative case fornouns should therefore be shown in the lexiconitself. The tags assigned by the dictionaryand morphology should be directly suited for

syntactic analysis, without going back to anyother module. Note that syntax works onlywith tags, not with the words themselves. Thewhole idea of defining word classes and thetagging scheme is to reduce the set of wordforms into a set of tags. As far as syntax isconcerned, a sentence is simply a sequence of tags. All the lexical, morphological, syntacticand semantic pieces of information necessary or

useful for syntactic parsing should therefore bedirectly reflected in the tag structure.

The lexicon is truely a list of exceptions andall idiosyncratic word forms must be storeddirectly in the lexicon. Thus idiosyncratic casesof word forms including plurals, various cases,clitics etc. will be listed in the dictionary withappropiate tags.

Certain categories such as pronouns and

cardinals show partly irregular morphology.One extreme solution would be to store allforms of these words directly in the lexicon,avoiding morphology altogether. Anotherextreme would be to somehow try and handleall the irregularities within morphology. A goodintermediate solution would be to store only theirregular forms in the lexicon and let morphol-ogy handle all the regular forms. In Kannada,for example, nominative, dative and genitiveforms of pronouns are stored in the lexiconand other forms are derived by the morphologystarting from the genitive form. For example,’nannalli’ can be obtained from ’nanna’. Thiswould require suitable adjustments in tags aftermorphology.

The list of tags used in the Kannada lexiconis given below. All the major categories andsub-categories are included but only samplesof feature combinations are shown here. Forpronouns the exhaustive set is included to helpthe reader get a feel for the complete tag set.

Each item here is a tag. It has a hierarchicalstructure shown via dashes. The parts sepa-rated by dashes are called tag elements. Tagelements may in turn be made up of tag atoms.Thus, ’PROREFP23 MFN PLNOM’ is one single



tag, PRO, REF, etc. are the tag elements, andP23 MFN PL is a single tag element containingthe person, number and gender information.P23 is a short hand for second or third personand MFN is a short hand for Masculine orfeminine or Neuter. Short hands such as theseencapsulate tag ambiguities at the level of features and reduce the overt ambiguity intagging. The labels used here are generally

quite well known and need no explanation.

1. ADJ-ABS

2. ADJ-DEM

3. ADJ-ORD

4. ADJ-QNTF

5. ADJ-QW

6. ADV-ABS

7. ADV-CONJ

8. ADV-INTF

9. ADV-MAN

10. ADV-PLA

11. ADV-POSN

12. ADV-QW

13. ADV-TIM

14. N-CARD-N.SL-NOM

15. N-COM-COU-M.SL-NOM

16. N-COM-UNC-N.SL-NOM

17. N-LOC-TIM-COU-ABS-N.SL-NOM

18. N-LOC-TIM-UNC-ABS-N.SL-NOM

19. N-LOC-TIM-UNC-DIST-N.SL-NOM

20. N-LOC-TIM-UNC-PROX-N.SL-NOM

21. N-LOC-TIM-UNC-QW-N.SL-NOM

22. N-LOC-PLA-COU-ABS-N.SL-NOM

23. N-PRP-LOC

24. N-PRP-ORG

25. N-PRP-OTH

26. N-PRP-PER-M.SL-NOM

27. PRO-INTG-P3.F.SL-NOM

28. PRO-INTG-P3.MF.PL-NOM

29. PRO-INTG-P3.M.SL-NOM

30. PRO-INTG-P3.N.PL-NOM

31. PRO-INTG-P3.N.SL-COMP

32. PRO-INTG-P3.N.SL-DAT

33. PRO-INTG-P3.N.SL-GEN

34. PRO-INTG-P3.N.SL-NOM

35. PRO-INTG-P3.N.SL-PURP1

36. PRO-INTG-P3.N.SL-PURP2

37. PRO-PER-P1.MFN.PL-ABS-COMP

38. PRO-PER-P1.MFN.PL-ABS-DAT

39. PRO-PER-P1.MFN.PL-ABS-GEN

40. PRO-PER-P1.MFN.PL-ABS-NOM

41. PRO-PER-P1.MFN.PL-ABS-PURP1


43. PRO-PER-P1.MFN.SL-ABS-COMP

44. PRO-PER-P1.MFN.SL-ABS-DAT

45. PRO-PER-P1.MFN.SL-ABS-GEN

46. PRO-PER-P1.MFN.SL-ABS-NOM

47. PRO-PER-P1.MFN.SL-ABS-PURP1


49. PRO-PER-P2.MFN.PL-ABS-COMP

50. PRO-PER-P2.MFN.PL-ABS-DAT

51. PRO-PER-P2.MFN.PL-ABS-GEN



52. PRO-PER-P2.MFN.PL-ABS-NOM



55. PRO-PER-P2.MFN.SL-ABS-COMP

56. PRO-PER-P2.MFN.SL-ABS-DAT

57. PRO-PER-P2.MFN.SL-ABS-GEN

58. PRO-PER-P2.MFN.SL-ABS-NOM



61. PRO-PER-P3.F.SL-DIST-NOM

62. PRO-PER-P3.F.SL-PROX-NOM

63. PRO-PER-P3.MF.PL-DIST-NOM

64. PRO-PER-P3.MF.PL-PROX-NOM

65. PRO-PER-P3.M.SL-DIST-NOM

66. PRO-PER-P3.M.SL-PROX-NOM

67. PRO-PER-P3.N.PL-DIST-COMP

68. PRO-PER-P3.N.PL-DIST-DAT

69. PRO-PER-P3.N.PL-DIST-NOM

70. PRO-PER-P3.N.PL-DIST-PURP1

71. PRO-PER-P3.N.PL-DIST-PURP2

72. PRO-PER-P3.N.PL-PROX-COMP

73. PRO-PER-P3.N.PL-PROX-DAT

74. PRO-PER-P3.N.PL-PROX-NOM

75. PRO-PER-P3.N.PL-PROX-PURP1

76. PRO-PER-P3.N.PL-PROX-PURP2

77. PRO-PER-P3.N.SL-DIST-COMP

78. PRO-PER-P3.N.SL-DIST-DAT

79. PRO-PER-P3.N.SL-DIST-GEN

80. PRO-PER-P3.N.SL-DIST-NOM

81. PRO-PER-P3.N.SL-DIST-PURP1

82. PRO-PER-P3.N.SL-DIST-PURP2

83. PRO-PER-P3.N.SL-PROX-COMP

84. PRO-PER-P3.N.SL-PROX-DAT

85. PRO-PER-P3.N.SL-PROX-GEN

86. PRO-PER-P3.N.SL-PROX-NOM

87. PRO-PER-P3.N.SL-PROX-PURP1

88. PRO-PER-P3.N.SL-PROX-PURP2

89. PRO-INDF-P1.MF.PL-NOM

90. PRO-INDF-P3.F.SL-NOM

91. PRO-INDF-P3.MF.PL-NOM

92. PRO-INDF-P3.M.SL-NOM

93. PRO-INDF-P3.N.PL-NOM

94. PRO-INDF-P3.N.SL-DAT

95. PRO-INDF-P3.N.SL-NOM

96. PRO-REF-P23.MFN.PL-COMP

97. PRO-REF-P23.MFN.PL-DAT

98. PRO-REF-P23.MFN.PL-GEN

99. PRO-REF-P23.MFN.PL-NOM

100. PRO-REF-P23.MFN.PL-PURP1

101. PRO-REF-P23.MFN.PL-PURP2

102. PRO-REF-P3.MFN.PL-GEN

103. PRO-REF-P3.MFN.SL-COMP

104. PRO-REF-P3.MFN.SL-DAT

105. PRO-REF-P3.MFN.SL-GEN

106. PRO-REF-P3.MFN.SL-NOM

107. PRO-REF-P3.MFN.SL-PURP1

108. PRO-REF-P3.MFN.SL-PURP2

109. PTN



110. V-BI1

111. V-DEFE

112. V-IN

113. V-TR1

Tags shown in the lexicon get enriched withthe information coming from morphology. Look

at the examples of morph output below:

annavannu = anna(N − COM − U N C −N.SL − N OM ) + epsilon(Singular) +annu(Accusative)koTTanu = koDu : (V − BI 1) +epsilon + id(Past − Tense) + anu(Third −P erson.Masculine.Singular)

The theory and implementation details of themorphology component is being published else-

where. The tagger combines the elements ob-tained from the lexicon and morph and assignsthe following tags:

annavannu||anna||N − COM − U N C −N.SL − ACC

koTTanu||koDu||V − BI 1 − ABS − PAST −P 3.M.SL

More than 10,000 word forms can be gen-erated and analyzed for a given nominal stem

including all the number, case, clitic and voca-tive combinations. A mind boggling numberof word forms can be analyzed and generatedfor verbs. The complete set of tags is thereforeextremely large. This does not cause anyproblems since in our approach tagging is doneautomatically and there is neither any need formanual tagging nor for any machine learningalgorithm. The current system is more or lesscomplete with respect to inflection, derivationand to some extent external sandhi.

2.3.1 Derivation

Derivational morphology can introducechanges in grammatical category. A gerund, forexample, is a noun derived from a verb. Derived

words can retain some properties of their initialcategory and obtain new properties in their newcategory. Both are important and classifyingthem either under the initial category or underthe final category would be incorrect. A gerund,for example, may retain its transitivity propertyand ability to take an object, while at the sametime take nominal suffixes and behave like asubject or object of another verb. Look at this

example: heccuvudariMda||heccu||V − T R1 >v − ABS − F U T − GRND− > n − ABL||In our tagging system, we retain the completehistory and all the relevant properties at eachstage. This is essential for syntax.

2.3.2 Comparison with Other Tag Sets

Instead of making an exhaustive comparisonwith other tag sets, we shall only highlight afew important aspects of divergence. Therehave been several Initial efforts to develop tag

sets by various groups in India over the lastfew years. Most of these have been motivatedby, if not based on, the tag sets and taggingapproaches followed by others in the world,especially for English. These tag sets tendto be coarse, quite shallow if not completelyflat, intended for manual tagging and machinelearning, under the assumption that tag assign-ment and disambiguation are largely a matterof arrangement of tokens in a given sentence.Morphology does not have a central role. Mostimportantly, all these tagging schemes are

orthography oriented and consider words to besequence of written symbols separated by whitespaces. Neither meanings nor pronunciationare taken as the basis for defining words. Thespecific purpose for which these tag sets havebeen designed is not clear. No attempt seemsto have been made to precisely define each of the tags and publish the same. Presumably,all the efforts so far have been driven by theneeds of machine translation. Over time, allthese various proposed schemes are converging

into a proposed draft BIS standard tag set. Weshall therefore take up some issues with thisproposed draft BIS tag set for comparison.

The proposed draft BIS tag sets show division



of verbs into main and auxiliary verbs, whichwe have already questioned. Also, there are at-tempts to divide verbs into finite and non-finite.What is the basis for finiteness? Is this divisionbased on tense? Is it based on agreement mark-ing? What exactly does this mean to syntax?Precise definitions have not been provided, in-finitive is sometimes listed separately, and thefact that it is possible to derive finite verb forms

from non-finite forms through legitimate pro-cesses of morphology seems to have been over-looked. Gerunds and the so-called verbal nounsadd to the confusion relating to verbs. It is notclear why demonstratives and quantifiers havebeen considered as top level categories. The no-tion of a post-position as a grammatical cate-gory is questionable.

2.4 The Bridge Module and Tag

Disambiguation

Ideally, the input to a grammatical systemshould be a normalized sentence, whereinorthographic tokens have already been pre-processed into meaningful words. Ideally,the morphological analyzer should depict thecomplete and correct structure of the givenwords, helping the readers to obtain the cor-rect meanings. Due to a variety of designand implementation strategies followed whilebuilding a computational system, there couldbe some deviations. The purpose of the bridgemodule is to iron out all such deviations so that

the subsequent syntactic module gets to see acompletely normalized and tagged sentence.

Here in the bridge module, the bits of information obtained from the dictionary andmorphology are combined to generate final tags.For example,

manege||mane||N-COM-COU-N.SL-DAT

maaDuttaane||maaDu||V-TR1-PRES-P3.M.SL

maaDidare||maaDu||V-TR1-PAST-COND

maaDabeekaagibaMdaaga||maaDu||

V-TR1-INF-CMPL-AUX.aagu-CJP.PAST

-AUX.baru-PAST-RP-adj-CLIT.aaga

maaDibiTTaraMtaa||maaDu||V-TR1

-CJP.PAST-AUX.biDu-PAST-P3.MF.PL

-CLIT.aMte-CLIT.INTG

It may be noted that noun verb ambiguities

have been resolved by morphology. Note thatDravidian morphology is very rich and standardlinguistic terms may not always be readilyavailable. We have chosen to tag such caseswith the morphemes themselves. Further,certain morphemes have several possible inter-pretations and since the dictionary and morphwork with words in isolation, there is no wayto disambiguate. In such cases, it is best toretain the ambiguity in implicit form, ratherthan provide several different tags. The clitic’aMte’ is an example of this.

Since we have decided to give completegrammatical information for all entries in thedictionary and since we have decided to storeonly the irregular forms of pronouns in thedictionary, we would store, for example, ‘nanna’in Kannada as a personal pronoun in firstperson, singular, genitive case. We would bederiving ‘nannannu’ through rules of nominalmorphology. This is the accusative form, butdictionary would be showing ‘nanna’ as the root

in genitive case. One pronoun cannot have twocases, here the final case should be accusative,not genitive. The bridge module clears up andgives the correct analysis as the final output.

Look at the Kannada word ‘maaDalilla’.Structurally, this is ‘maaDu (verb) (Do) + alu(infinitive) + illa (existential negative)’. Thisis how grammar books generally analyze thisword. If you look at the meaning, there isno sense of the infinitive, this is simply the

past tense form. The infinite suffix is usedas a glue in Kannada and Telugu, so that avariety of other suffixes can be added, there isreally no infinitive sense. More interestingly,here it actually signifies past tense. Grammar



is not all simple, neat and round, like schooltext books tend to suggest. The mappingfrom structure to meaning is not so straightforward in all situations. Our strategy hereis to work with structure first and correctaberrations if any in the bridge module.This makes sense since computers can onlywork with structure, not with meanings directly.

The bridge module can handle ambiguities,unknown words, proper names or named enti-ties, spelling errors, colloquial forms, etc. Allthis cleaning up would make the job of the sub-sequent modules so much simpler.

3 Experiments and Results

Here we shall give details of some experimentswe have carried out for Telugu. The status of Kannada is similar.

We have performed POS tagging experimentson various corpora. F1 is a randomly selectedfile from TDIL corpus. F2 is set of sentencesextracted from the TDIL corpus containingabout 15,000 most frequent words from thiscorpus. These most frequent words have a greatsignificance not just because they account formore than 60% of the whole corpus but alsobecause they include the most confusing itemsfrom the point of view of lexicon, morphologyand tag assignment. The rest of the wordsforming the bulk of the corpus are the simplestas far as tagging is concerned. In order tobe sure that there is no over-fitting for anyparticular data set, we have next attemptedtagging files F3 and F4 from the Eenadu Telugudaily newspaper. The table below shows theperformance of the system as on date. Here Dindicates the number of words directly found inthe dictionary, M indicates the number of wordsanalyzed by the morph, D-AMB is the numberof words found in the dictionary and having

more than one tag, M-AMB is the number of words analyzed by morph and having morethan one tag. UNK indicates the number of words that remain untagged.

It may be observed that about 40% of theword forms are directly found in the dictio-nary and 50-60% of the words are analyzedby morph. Unless the texts contain a largepercentage of proper nouns (ex. F4), only some5-10% of the words remain untagged. Mostof the unknown words are loan words, propernouns, or words involving external sandhior compounds. Further work on dictionary

and morphology can reduce the number of untagged words in future. Of the tagged words,only about 10% of the words have more thanone tag assigned. The maximum number of tags that can get assigned is 4 but this is avery rare case. Even getting 3 analysis is arare phenomenon. Only a few typical kindsof ambiguities occur most of the times andpreliminary studies have shown that a majorityof them will automatically get resolved throughchunking and parsing. We may not need a

statistical approach to resolve these but if onewishes, we can easily tag the whole corpus usingour system and create training data from that.One can also try a variety of heuristic rules toresolve the remaining ambiguities.

Upon careful observation of the tagged sam-ples, we find that most words are taggedcorrectly. In order to be doubly sure, weagain tagged 252 sentences selected from variousgrammar books including a wide variety of sen-

tence structures. These sentences include 1278tokens. All of these get tagged. Only two wordswere tagged incorrectly. Only 140 words are as-signed more than one tag.

4 Conclusions

In this paper we have presented a new approachto tagging based on our theory of language,grammar and computation. We have described

in detail a large, hierarchical tag set being de-signed by us for Dravidian. We have demon-strated the viability and merits of our ideasthrough actually developed systems for Kan-nada and Telugu. More work is on.

On the Design of a Tag Set for Dravidian Languages

Documents

Transcript of On the Design of a Tag Set for Dravidian Languages