in partial fulfillment for the award of the degree of

CASE STUDY OF PART OF SPEECH TAGGERSA PROJECT REPORT
of
AMRITA VISHWA VIDYAPEETHAM
COIMBATORE – 641 105
A PROJECT REPORT
of
AMRITA VISHWA VIDYAPEETHAM
COIMBATORE – 641 105
BONAFIDE CERTIFICATE
This is to certify that the project report entitled “PART OF SPEECH
TAGGING FOR MALAYALAM” submitted by “ANISH .A, Reg. No.
CB206CN002” in partial fulfillment of the requirements for the award of the
Degree of Master of Technology in “COMPUTATIONAL ENGINEERING
AND NETWORKING” is a bonafide record of the work carried out under our
guidance and supervision at Amrita School of Engineering, Coimbatore.
Dr.K.P.SOMAN Mr. R.RAJEEV Dr. K.P.SOMAN
Project Guide, Project Guide Head of the Department
Professor and Head, CEN Sr.Computer Scientist
Tamil University, Tanjore This project report was evaluated by us on ……………………… INTERNAL EXAMINER EXTERNAL EXAMINER
AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING, COIMBATORE CENTRE FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING
AND NETWORKING
DECLARATION
I, ANISH .A (Reg.No. CB206CN002) hereby declare that this project report
entitled “PART OF SPEECH TAGGING FOR MALAYALAM”, is a record
of the original work done by me under the guidance of Dr.K.P.SOMAN, (Head
of the Department), Department of Computational Engineering and Networking,
Amrita School of Engineering,Coimbatore and Mr.R.Rajeev(Sr.Computer
Scientist) ,Tamil University and this work has not formed the basis for the award
of any degree / diploma / associateship / fellowship or a similar award, to any
candidate in any University, to the best of my knowledge.
Place : Signature of the student Date:
COUNTERSIGNED
Dr.K.P.SOMAN
ACKNOWLEDGEMENT
I express my heartfelt gratitude and unforgettable indebtedness to my guide
Dr. K.P.Soman , Head of the Department of Computational Engineering and
Networking, for his expert guidance , valuable suggestions ,constant encouragement
and above all the understanding and wholehearted cooperation throughout the entire
period.
I am extremely grateful to Mr. Rajeev R.R, Sr.Computer Scientist, Tamil University,
for providing me with necessary help and suggestions throughout the entire project
period.
I extend my cordial thanks to all the teaching and the non teaching staffs of the
Department of Computational Engineering and Networking especially
Mrs.Dhanalakshmi. V, Mr.Loganathan and Mr.Ajith V.P for providing the
necessary softwares and also for the help rendered at various phases of the project
work.
I am in short of words to express my sincere gratitude and profound thanks to all my
dear friends without whose help this project could not have been completed.
ABSTRACT The process of assigning part of speech for every word in a given sentence according
to the context is called as part of speech tagging. This is one of the useful tasks in
Natural Language Processing (NLP). It plays an important role in Speech and NLP
such as Speech Recognition, Speech Synthesis, Information Retrieval, word sense
disambiguation and machine translation.
Many algorithms have been developed for part of speech (POS) tagging. Support
Vector Machine (SVM), Hidden Markov Model (HMM), Maximum Entropy Markov
Model (MEMM), Neural Networks (NN), Decision Trees, Rule based and
Transformation based techniques are to name a few.
The project is about performing Part of speech tagging to the Indian language
Malayalam using SVM Tool, which was implemented using support vector machines
and TnT tagger, which was built using Hidden Markov Model. The SVMTool is
giving accuracy of 87.5% and TnT tagger is giving an accuracy of about 75 %.
This thesis will give an idea about various algorithms used for part of speech tagging
followed by SVM and TnT taggers. It presents a clear idea about all the algorithms
from the ground level with suitable examples. It also introduces a tag set for
Malayalam which can be used for tagging Malayalam text. A tool which helps in
tagging the text is also introduced. After tagging , the outputs were also compared to
check the accuracy.
TABLE OF CONTENTS
1. INTRODUCTION ………………………………………………..…..……………….1
1.2 English word classes ................................................................................................ 1
1.3 Tagsets for English .................................................................................................. 4
1.4 Part of speech tagging ............................................................................................. 7
1.5 Tagging approaches .................................................................................................. 8
1.5.2 Stochastic part of speech tagging …………………………………………...11
2. LITERATURE SURVEY…………………………………………………………….15
3.2 The theory of support vector machines ................................................................... 23
3.3 Problem setting ........................................................................................................ 26
3.3.2 Feature codification ......................................................................................... 26
3.4.1 SVMTlearn ...................................................................................................... 29
3.4.1.5 Test ....................................................................................................... 41
3.4.1.6 Models .............................................................................................. …41
3.4.2 SVMTagger .................................................................................................. ..44
3.4.2.1 Options ................................................................................................ 47
3.4.2.2 Strategies .............................................................................................. 51
3.4.3 SVMTeval ....................................................................................................... 52
3.4.3.1 Reports ................................................................................................. 53
4.1 Markov chains ......................................................................................................... 59
4.3 Computing likelihood: The forward algorithm ...................................................... 67
4.4 Decoding: The viterbi algorithm ............................................................................. 73
4.5 Training HMMs: The forward-backward algorithm ................................................ 79
4.6 An Example ............................................................................................................. 85
4.7 TNT Tool…………………………………………………………………….….….87
4.7.1 File formats……………………………………………………………………..87
4.7.1.2 Format of the n-gram file…………………………………………………..89
4.7.1.3 Format of map files………………………………………………………...90
4.7.2 Usage……………………………………………………………………….…...90
4.7.2.4 Counting tokens and types: tnt-wc..………………………………………..93
5. MALAYALAM PART- OF- SPEECH TAGGING………….……………………...94
5.1 Tagset for Malayalam ………………………….……………………….…..…......101
5.2 Transliteration for Malayalam………………………………………….………….106
5.3 Part of Speech Tagging for Malayalam using TNT and SVM………........….........107
5.3.1 Results………………………………………………………………….…….107
6.CONCLUSION………………………………………….……………….………… 110
REFERENCES….…………………………………………………………………..…111
4.1 Simple Markov Chain for Weather Events…………………………………….......60
4.2 Alternate Representation of Markov Chain..………………………………………62
4.3 Sample Hidden Markov Model…………………………………………………….66
4.4 Graphical representation of computation of forward probability.............................69
4.5 Graphical representation of finding Hidden State Sequence....................................70
4.6 The Forward Algorithm………..……………………..……………………….. .…72
4.7 Example of Viterbi Algorithm.….……………..………………...………...............74
4.8 Viterbi Algorithm…………….….……………………………………….….....…..76
4.10 Computation of Transition Probability…………………….……………..…........82
4.11 Computation of Emission Probability……………………………….……….…...85
4.12 Part of Lexicon File.................................................................................................89
i
1.1 Penn Treebank Tagset..………………………………………………………………..5
1.2 The number of word types in Brown corpus by degree of ambiguity……..…….........8
3.1 Rich Feature Pattern Set………………………………………….…….………….…28
3.2 SVMTlearn config-file mandatory arguments…………………..…..….………........34
3.3 SVMTlearn config-file action arguments………………………..……......…..…......34
3.4 SVMTlearn config-file optional arguments…………..………….….…..……….......36
3.5 SVMTool feature language…………………………………………………………..37
3.6 Model 0. Example of suitable POS Features……………….………………….….….43
3.7 Model 1. Example of suitable POS Features….………….….……..…...….….….….43
3.8 Model 2. Example of suitable POS Features.…………….….…..…….……….….…44
ii
ORGANIZATION OF THE PROJECT
This thesis is all about Part of speech Tagging for Malayalam. The work is
divided into five chapters.
First Chapter will give an introduction to the Part of speech Tagging. It will
discuss about word classes, tag sets and little description about various tagging
approaches used for POS tagging.
Second chapter will give the literature review about part of speech tagging.
Third chapter is about part of speech tagger based on Support Vector Machines. It
will discuss about the software SVMTOOL and how it is used for training and
testing.
Fourth chapter will discuss about the fundamental concepts about Hidden Markov
Models and algorithms like forward, backward, Viterbi and Baum-Weltch
algorithms. And finally it will tell how it is being applied in our case of POS
tagging.
Fifth chapter discusses about the part of speech tagging for Malayalam and
tagging using SVM and HMM. First it will give an idea about the part of speech
in Malayalam.Then.the tags sets that are used for tagging are discussed. Finally,
how part of speech tagging is performed for Malayalam using SVM Tool and
TNT, are discussed.
1.1 WORD CLASSES AND PART-OF-SPEECH TAGGING
Words are divided into different classes called parts of speech (POS; Latin pars
orationis), word classes, morphological classes, or lexical tags. In traditional
grammars, there are few parts of speech (noun, verb, adjective, preposition, adverb,
conjunction, etc.). Many of the recent models have much larger numbers of word
classes (45 for the Penn Treebank, 87 for the Brown corpus, and 146 for the C7
tagset).
Part-of-speech tagging (POS tagging or POST), also called grammatical tagging, is
the process of marking up the words in a text as corresponding to a particular part of
speech, based on both its definition, as well as its context —i.e., relationship with
adjacent and related words in a phrase, sentence, or paragraph.
1.2 ENGLISH WORD CLASSES
The definition of parts of speech has been based on morphological and syntactic
function; words that function similarly with respect to the affixes they take (their
morphological properties) or with respect to what can occur nearby (their
'distributional properties') are grouped into classes. While word classes are having
tendencies toward semantic coherence (nouns often describe 'people, places or
things', and adjectives often describe properties), this is not necessarily the case, and
in semantic coherence is not used as a definitional criterion for parts of speech.
Parts of speech can be divided into two broad super categories: closed class types and
open class types. Closed classes are those that have relatively fixed membership. For
example, prepositions are a closed class because there is a fixed set of them in
2
English; new prepositions are rarely added. By contrast nouns and verbs are open
classes because new nouns and verbs are continually added or borrowed from other
languages. It is likely that any given speaker or corpus will have different open class
words, but all speakers of a language, and corpora that are large enough, will likely
share the set of closed class words. Closed class words are generally also function
words; function words are grammatical words like of, it, and, or you, which tend to be
very short, occur frequently, and play an important role in grammar.
There are four major open classes that occur in the languages of the world: nouns,
verbs, adjectives, and adverbs. It turns out that English has all four of these, although
not every language does. Many languages have no adjectives. In the American
language Lakhota, for example, and also possibly in Chinese, the words
corresponding to English adjectives act as a subclass of verbs.
Every known human language has at least the two categories noun and verb (although
in some languages, for example Nootka, the distinction is subtle). Noun is the name
given to the lexical class in which the words for most people, places, or things occur.
But since lexical classes like noun are defined functionally (morphological and
syntactically) rather than semantically, some words for people, places, and things may
not be nouns, and conversely some nouns may not be words for people, places, or
things. Thus nouns include concrete terms like ship and chair, abstractions like
bandwidth and relationship. What defines a noun in English, then, are things like its
ability to occur with determiners (a goat, its bandwidth, Plato's Republic), to take
possessives (IBM's annual revenue), and for most but not all nouns, to occur in the
plural form (goats, abaci).
Nouns are traditionally grouped into proper nouns and common nouns. Proper nouns,
like Regina, Colorado, and IBM, are names of specific persons or entities. In English,
they generally aren't preceded by articles (e.g. the book is upstairs, but Regina is
upstairs). In written English, proper nouns are usually capitalized.
3
In many languages, including English, common nouns are divided into count nouns
and mass nouns. Count nouns are those that allow grammatical enumeration; that is,
they can occur in both the singular and plural (goat/goats, relationship/relationships)
and they can be counted (one goat, two goats). Mass nouns are used when something
is conceptualized as a homogeneous group. So words like snow, salt, and communism
are not counted (i.e. *two snows or *two communisms). Mass nouns can also appear
without articles where singular count nouns cannot (Snow is white but not *Goat is
white).
The verb class includes most of the words referring to actions and processes,
including main verbs like draw, provide, differ, and go. English verbs have a number
of morphological forms (non3rd-person-sg (eat), 3d-person-sg (eats), progressive
(eating), past participle eaten).
The third open class English form is adjectives; semantically this class includes many
terms that describe properties or qualities. Most languages have adjectives for the
concepts of color (white, black), age (old, young), and value (good, bad), but there are
languages without adjectives.
The closed classes differ more from language to language than do the open classes.
Here's a quick overview of some of the more important closed classes in English,
with a few examples of each:
1. prepositions: on, under, over, near, by, at, from, to, with
2. determiners: a, an, the
3. pronouns: she, who, I, others
4. conjunctions: and, but, or, as, if, when
5. auxiliary verbs: can, may, should, are
4
6. particles: up, down, on, off, in, out, at, by,
7. numerals: one, two, three, first, second, third
1.3 TAGSETS FOR ENGLISH
This section describes about actual tagsets used in part-of-speech tagging, in
preparation for the various tagging algorithms to be described later.There are a small
number of popular tagsets for English, many of which evolved from the 87-tag tagset
used for the Brown corpus (Francis, 1979; Francis and Kucera, 1982). Three of the
most commonly used are the small 45-tag Penn Treebank tagset (Marcus et al., 1993),
the medium-sized 61 tag C5 tagset used by the Lancaster UCREL project's CLAWS
(the Constituent Likelihood Automatic Word-tagging System) tagger to tag the
British National Corpus (BNC) (Garside et al., 1997), and the larger 146-tag C7
tagset (Leech et al., 1994). This section will present the smallest of them, the Penn
Treebank set, and then discuss specific additional tags from some of the other tagsets
that might be useful to incorporate for specific projects.
The Penn Treebank tagset, shown in Table 1.1, has been applied to the Brown corpus
and a number of other corpora. Here is an example of a tagged sentence from the
Penn Treebank version of the Brown corpus (in a flat ASCII file, tags are often
represented after each word, following a slash, but tags can also be represented in
various other ways):
topics/NNS. /.
The Penn Treebank tagset was chosen from the original 87-tag tagset for the Brown
corpus. This reduced set leaves out information that can be recovered from the
identity of the lexical item. For example the original Brown tagset and other large
5
tagsets like C5 include a separate tag for each of the different forms of the verbs do
(e.g. C5 tag 'VDD'
for did and 'VDG' for doing), be, and have. These were omitted from the Penn set.
Tagging is the task of classifying words in a natural language text with respect to a
specific criterion. Different types of tagging can be distinguished based on the specific
criterion adopted.
1 CC Coordinating conjunction 2 CD Cardinal number 3 DT Determiner 4 EX Existential there 5 FW Foreign word 6 IN Preposition/subord. conjunction 7 JJ Adjective 8 JJR Adjective, comparative 9 JJS Adjective, superlative
10 LS List item marker 11 MD Modal 12 NN Noun, singular or mass 13 NNS Noun, plural 14 NNP Proper noun, singular 15 NNPS Proper noun, plural 16 PDT Predeterminer 17 POS Possessive ending 18 PRP Personal pronoun 19 PP Possessive pronoun 20 RB Adverb 21 RBR Adverb, comparative 22 RBS Adverb, superlative 23 RP Particle 24 SYM Symbol (mathematical or scientific)
25 TO to 26 UH Interjection 27 VB Verb, base form 28 VBD Verb, past tense 29 VBG Verb, gerund/present participle 30 VBN Verb, past participle 31 VBP Verb, non-3rd ps. sing. present 32 VBZ Verb, 3rd ps. sing. present 33 WDT wh-determiner 34 WP wh-pronoun 35 WP Possessive wh-pronoun 36 WRB wh-adverb 37 # Pound sign 38 $ Dollar sign 39 . Sentence-final punctuation 40 , Comma 41 : Colon, semi-colon 42 ( Left bracket character 43 ) Right bracket character 44 " Straight double quote 45 ` Left open single quote 46 " Left open double quote 47 ' Right close single quote 48 " Right close double quote
Table 1.1 Penn TreeBank Tagset
6
Syntactic word class: Identify the syntactic category, i.e., the part-of-speech (POS),
of a word in the context of a sentence. This helps subsequent stages of processing,
e.g., parsing, because the ambiguity is reduced from the beginning. Usually, the
process only takes into account the immediate neighborhood.
The horse raced [/VBD? VBN?] past the barn fell.
Word sense: Identifying the intended meaning of word in a given context. A
successful disambiguation requires considering more global aspects of the utterance,
e.g., what the topic of the discourse is about.
Plants [/factory? Vegetable?] are known to be dangerous.
Attachment: Identify the site in a sentence where a phrase attaches to. This usually
requires syntactic as well semantic information. Often (at least for the purpose of an
evaluation) the problem is reduced to a binary decision.
He read the article in [/<=noun? <=verb?] the newspaper/train.
(Underspecified) syntactic parsing: This combines the task of POS tag resolution
and attachment disambiguation. The exact position of attachment is left
underspecified, i.e., only the direction of attachment is given, or a complete
disambiguation is achieved.
marks that end a sentence and those that do not.
He saw Mr.[/EOS?] Jones surrounded by the crowd. [/EOS?]
7
Tagging based on syntactic word class is the most frequently encountered form of
tagging in the field of natural language processing
1.4 PART OF SPEECH TAGGING
Part-of-speech tagging (or just tagging for short) is the process of assigning a part-of-
speech or other lexical class marker to each word in a corpus. Tags are also usually
applied to punctuation markers; thus tagging for natural language is the same process
as tokenization for computer languages, although tags for natural languages are much
more ambiguous. Taggers play an increasingly important role in speech recognition,
natural language parsing and information retrieval.
The input to a tagging algorithm is a string of words and a specified tagset of the kind
described in the previous section. The output is a single best tag for each word. For
example, here are some sample sentences from the ATIS corpus of dialogues about
air-travel reservations. For each, a potential tagged output using the Penn Treebank
tagset is shown.
VB DT NN
Book that flight.
VBZ DT NN VB NN .
Does that flight serve dinner?
Even in these simple examples, automatically assigning a tag to each word is not
trivial. For example, book is ambiguous. That is, it has more than one possible usage
and part of speech. It can be a verb (as in book that flight or to book the suspect) or a
noun (as in hand me that book, or a book of matches). Similarly that can be a
determiner (as in Does that flight serve dinner), or a complementizer (as in I thought
that your flight was earlier). The problem of POS-tagging is to resolve these
ambiguities, choosing the proper tag for the context. Part-of-speech tagging is thus
one of the many disambiguation tasks we will see in this book.
8
Most words in English are unambiguous; i.e. they have only a single tag. But many of
the most common words of English are ambiguous (for example ‘can’ can be an
auxiliary ('to be able'), a noun ('a metal container'), or a verb ('to put something in
such a metal container')). Only 11.5% of English word types in the Brown Corpus are
ambiguous, over 40% of Brown tokens are ambiguous.
The following table 1.2 shows the degree of ambiguity.
Unambiguous (1-tag) 35,340
2 tags 3,760
3 tags 264
4 tags 61
5 tags 12
6 tags 2
7 tags 1
Table 1.2: The number of word types in Brown corpus by degree of ambiguity
Luckily, many of the 40% ambiguous tokens are easy to disambiguate. This is
because the various tags associated with a word are not equally likely. For example, a
can be a determiner, or the letter a (perhaps as part of an acronym or an initial). But
the determiner sense of a is much more likely.
1.5 TAGGING APPROACHES
Most tagging algorithms fall into one of two classes: rule-based taggers and stochastic
taggers. Rule-based taggers generally involve a large database of hand-written
disambiguation rule which specify, for example, that an ambiguous word is a noun
rather than a verb if it follows a determiner.
9
Stochastic taggers generally resolve tagging ambiguities by using a training corpus to
compute the probability of a given word having a given tag in a given context.
The Brill tagger shares features of both tagging architectures. Like the rule-based
tagger, it is based on rules which determine when an ambiguous word should have a
given tag. Like the stochastic taggers, it has a machine-learning component: the rules
are automatically induced from a previously-tagged training corpus.
1.5.1 RULE-BASED PART-OF-SPEECH TAGGING
The earliest algorithms for automatically assigning part-of-speech were based on a
two-stage architecture. The first stage used a dictionary to assign each word a list of
potential parts of speech. The second stage used large lists of hand-written
disambiguation rules to bring down this list to a single part-of-speech for each word.
The ENGTWOL tagger is based on the same two-stage architecture, although both
the lexicon and the disambiguation rules are much more sophisticated than the early
algorithms. The ENGTWOL lexicon is based on the two-level morphology and has
about 56,000 entries for English word stems, counting a word with multiple parts of
speech (e.g. nominal and verbal senses of hit) as separate entries, and of course not
counting inflected and many derived forms. Each entry is annotated with a set of
morphological and syntactic features.
In the first stage of the tagger, each word is run through the two-level lexicon
transducer and the entries for all possible parts of speech are returned.
For example the phrase “Pavlov had shown that salivation...”
would return the following list (one line per possible tag, with the correct tag shown
in boldface):
10
HAVE PCP2 SVO
that ADV
salivation N NOM SG
A set of about 1,100 constraints are then applied to the input sentence to rule out
incorrect parts of speech; the boldfaced entries in the table above show the desired
result, in which the preterite (not participle) tag is applied to had, and the
complementizer (CS) tag is applied the that. The constraints are used in a negative
way, to eliminate tags that are inconsistent with the context. For example one
constraint eliminates all readings of that except the ADV (adverbial intensifier) sense
(this is the sense in the sentence it isn't that odd). Here's a simplified version of the
constraint:
(+1 A/ADV/QUANT); / * if next word is adj, adverb, or quantifier * /
(+2 SENT-LIM); / * and following which is a sentence boundary, * /
(NOT -1 SVOC/A); / * and the previous word is not a verb like * /
/ * 'consider' which allows adjs as object complements * /
then eliminate non-ADV tags
else eliminate ADV tag
The first two clauses of this rule check to see that the ‘that’ directly precedes a
sentence-final adjective, adverb, or quantifier. In all other cases the adverb reading is
11
eliminated. The last clause eliminates cases preceded by verbs like consider or
believe which can take a noun and an adjective; this is to avoid tagging the following
instance of that as an adverb:
I consider that odd.
Another rule is used to express the constraint that the complementizer sense of that is
most likely to be used if the previous word is a verb which expects a complement
(like believe, think, or show), and if the ‘that’ is followed by the beginning of a noun
phrase, and a finite verb.
1.5.2 STOCHASTIC PART-OF-SPEECH TAGGING
The use of probabilities in tags is quite old; probabilities in tagging were first used in
1965, a complete probabilistic tagger with Viterbi decoding was sketched by Bahl
and Mercer (1976), and various stochastic taggers were built in the 1980's (Marshall,
1983; Garside, 1987; Church, 1988; DeRose, 1988). This section describes a
particular stochastic tagging algorithm generally known as the Hidden Markov Model
or HMM tagger. The idea behind all stochastic taggers is a simple generalization of
the 'pick the most-likely tag for this word’.
For a given sentence or word sequence, HMM taggers choose the tag sequence that
maximizes the following formula:
P(word | tag ) X P(tag | previous n tags) (1.1)
The rest of this section will explain and motivate this particular equation. HMM
taggers generally choose a tag sequence for a whole sentence rather than for a single
word, but for pedagogical purposes, let's first see how an HMM tagger assigns a tag
to an individual word. We first give the basic equation, then work through an
example, and, finally, give the motivation for the equation.
12
A bigram-HMM tagger of this kind chooses the tag ti for word wi that is most
probable given the previous tag ti-1 and the current word wi:
argmax ( | , )1t P t t wi j i ij = − (1.2)
We can restate the Equation 1.2 to give the basic HMM equation for a single tag by
using some markov assumptions as follows:
argmax ( | ) ( | )1t P t t P w ti j i i jj = − (1.3)
An Example
Using an HMM tagger to assign the proper tag to the single word race in the
following examples (both shortened slightly from the Brown corpus):
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN (1.4)
People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT
race/NN for/IN outer/JJ space/NN (1.5)
In the first example race is a verb (VB), in the second a noun (NN). For the purposes
of this example, we will assume that some other mechanism has already done the best
tagging job possible on the surrounding words, leaving only the word race untagged.
A bigram version of the HMM tagger makes the simplifying assumption that the
tagging problem can be solved by looking at nearby words and tags. Consider the
problem of assigning a tag to race given just these subsequences:
to/TO race/???
the/DT race/???
13
Equation 1.3 says that if we are trying to choose between NN and VB for the
sequence to race, we choose the tag that has the greater of these two probabilities:
P(VB|TO) P(race|VB) (1.6)
and
P(NN|TO) P(race|NN) (1.7)
Equation 1.3 and its instantiations Equations 1.6 and 1.7 each have two probabilities:
a tag sequence probability ( | )1P t ti i− and a word-likelihood ( | )P w ti j . For race, the
tag sequence probabilities P (NN|TO) and P (VB|TO) give the answer to the question
"how likely are we to expect a verb (noun) given the previous tag?” .They can just be
computed from a corpus by counting and normalizing. We can expect that a verb is
more likely to follow TO than a noun is, since infinitives (to race, to run, to eat) are
common in English. While it is possible for a noun to follow TO (walk to school
related to hunting), it is less common.
Suppose the combined Brown and Switchboard corpora gives us the following
probabilities, showing that verbs are fifteen times as likely as nouns after TO:
P (NN|TO): .021
P (VB|TO): .34
The second part of Equation 1.3 and its instantiations Equations 1.6 and 1.7 is the
lexical likelihood: the likelihood of the noun race given each tag, P (race |VB) and
P(race |NN ). This likelihood term is not 'which is the most likely tag for this word'.
That is, the likelihood term is not P(VB |race). Instead we are computing P (race
|VB). The probability answers the question "if we were expecting a verb, how likely
is it that this verb would be race".
14
Here are the lexical likelihoods from the combined Brown and Switchboard corpora:
P(race|NN ) = .00041
P(race|VB ) = .00003
If we multiply the lexical likelihoods with the tag sequence probabilities, we see that
even the simple bigram version of the HMM tagger correctly tags race as a VB
despite the fact that it is the less likely sense of race:
P(VB|TO) P(race|VB) : .00001 ,P(NN|TO ) P(race|NN) : .000007
15
CHAPTER 2
LITERATURE SURVEY
Part-of-speech tagging is the act of assigning each word in a sentence a tag that
describes how that word is used in the sentence. Typically, these tags indicate
syntactic categories, such as noun or verb, and occasionally include additional feature
information, such as number (singular or plural) and verb tense. The Penn Treebank
[20] documentation defines a commonly used set of tags.
A large number of current language processing systems use a part-of-speech tagger
for pre-processing. The tagger assigns a part-of-speech tag to each token in the input
and passes its output to the next processing level, usually a parser. For both
applications, a tagger with the highest possible accuracy is required. Recent
comparisons of approaches that can be trained on corpora have shown that in most
cases statistical approaches yield better results than finite-state, rule-based, or
memory-based taggers. The authors in [1], describes the models and techniques used
by TnT together with the implementation. The result of the tagger comparison seems
to support the sentence "the simplest is the best".
Part-of-speech tagging is also a very practical application, with uses in many areas,
including speech recognition and generation, machine translation, parsing,
information retrieval and lexicography. Tagging can be seen as a prototypical
problem in lexical ambiguity; advances in part-of-speech tagging could readily
translate to progress in other areas of lexical, and perhaps structural, ambiguity, such
as word sense disambiguation and prepositional phrase attachment disambiguation.
Also, it is possible to cast a number of other useful problems as part-of-speech
tagging problems, such as letter-to-sound translation and building pronunciation
networks for speech recognition
When automated part-of-speech tagging was initially explored [2], people manually
engineered rules for tagging, sometimes with the aid of a corpus. As large corpora
became available, it became clear that simple Markov-model based stochastic taggers
that were automatically trained could achieve high rates of tagging accuracy. Markov-
model based taggers assign to a sentence the tag sequence that maximizes P (word |
tag) * P (tag | previous n tags). These probabilities can be estimated directly from a
manually tagged corpus. These stochastic taggers have a number of advantages over
the manually built taggers, including the need for laborious manual rule construction,
and possibly capturing useful information that may not have been noticed by the
human engineer. However, stochastic taggers have the disadvantage that linguistic
information is captured only indirectly, in large tables of statistics. All recent work in
developing automatically trained part-of-speech taggers has been on further exploring
Markov model based tagging.
Statistical methods have also been used (e.g., [5]). These provide the capability of
resolving ambiguity on the basis of most likely interpretation. A form of Markov
model has been widely used that assumes that a word depends probabilistically on
just its part-of-speech category, which in turn depends solely on the categories of the
preceding two words (in case of trigram).
Two types of training (i.e., parameter estimation) have been used with this model.
The first makes use of a tagged training corpus. Derouault and Merialdo use a
bootstrap method for training [6]. At first, a relatively small amount of text is
manually tagged and used to train a partially accurate model. The model is then used
to tag more text, and the tags are manually corrected and then used to retrain the
model. Church uses the tagged Brown corpus for training [7]. These models involve
probabilities for each word in the lexicon, so large tagged corpora are required for
reliable estimation.
17
The second method of training does not require a tagged training corpus. In this
situation the Baum-Welch algorithm (also known as the forward-backward algorithm)
can be used [8]. Under this system the model is called a hidden Markov model
(HMM), as state transitions (i.e., part-of-speech categories) are assumed to be
unobservable. Jelinek has used this method for training a text tagger [9]. Parameter
smoothing can be conveniently achieved using the method of ‘deleted interpolation’
in which weighted estimates are taken from second and first-order models and a
uniform probability distribution. Kupiec used word equivalence classes (referred to
here as ambiguity classes) based on parts of speech, to pool data from individual
words. The most common words are still represented individually, as sufficient data
exist for robust estimation.
All other words are represented according to the set of possible categories they can
assume. In this manner, the vocabulary of 50,000 words in the Brown corpus can be
reduced to approximately 400 distinct ambiguity classes [10]. To further reduce the
number of parameters, a first-order model can be employed (this assumes that a
word's category depends only on the immediately preceding word's category). In [11],
networks are used to selectively augment the context in a basic first order model,
rather than using uniformly second-order dependencies.
Part-of-speech tagging itself is a useful tool in lexical disambiguation; for example,
knowing that "dig" is being used as a noun rather than as a verb indicates the word's
appropriate meaning. But many words have multiple meanings even while occupying
the same part of speech. To this end, the tagger has been used in the implementation
of an experimental noun homograph disambiguation algorithm [12].
The different approaches for part of speech tagging can be classified into two main
classes depending on the tendencies followed for establishing the Language Model
(LM): the linguistic approach, based on hand-coded linguistic rules and the learning
18
use hybrid methods have also been proposed.
In the linguistic approach, an expert linguist is needed to formalize the restrictions of
the language. This implies a very high cost and it is very dependent on each particular
language. We can find an important contribution that uses Constraint Grammar
formalism. Supervised learning methods were proposed into learn a set, of
transformation rules that repair the error committed by a probabilistic tagger. The
main advantage of the linguistic approach is that the model is constructed from a
linguistic point of view and contains many and complex kinds of knowledge.
In the learning approach, the most extended formalism is based on n-grams. In this
case, the language model can be estimated from a labeled corpus (supervised
methods) [13] or from a non-labeled corpus (unsupervised methods) [14]. In the first;
case, the model is trained from the relative observed frequencies. In the second one,
the model is learned using the Baum-Welch algorithm from an initial model which is
estimated using labeled corpora [15]. The advantages of the unsupervised approach
are the facility to build language models, the flexibility of choice of categories and
the ease of application to other languages.
Part-of-speech tagging is an important research topic in Natural Language Processing
(NLP). Taggers are often preprocessors in NLP systems, making accurate
performance especially important. Much research has been done to improve tagging
accuracy using several different models and methods.
Most NLP applications demand at initial stages shallow linguistic information (e.g.,
part–of–speech tagging, base phrase chunking, named entity recognition).This
information may be predicted fully automatically (at the cost of some errors) by
means of sequential tagging over un annotated raw text. Generally, tagging is
required to be as accurate as possible, and as efficient as possible. But, certainly, there
19
is a trade-off between these two desirable properties. This is so because obtaining a
higher accuracy relies on processing more and more information, digging deeper and
deeper into it. However, sometimes, depending on the kind of application, a loss in
efficiency may be acceptable in order to obtain more precise results. Or the other way
around, a slight loss in accuracy may be tolerated in favour of tagging speed.
Some languages have a richer morphology than others, requiring the tagger to have
into account a bigger set of feature patterns. Also the tagset size and ambiguity rate
may vary from language to language and from problem to problem. Besides, if few
data are available for training, the proportion of unknown words may be huge.
Sometimes, morphological analyzers could be utilized to reduce the degree of
ambiguity when facing unknown words. Thus, a sequential tagger should be flexible
with respect to the amount of information utilized and context shape. Another very
interesting property for sequential taggers is their portability. Multilingual
information is a key ingredient in NLP tasks such as Machine Translation,
Information Retrieval, Information Extraction, Question Answering and Word Sense
Disambiguation, just to name a few. Therefore, having a tagger that works equally
well for several languages is crucial for the system robustness. Besides, quite often
for some languages, but also in general, lexical resources are hard to obtain.
Therefore, ideally a tagger should be capable for learning with fewer (or even none)
annotated data. The svmtool [21] is intended to comply with all the requirements of
modern NLP technology, by combining simplicity, flexibility, robustness, portability
and efficiency with state–of–the–art accuracy. This is achieved by working in the
Support Vector Machines (SVM) learning framework, and by offering NLP
researchers a highly customizable sequential tagger generator.
In the recent literature, several approaches to PoS tagging based on statistical and
machine learning techniques are applied, including among many others like Hidden
Markov Models [13], and Support Vector Machines [16]. Most of the previous
taggers have been evaluated on the English WSJ corpus, using the Penn Treebank set
20
of PoS categories and a lexicon constructed directly from the annotated corpus.
Although the evaluations were performed with slight variations, there was a wide
consensus in the late 90’s that the state–of-the–art accuracy for English PoS tagging
was between 96.4% and 96.7%. TnT is an example of a really practical tagger for
NLP applications. It is available to anybody, simple and easy to use, considerably
accurate, and extremely efficient, allowing training from 1 million word corpora in
just a few seconds and tagging thousands of words per second.
Many natural language tasks require the accurate assignment of Part-Of-Speech
(POS) tags to previously unseen text. Due to the availability of large corpora which
have been manually annotated with POS information, many taggers use annotated
text to "learn" either probability distributions or rules and use them to automatically
assign POS tags to unseen text.
21
SVM
This presents the svmtool, a simple, flexible, and effective generator of sequential
taggers based on Support Vector Machines and how it is being applied to the problem
of part-of-speech tagging. This SVM-based tagger is robust and flexible for feature
modeling (including lexicalization), trains efficiently with almost no parameters to
tune, and is able to tag thousands of words per second, which makes it really practical
for real NLP applications. Regarding accuracy, the SVM-based tagger significantly
outperforms the TnT tagger exactly under the same conditions, and achieves a very
competitive accuracy of 97.2% for English on the Wall Street Journal corpus.
3.1 PROPERTIES OF THE SVMTOOL
The following are the properties of the svmtool:
Simplicity: The svmtool is easy to configure and to train. The learning is controlled
by means of a very simple configuration file. There are very few parameters to tune.
And the tagger itself is very easy to use, accepting standard input and output
pipelining. Embedded usage is also supplied by means of the svmtool API.
Flexibility: The size and shape of the feature context can be adjusted. Also, rich
features can be defined, including word and PoS (tag) n-grams as well as ambiguity
classes and “may be’s”, apart from lexicalized features for unknown words and
sentence general information. The behavior at tagging time is also very flexible,
allowing different strategies.
Robustness: The overfitting problem is well addressed by tuning the C parameter in
the soft margin version of the SVM learning algorithm. Also, a sentence-level
analysis may be performed in order to maximize the sentence score. And, for
22
unknown words not to punish so severely on the system effectiveness, several
strategies have been implemented and tested.
Portability: The svmtool is language independent. It has been successfully applied to
English and Spanish without a priori knowledge other than a supervised corpus.
Moreover, thinking of languages for which labeled data is a scarce resource, the
svmtool also may learn from unsupervised data based on the role of non-ambiguous
words with the only additional help of a morpho-syntactic dictionary.
Accuracy: Compared to state–of–the–art PoS taggers reported up to date, it exhibits a
very competitive accuracy (97.2% for English on the WSJ corpus). Clearly, rich sets
of features allow to model very precisely most of the information involved. Also the
learning paradigm, SVM, is very suitable for working accurately and efficiently with
high dimensionality feature spaces.
Efficiency: Performance at tagging time depends on the feature set size and the
tagging scheme selected. For the default (one-pass left-to-right greedy) tagging
scheme, it exhibits a tagging speed of 1,500 words/second whereas the C++ version
achieves a tagging speed of over 10,000 words/second. This has been achieved by
working in the primal formulation of SVM. The use of linear kernels causes the
tagger to perform more efficiently both at tagging and learning time, but forces the
user to define a richer feature space. However, the learning time remains linear with
respect to the number of training examples.
23
3.2 THE THEORY OF SUPPORT VECTOR MACHINES
SVM is a machine learning algorithm for binary classification, which has been
successfully applied to a number of practical problems, including NLP. Let {(x1, y1). .
. (xN, yN)} be the set of N training examples, where each instance xi is a vector in
N and yi ∈ {−1,+1} is the class label. In their basic form, a SVM learns a linear
hyperplane that separates the set of positive examples from the set of negative
examples with maximal margin (the margin is defined as the distance of the
hyperplane to the nearest of the positive and negative examples). This learning bias
has proved to have good properties in terms of generalization bounds for the induced
classifiers.
The linear separator is defined by two elements: a weight vector w (with one
component for each feature), and a bias b which stands for the distance of the
hyperplane to the origin. The classification rule of a SVM is:
sign(f(x, w, b)) (3.1)
f(x, w, b) = .w x b⟨ ⟩ + (3.2)
Being x the example to be classified. In the linearly separable case, learning the
maximal margin hyperplane (w, b) can be stated as a convex quadratic optimization
problem with a unique solution: minimize ||w||, subject to the constraints (one for
each training example):
See an example of a 2-dimensional SVM in Figure 3.1.
24
Data Belongs to Class +1
Data Belongs to Class -1
Hyperplane
25
The SVM model has an equivalent dual formulation, characterized by a weight vector
α and a bias b. In this case, α contains one weight for each training vector,
indicating the importance of this vector in the solution. Vectors with non null weights
are called support vectors.
( , , ) 1
N f x b y x x bi i ii
α α= ⟨ ⟩ +∑ =
i (3.4)
The α vector can be calculated also as a quadratic optimization problem. Given the
optimal *α vector of the dual quadratic optimization problem, the weight vector *w
that realizes the maximal margin hyperplane is calculated as:
* * 1
α= ∑ =
(3.5)
The *b has also a simple expression in terms of *w and the training examples
( ){ }, 1 N
x yi i i= .
The advantage of the dual formulation is that permits an efficient learning of non–
linear SVM separators, by introducing kernel functions. Mathematically, a kernel
function calculates a dot product between two vectors that have been (non linearly)
mapped into a high dimensional feature space. Since there is no need to perform this
mapping explicitly, the training is still feasible although the dimension of the real
feature space can be very high or even infinite.
In the presence of outliers and wrongly classified training examples it may be useful
to allow some training errors in order to avoid overfitting. This is achieved by a
variant of the optimization problem, referred to as soft margin, in which the
26
contribution to the objective function of margin maximization and training errors can
be balanced through the use of a parameter called C as shown in Figure 1 (rightmost
representation).
Here, description about the collection of training examples and feature codification is
done.
3.3.1 Binarizing the Classification Problem
Tagging a word in context is a multi-class classification problem. Since SVMs are
binary classifiers, a binarization of the problem must be performed before applying
them. Here a simple one-per- class binarization is applied, i.e., a SVM is trained for
every PoS tag in order to distinguish between examples of this class and all the rest.
When tagging a word, the most confident tag according to the predictions of all
binary SVMs is selected.
However, not all training examples have been considered for all classes. Instead, a
dictionary is extracted from the training corpus with all possible tags for each word,
and when considering the occurrence of a training word w tagged as ti, this example
is used as a positive example for class ti and a negative example for all other tj classes
appearing as possible tags for w in the dictionary. In this way, the generation of
excessive (and irrelevant) negative examples can be avoided, and training step can be
made faster.
3.3.2 Feature Codification
Each example (event) has been represented using the local context of the word for
which the system will determine a tag (output decision). This local context and local
information like capitalization and affixes of the current token will help the system
27
make a decision even if the token has not been encountered during training. A
centered window of seven tokens is considered , in which some basic and n–gram
patterns are evaluated to form binary features such as: “previous word is the”, “two
preceding tags are DT NN”, etc. Table 3.1 contains the list of all patterns considered.
As it can be seen, the tagger is lexicalized and all word forms appearing in window
are taken into account. Since a very simple left–to–right tagging scheme will be used,
the tags of the following words are not known at running time. Following the
approach of Memory based tagger, the more general ambiguity–class tag for the right
context words is used, which is a label composed by the concatenation of all possible
tags for the word (e.g., IN-RB, JJ-NN, etc.). Each of the individual tags of an
ambiguity class is also taken as a binary feature of the form “following word may be
a VBZ”. Therefore, with ambiguity classes and “maybe’s”, two pass solution is
avoided, in which an initial first pass tagging is performed in order to have right
contexts disambiguated for the second pass. Explicit n–gram features are not
necessary in the SVM approach, because polynomial kernels account for the
combination of features. However, since we are interested in working with a linear
kernel, those are included in the feature set.
Additional features have been used to deal with the problem of unknown words.
Features appearing a number of times under a certain count cut-off might be ignored
for the sake of robustness.
28
Word features , , , , , ,3 2 1 0 1 2 3w w w w w w w− − − + + +
POS Features , , , , , ,3 2 1 0 1 2 3p p p p p p p− − − + + +
Ambiguity classes , , ,0 1 2 3a a a a
May_be’s , , ,0 1 2 3m m m m
Word bigrams ( ) ( ) ( ) ( ) ( ), , , , , , , , ,2 1 1 1 1 0 0 1 1 2w w w w w w w w w w− − − + − + + +
POS Bigrams ( ) ( ) ( ), , , , ,2 1 1 1 1 2p p p a a a− − − + + +
Word Trigrams ( ) ( ) ( ) ( ) ( )
, , , , , , , , ,2 1 0 2 1 1 1 0 1 , , , , ,1 1 2 0 1 2
w w w w w w w w w
w w w w w w − − − − + − +
− + + + +
p p a p p a
p a a p a a − − − +−
+ +−− +
Sentence_info ( )'.', '?', '!'punctuation
Prefixes , , ,1 1 2 1 2 3 1 2 3 4s s s s s s s s s s
Suffixes , , ,1 2 1 3 2 1s s s s s s s s s sn n n n n n n n n n− − − − − −
Binary word
, , , ,
/ / ...
initialUpper Case allUpper Case noinitial Capital Letter s all Lower Case
contains a period number hyphen
Word length Integer
29
3.4 THE SVMTOOL
The svmtool software package consists of three main components, namely the model
learner (SVMTlearn), the tagger (SVMTagger) and the evaluator (SVMTeval).
Previous to the tagging, SVM models (weight vectors and biases) are learned from a
training corpus using the SVMTlearn component. Different models are learned for the
different strategies. Then, at tagging time, using the SVMTagger component, one
may choose the tagging strategy that is most suitable for the purpose of the tagging.
Finally, given a correctly annotated corpus, and the corresponding svmtool predicted
annotation, the svmteval component displays tagging results.
3.4.1 SVMTLEARN
Given a training set of examples (either annotated or unannotated), it is responsible
for the training of a set of SVM classifiers. So as to do that, it makes use of SVM–
light, an implementation of Vapnik’s SVMs in C, developed by Thorsten Joachim’s.
The SVMlight software implementation of Vapnik’s Support Vector Machine by
Thorsten Joachim’s has been used to train the models.
3.4.1.1 Training Data Format
Training data must be in column format, i.e. a token per line corpus in a sentence by
sentence fashion. The column separator is the blank space. The token is expected to
be the first column of the line. The tag to predict takes the second column in the
output. The rest of the line may contain additional information. See an example:
Pierre NNP B-PERSON
Vinken NNP I-PERSON
. . O
No special ‘<EOS>’ mark is employed for sentence separation. Sentence punctuation
is used instead, i.e. [.!?] symbols are taken as unambiguous sentence separators.
3.4.1.2 Options
Usage: SVMTlearn [options] <config-file>
1: low verbose [default]
These are the currently available config-file options:
31
Sliding window: The size of the sliding window for feature extraction can be
adjusted. Also, the core position in which the word to disambiguate is to be located
may be selected. By default, window size is 5 and the core position is 2, starting at 0.
Feature set: Three different kinds of feature types can be collected from the sliding
window:
– Word features: Word form n-grams. Usually unigrams, bigrams and trigrams
suffice. Also, the sentence last word, which corresponds to a punctuation mark (‘.’,
‘?’, ‘!’), is important.
– PoS features: Annotated parts–of–speech and ambiguity classes n-grams, and
“may be’s”. As for words, considering unigrams, bigrams and trigrams is enough.
The ambiguity class for a certain word determines which PoS are possible. A “may
be” states, for a certain word, that certain PoS may be possible, i.e. it belongs to the
word ambiguity class.
hyphenization, and similar information related to a word form.
Default feature sets for every model are defined.
Feature filtering: The feature space can be kept in a convenient size. Smaller models
allow for a higher efficiency. By default, no more than 100,000 dimensions are used.
Also, features appearing less than n times can be discarded, which indeed causes the
system both to fight against overfitting and to exhibit a higher accuracy. By default,
features appearing just once are ignored.
SVM model compression: Weight vector components lower than a given threshold,
in the resulting SVM models can be filtered out, thus enhancing efficiency by
decreasing the model size but still preserving accuracy level. That is an interesting
32
behavior of SVM models being currently under study. In fact, discarding up to 70%
of the weight components accuracy remains stable, and it is not until 95% of the
components are discarded that accuracy falls below the current state–of–the–art
(97.0% - 97.2%).
C parameter tuning: In order to deal with noise and outliers in training data, the soft
margin version of the SVM learning algorithm allows the misclassification of certain
training examples when maximizing the margin. This balance can be automatically
adjusted by optimizing the value of the C parameter of SVMs. A local maximum is
found exploring accuracy on a validation set for different C values at shorter
intervals.
Dictionary repairing: The lexicon extracted from the training corpus can be
automatically repaired either based on frequency heuristics or on a list of corrections
supplied by the user. This makes the tagger robust to corpus errors. Also a heuristic
threshold may be specified in order to consider as tagging errors those (wordxtagy)
pairs occurring less than a certain proportion of times with respect to the number of
occurrences of wordx. For example, a threshold of 0.001 would consider (run DT) as
an error if the word run had been seen at least 1000 times and only once tagged as a
‘DT’. This kind of heuristic dictionary repairing does not harm the tagger
performance, on the contrary, it may help a lot.
Repairing list must comply with the SVMTOOL dictionary format,
i.e. <word> <N occurrences> <N possible tags> 1{<tag (i)> <N occurrences (i)}
For example:
...
a 23673 4 DT 23647 FW 8 LS 2 SYM 11
an 3819 1 DT 3819
and 19762 1 CC 19762
33
can 1133 2 MD 1128 NN 5
did 743 1 VBD 743
do 1156 2 VB 402 VBP 754
does 601 1 VBZ 601
for 9890 2 IN 9884 RP 6
he 3181 1 PRP 3181
if 994 1 IN 994
in 18857 3 IN 18573 RB 65 RP 219
is 8499 1 VBZ 8499
it 5768 1 PRP 5768
...
Ambiguous classes: The list of PoS presenting ambiguity is, by default,
automatically extracted from the corpus but, if available, this knowledge can be made
explicit. This acts in favor of the system robustness.
Open classes: The list of PoS tags an unknown word may be labeled as is also, by
default, automatically determined.
Backup lexicon: A morphological lexicon containing words that are not present in
the training corpus may be provided. It can be also provided at tagging time. This file
must comply with the svmtool dictionary format.
3.4.1.3 Configuration File
Several arguments are mandatory (as shown in table 3.2). The rest are optional (as
shown in table 3.4). Lists of features are defined in the SVMTool feature language
(svmtfl) as shown in table 3.5. Lines beginning with ‘#’ are ignored. The list of action
items for the learner must be declared (shown in table 3.3):
34
NAME => name of the model to create (a log of the experiment is generated onto the
file “NAME.EXP”)
SVMDIR => location of the Joachim’s SVMlight software
Table 3.2: SVMTlearn config-file mandatory arguments.
Syntax: do <MODEL> <DIRECTION> [<CK>] [<CU>] [<T>]
Where MODEL = [M0|M1|M2|M3|M4]
DIRECTION = [LR|RL|LRL]
<log>|<nolog>> | <CK-value>]
DIRECTION model direction
T test options (optional)
35
# -------------------------------------------------------------
NAME = WSJTP
TRAINSET = /home/me/SVMT/corpora/WSJTP/WSJTP.TRAIN
SVMDIR = /home/me/SVMT/soft/
#action items
SET location of the whole set
VALSET location of the validation set
TESTSET location of the test set
TRAINP proportion of sentences belonging to the provided whole SET which
will be used for training
VALP proportion of sentences belonging to the provided whole SET which
will be used for validation
TESTP proportion of sentences belonging to the provided whole SET which
will be used for test
REMOVE FILES remove intermediate files?
REMAKE FOLDERS remake cross-validation folders?
Kfilter Weight filtering for known word models
Ufilter Weight filtering for unknown word models
R dictionary repairing list (heuristically repaired by default)
D dictionary repairing heuristic threshold (0.001 by default)
BLEX backup lexicon
W window definition (size, core position)
F feature filtering (count cut off, max mapping size)
CK C parameter for known words -all models-(0 by default)
CU C parameter for unknown words -all models-(0 by default)
X percentage of unknown words expected (3 by default)
AP list of parts-of-speech presenting ambiguity (automatically created by
default)
Table 3.4: SVMTlearn config-file optional arguments.
37
WORD n-grams w(n1, ...ni...nm) (equivalent to C(0; n1, ...ni...nm))
TAG n-grams p(n1, ...ni...nm) (equivalent to C(1; n1, ...ni...nm))
AMBIGUITY CLASSES k (n)
MAYBE’s m (n) where ni is the relative position with respect to the element to
disambiguate
CHARACTER A (i) ca (i) where i is the relative position of the character with
respect to the beginning of the word
CHARACTER Z (i) cz (i) where i is the relative position of the character with
respect to the end of the word
PREFIXES a(i) = s1s2...si
SUFFIXES z(i) = sn-i...sn-1sn sa does the word start with lower case?
SA does the word start with upper case?
CA does the word contain any capital letter?
CAA does the word contain several capital letters?
aa are all letters in the word in lower case?
AA are all letters in the word in upper case?
SN does the word start with number?
CP does the word contain a period?
CN does the word contain a number?
CC does the word contain a comma?
MW does the word contain a hyphen?
L word length
38
version of the previous config-file:
# -------------------------------------------------------------
# -------------------------------------------------------------
NAME = WSJTP
TRAINSET = /home/me/SVMT/corpora/WSJTP/WSJTP.TRAIN
SVMDIR = /home/me/SVMT/soft/
X = 10
#list of parts-of-speech presenting ambiguity
AP = ’’ CC CD DT EX FW IN JJ JJR JJS LS MD NN NNS NNP NNPS PDT
POS PRP PRP$ RB RBR RBS RP SYM UH VB VBD VBG VBN VBP VBZ
WDT WP WRB
#list of open-classes
UP = FW JJ JJR JJS NN NNS NNP NNPS RB RBR RBS VB VBD VBG VBN
VBP VBZ
#ambiguous-right [default]
w(-1,1) w(1,2) w(-2,-1,0) w(-2,-1,1) w(-1,0,1)
w(-1,1,2) w(0,1,2) p(-2) p(-1) p(-2,-1) p(-1,1) p(1,2)
p(-2,-1,1) p(-1,1,2) a(0) a(1) a(2) m(0) m(1) m(2)
A0u = w(-2) w(-1) w(0) w(1) w(2) w(-2,-1) w(-1,0) w(0,1)
w(-1,1) w(1,2) w(-2,-1,0) w(-2,-1,1) w(-1,0,1)
w(-1,1,2) w(0,1,2) p(-2) p(-1) p(-2,-1) p(-1,1) p(1,2)
p(-2,-1,1) p(-1,1,2) a(0) a(1) a(2) m(0) m(1) m(2)
a(2) a(3) a(4) z(2) z(3) z(4) ca(1) cz(1)
L SA AA SN CA CAA CP CC CN MW
40
In this case model NAME is ‘WSJTP’, so model files will begin with this prefix.
Only the training set is specified. A list of dictionary repairings is provided. A
window of 5 elements, being the core in the third position, is defined for feature
extraction. The expected proportion of unknown words is 10%. Intermediate files will
be removed. The list of parts-of-speech presenting ambiguity is supplied. Also, a list
of open-classes is provided.
This config-file is designed to learn Model 0 on the Wall Street Journal English
Corpus. That would allow for the use of tagging strategies 0 and 5, only left-to-right
though. Instead of using default feature sets, two feature sets are defined for Model 0
(for the two distinct problems of known word and unknown word tag guessing).
3.4.1.4 C Parameter Tuning
C parameter tuning is optional. Either we can specify no C parameter (C = 0 by
default), or we can specify a fixed value (e.g., CK: 0.1 CU: 0.01), or we can perform
an automatic tuning by greedy exploration. If we decide to do so then we must
provide a validation set and specify the interval to be explored and how to do it (i.e.
number of iterations and number of segments per iteration). Moreover, the first
iteration can take place in a logarithmic fashion or not. For example, CK:
0.01:10:3:10: log would try these values for C: 0.01, 0.1, 1, 10 at the first iteration.
For the next iteration, the algorithm explores on both sides of the point where the
maximal accuracy was obtained, half to the next and previous points. For example,
suppose the maximal accuracy was obtained for C = 1, then it would explore the
range form 0.1 / 2 = 0.05 to 0.1 / 2 = 0.5. The segmentation ratio would be 0.045 so
the algorithm would go for values 0.05, 0.095, 0.14, 0.185, 0.23, 0.275, 0.32, 0.365,
0.41, 0.455, and 0.5. And so on for the following iteration.
41
3.4.1.5 Test
After training a model it can be evaluated against a test set. To indicate so the T
option must be activated in the corresponding do-action, e.g., “do M0 LR CK:
0.01:10:3:10: log CU: 0.07 T”. By default it expects a test set definition in the config
file. But training/test can also be performed through a cross-validation. The number
of folders must be provided, e.g., “do M0 LR CK: 0.01:10:3:10: log CU: 0.07 T: 10”.
10 is a good number.
Furthermore, if training/test goes in cross-validation then the C Parameter tuning goes
too, even if a validation set has been provided.
3.4.1.6 Models
Five different kinds of models have been implemented in this. Models 0, 1, and 2
differ only in the features they consider. Model 3 and Model 4 are just like Model 0
with respect to feature extraction but examples are selected in a different manner.
Model 3 is for unsupervised learning so, given an unlabeled corpus and a dictionary,
at learning time it can only count on knowing the ambiguity class, and the PoS
information only for unambiguous words. Model 4 achieves robustness by simulating
unknown words in the learning context at training time.
Model 0: This is the default model. The unseen context remains ambiguous. It was
thought having in mind the one-pass on-line tagging scheme, i.e. the tagger goes
either left-to-right or right-to-left making decisions. So, past decisions feed future
ones in the form of PoS features. At tagging time only the parts-of-speech of already
disambiguated tokens are considered. For the unseen context, ambiguity classes are
considered instead. Features are shown in Table 3.6.
42
Model 1: This model considers the unseen context already disambiguated in a
previous step. So it is thought for working at a second pass, revisiting and correcting
already tagged text. Features are shown in Table 3.7.
Model 2: This model does not consider pos features at all for the unseen context. It is
designed to work at a first pass, requiring Model 1 to review the tagging results at a
second pass. Features are shown in Table 3.8.
Model 3: The training is based on the role of unambiguous words. Linear classifiers
are trained with examples of unambiguous words extracted from an unannotated
corpus. So, fewer PoS information is available. The only additional information
required is a morpho-syntactic dictionary.
Model 4: The errors caused by unknown words at tagging time punish severely the
system. So as to reduce this problem, during the learning some words are artificially
marked as unknown in order to learn a more realistic model. The process is very
simple. The corpus is divided in a number of folders. Before starting to extract
samples from each of the folders, a dictionary is generated out from the rest of
folders. So, the words appearing in a folder but not in the rest are unknown words to
the learner.
POS Features ,2 1p p− −
POS Bigrams ( ) ( ) ( ), , , , ,2 1 1 1 1 2p p p a a a− − − + + +
POS Trigrams ( ) ( ) ( ) ( )
p p a p p a
p a a p a a − − − +−
+ +−− +
Lexicalized features SA, CAA, AA, SN, CP, CN, CC, MW,L
Ambiguity classes , ,0 1 2a a a
May_be’s , ,0 1 2m m m
POS Features , , ,2 1 1 2p p p p− − + +
POS Bigrams ( ) ( ) ( ), , , , ,2 1 1 1 1 2p p p p p p− − − + + +
POS Trigrams ( ) ( ) ( ) ( )
p p a p p p
p a p p p p − − − +−
+ +−− +
44
POS Trigrams ( ), ,2 1 0p p a− −
Single characters ca(1), cz(1)
prefixes a(2), a(3), a(4)
suffixes z(2), z(3), z(4)
3.4.2 SVMTAGGER
Given a text corpus (one token per line) and the path to a previously learned SVM
model (including the automatically generated dictionary), it performs the PoS tagging
of a sequence of words. The tagging goes on-line based on a sliding window which
gives a view of the feature context to be considered at every decision.
In any case, there are two important concepts we must consider:
(1) Example generation
(2) Feature extraction
(1) Example generation: This step is to define what an example is, according to the
concept in which the machine is to be learned. For instance, in PoS tagging, the
machine should correctly classify words according to their PoS. Thus, every PoS is a
45
class, and, typically, every occurrence of a word generates a positive example for its
class, and a negative example for the rest of classes. Therefore, every sentence may
generate a large number of examples.
(2) Feature Extraction: The set of features based on the algorithm to be used have
to be defined. For instance, the PoS tags should be guessed according to the preceding
and following word. Thus, every example is represented as a set of active features.
These representations will be the input for the SVM classifiers. If SVMTool working
have to be learned, it is necessary to run the SVMTlearn (Perl version) setting the
REMOVE_FILES option to 0 (in the config file). In this way inspection of
intermediate files can be done.
Feature extraction is performed by the sliding window object. The whole story takes
place as follows. A sliding window works on a very local context (as defined in the
CONFIG file), usually a 5 words context [-2, -1, 0, +1, +2], being the current word
under analysis at the core position. Taking this context into account a number of
features may be extracted. The feature set depends on how the tagger is going to
proceed later (i.e., the context and information that’s going to be available at tagging
time). Commonly, all words are known before tagging, but PoS is only available for
some words (those already tagged).
In tagging stage, if the input word is 'known and ambiguous, the word is tagged (i.e.,
classified), and the predicted tag feeds forward next decisions. This will be done in
the subroutine "sub classify_sample_merged ( )" in the file SVMTAGGER.
In order to speed up SVM classification, they decided to merge mapping and SVM
weights and biases, into a single file. Therefore, when a new example is to be tagged,
we just access the merged model and for every active feature retrieve the associated
weight. Then for every possible tag, the bias will be retrieved as well. Finally, we
apply the SVM classification rule (i.e., scalar product + bias).
46
AN EXAMPLE
Suppose the example is taken as “cat chased rat”, in this the word to be tagged is
"chased", and taking the active features like w-1, w+1, i.e., predict PoS-tags based
only on the preceding and following words.
Then, active features are "w-1: cat" and "w+1: rat".
Therefore, when applying the svm classification rule for a given PoS, it is necessary
to go to the merged model and retrieve the weight for these features, and the bias
(first line after the header, beginning with "BIASES "), corresponding to the given
PoS.
BIASES $:0.7331143 '':- 0.0890327 , 1.0053873 CC: 0.85252864 CD: 0.63454985
DT:-0.33000354 EX: 0.0015703252 FW: 0.39300653 IN:-0.10980182 JJ:
0.66161774 JJR: 0.18644595 JJS: 0.13703539 LS: 1.4687678 MD: 0.13016217 NN:
0.78446686 NNP: 0.48506291 NNPS: 0.60520796 NNS:-0.072942298 PDT:
1.7344161 POS: 0.54289449 PRP:-0.43581492 PRP$:-0.47878762 RB: 0.42125766
RBR: 1.0719682
NNS: - 0.0580028488691131 RB: - 0.11975 SYM: 0.0785734202231166 VBD: -
0.11975 VBN: 0.11975 VBZ: 0.0353085505337163
W+1: rat CC:-0.00190667685545396 CD: 0.0623807416364278 DT:
0.0407755247730493 JJ: 0.148842327779351 NN:-0.0739079222302325 NNP:
0.04153802630344 NNS:-0.075347569791277 PRP:-0.0403647815284746 RB:-
0.00791220904616591 VBD:-0.04987337944059 VBG: 0.2395
The SVM score for "chased" being a VBD is:
Weight ("w-1: cat", "VBD") + weight ("w+1: rat", "VBD") – bias ("VBD") = -
0.11975 + -0.04987337944059 - 0.28732588 = -0.4569492594
Weight (“w-1: cat”, ”NN”) + weight (“w+1: rat”, ”NN”) – bias (“NN”) = 0.025704 +
-0.073907-0.784466 = -0.83266
Here SVM score for VBD is more compared to NN, So, the tag VBD is assigned to
the word ‘chased’.
Calculated part–of–speech tags feed directly forward next tagging decisions as
context features. The SVMTagger component works on standard input/output. It
processes a token per line corpus in a sentence by sentence fashion. The token is
expected to be the first column of the line. The predicted tag will take the second
column in the output. The rest of the line remains unchanged. Lines beginning with
‘## ’ are ignored by the tagger. See an example of input to the SVMTagger:
3.4.2.1 Options
SVMTagger is very flexible, and adapts very well to the needs of the user. Thus we
may find the several options currently available:
Tagging scheme: Two different tagging schemes may be used.
– Greedy: Each tagging decision is made based on a reduced context. Later on,
decisions are not further reconsidered, except in the case of tagging at two steps
or tagging in two directions.
– Sentence-level: By means of dynamic programming techniques (Viterbi
algorithm), the global sentence sum of SVM tagging scores is the function to
maximize as shown in equation 3.6.
48
Given a sentence S =w1...wn as a word sequence and the set T(S) = {ti: 1 <= i<= |S|)
∧ ti ∈ ambiguity_class (wi)} of all possible sequences of PoS tags associated to S.
( ) ( ) ( )argmax 1
= ∑∈ = (3.6)
A softmax function is used by default so as to transform this sum of scores into a
product of probabilities. Because sentence-level tagging is expensive, two pruning
methods are provided. First, the maximum number of beams may be defined.
Alternatively, a threshold may be specified so that solutions scoring under a certain
value (with respect to the best solution at that point) are discarded. Both pruning
techniques have proved effective and efficient in our experiments.
Tagging direction: The tagging direction can be either “left-to right”, “right-to-left”,
or a combination of both. The tagging direction varies results yielding a significant
improvement when both are combined. For every token, every direction assigns a tag
with a certain score. The highest-scoring tag is selected.
This makes the tagger very robust. In the case of sentence level tagging there is an
additional way to combine left-to-right and right-to-left. “GLRL” direction makes a
global decision, i.e. considering the sentence as a whole. For every sentence, every
direction assigns a sequence of tags with an associated score that corresponds to the
sum of scores (or the product of probabilities when using a softmax function, the
default option). The highest scoring sequence of tags is selected.
One pass / two passes: Another way of achieving robustness is by tagging in two
passes. At the first pass only PoS features related to already disambiguated words are
considered. At a second pass disambiguated PoS features are available for every word
in the feature context, so when revisiting a word tagging errors may be alleviated.
49
SVM Model Compression: Just as for the learning, weight vector components lower
than a certain threshold can be ignored.
All scores: Because sometimes not only the tag is relevant, but also its score and
scores of all competing tags, as a measure of confidence, this information is available.
Backup lexicon: Again, a morphological lexicon containing new words that were not
available in the training corpus may be provided.
Lemmae lexicon: Given a lemmae lexicon containing <word form, tag, lemma>
entries, output may be lemmatized. A lexicon is shown below:
playing JJ playing
playing NN playing
playing VBG play
<EOS> tag: The ‘<s>’ tag may be employed for sentence separation. Otherwise,
sentence punctuation is used instead, i.e. [.!?] symbols are taken as unambiguous
sentence separators.
0: one-pass (default - requires model 0)
1: two-passes [revisiting results and relabeling requires model 2 and model 1]
2: one-pass [robust against unknown words requires model 0 and model 2]
50
4: one-pass [very robust against unknown words requires model 4]
5: one-pass [sentence-level likelihood requires model 0]
6: one-pass [robust sentence-level likelihood requires model 4]
- S <direction>
(global assignment, only applicable under a sentence level tagging strategy)
- K <n> weight filtering threshold for known words (default is 0)
- U <n> weight filtering threshold for unknown words (default is 0)
- Z <n> number of beams in beam search, only applicable under sentence-level
strategies (default is disabled)
- R <n> dynamic beam search ratio, only applicable under sentence-level
strategies (default is disabled)
- F <n> softmax function to transform SVM scores into probabilities (default is
1)
- B <backup_lexicon>
- L <lemmae_lexicon>
- EOS enable usage of end_of_sentence ‘<s>’ string (disabled by default, [!.?] used
instead)
1: low verbose
2: medium verbose
3: high verbose
(Name as declared in the config-file NAME)
Example: SVMTagger -V 2 -S LRL -T 0 WSJTP < wsj.test > wsj.test.out
3.4.2.2 Strategies
Different tagging strategies have been implemented so far:
Strategy 0: It is the default one. It makes use of Model 0 in a greedy on-line fashion,
one-pass.
Strategy 1: As a first attempt to achieve robustness in front of error propagation, it
works in two passes, in an on-line greedy way. It uses Model 2 in the first pass and
Model 1 in the second. In other words, in the first pass the unseen morpho syntactic
context remains ambiguous while in the second pass the tag predicted in the first pass
is available also for unse

in partial fulfillment for the award of the degree of

Documents

Transcript of in partial fulfillment for the award of the degree of