[IEEE 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI) -...
Transcript of [IEEE 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI) -...
IEEE Proceedings of 4th International Conference on Intelligent Human Computer Interaction, Kharagpur, India, December 27-29, 2012
Identification of Nominal Multiword Expressions inBengali Using CRF
Tanmoy ChakrabortyDepartment of Computer Science and Engineering
Indian Institute of Technology, Kharagpur
India
Email: its [email protected]
Abstract—One of the key issues in both natural languageunderstanding and generation is the appropriate processing ofMultiword Expressions (MWEs). MWEs pose a huge problem toa precise language processing due to their idiosyncratic natureand diversity in lexical, syntactical and semantic properties. Thesemantic of a MWE can be expressed transparently or opaquelyafter combining the semantic of its constituents. This paper dealswith the identification of Nominal Multiword Expressions in theBengali text using Conditional Random Field (CRF) machinelearning technique. Bengali is highly agglutinative and morpho-logically rich language. Thus the selection of features such assurrounding words, POS tag, prefix, suffix, length etc are provedto be very effective for running the CRF tool for the identificationof Nominal MWEs. Compared to the statistical system built inBengali language for compound noun MWEs identification, ourproposed system shows higher accuracy in terms of precision,recall and F-score. We also conclude that with the identificationof Reduplicated MWEs (RMWEs) and considering it as a featuremakes reasonable improvement compared to the earlier system.
Index Terms—Multiword Expressions, Bengali, CRF, Redupli-cations
I. INTRODUCTION
Over the past two decades or so, Multiword Expressions
(MWEs) have been identified with an increasing amount of
interest in the field of Computational Linguistics and Natural
Language processing. The term MWE is used to refer to the
various types of linguistic units and expressions including
idioms (kick the bucket, “to die”), noun compounds (villagecommunity), phrasal verbs (find out, “search”), other habitual
collocations like conjunction (as well as), institutionalized
phrases (many thanks) etc. However, while there is no uni-
versally agreed definition for MWE as yet, most researchers
use the term to refer to those frequently occurring phrasal units
which are subject to certain level of semantic opaqueness, or
non-compositionality. Sag et al. (2002) [1] defined them as
“idiosyncratic interpretations that cross word boundaries (or
spaces)”.
The identification of MWEs in several languages have
started with concentration on compound nouns, noun-verb
combination, some on idioms and phrases and so on but
not much on combined MWEs. The reason may be that the
combined identification of MWEs are tough in any language.
MWE is treated as a special issue of semantics where the
individual components of an expression often fail to keep their
meanings intact within the actual meaning of that expression.
This opaqueness in meaning may be partial or total depending
on the degree of compositionality of the whole expression [2].
MWEs span a continuum from complete compositionality (aka
“institutionalized phrases”) (e.g., many thanks, which decom-
pose into simplex senses and generally display high syntactic
variability) to partial compositionality (e.g., light house, where
partial meaning is identified from the components) then to
idiosyncratically compositionality (e.g., spill the beans, “to
reveal”, which are decomposable but coerce their parts into
taking semantics unavailable outside the MWE and undergo
a certain degree of syntactic variation) and finally complete
non-compositionality (e.g., hot dog, where no decomposition
analysis is possible and the MWE is semantically impenetra-
ble). A number of research activities regarding MWE have
been carried out in various languages like English, German
and many other European languages. Various statistical co-
occurrence measurements like Mutual Information [3], Log-
Likelihood [4], Salience [5] have been suggested for identifi-
cation of MWEs.
In case of Indian languages, a considerable approach in
compound noun MWE extraction [6], complex predicate ex-
traction [7], clustering based approach [8] and a classification
based approach for Noun-Verb collocations [9] have been
done. In Bengali, works on automated extraction of MWEs
are limited in number. One method of automatic extraction of
Noun-Verb MWE in Bengali [10] has been carried out using
morphological evidence and significance function. They have
classified Bengali MWEs based on their morpho-syntactic flex-
ibilities. They proposed a statistical approach for extracting the
verbal compounds from a medium size corpus. Chakraborty
and Bandyopadhyay (2010) [11] attempted to extract noun-
noun bigram MWEs from Bengali corpus using statistical
approach.
In this experiment, we have tried to build up standard
lexicon of Bengali Nominal MWEs so that it can help to
develop proper training samples for machine learning approach
as well as a gold standard to evaluate our system. For the first
time in Bengali language, we introduce CRF to tag MWEs
using the information of morphological and phraseological
markers and dependencies between candidate phrase and con-
textual tokens. Beside this, we incorporate the information of
reduplicated MWEs in the feature set and draw the conclusion
that it improves the performance of CRF model significantly.
978-1-4673-4369-5/12/$31.00 c©2012 IEEE
Finally, we add a post-processing step based on the heuristics
that the constituent of an MWE always belong to a single
chunk. Though it reasonably improves the precision value, the
recall value is dropped because of the inefficiency of Bengali
Shallow parser to tag the chunk of a raw text.
Section II describes the classification of Nominal MWEs
in Bengali, Section III gives very brief idea of Conditional
Random Field model, Section IV gives detail description of
experimental methodology, Section V illustrates the evaluation
part, Section VI shows the improvement using RMWEs and
the conclusion is drawn in Section VII.
II. NOMINAL MULTIWORD EXPRESSIONS IN BENGALI
The compound noun or nominal compound consists of
more than one free morpheme and when acts as a MWE,
the components sometime lose their individual literal meaning
and looks like a single semantic unit. The compound noun
MWEs can occur in open, closed or hyphenated forms and sat-
isfy semantic non-compositionality, statistical co-occurrence
or literal phenomena [6] etc. Agarwal et al. (2004) [10]
have classified the Bengali MWEs in three main classes that
consists of twelve different fine-grained subclasses. However,
we have classified Bengali Nominal MWEs into eight different
subclasses based on their morpho-syntactic flexibilities. The
classifications are as follows:
Named-Entities (NE): Name of the people (RabindranathThakur, “Rabindranath Tagore”), name of the location
(Bharat-barsa, “India”), name of the organization (PaschimBanga Siksha Samsad, “West Bengal Board of Education”)
etc where inflection can be added to the last word only.
Idiomatic Compound Nouns: These are unproductive and
idiomatic in nature and inflection can be added only to the
last word. The formation of this type is due to the hidden
conjunction between the components or extinction of inflection
from the first component (maa-baba, “mother and father”).
Idioms: They are also compound nouns with idiosyncratic
meaning, but first noun is generally in possessive form (taserghar, “fragile”). Sometime, individual components may not
carry any significant meaning and may not be a part of the
dictionary (gadai laskari chal, “indolent habit”). For them, no
inflection is allowed even to the last word.
Numbers: They are highly productive, impenetrable and
allow slight syntactic variations like inflections. Inflections can
be added only to the last component (soya sat ghanta, “seven
hours and fifteen minutes”).
Relational Noun Compounds: They are mainly kin terms
and bigram in nature. Inflection can be added with the last
word ( pistuto bhai, “maternal cousin”).
Conventionalized Phrases: Sometime they are called as
“Institutionalized phrase”. They are not idiomatic and a par-
ticular word combination coming to be used to refer to a
given object. They are productive and have unexpectedly low
frequency and in doing so, contrastively highlight the statis-
tical idiomaticity of the target expression (bibhha barshiki,“marriage anniversary”).
Simile Terms: They are analogy term in Bengali and
sometime similar to the idioms except the fact that they are
semi-productive (hater panch, “remaining resource”).
Reduplicated Terms: Reduplications are non-productive
and tagged as noun phrase. They are further classified
as onomatopoeic expressions (khat khat, “knocking”), com-
plete reduplication (bara-bara, “big big”), partial reduplica-
tion ( thakur-thukur, “God”), semantic reduplication (matha-mundu, “head”), correlative Reduplication (maramari, “fight-
ing”) [12].
Identification of reduplication has already been carried out
using the clues of the Bengali morphological pattern [12]. A
number of research activities in Bengali Named Entity (NE)
detection have been carried out [13], but the lack of standard
tool to detect NEs inhibits to incorporate it within the existing
system. In this experiment, we mainly focus on the extraction
of the above mentioned Nominal MWEs in Bengali.
III. CONDITIONAL RANDOM FIELD (CRF)
Conditional Random Field (CRF) is a new probabilistic
model for segmenting and labeling sequence data [14]. CRF
is an undirected graphical model that encodes a conditional
probability distribution with a given set of features. For the
given observation sequential data X(X1X2...Xn), and their
corresponding status label Y (Y 1Y 2...Y n), a linear chain
structure which CRF defines as the conditional probability as
follows:
P (Y |X) =1
Zxexp(
∑i
∑j
λjfj(yi−1, yi, X, i)) (1)
Where, Zx is a normalization and it makes the probability
of all state sequences sum to 1. Function inside the summation
is the feature function and λj is a learned weight associated
with the feature fj . Maximum entropy learning algorithm can
be used to train CRF. For the given observation sequential
data, the most probable sequence can be determined by
Y ∗ = argmaxj
P (Y |X) (2)
Where, Y can be efficiently determined using Viterbi al-
gorithm. An N-best list of labeling sequences can also be
obtained using modified Viterbi algorithm and A* search. The
main advantage of CRF comes from the fact that it can relax
the assumption of conditional independence of the observed
data often used in generative approaches, an assumption that
might be too restrictive for a considerable number of object
classes.
IV. EXPERIMENTAL METHODOLOGY
The system architecture of the proposed model is shown
in Figure 1. The process begins with the preprocessing of
crawled corpus which is very scattered and unformatted. Then
the cleaned corpus is fed into the CRF model for training and
testing phases. Before that, we annotate the MWEs from the
cleaned corpus. CRF labels the candidate phrases as MWEs
or not using the statistics learned from the training dataset.
Finally, we use a post-processing step to filter some false
positive terms from the output of the CRF model. We report
both the results before and after post-processing steps in the
evaluation phase.
Feature Extraction
CleanedDocument
DocumentCollection
Feature Extraction
CRF Model Labeling
Date Training Data Test
Preprocessing
Post−processing
Final Result
CRF
Fig. 1. Proposed system architecture
A. Corpus Acquisition and Candidate Extraction
Resource acquisition is one of the challenging obstacles
to work with electronically resource constrained languages
like Bengali. However, our system uses a large number of
Bengali articles written by the noted Indian Nobel laureate
Rabindranath Tagore1, and 150 articles each of Sarat Chandra
Chottopadhyay and a group of other Bengali authors2. The
statistics of the entire dateset is tabulated in Table I. As the
order of the documents within the sequence is not of major
importance, we merged all the articles and a raw corpus.
The actual motivation of choosing the literature domain in
the present task was to develop a useful statistics and further
work on the Stylometry analysis. However, in the literature,
the application of MWEs is large compared to the other do-
main like Tourism, Scientific documents because the semantic
versatility of MWEs often influences the writer to express
his viewpoint appropriately. Especially in Bengali literature,
the idiomatic expressions, relations terms are quite frequently
used by the writers. Our crawled corpus was very scattered
and unformatted that we aided basic semi-automatic pre-
processing techniques to make the corpus suitable for parsing.
The parsing using Bengali shallow parser3 has been done
for identifying the POS, chunk, root and inflection and other
morphological information of the token. Some of the tokens
are misspelled due to typographic or phonetic error. Thus,
the Shallow parser is not able to detect their actual root and
inflection properly. Shallow parser is little confused with some
of the nominal tags like common noun (NN) and proper noun
(NNP) because of the continuous need for coinage of new
terms for describing new concepts. For identifying all Nominal
MWEs present in the document, we have taken both of them.
1http://www.rabindra-rachanabali.nltr.org2http://banglalibrary.evergreenbangla.com/3http://ltrc.iiit.ac.in/analyzer/bengali
TABLE ISTATISTICS OF THE USED DATASET
Authors # documents # tokens # unique tokensRabindranath 150 6,862,580 4,978,672
TagoreSarat Chandra 150 4,083,417 2,987,450Chottopadyhay
Others 150 3,818,216 2,657,813
B. Annotation Agreement
Three annotators identified as A1, A2 and A3 (those are
linguistic experts working with our project) were engaged
to carry out the annotation. They were asked to divide all
extracted phrase into three classes and the definition of each
class has been also provided with examples:
Class 1: Valid Nominal MWEs (M): phrases which show
total non-compositionality and their meanings are hard to
predict from their constituents (e.g., hater panch, “remaining
resource”).
Class 2: Valid N-N semantic collocations but not MWEs(S): phrases which are partial or total compositional, sometime
act as institutionalized phrases and show Statistical Idiomatic-
ity (e.g., bibaha barsiki, “marriage anniversary”).
Class 3: Invalid candidates (E): phrases enlisted due to
error in parsing like POS, chunk, inflection (e.g., granthagartayri, “build library”).
The candidates in Class 3 are filtered initially and their
individual frequencies are noted as 53.90%. Then the remain-
ing 46.10% (5628 phrases) of total candidates are annotated
and labeled as “M” (MWEs) or “S” (Semantically collocated
phrases) and they are fed into the evaluation phase.
The annotation agreement is measured using standard Co-
hen’s kappa coefficient (κ) [15]. It is a statistical measure of
inter-rater agreement for qualitative (categorical) items. It mea-
sures the agreement between two raters who separately classify
items into some mutually exclusive categories. MWEs are
words or strings of words that are selected by the annotators.
The agreement is carried out between the sets of text spans
selected by the two annotators for each of the expressions.
We have employed another strategy in addition with kappa (κ)
to calculate the agreement between annotators. We chose the
measure of agreement on set-valued items (MASI) [16] that is
used for measuring agreement in the semantic and pragmatic
annotations. MASI is a distance between sets whose value is
1 for identical sets, and 0 for disjoint sets. For sets A and B,
it is defined as: MASI = J ∗M , where the Jaccard metric
(J) is:
J =A ∩B
A ∪B(3)
Monotonicity (M) is defined as follows:
M =
⎧⎪⎪⎨⎪⎪⎩
1, A = B2/3, A ⊂ B or B ⊂ A1/3, A ∩B �= φ,A−B �= φ and B −A �= φ0, A ∩B = φ
(4)
Table II illustrates the agreement statistics using two mea-
sures. Among the full-agreement MWEs, 50% of them are
used for training, 25% of them are used for development
dataset and rest of the candidates are taken for testing phase.
TABLE IIINTER-ANNOTATION AGREEMENT
MWEs Pair-wise agreement (%) between annotators(#5628) A1-A2 A1-A3 A2-A3 Avg.KAPPA 87.23 86.14 88.78 87.38MESI 87.17 87.02 89.02 87.73
C. MWE Extraction Using CRF
The process of MWE extraction using CRF requires feature
selection, preprocessing which includes arrangement of tokens
or words into sentences with other notations, creation of model
file after training and the testing with other corpus. For the
current work, C++ based CRF++ 0.53 package4 which is
readily available as open source for segmenting or labeling
sequential data is used. Following subsections explain the
overall process in detail:
1) Feature Selection: The feature selection is important in
CRF. The various features used in the system are:
F={Wi−m, ...,Wi−1,Wi,Wi+1, ...,Wi+n, |prefix| <=n,|suffix| <= n, Surrounding POS tag, word length, wordfrequency, acceptable prefix, acceptable suffix}
Surrounding words as feature: Preceding and following
words of a particular word can be used as features since the
preceding and following words influence the present word.
Word suffixes and prefixes as feature: The suffix and
prefix play an important role for Bengali POS tagging. A
maximum of n characters for every word is considered for
suffix and prefix, for words with length less then n, a NIL is
substituted in the respective field for the corresponding suffix
and prefix. These prefix characters or suffix characters are
considered regardless of whether it is meaningful or not.
Surrounding POS tag: MWEs can be a combination of
noun-noun, verb-noun, adjective-noun POS patterns, so the
POS of the surrounding words are considered.
Length of the word: Length of the word is set to 1 if it is
greater than 3 otherwise, it is set to 0. Very short words are
rarely proper nouns.
Word frequency: A range of frequency is being set: those
words with frequency < 100 occurrences are set the value 0,
those words which occurs >= 100 but less than 400 are set to
1 and so on. The word frequency is considered as one feature
since MWEs are rare in occurrence.
Acceptable Prefix: Eight prefixes have been manually
identified in Bengali and the list of prefixes is used as one
feature. A binary notation is being used in such a way that
a ‘1’ is set if the word consist of one among the acceptable
prefixes otherwise a ‘0’.
4http://crfpp.sourceforge.ne
Acceptable suffixes: Twenty suffixes have been manually
identified in Bengali and the list of suffixes is used as one
feature. A binary notation is being used in such a way that
a ‘1’ is set if the word consist of one among the acceptable
suffixes otherwise a ‘0’.
2) Feature Extraction: The input file is a preprocessed
Bengali text document. Generally training and test file must
consist of multiple tokens. In addition, a token consists of
multiple (but fixed-numbers) columns where the columns are
used by a template file. Each token must be represented in one
line, with the columns separated by white spaces (spaces or
tabular characters). A sequence of token becomes a sentence.
The template file gives the complete idea about the feature
selection. Before undergoing training and testing in the CRF,
the input document is converted into a multiple token file
with fixed columns and the template file allows the feature
combination and selection. Two standard files of multiple
tokens with fixed columns are created: one for training and
another one for testing. In the training file the last column
should be tagged with all those identified MWEs by marking
“B-MWE” for the beginning of MWE and “I-MWE” for the
rest of the MWE else “O” for those which are not MWE
whereas in the test file we can either use the same tagging
for comparisons or only “O” for all the tokens regardless of
whether it is MWE or not.
TABLE IIINOTATION USED IN THE RESULT SECTION
Notation Meaning
W[-i,+j] Words spanning from the ith left position
to the jth right position
POS[-i,+j] POS tags of the words spanning from the ith
left position to the jth right positionPre Prefix of the wordSuf Suffix of the word
3) Post-processing: The phrases tagged by the CRF as
MWEs are fed further to a post-processing step. In order
to find out correct Nominal MWEs, our first intuition was
that all the terms present in an MWE should together make
a single nominal chunk. After verifying the output of the
development set, we have seen that CRF tagged few phrases
as MWEs whose constituent terms belong to multiple chunk
in the parsed corpus. In the post-processing step, we prune
all such tagged MWEs whose constituents belong to multiple
chunks. Note that we blindly believe the chunking information
that Bengali Shallow Parser produces. In the evaluation phase,
we will see that due to wrong chunking by the parser, the recall
value is drooped significantly after post-processing, though a
considerable amount of precision is increased for this post-
pruning.
V. EVALUATION
A. Evaluation Metrics
In order to evaluate our system, we use standard Information
Retrieval based metrics: Precision, Recall and F-score. Based
on our present task, they are defined below.
TABLE IVFEW RESULTS OF FEATURE TUNNING EXPERIMENT OVER DEVELOPMENT SET
Features Precision Recall F-score
W [−3,+3], POS[−3,+3], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 60.28 84.58 70.39
W [−4,+4], POS[−4,+4], |Pre| <= 5, |Suf | <= 5, Length, word frequency, acceptable prefix, acceptable suffix 58.29 78.59 66.93
W [−4,+3], POS[−4,+3], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 57.89 75.65 65.59
W [−3,+4], POS[−5,+4], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 51.30 72.26 60.00
W [−2,+2], POS[−2,+2], |Pre| <= 3, |Suf | <= 3, Length, word frequency, acceptable prefix, acceptable suffix 45.62 68.98 54.92
W [−4,+4], POS[−4,+4], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 52.32 77.62 62.51
W [−5,+4], POS[−4,+3], |Pre| <= 5, |Suf | <= 5, Length, word frequency, acceptable prefix, acceptable suffix 38.69 49.63 43.48
W [−5,+5], POS[−5,+5], |Pre| <= 6, |Suf | <= 6, Length, word frequency, acceptable prefix, acceptable suffix 37.78 48.23 42.37
W [−2,+3], POS[−2,+3], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 47.60 65.30 55.06
W [−3,+2], POS[−3,+2], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 48.77 62.23 54.68
Precision of a system is defined by the number of correct
tagged MWEs as a ratio of the total number of tagged MWEs
by the system.
Precision(P ) =No of correct tagging
Total # tagged MWEs by the system(5)
Recall of a system is defined as the accuracy of the system
in terms of all correct MWEs in a given document.
Recall(R) =No of correct tagging
Total # actual MWEs in the document(6)
F-Score (F1) is a trade off between Precision and Recall,
which is defined as the harmonic mean of Precision and Recall
(when we give equal weight to both Precision and Recall).
F − Score =(2× Precision×Recall)
(Precision+Recall)(7)
B. Best Feature Selection
In order to select best features, we have performed lot
of variation of feature parameters while we change the
combinations of the suggested features. We have started
the combination from four words prior and four words
succeeding a given word, POS tag of previous four words and
the following four words, prefixes and suffixes starting from
four characters, word length, frequency, acceptable prefixes
and suffixes. With the experiments performed above we are
able to find the best feature selection for the CRF. Table III
gives the idea of notations used further. Few of the tunning
features with their corresponding performances are reported
in Table IV. The best combination over development dataset
is reported as follows:
F={Wi−3,Wi−2,Wi−1,Wi,Wi+1,Wi+2,Wi+3, POS(s)of the current and 3 preceding and following word(s),|prefix| <= 4, |suffix| <= 4, Length of the word, wordfrequency, acceptable prefix, acceptable suffix}
C. Performance Analysis
We have compared our results with the results obtained from
statistical measures by Chakraborty and Bandyopadhyay [11].
The parameter of the statistical method is set to 0.5 (those
candidates having combined statistical score >= 0.5 are
labeled as MWEs). It is worth noting that statistical method
was reported for the Noun-Noun bigram MWEs; but we extend
it up to n-gram to make it equivalent with the proposed
algorithm. Table V shows that the precision, recall and F-
Score are improved significantly in CRF based approach. The
reason is quite obvious that the statistical method can not
have any signature of the contextual words which is actually
play an important role in any token labeling task. Moreover,
the hidden interdependencies of the candidate phrase with
the surrounding words act as a significant clue to find out
whether it occurs naturally at that context or it occurs by
chance. This information is very tactfully captured in CRF
based model. Furthermore, we evaluate our system before and
after the post-processing step. As shown in Table V, small
percentage of recall is dropped after post-filtering and the
precision is increased reasonably which in tern increases the
F-score of the system. While searching for the probable cause
of decreasing the recall value, we have noticed that some of
the true MWEs present in the document are parsed wrongly
by shallow parser. Because of the morphological peculiarity of
Bengali language, shallow parser separates some valid chunks
into distinct chunks. We point out that 1.5% of recall is
dropped due to this reason.
TABLE VCOMPARISON OF OUR RESULTS (CRF+PP: USING POST-PROCESSING,
CRF-PP: WITHOUT USING POST-PROCESSING) WITH STATISTICAL
SIMILARITY MEASURE (STAT) ON TEST SET
System Precision(%) Recall(%) F-score(%)STAT 46.23 78.56 58.21
CRF-PP 60.88 80.69 69.39CRF+PP 65.72 78.90 71.70
VI. PERFORMANCE IMPROVEMENT USING REDUPLICATED
MWES
Nongmeikapam and Bandyopadhyay (2010) [17] mentioned
that prior tagging of reduplicated phrase can improve the pro-
cessing of MWE tagging by CRF. They have shown significant
improvement of accuracy in Manipury language. We have tried
to use the same concept to tag Reduplicated MWEs (RMWEs)
in Bengali and use them as a feature of the CRF model.
Chakraborty and Bandyopadhyay (2010) [12] have already
done the extraction of all types of reduplicated MWEs from
Bengali corpus using the clues of the morphological patterns.
All types of reduplications namely onomatopoeic, complete,
partial, correlative reduplications can be identified by this
system. But for semantic reduplication identification, we need
standard dictionary to tackle the synonymous and antonymous
patters between the constituents. We have developed exactly
the same experimental setup as discussed in their paper [12]
and extract RMWEs. The outputs of this phase are marked
with “B-RMWE” for the beginning and “I-RMWE” for the
rest of the RMWE and “O” for the non-RMWEs. This output
is placed as a new column in the multiple token file for both
training and testing phases of CRF. The CRF toolkit is run
again to compare with the previous output. The output is
shown in Table VI. It signifies the improvement of perfor-
mance compared to the previous model.
TABLE VIRESULTS ON THE TEST SET USING REDUPLICATED MWES AS A FEATURE
OF CRF
System Precision (%) Recall (%) F-score (%)CRF-PP 62.40 82.95 71.22CRF+PP 67.98 80.01 73.01
VII. CONCLUSION
In this experiment, we have incorporated CRF to identify
Nominal MWEs from Bengali corpus. We used various mor-
phological features and tuned our system to get better feature
set. We included reduplicated MWEs as a feature of CRF, and
it showed reasonable improvement in terms of Precision and
F-score. We have also seen that the lack of performance of
the basic morphological tool (e.g., shallow parser) can make a
significant deterioration of the overall performance the system.
However, beside experimenting with MWEs, we are trying to
develop a preliminary version of Bengali dependency parser.
we plan to include clausal dependency information from the
dependency parser along with the existing feature set. We
also plan to handle other types of MWEs like Verbal MWEs,
Prepositional MWEs in Bengali. Moreover, we will also try
to use an hybrid approach in the processing step to prune the
false positive candidates and make the final list rich with all
relevant MWEs.
REFERENCES
[1] I. Sag, T. Baldwin, F. Bond, A. Copestake and D. Flickinger, “MultiwordExpressions: A Pain in the Neck for NLP”, In Proceedings of Conferenceon Intelligent Text Processing and Computational Linguistics (CICLING),pp. 1-15, 2002.
[2] T. Chakraborty, S. Pal, T. Mondal, T. Saikh, and S. Bandyopadhyay,“Shared task system description: Measuring the Compositionality ofBigrams using Statistical Methodologies”, In Proceedings of Distribu-tional Semantics and Compositionally (DiSCo), The 49th Annual Meetingof the Association for Computational Linguistics: Human LanguageTechnologies (ACL-HLT 2011), Portland, Oregon, USA, pp. 38-42, June24, 2011.
[3] K. W. Church, and P. Hans, “Word Association Norms, Mutual In-formation and Lexicography”, In Proceedings of 27th Association forComputational Linguistics (ACL), 16(1), pp. 22-29, 1990.
[4] T. Dunning, “Accurate Method for the Statistic of Surprise and Coinci-dence”, In Computational Linguistics, pp. 61-74, 1993.
[5] A. Kilgarriff, and J. Rosenzweig, “Framework and results for EnglishSENSEVAL”, Computers and the Humanities, Senseval Special Issue,34(1-2), pp. 15-48, 2000.
[6] F. A. Kunchukuttan, and O. P. Damani, “A System for CompoundNoun Multiword Expression Extraction for Hindi” In proceeding of 6thInternational Conference on Natural Language Processing (ICON), pp.20-29, 2008.
[7] D. Das, S. Pal, T. Mondal, T. Chakraborty, and S. Bandyopadhyay, “Au-tomatic Extraction of Complex Predicates in Bengali”, In Proceedings ofMultiword Expressions: from Theory to Applications (MWE 2010), The23rd International Conference on Computational Linguistics (COLING),Beijing, China, pp. 37-45, August 28, 2010,
[8] T. Chakraborty, D. Das, and S. Bandyopadhyay, “Semantic Clustering: anAttempt to Extract Multiword Expressions in Bengali”, In Proceedingsof Multiword Expressions: from Parsing and Generation to the RealWorld (MWE 2011), The 49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies (ACL-HLT2011), Portland, Oregon, USA, pp. 8-11, June 23, 2011.
[9] S. Venkatapathy, and A. Joshi, “Measuring the relative compositionality ofverb-noun (V-N) collocations by integrating features”, In Proceedings ofHuman Language Technology Conference and Conference on EmpiricalMethods in Natural Language Processing (HLT/EMNLP), Association forComputational Linguistics, pp. 899-906, 2009.
[10] A. Agarwal, B. Ray, M. Choudhury, S. Sarkar, and A. Basu, “AutomaticExtraction of Multiword Expressions in Bengali: An Apaproach forMiserly Resource Scenario”, In Proceedings of International Conferenceon Natural Language Processing (ICON), pp. 165-174, 2004.
[11] T. Chakraborty, and S. Bandyopadhyay, “Identification of Noun-noun(NN) collocation as multiword expression in Bengali Corpus”, In8th International Conference on Natural Language Processing (ICON),India, 2010.
[12] T. Chakraborty, and S. Bandyopadhyay, “Identification of Reduplicationin Bengali Corpus and their Semantic Analysis: A Rule Based Approach”,In proceedings of the 23rd International Conference on ComputationalLinguistics (COLING 2010), Workshop on Multiword Expressions: fromTheory to Applications (MWE 2010). Beijing, China, pp. 72-75, 2010.
[13] A. Ekbal, R. Haque, and S. Bandyopadhyay, “Maximum Entropy BasedBengali Part of Speech Tagging”, In proceedings of Advances in NaturalLanguage Processing and Applications Research in Computing Science,pp. 67-78, 2008.
[14] C. Zhang, H. Wang, Y. Liu, D. Wu Liao, and B. Wang, “AutomaticKeyword Extraction from Documents Using Conditional Random Fields”,Journal of Computational Information Systems(4.3), pp. 1169-1180, 2008.
[15] J. Cohen, “A coefficient of agreement for nominal scales”, Educationaland Psychological Measurement, vol. 20, pp. 3746, 1960.
[16] R.J. Passonneau, “Measuring agreement on set-valued items (MASI) forsemantic and pragmatic annotation”, In Proceedings of 5th InternationalConference on Language Resources and Evaluation, 2006.
[17] K. Nongmeikapam, and S. Bandyopadhyay, “Identification of MWEsUsing CRF in Manipuri and Improvement Using Reduplicated MWEs”,In Proceedings of 8th International Conference on Natural LanguageProcessing, India, 2010.