On active annotation for named entity recognition
-
Upload
utpal-kumar -
Category
Documents
-
view
225 -
download
3
Transcript of On active annotation for named entity recognition
ORIGINAL ARTICLE
On active annotation for named entity recognition
Asif Ekbal • Sriparna Saha • Utpal Kumar Sikdar
Received: 15 October 2013 / Accepted: 5 June 2014
� Springer-Verlag Berlin Heidelberg 2014
Abstract A major constraint of machine learning tech-
niques for solving several information extraction problems
is the availability of sufficient amount of training exam-
ples, which involve huge costs and efforts to prepare.
Active learning techniques select informative instances
from the unlabeled data and add it to the training set in
such a way that the overall classification performance
improves. In random sampling approach, unlabeled data is
selected for annotation at random and thus can’t yield the
desired results. In contrast, active learning selects the
useful data from a huge pool of unlabeled documents. The
strategies used often classify the instances to belong to the
incorrect classes. The classifier is confused between two
classes if the test instance is located near the margin. We
propose two methods for active learning, and show that
these techniques favorably result in the increased perfor-
mance. The first approach is based on support vector
machine (SVM), whereas the second one is based on an
ensemble learning which utilizes the classification capa-
bilities of two well-known classifiers, namely SVM and
conditional random field. The motivation of using these
classifiers is that these are orthogonal in nature, and
thereby a combination of them can produce the better
results. In order to show the efficacy of the proposed
approach we choose a crucial problem, namely named
entity recognition (NER) in three languages, namely
Bengali, Hindi and English. This is also evaluated for NER
in biomedical domain. Evaluation results reveal that the
proposed techniques indeed show considerable perfor-
mance improvements.
Keywords Named entity recognition (NER) � Active
learning � Conditional random field (CRF) � Support vector
machine � Classifier ensemble � Biomedical domain
1 Introduction
One of the greatest difficulties in machine learning tech-
niques is the availability of the large amount of training
data. It is both very cost-sensitive and time consuming to
create these labeled examples [1–3]. Active learning (AL)
[4] is, nowadays, a popular research area due to its many-
fold potential benefits. By using active learning techniques
we can reduce the amount of manual annotations which are
necessary for creating a large training corpus. The strength
of active learning lies in the fact that it selects only a subset
of tokens which are useful for a given classifier.
Active learning [4–7] optimizes the control of model
growth and it greatly reduces the time and costs involved in
preparing the data as well as the model. Without AL, the
knowledge base models grow with the increase of size of
the already built data set. Active learning selects the most
informative training examples instead of the entire body of
data, thus restricting the amount of learning by the learning
algorithm. In most of the cases, the predictive accuracies
obtained from the resulting models are comparable to that
of a standard (exact) learning model. In [5] authors discuss
the use of active annotation for annotating corpora of
articles about archaeology in the Portale della Ricerca
Umanistica Trentina in the domains of humanities (and in
A. Ekbal (&) � S. Saha (&) � U. K. Sikdar
Department of Computer Science and Engineering, Indian
Institute of Technology Patna, Patna 800 013, India
e-mail: [email protected]; [email protected]
S. Saha
e-mail: [email protected]; [email protected]
U. K. Sikdar
e-mail: [email protected]; [email protected]
123
Int. J. Mach. Learn. & Cyber.
DOI 10.1007/s13042-014-0275-8
other scholarly domains) where there is a great need for
making the best possible use of the annotators, as the
entities mentioned in collections of these scholarly articles
belong to different types from those familiar from news
corpora, hence new resources need to be annotated to
create supervised taggers for tasks such as named entity
extraction. The thesis of Settles [8] focused on active
learning with structured instances and potentially varied
annotation costs. Active learning with support vector
machines (SVMs) and Bayesian networks can be found in
Tong [9]. Some theoretical aspects of active learning for
classification are described in Monteleoni [10]. For named
entity recognition (NER), active learning techniques have
also been used in the past [11]. A good survey of active
learning with its applications to natural language process-
ing (NLP) can be found in [12]. In an interesting study,
Schein and Ungar [13] illustrated that active learning can
sometimes requires more labeled data than passive learning
while utilizing the similar model class (here it is logistic
regression). Baldridge and Palmer [14] found that the
performance of active learning depends on the proficiency
of the annotator (specially a domain expert). Tomanek and
Olsson [15] reported a survey where 91 % researchers who
have used active annotations for solving their problems had
their expectations fully or partially met. Dasgupta [16]
determined a variety of theoretical upper and lower bounds
for active learning when huge collection of unlabeled data
is available. Balcan et al. [17] has proved that asymptoti-
cally, some active learning strategies should perform better
than the supervised learning in the limit. Settles and Craven
[18] have developed a large number of active learning
algorithms for sequence labeling tasks using probabilistic
sequence models like conditional random field (CRF).
Reichart et al. [19] developed a two-task active learning
technique for natural language parsing and NER. Two
methods are proposed for actively learning both the tasks.
The first approach is termed as alternating selection where
the parser is used to query sentences in one iteration, and
then the NER system is used to query instances in the next
iteration. In the second strategy named rank combination
both the learners can rank the query candidates in the pool
independently and then the candidates having highest ranks
are selected for active expert annotation.
Some unsupervised learning paradigms are developed in
[20–22]. In [20] a mutual bootstrapping technique is used
to learn from a set of unannotated training texts and a
handful of ‘‘seed’’ words for the semantic category of
interest. At first some extraction patterns are learnt from
the seed words and then these learned patterns are utilized
to detect more words which belong to the same semantic
category. In the second phase authors have devised a sec-
ond level of bootstrapping where the most reliable lexicon
entries generated after application of the first stage are kept
and the process is restarted with enhanced semantic lexi-
con. Results shown by the paper support the fact that this
two-tiered bootstrapping process is less sensitive to noise
than a single level of bootstrapping and generates high-
quality dictionaries. In [21] an unsupervised NER system is
developed using syntactic and semantic contextual evi-
dences. Here a corpus-driven statistical technique was
developed which uses a learning corpus to acquire con-
textual classification clues and then utilizes these results for
classifying unrecognized proper nouns (PN) in an unla-
beled corpus. In order to generate the training examples of
proper nouns they used both rule-based as well as machine
learning based recognizers. But the contextual model of PN
categories can be learnt without using any supervised
information. In [22] authors have developed a system
named KNOWITALL which is used to automate the
complex task of extracting large collections of facts (e.g.,
names of scientists or politicians) from the Web in an
unsupervised, domain-independent, and scalable manner.
After the first execution, KNOWITALL was able to extract
50,000 class instances. But here the challenge is to increase
the recall without sacrificing the precision. In order to
address this challenge three distinct methods are devised.
In order to learn the domain-specific extraction rules pat-
tern learning technique is used. This also enables learning
additional extractions. In order to increase the recall values
subclass extraction is used to automatically determine sub-
classes (e.g., chemist and biologist are identified as sub-
classes of scientist). The third method named as ‘‘list
extraction’’ generates lists of class instances. It then learns
a wrapper for each list, and finally extracts elements of
each list. Authors have evaluated all the methods in con-
nection with KNOWITALL. Application of these three
techniques helps to increase the recall of KNOWITALL.
The active learning for NER that was reported in [23]
focuses on reducing class imbalance. Here main goal is to
generate more balanced data sets using annotation proce-
dures of AL. Results prove that the resultant approaches can
indeed minimize class imbalance and increase the perfor-
mance of classifiers on minority classes while maintaining a
good overall performance in terms of macro F-score. In [24]
authors have developed some active learning techniques to
bootstrap a NER system for a new domain of radio astro-
nomical abstracts. Several committee-based metrics are
evaluated for quantifying the disagreement between the
classifiers built using multiple views. Results show that
appropriate value of the metric can be determined using
simulation experiments with the existing annotated data
collected from the different domains. Final evaluation
reveals that active learning performs much better than a
randomly sampled baseline. In [25] a CRF based active
learning technique is developed which utilizes the concepts
of information density for selecting uncertain samples. This
Int. J. Mach. Learn. & Cyber.
123
technique is then applied for solving NER in Chinese. Some
works on stopping criterion of active learning based tech-
niques are done in [26], where the authors have proposed
three different stopping criteria for active learning. Results
reveal that among these three stopping criteria, gradient-
based stopping, is the best one to stop active learning and
achieves near optimal NER performance. In [27] authors
have developed some multi-criteria based active learning
approach and finally applied this technique for NER from
two standard corpus. In order to maximize the contribution
of the selected examples by sample selection technique,
multiple criteria: informativeness, representativeness and
diversity are considered. Thereafter some measures are
proposed to quantify these values. Two sample selection
strategies were developed, which result in less labeling cost
than single-criterion-based method.
NER is an important task in the field of NLP. It is an
important module in many applications including infor-
mation extraction, information retrieval, machine transla-
tion, question answering and automatic summarization etc.
The main task of NER can be viewed as a combination of
two steps; in the first phase every word/term from the text
has to be identified and in the second phase these are cat-
egorized into groups like person name, location name,
organization name, miscellaneous name (date, time, per-
centage and monetary expressions etc) and ‘‘none-of-the-
above’’. The existing works related to Indian language
NER are still limited. Some of the facts behind these are:
lack of capitalization information, free word order nature,
more diversity, resource-constrained nature etc. As part of
the Indian languages, there are some existing works that
cover a few languages like Bengali [28, 29], Hindi [30] and
Telugu [31]. As mentioned before, the performance of any
supervised system greatly depends on the amount of
available annotated datasets, and this not very easier to
achieve. In order to tackle this issue active learning can be
an effective solution. This will provide us a way to auto-
matically increase the amount of training data. The work
reported in this paper differs from the works reported in
[20–22] in the sense that all these were built based on
unsupervised machine learning. But the current work deals
with an active annotation technique where in each iteration
some tokens are selected using some novel techniques for
which active expert opinion is sought. These tokens along
with the corresponding sentences are added to the training
data, and the system is retained and evaluated on devel-
opment/unlabeled data in an iterative fashion.
The explosion of information in the biomedical domain
leads to growing demand for automated biomedical
information extraction techniques [32]. Named entity
(NE) extraction1 is a fundamental task of biomedical text
mining. Recognizing NEs like mentions of proteins,
DNA, RNA etc. is one of the most important factors in
biomedical knowledge discovery. But the inherently
complex structures of biomedical NEs pose a big chal-
lenge for their identification and classification in bio-
medical information extraction. The biomedical NE
extraction is vast, but there is still a wide gap in perfor-
mance between the systems developed for the traditional
news-wire domains and the systems developed targeting
biomedical domains. The major challenges and/or diffi-
culties associated with the identification and classification
of biomedical NEs are as follows: (1) building a complete
dictionary for all types of biomedical NEs is infeasible
due to the generative nature of NEs; (2) NEs are made of
very long compounded words (i.e., contain nested entities)
or abbreviations and hence difficult to classify them
properly; (3) names do not follow any nomenclature; (4)
names include different symbols, common words and
punctuation symbols, conjunctions, prepositions etc. that
make NE boundary identification more difficult and
challenging; and (5) same word or phrase can refer to
different NEs based on their contexts.
In this paper we propose two methods for active
learning. The first one is based on SVM [33]. We eval-
uate our proposed technique for NER in three languages,
namely Bengali, Hindi and English. Bengali and Hindi are
the two widely spoken languages, rank fifth and second,
respectively, in terms of the native speakers all over in
the world. Evaluation results show that our proposed
approach in general performs well for three different
datasets. Thereafter this approach is evaluated for NER in
biomedical domain. We identify and implement variety of
features that are based on orthography, local contextual
information and global contexts. Thereafter we propose
the active learning technique based on the concept of
classifier ensemble, where SVM [33] and CRF [34] are
used as the underlying classification techniques. Based on
the distance from the hyperplane of SVM and the con-
ditional probabilities assigned to each token by CRF, we
select most uncertain samples from the unlabeled data to
be added to the initial training data. The proposed
approach is again evaluated for NER in Bengali, Hindi,
English and biomedical texts. Results show that ensemble
performs reasonably superior compared to the individual
classifier.
Some unsupervised models for NER are developed in
[20, 35]. Collins and Singer [35] developed two tech-
niques which can build a NER system by utilizing a small
amount of labeled data and a huge collection of unlabeled
documents for solving the NER. The first approach
describes how to generate rules for NER from the sig-
nificantly large amount of unlabelled documents. It first
starts with some seed set of rules that are increased while1 Here by extraction we mean both recognition and classification.
Int. J. Mach. Learn. & Cyber.
123
maintaining a high level of agreement between spelling
and contextual decision lists. The second approach,
named as CoBoost, is a generalization of boosting tech-
nique [35] that is applied to solve the problem of NER. It
utilizes both the labeled and unlabeled data and builds
two classifiers in parallel. While AdaBoost determines a
weighted combination of simple (weak) classifiers where
weights are calculated after minimizing a function which
bounds the classification error on a set of training
examples. The second algorithm devised in this paper
performs the similar kind of search but instead of mini-
mizing only the classification error on the training data it
also minimizes the disagreement between the classifiers in
predicting class labels of unlabeled examples. The pro-
posed algorithm also develops two classifiers iteratively
where in each iteration they tried to minimize a contin-
uously differential function which bounds the number of
examples on which the two classifiers disagree. Thus
CoBoost algorithm relies on certain samples (samples on
which two classifiers agree).
In the current paper we have selected tokens for which
both the classifiers have confusions in determining the
class label. For each classifier and for each token we
determine the difference between the confidence values
of the two most probable classes. If this difference is less
than a predefined threshold (if the confidence values are
similar to each other) then that particular instance is a
probable candidate where the classifier is most uncertain.
For each of the classifiers, we determine two different
lists of potential candidates, and then combine them
together in an unique way. Finally the sentences con-
taining the most confusing ten instances are selected for
active expert opinion. Thereafter these are added to the
training set. Thus our algorithm is different from the
work proposed in [35] that focused on to minimize an
objective function implicitly which bounds the number of
examples on which two classifiers disagree. The tech-
nique is rather a way of building classifiers using few
labled and a huge number of unlabeled documents. This
is in contrast to the concept of active learning technique
where in each iteration we select informative tokens that
are assigned the correct class labels by some domain
experts.
The rest of the paper is organized as follows. Section
2 describes very briefly the base classifiers that we have
used for building our active learning systems. In Sect. 3
we present our algorithms for active annotation. Section
4 describes the set of features that we have used for
training and/or testing our machine learning algorithms.
Section 5 elaborately reports on the datasets used,
experimental results, detailed analysis and necessary
comparisons with the existing works. Finally, we con-
clude in Sect. 6.
2 Base classifiers
In our work we use two different classifiers, namely CRF
[34] and SVM [33].
2.1 Conditional random field
CRFs [34] are undirected graphical models, widely used
for sequence learning tasks. A special case of this classi-
fication technique corresponds to the conditionally trained
probabilistic finite state automata. As CRFs are condi-
tionally trained, they can easily incorporate a large number
of arbitrary and non-independent features. At the same time
they can still have the efficient procedures for non-greedy
finite-state inference and training.
Given an observation sequence we have to determine the
best state sequence. A feature function fkðst�1; st; o; tÞ is
having a value of 0 for most cases and is only set to be 1,
when st�1; st are certain states and the observation has
certain properties. We use the Cþþ based CRFþþ package2,
a simple, customizable, and open source implementation of
CRF for segmenting or labeling sequential data.
2.2 Support vector machine
In the field of NLP, SVMs [33] have been widely applied
for text categorization, and are reported to have achieved
high accuracy without falling into over-fitting even though
with a large number of words taken as the features [36].
We develop our system using SVM [33, 36] which per-
forms classification by constructing an N-dimensional
hyperplane that optimally separates data into two catego-
ries. We have used YamCha3 toolkit, an SVM based tool
for detecting classes in documents and formulating the
NER task as a sequential labeling problem. Here, the
pairwise multi-class decision method and the polynomial
kernel function are used. We use TinySVM-0.074 classifier.
3 Proposed active learning techniques
In this section we describe our proposed active learning
techniques. Our first approach is based on the classification
technique, namely SVM. The second method is based on
an ensemble approach, where two supervised classifiers,
namely SVM and CRF are used. Effective uncertain sam-
ples are selected based on the measurements of distance
from the hyperplane of SVM and conditional probabilities
of CRF.
2 http://crfpp.sourceforge.net.3 http://chasen.org/*taku/software/yamcha/.4 http://cl.aist-nara.ac.jp/taku-ku/software/TinySVM.
Int. J. Mach. Learn. & Cyber.
123
3.1 Active annotation
Active annotation–the term introduced by [5, 37] to refer to
the application of active learning [1–4] to corpus creation–
is becoming a popular annotation technique because it can
lead to drastic reductions in the amount of annotation
needed for constructing training set to develop some highly
accurate classifiers. In the traditional, random sampling
approach, unlabeled data is selected for annotation at
random.
In contrast, in active learning, the most useful data for
the classifier are carefully selected. Generally, a given
classifier is trained using a small sample of the data (usu-
ally selected randomly) which are also termed as the seed
examples. The classifier is subsequently applied to a pool
of unlabeled data with the purpose of selecting additional
examples that the classifier views as informative. The
selected data are manually annotated and the steps are
repeated so that the classifier can determine the optimal
decision boundary between the classes. The key question in
this approach is how to determine the samples that will be
most useful to the classifier.
3.2 Active annotation with SVM
A feature vector consisting of the features described in the
following section is extracted for each word in the NE
tagged corpus. Now, we have a training data in the form
ðWi; TiÞ, where, Wi is the ith word and its feature vector and
Ti is its output tag.
The SVM is trained with the available feature set and
evaluated on the gold standard test set. We develop our
system using SVM [33, 36] which performs classification
by constructing an N-dimensional hyperplane that opti-
mally separates data into two categories. Based on some
selection criterion, sentences are chosen from the devel-
opment set and added to the initial training set in such a
way that the performance on the test set improves.
Our selection criterion is based on the confidence values
of a SVM model. For each token of the development set, a
SVM classifier produces the distance from different sepa-
rating hyper planes. Here at first we normalize these dis-
tance values in the range [0, 1]. The normalized value is
treated as the confidence value for a particular class. Our
selection criterion is based on the differences between the
confidence values of the two most probable classes for a
token, the hypothesis being that items for which this dif-
ference is smaller are those of which the classifier is less
certain. A threshold on the confidence interval is defined,
and at each iteration of the algorithm we select the effec-
tive sentences from the development set and add to train-
ing. In each iteration, we add ten most informative
sentences to the training set. We stop iteration of the
algorithm when the performance in two consecutive itera-
tions be equal.
The main steps of the active annotation approach we
followed in this work are shown in Fig. 1.
3.3 Ensemble approach for active annotation
The method is based on an ensemble learning. As the base
classifiers we use CRF and SVM. For each of the base
classifiers, a feature vector consisting of the features
described in the following section is extracted for each
token. We consider the feature vector consisting of all the
features of the current token and varied the contexts within
wiþ2i�2 ¼ wi�2. . .wiþ2 (i.e. preceding two and succeeding two
tokens). For CRF we use bigram feature template that
computes the combinations of all the available features for
the current and previous tokens. For SVM, we include the
dynamic output labels of the previous two tokens. Based on
some selection criterion, sentences are chosen from the
development set and added to the initial training set in such
a way that the performance on the test set improves. Our
technique is based on the combined decisions of both SVM
and CRF.
Step 1: Evaluate the system on the gold standard test data.Step 2: Test on the development data and calculate the confidence values of
each class of the output classes.Step 3: Compute the confidence interval (CI) between the two most probable classes
for each token.Step 4: If CI is below the threshold value (set to 0.1 and 0.2) then
Step 4.1: Add the NE token along with its sentence identifier and CI in a list ofeffective sentences, selected for active annotation (named as EA).
Step 5: Sort EA in ascending order of CI.Step 6: Select the top most 10 sentences.Step 7: Remove the 10 sentences from the development set.Step 8: Add the sentences to the training set.Step 9: Retrain the SVM classifier and evaluate with the test set.Step 10: Repeat steps 2-9 until the performance in two consecutive iterations be same.
Fig. 1 Main steps of the
proposed active learning
technique
Int. J. Mach. Learn. & Cyber.
123
As both algorithms produce two different kinds of proba-
bilistic scores, we first normalize all the confidence values
within the range [0, 1]. We consider these as the actual con-
fidence scores of outputs. The selection criterion is again
based on the differences between the confidence values of the
two most probable classes for a token. A threshold on the
confidence interval is defined, and for each base classifier we
generate a set of uncertain samples. These sets contain the
selected sentence identifiers along with the confidence inter-
vals for which they are included into the respective sets.
Thereafter we combine the decisions of SVM and CRF, and
generate a new set of uncertain samples by taking the unions
of these two sets. The union is taken in such a way that the
common sentence is assumed to have the confidence interval,
equal to the minimum of two values assigned to that particular
sentence in two sets. Finally, we select ten most uncertain
sentences from the development data. Thus, in each iteration
of the algorithm, we actually add ten most informative sen-
tences to the training set. We run the algorithm for the max-
imum ten iterations. In some cases, the performance starts to
decrease even at the earlier step of the algorithm. In order to
account this fact we stop the algorithm’s iteration, and retrieve
last iteration’s training data as the final one.
The main steps of the proposed active annotation are
shown in Fig. 2.
4 Named entity features
Performance of any classification technique greatly
depends on the features used in the model. In our work we
implement the following set of features for our task. These
features are easy to derive and don’t require deep domain
knowledge and/or external resources for their generation.
Thus, these features are general in nature and can be easily
extracted for other domains.
1. Context words: These are the preceding and following
words surrounding the current token. This is based on
the observation that surrounding words carry effective
information for NE identification.
2. Word suffix and prefix: Fixed length (say, n) word
suffixes and prefixes are very effective to identify
NEs and work well particularly for the highly
inflective Indian languages. Actually, these are the
fixed length character sequences stripped either from
the rightmost (for suffix) or from the leftmost (for
prefix) positions of the words.
3. First word: This is a binary valued feature that checks
whether the current token is the first word of the
sentence or not.
4. Length of the word: This binary valued feature checks
whether the number of characters in a token is less
than a predetermined threshold value (here, set to 5).
This feature is defined with the observation that very
short words are most probably not the NEs.
5. Infrequent word: This is a binary valued feature that
checks whether the current word appears in the
training set very frequently or not. We include this
feature as the frequently occurring words are most
likely not the NEs.
6. Last word of sentence: This feature checks whether
the current word is the last word of a sentence or not.
In Indian languages, verbs generally appear in the last
position of the sentence. Indian languages follow
subject–object–verb structure. This feature distin-
guishes NEs from the verbs.
Step 1: Train the base classifiers with the initial training data and evaluate with the gold standard test data.Step 2: Train the base classifiers with the initial training data and evaluate with the development data.Step 3: Calculate the confidence value of each token for each output class.Step 4: Normalize the confidence scores within the range of [0,1].Step 5: Compute the confidence interval (CI) between the two most probable classes for each tokenof the development data.This is computed on the outputs of both SVM and CRF.Step 6: From each of dev output, perform the following operations:
Step 6.1: if CI is below the threshold value (set to 0.2) then add the NE tokenalong with its sentence identifier and CI in a set of effective sentences,selected for active annotation.Step 6.2: Create two different sets, (Set SVM and Set CRF ) for two classifiers.
Step 7: Combine two sets into one, named as EA in such a way that if the sentence identifiers are same, thenfor that sentence CInew = min(CISV M , CICRF ). All the dissimilar sentences are added as they are.
Step 8: Sort EA in ascending order of CInew .Step 9: Select the top most 10 sentences, and remove these from the development data.Step 10: Add the sentences to the training set. This generates new training set.Retrain the SVM and CRF classifiersand evaluate with the test set.Step 11: Repeat steps 3-10 for some iteration (10 in our case).
Fig. 2 Steps of the proposed ensemble technique for active annotation
Int. J. Mach. Learn. & Cyber.
123
7. Digit features: Several orthographic features are
defined depending upon the presence and/or the
number of digits and/or symbols in a token. These
features are digitComma (token contains digit and
comma), digitPercentage (token contains digit and
percentage), digitPeriod (token contains digit and
period), digitSlash (token contains digit and slash),
digitHyphen (token contains digit and hyphen) and
digitFour (token consists of four digits only).
8. Dynamic feature: Dynamic feature denotes the output
tags ti�3ti�2ti�1, ti�2ti�1, ti�1 of the word
wi�3wi�2wi�1, wi�2wi�1, wi�1 preceding wi in the
sequence wn1. For CRF, we consider the bigram
template that considers the combination of the current
and previous output labels.
9. Content words in surrounding contexts: We consider
all unigrams in contexts wiþ3i�3 ¼ wi�3. . .wiþ3 of wi
(crossing sentence boundaries) for the entire training
data. We convert tokens to lower case, remove
stopwords, numbers and punctuation symbols. We
define a feature vector of length 10 using the 10 most
frequent content words. Given a classification
instance, the feature corresponding to token t is set
to 1 if and only if the context wiþ3i�3 of wi contains t.
10. Part of speech (PoS) information: PoS information is
a critical feature for NE identification. We use PoS
information of the current and/or the surrounding
token(s) as the features. For Bengali and Hindi we
use our in-house PoS tagger to extract the PoS
information. The PoS information for English was
provided with the datasets. For biomedical texts, we
use the GENIA tagger V2.0.25 to extract this
information.
In addition to the above we make use of the following
additional features for NER, particularly for biomedical
texts.
1. Chunk information: We use GENIA tagger V2.0.2 to
get the chunk information. Chunk information provide
useful evidences about the boundaries of biomedical
NEs. In the current work, we use chunk information of
the current and/or the surrounding token(s). This
information was provided for the English datasets.
2. Unknown token feature: This is a binary valued feature
that checks whether the current token was seen or not
in the training corpus. In the training phase, this feature
is set randomly.
3. Word normalization: We define the feature for word
normalization. The first type of feature attempts to
reduce a word to its stem or root form. This helps to
handle the words containing plural forms, verb
inflections, hyphen, and alphanumeric letters. The
second type of feature indicates how a target word is
orthographically constructed. Word shapes refer to the
mapping of each word to their equivalence classes.
Here each capitalized character of the word is replaced
by ‘A’, small characters are replaced by ‘a’ and all
consecutive digits are replaced by ‘0’. For example,
‘IL’ is normalized to ‘AA’, ‘IL-2’ is normalized to
‘AA-0’ and ‘IL-88’ is also normalized to ‘AA-0’.
4. Head nouns: Head noun is the major noun or noun
phrase of a NE that describes its function or the
property. For example, transcription factor is the head
noun for the NE NF-kappa B transcription factor. In
comparison to other words in NE, head nouns are more
important as these play key role for correct classifica-
tion of the NE class.
5. Verb trigger: These are the special type of verb (e.g.,
binds, participates etc.) that occur preceding to NEs
and provide useful information about the NE class.
These trigger words are extracted automatically from
the training corpus based on their frequencies of
occurrences. A feature is then defined that fires iff the
current word appears in the list of trigger words.
6. Informative words: In general, biomedical NEs are
too long and they contain many common words that
are actually not NEs. For example, the function
words such as of, and etc.; nominals such as active,
normal etc. appear in the training data often more
frequently but these don’t help to recognize NEs. In
order to select the most important effective words,
we first list all the words which occur inside the
multiword NEs. Thereafter digits, numbers and
various symbols are removed from this list. For each
word (wi) of this list, a weight is assigned that
measures how better the word is to identify and/or
classify the NEs. This feature is defined in line with
the one defined in [38].
7. Orthographic features: We define a number of ortho-
graphic features depending upon the contents of the
wordforms. Several binary features are defined which
use capitalization and digit information. These features
are: initial capital, all capital, capital in inner, initial
capital then mix, only digit, digit with special charac-
ter, initial digit then alphabetic, digit in inner.
The presence of some special characters like
(‘,’,‘-’,‘.’,‘)’,‘(’ etc.) is very much helpful to detect
NEs, especially in biomedical domain. For example,
many biomedical NEs have ‘-’ (hyphen) in their
construction. Some of these special characters are also
important to detect boundaries of NEs. We also use the
features that check the presence of ATGC sequence
and stop words. The complete list of orthographic
features is shown in Table 1.5 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger.
Int. J. Mach. Learn. & Cyber.
123
5 Datasets, experiments and discussions
In this section we describe the datasets used for the
experiments, report the evaluation results and present the
necessary discussions.
5.1 Datasets and experimental setup
Indian languages are resource-constrained in nature. For
NER, we use a Bengali news corpus [39], developed from the
archive of a leading Bengali newspaper available in the web.
A portion of this dataset containing 250 K wordforms was
manually annotated with the NE tagset of four tags namely,
Person name (PER), Location name (LOC), Organization
name (ORG) and Miscellaneous name (MISC). The Miscel-
laneous name includes date, time, number, percentages,
monetary expressions and measurement expressions. The
data is collected mostly from the National, States, Sports
domains and the various sub-domains of District of the
newspaper. This annotation was carried out by one of the
authors and verified by an expert [39]. We also use the
IJCNLP-08 NER on South and South East Asian Languages
(NERSSEAL)6 Shared Task data of around 100 K wordforms
that were originally annotated with a fine-grained tagset of
twelve tags. This data is mostly from the agriculture and
scientific domains. For Hindi, we use approximately 502,913
tokens obtained from the NERSSEAL shared task.
In the present work, we consider only the tags that
denote person names (NEP), location names (NEL), orga-
nization names (NEO), number expressions (NEN), time
expressions (NETI) and measurement expressions (NEM).
The NEN, NETI and NEM tags are mapped to the Mis-
cellaneous name tag that denotes miscellaneous entities.
Other tags of the shared task have been mapped to the
‘other-than-NE’ category denoted by ‘O’.
For the development sets, we partition the available
training sets in such a way that 10 % instances of each class
belong to the respective development set. Some statistics of
training, development and test data are presented in Table 2.
For English NER, we use the CoNLL-2003 shared task [40]
data. Here training data set is having total 203,621 tokens
out of which 23,499 tokens are NE. In the test data there are
51,578 tokens, out of which 5,648 tokens are of NE types.
For biomedical texts, we use the JNLPBA 2004 shared
task datasets7. The data sets were extracted from the
GENIA Version 3.02 corpus of the GENIA project. This
was constructed by a controlled search on Medline using
MeSH terms such as human, blood cells and transcription
factors. From this search, 2,000 abstracts of about 500 K
wordforms were selected and manually annotated according
to a small taxonomy of 48 classes based on a chemical
classification. Out of these classes, 36 classes were used to
annotate the GENIA corpus. In the shared task, the data sets
were further simplified to be annotated with only five NE
classes, namely Protein, DNA, RNA, Cell_line and Cell_-
type [41]. The test set was relatively new collection of
Medline abstracts from the GENIA project. The test set
contains 404 abstracts of around 100K words. One half of
the test data was from the same domain as that of the
training data and the rest half was from the super domain of
blood cells and transcription factors.
For the active annotation experiment we take out 10 %
examples of each class from the training data and create
development set. Initially the system is trained on the
training set and evaluated on the development and test
datasets. Our algorithm selects most uncertain samples
from the development set and adds to the training set.
We use the standard metrics of recall, precision and
F-measure to evaluate the performance of our system.
These metrics are defined below:
Recall is the ratio of the number of correctly tagged
entities and the total number of entities.
Recall ¼ number of correctly tagged entities
total number of entities
Precision is the ratio of the number of correctly tagged
entities and the total number of tagged entities.
Table 2 Statistics of the datasets used for Indian language
Language # Tokens
in training
# Tokens
in dev
# Tokens
in test (in %)
Bengali 277,611 35,336 37,053
Hindi 455,961 47,218 32,796
Table 1 Orthographic features
Feature Example Feature Example
InitCap Src AllCaps EBNA, LMP
InCap mAb CapMixAlpha NFkappaB,
EpoR
DigitOnly 1, 123 DigitSpecial 12-3
DigitAlpha 2�NFkappaB,
2A
AlphaDigitAlpha IL23R, EIA
Hyphen – CapLowAlpha Src, Ras, Epo
CapsAndDigits 32Dc13 RomanNumeral I, II
StopWord At, in ATGCSeq CCGCCC,
ATAGAT
AlphaDigit p50, p65 DigitCommaDigit 1, 28
GreekLetter Alpha, beta LowMixAlpha mRNA, mAb
6 http://ltrc.iiit.ac.in/ner-ssea-08. 7 http://research.nii.ac.jp/collier/workshops/JNLPBA04st.htm.
Int. J. Mach. Learn. & Cyber.
123
Precision ¼ number of correctly tagged entities
total number of tagged entities
F-measure is the harmonic mean of recall and precision.
F-measure ¼ ð2� recall � precisionÞðrecall þ precisionÞ
For biomedical domain, we executed JNLPBA 2004 shared
task evaluation script.8 The script outputs three sets of
F-measures according to exact, right and left boundary
matches.
5.2 Results of SVM based active annotation technique
for indian languages
We trained a SVM model with the feature set mentioned in
Sect. 4. We consider various combinations from the set of
feature combinations as given by, F1 ¼{wi�m; . . .;wi�1;
wi;wiþ1; . . .;wiþn; feature vector consisting of root word,
prefix and suffix, first word, infrequent word, digit, content
words, and dynamic NE information.gWe observed the best performance with the context of
wi�1;wi;wiþ1, and thus only report its results.
Results for Bengali: We conducted active learning
experiments with the thresholds of both 0.1 and 0.2. But
we report the results only with 0.2 threshold value in
Table 3 as it yielded better performance. Here, in each
iteration of the algorithm ten most effective sentences are
added to the training set after removing from the devel-
opment set.
The highest performance obtained with this method has
the recall, precision and F-measure values of 86.80, 87.84
and 87.317 %, respectively. This highest performance is
obtained at the seventh iteration, and it does not improve
further in the subsequent iterations. This is actually a
marginal improvement over the first iteration. However it
proves the effectiveness of our proposed approach.
We also develop a baseline model, where in each iter-
ation ten sentences are randomly chosen from the devel-
opment set and added to the training set. Results of this
baseline show the recall, precision and F-measure values of
86.77, 87.80 and 87.28 %, respectively.
Results on Hindi: The proposed technique is evaluated
on the Hindi language, and its results are shown in Table 4.
The Hindi dataset is highly unbalanced, and we sample it
by removing the sentences that don’t contain any NEs. The
system attains the highest performance of recall, precision
and F-measure values of 87.12, 88.54, and, 87.82 %,
respectively, in the seventh and eighth iteration. This is
actually an improvement of 4.21 % F-measure points over
the first iteration. The baseline model, where in each
iteration ten sentences were selected randomly showed the
recall, precision and F-measure values of 86.23, 87.77 and
86.99 %, respectively. These results again show the effi-
cacy of the proposed technique.
Results on English: For English, we use the CoNLL-
2003 benchmark datasets [40]. We trained with CoNLL-
2003 training data and with the same set of features as we
used for the Indian languages except the Last word of
sentence feature. But for English, we use two additional
features, first one checks capitalization information, and
the second one denotes the chunk information. We use
context window within the previous four and next four
words, i.e. wiþ4i�4 ¼ wi�4. . .wiþ4 of wi, word suffixes and
prefixes of length upto four (4þ 4 different features)
characters. Experimental results for this dataset are pre-
sented in Table 5. It shows the overall recall, precision and
F-measure values of 87.16, 88.50 and 87.82 %, respec-
tively. This is actually an improvement of 2.06 F-measure
points over the first iteration. The baseline model, where in
each iteration ten sentences were selected randomly
Table 3 Evaluation results of SVM based AL on Bengali with
threshold 0.2 (in terms of percentage)
Iteration Recall Precision F-measure
0 (initial) 86.75 87.77 87.258
1 86.754 87.81 87.279
2 86.75 87.82 87.280
3 86.75 87.82 87.281
4 86.74 87.83 87.283
5 86.76 87.83 87.290
6 86.76 87.84 87.297
7 86.76 87.84 87.298
8 86.76 87.84 87.309
9 86.78 87.84 87.317
10 86.80 87.84 87.317
Table 4 Evaluation results of SVM based AL on Hindi with
threshold 0.2 (in terms of percentage)
Iteration Recall Precision F-measure
0 82.98 84.25 83.61
1 83.10 84.34 83.715
2 83.51 84.71 84.11
3 83.96 85.01 84.48
4 84.01 85.27 84.64
5 85.05 86.28 85.66
6 86.08 87.49 86.78
7 87.12 88.54 87.82
8 87.12 88.54 87.82
8 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html.
Int. J. Mach. Learn. & Cyber.
123
showed the recall, precision and F-measure values of
86.13, 87.56 and 86.84 %, respectively. These results again
show the efficacy of the proposed technique.
Results on biomedical texts: We train a SVM model with
the feature set mentioned in Sect. 4. We observe the best
performance with the context of previous three and next
three tokens, and thus only report its results. Results are
reported with the following feature combinations:
(1) Contexts within the previous and next three
tokens, i.e. wiþ3i�3 ¼ wi�3. . .wiþ3, (2) word suffixes and
prefixes of length upto three (4þ 4 different features)
characters of the word, (3) PoS information of the cur-
rent token, (4) chunk information of the current token,
(5) dynamic NE information, (6) word normalization, (7)
word length, (8) infrequent word, (9) unknown tokens,
(10) head nouns (unigram and bigram), (11) verb trigger,
(12) word class, (13) informative NE information, (14)
orthographic features and (15) all features within the
context of wi�1 to wiþ1.
We experiment with the selection criteria that only adds
the current sentence to the training set. We report the
results with threshold value of 0.2 in Table 6 as it yields
better results compared to the threshold of 0.1. It shows the
highest recall, precision and F-measure values of 76.2, 70.1
and 73.02 %, respectively. This shows improvements of
1.43, 1.39 and 1.41 % in recall, precision and F-measure,
respectively over the first iteration.
The results of the baseline model as defined previously
show the recall, precision and F-measure values of 75.37,
69.18 and 72.14 %, respectively. This is lower in com-
parison to our proposed approach by 0.83, 0.92 and 0.88 %
of recall, precision and F-measure values, respectively.
Hence the proposed method works effectively even for the
biomedical domain.
5.3 Results on ensemble based active annotation
In this section we present the results of the proposed
ensemble based active learning.
Results on Indian language NER: We train both CRF
and SVM with the feature set mentioned in Sect. 4. At first
we evaluate the active learning technique proposed in [5]
for CRF based classifier on the Bengali data. Its results are
shown in Table 7.
The system attains the highest performance with the
recall, precision and F-measure values of 87.76, 89.34 and
88.55 %, respectively. This is actually an improvement of
around 0.913 % F-measure points over the first iteration.
The baseline model showed the recall, precision and
F-measure values of 87.22, 88.96 and 88.08 %, respec-
tively. We already have the results for SVM as shown in
Table 3.
The learning curve comparing between SVM based
supervised model (i.e. baseline model, where at each
Table 5 Evaluation results of SVM based AL on English with
threshold 0.2
Iteration Recall Precision F-measure
0 85.39 86.12 85.76
1 85.90 86.65 86.27
2 85.98 86.90 86.44
3 86.02 87.01 86.51
4 86.15 87.14 86.64
5 86.32 87.55 86.93
6 86.54 87.81 87.17
7 86.90 88.15 87.52
8 86.99 88.32 87.64
9 87.16 88.50 87.82
10 87.16 88.50 87.82
Table 6 Evaluation results of SVM based AL for biomedical texts
with threshold = 0.2
Iteration number Recall Precision F-measure
1 74.77 68.71 71.61
2 74.90 69.20 71.94
3 75.01 69.50 72.15
4 75.10 69.81 72.36
5 75.5 69.95 72.62
6 75.82 70.02 72.80
7 76.2 70.10 73.02
8 76.2 70.10 73.02
9 76.2 70.10 73.02
10 76.2 70.10 73.02
Table 7 Evaluation results of CRF based AL for Bengali with
threshold value of 0.2
Iteration Recall Precision F-
measure
0 87.02 88.26 87.63
1 87.51 89.07 88.28
2 87.56 89.05 88.30
3 87.63 89.10 88.36
4 87.63 89.20 88.41
5 87.67 89.21 88.43
6 87.72 89.26 88.48
7 87.76 89.26 88.51
8 87.76 89.32 88.54
9 87.76 89.34 88.55
10 87.76 89.34 88.55
Int. J. Mach. Learn. & Cyber.
123
iteration ten sentences are selected randomly) and SVM
based active learning approach with the same amount of
training data is shown in Fig. 3. Similarly learning curve
between CRF based baseline model and CRF based active
learning model are shown in Fig. 3. The characteristics of
the learning curve shows that with the same annotation
effort (in each case we are selecting ten sentences to be
added in the training set) we gain more performance
improvements with active learning.
Thereafter we evaluate the proposed ensemble based
technique which combines the outputs of both CRF and
SVM based classifiers for the Bengali data. Results are
shown in Table 8. The system attains the highest
performance of recall, precision and F-measure values of
87.99, 89.99 and 88.97 %, respectively. This shows an
improvement of around 0.657 % F-measure points. The
baseline model shows the recall, precision and F-measure
values of 87.72, 89.26 and 88.48 %, respectively.
Table 9 compares the results between the (1) ensemble
and SVM based classifier, and (2) ensemble and CRF based
classifier for Bengali. In each of the cases, the system
attains the highest performance in the ninth iteration with
the F-measure values of 87.32, 88.55 and 88.98 %,
respectively. These are the improvements of 1.66 and
0.43 %, over the active learning techniques based on CRF
and SVM classifiers, respectively.
At first we evaluated the active learning based technique
using CRF based classifier [5] for the Hindi data. Here we
use the same features as Bengali. Results on Hindi for CRF
are shown in Table 10.9 Please note that we make use of
the same set of features as like Bengali. The system attains
the highest recall, precision and F-measure values of 87.26,
88.55 and 87.89 %, respectively. These were obtained in
0 1 2 3 4 5 6 7 8 987.25
87.26
87.27
87.28
87.29
87.3
87.31
87.32
Iterations
F−
mea
sure
val
ues
SVMBaseline
0 1 2 3 4 5 6 7 8 987.6
87.7
87.8
87.9
88
88.1
88.2
88.3
88.4
88.5
88.6
Iterations
F−
mea
sure
CRFBaseline
(a) (b)
Fig. 3 Learning curves for Bengali data comparing a SVM based baseline and active learning approaches, b CRF based baseline and active
learning approaches
Table 8 Evaluation results of ensemble based AL on Bengali with
threshold 0.2
Iteration Recall Precision F-measure
0 87.58 89.11 88.32
1 87.58 89.16 88.36
2 87.58 89.18 88.37
3 87.63 89.20 88.41
4 87.65 89.19 88.41
5 87.72 89.26 88.48
6 87.76 89.26 88.51
7 87.83 89.35 88.58
8 87.83 89.39 88.60
9 87.99 89.99 88.97
Table 9 Comparison results for Bengali using ensemble approach v/s
single classifiers with threshold 0.2
Iteration SVM CRF Ensemble
0 87.26 87.63 88.32
1 87.28 88.28 88.36
2 87.28 88.30 88.37
3 87.28 88.36 88.41
4 87.28 88.41 88.41
5 87.29 88.43 88.48
6 87.29 88.48 88.51
7 87.29 88.54 88.58
8 87.31 88.55 88.60
9 87.32 88.55 88.97
9 We iterate the algorithm for more than 10 iterations as we observed
performance improvement even in the 10th iteration.
Int. J. Mach. Learn. & Cyber.
123
the tenth and eleventh iterations. This is actually an
improvement of around 1.91 % F-measure points over the
first iteration. The baseline model showed the recall, pre-
cision and F-measure values of 85.72, 87.06 and 86.39 %,
respectively. Thus the CRF based active learning technique
attains an improvement of 1.51 % over the baseline. These
results prove that CRF based active learning technique is
clearly superior to the individual baseline models.
The learning curves, comparing between SVM based
baseline model and active learning model with same
amount of training data is shown in Fig. 4. Similarly the
learning curve for CRF based baseline and CRF based
active learning is shown in Fig. 4. These curves show that
with the same annotation effort we can get some more
points gain in the F-measure value.
Thereafter we evaluate the proposed ensemble based
active learning technique for the Hindi data. Results are
shown in Table 11. The system attains the highest recall,
precision and F-measure values of 88.01, 88.99 and
88.50 %, respectively. This is obtained in the tenth itera-
tion and remains unaltered in the next iteration. This an
improvement of around 2.58 % F-measure points over the
first iteration. The baseline model, where in each iteration
ten sentences were selected randomly show the recall,
precision and F-measure values of 85.46, 86.79 and
86.17 %, respectively. This proves that the proposed
ensemble based active learning technique is superior than
the baseline approach.
Table 12 compares the results of ensemble classifier
with the results of SVM based active learner and CRF
based active learner for Hindi. In each of the cases, the
system attains the highest performance of 87.82, 87.89 and
88.50 % in the 7th, 10th and 10th iteration, respectively.
For Hindi, we observe the improvement with the ensemble
approach. It shows the increments of 0.61 and 0.68 %,
respectively over CRF and SVM based active learning
techniques.
Results on English: Here at first we evaluate the active
learning technique proposed in [5] for CRF based classifier
on the English data. Results are shown in Table 13. The
system attains the highest recall, precision and F-measure
values of 87.58, 88.98 and 88.27 %, respectively. In base-
line model we observed the recall, precision and F-measure
values of 86.90, 88.01 and 87.45 %, respectively.
Thereafter we evaluate the proposed ensemble based
technique which combines the outputs of both CRF and
SVM based classifiers. Results are shown in Table 14. The
system attains the highest recall, precision and F-measure
values of 88.52, 88.86 and 88.69 %, respectively.
0 2 4 6 8 1083.5
84
84.5
85
85.5
86
86.5
87
87.5
88
Iterations
F−
mea
sure
SVMBaseline
0 2 4 6 8 1085.5
86
86.5
87
87.5
88
Iterations
F−
mea
sure
CRFBaseline
(a) (b)
Fig. 4 Learning curves for Hindi data set comparing a SVM based baseline and active learning approaches, b CRF based baseline and active
learning approaches
Table 10 Evaluation results of CRF alone on Hindi with threshold
0.2
Iteration Recall Precision F-measure
0 85.32 86.65 85.98
1 85.72 87.06 86.39
2 85.99 87.33 86.65
3 86.16 87.53 86.84
4 86.26 87.66 86.95
5 86.42 87.79 87.10
6 86.66 88.03 87.34
7 86.79 88.12 87.44
8 86.96 88.28 87.61
9 86.99 88.28 87.63
10 87.26 88.55 87.89
11 87.26 88.55 87.89
Int. J. Mach. Learn. & Cyber.
123
In Table 15 we present the comparisons of the ensemble
approach with individual CRF and SVM based approaches
for English data. In each of the cases, the system attains the
highest performing F-measure values of 87.82, 88.27 and
88.69 %, respectively. Table shows that the system
achieves this performance in the 9th iteration. Like other
languages, the ensemble obtains better accuracy over the
individual models. Comparisons with the other systems for
English are shown in Table 16.
For the benchmark English dataset, our proposed system
achieves the performance, comparable to the best per-
forming system [44] of CoNLL-2003 shared task. The best
system [44] at CoNLL-2003 shared task demonstrated the
recall, precision and F-measure values of 88.54, 88.99 and
88.76 %, respectively. They used an ensemble learner with
many domain dependent resources and/or tools. In contrast,
our proposed algorithm (1) makes use of the features that
Table 11 Evaluation results of AL using ensemble approach on
Hindi data with threshold 0.2 (RPF values; we report percentage
results)
Iteration Recall Precision F-measure
0 85.32 86.65 85.98
1 86.06 87.28 86.66
2 86.09 87.39 86.74
3 86.26 87.59 86.92
4 86.42 87.77 87.09
5 86.59 87.94 87.26
6 86.69 87.98 87.33
7 86.96 88.21 87.59
8 87.24 88.56 87.87
9 87.57 88.91 88.23
10 88.01 88.99 88.50
11 88.01 88.99 88.50
Table 12 Comparisons for Hindi using ensemble approach v/s single
classifier with threshold 0.2
Iteration SVM CRF Ensemble
0 83.61 85.98 85.98
1 83.72 86.39 86.66
2 84.11 86.65 86.74
3 84.48 86.84 86.92
4 84.64 86.95 87.09
5 85.66 87.10 87.26
6 86.78 87.34 87.33
7 87.82 87.44 87.56
8 87.82 87.61 87.87
9 87.82 87.63 88.23
10 87.82 87.89 88.50
Table 13 Evaluation results of CRF based AL on English with
threshold 0.2
Iteration Recall Precision F-
measure
0 85.74 86.03 85.89
1 85.91 86.65 86.28
2 85.99 86.91 86.45
3 86.11 87.03 86.57
4 86.41 87.23 86.82
5 86.51 87.41 86.96
6 86.67 87.78 87.22
7 86.91 87.92 87.41
8 87.21 88.14 87.67
9 87.58 88.98 88.27
10 87.58 88.98 88.27
Table 14 Evaluation results of ensemble based AL on English data
with threshold 0.2
Iteration Recall Precision F-
measure
0 86.01 86.63 86.32
1 86.41 86.91 86.66
2 86.72 87.05 86.88
3 86.91 87.34 87.12
4 87.04 87.61 87.32
5 87.51 87.80 87.65
6 87.67 87.98 87.82
7 87.88 88.05 87.96
8 88.01 88.25 88.13
9 88.52 88.86 88.69
10 88.52 88.86 88.69
Table 15 Comparison results of AL on English data using ensemble
approach v/s single classifiers with threshold 0.2
Iteration SVM CRF Ensemble
0 85.76 85.89 86.32
1 86.27 86.28 86.66
2 86.44 86.45 86.88
3 86.51 86.57 87.12
4 86.64 86.82 87.32
5 86.93 86.96 87.65
6 87.17 87.22 87.82
7 87.52 87.41 87.96
8 87.64 87.67 88.13
9 87.82 88.27 88.69
10 87.82 88.27 88.69
Int. J. Mach. Learn. & Cyber.
123
can be derived for any language with a little effort, (2) does
not make use of any domain dependent resources like the
gazetteers etc., and (3) does not make use of any additional
NE taggers, but still achieves state-of-the-art performance,
which is below 0.07 F-measure point compared to the best
system of CoNLL-2003.
Until now, the best reported results for CoNLL-2003
shared task data are in Lin and Wu [42] that proposed a
semi-supervised approach for NER. They obtained the
F-measure value of 90.90 %, which is 2.21 points higher
than our proposed system. In addition to the above men-
tioned two systems [42, 44], we also present the compar-
isons with some other well-known existing techniques in
Table 16. Suzuki and Isozaki [43] run a baseline discrim-
inative classifier on unlabeled data to generate pseudo
examples, which are then used to train a different type of
classifier for the same problem. Later on, they used the
automatically labeled corpus to train hidden Markov model
(HMMs). Chieu and Ng [45] showed how the use of global
information, in addition to the local ones, can improve the
model performance. It is to be noted that our system
achieves 6.00 points higher F-measure value in comparison
to the stacked, voted model, proposed by Wu et al. [47] in
the CoNLL-2003 shared task.
Results on biomedical texts: Here at first we execute the
CRF based active learning technique (described in [5]) on
the biomedical data. Results are shown in Table 17. We
trained a CRF model with the feature set mentioned in
Sect. 4.
The system attains the highest recall, precision and
F-measure values of 76.50, 73.00 and 74.80 %, respec-
tively. This was obtained in the ninth iteration, and accu-
racy does not change thereafter. This is actually an
improvement of around 1.57 % F-measure points over the
first iteration. The baseline model, where in each iteration
ten sentences were selected randomly showed the recall,
precision and F-measure values of 74.57, 72.30 and
73.42 %, respectively.
Figure 5 shows the learning curves demonstrating the
comparisons between SVM based baseline vs. SVM based
AL and CRF based baseline vs. CRF based AL. This
illustrates the effectiveness of active learning based tech-
niques where with the same annotation efforts we achieve
better accuracies.
Thereafter we evaluate the proposed ensemble based
technique for the biomedical data. Results are shown in
Table 18. The system attains the highest recall, precision
and F-measure values of 76.80, 74.95 and 75.86 %,
respectively. This is better compared to the accuracies
obtained in the first iteration and the baseline model. Table
19 compares the results of ensemble approach and CRF
based active learner and SVM based active learner. The
proposed ensemble based active learning technique attains
1.06 and 2.84 % F-measure improvements over the SVM
and CRF based active learning techniques, respectively.
The results clearly indicate that ensemble indeed achieves
better performance.
Comparison with existing biomedical NER systems: We
compare with the systems reported in the JNLPBA 2004
shared task as well as with those that were developed at the
later stages but made use of the same datasets. We present
the comparative evaluation results in Table 20 not only
with the domain-independent systems but also with the
systems that incorporate domain knowledge and/or exter-
nal resources.
GuoDong and Jian [48] developed the best system in the
JNLPBA 2004 shared task. This system provides the
highest F-measure value of 72.55 with several deep domain
knowledge. Song et al. [49] used CRF and SVM both, and
obtained the F-measure of 66.28 % with virtual samples.
The HMM-based system reported by Ponomareva et al.
[50] achieved a F-measure value of 65.7 % with PoS and
phrase-level domain dependent knowledge. A maximum
entropy (ME)-based system was reported in [51] where
recognition of terms and their classification were per-
formed in two steps. They achieved a F-measure value of
66.91 % with several lexical knowledge sources such as
Table 16 Comparisons with
some existing systems for
English NER
System F-measure
(in %)
Lin and Wu [42] 90.90
Suzuki and Isozaki
[43]
89.92
Florian et al. [44] 88.76
Our proposed
system
88.69
Chieu and Ng [45] 88.31
Klein et al. [46] 86.31
Wu et al. [47] 82.69
Table 17 Evaluation results of CRF based active learning technique
with threshold = 0.2 for biomedical data
Iteration number Recall Precision F-measure
1 74.5 72.0 73.23
2 74.6 72.1 73.33
3 74.9 72.1 73.47
4 75.1 72.1 73.57
5 75.5 72.1 73.76
6 76.0 72.5 74.21
7 76.2 72.8 74.46
8 76.5 73.0 74.7
9 76.5 73.0 74.7
10 76.5 73.0 74.8
Int. J. Mach. Learn. & Cyber.
123
salient words obtained through corpus comparison between
domain-specific and WSJ corpora, morphological patterns
and collocations extracted from the Medline corpus. As far
our knowledge is concerned, one of the very recent works
proposed in [38] obtained the F-measure value of 67.41 %
with PoS and phrase information as the only domain
knowledge. This is the highest performance achieved by
any system that did not use any deep domain knowledge.
A CRF-based NER system has been reported in [52] that
obtained the F-measure value of 70 % with orthographic
features, semantic knowledge in the form of 17 lexicons
generated from the public databases and Google sets.
Finkel et al. [53] reported a CRF-based system that showed
the F-measure value of 70.06 % with the use of a number
of external resources, including gazetteers, web-querying,
surrounding abstracts, abbreviation handling method, and
frequency counts from the BNC corpus. A two-phase
model based on ME and CRF was proposed by Kim et al.
[54] that achieved a F-measure value of 71.19 % by post-
processing the outputs of machine learning models with a
rule-based component.
Our proposed ensemble based active learning technique
attains the average recall, precision and F-measure values
of 76.80, 74.95 and 75.86 %, respectively. This is at par
with existing state-of-the-art systems.
We also compare the performance of our proposed
ensemble based active learning approach with the sate-of-
the-art biomedical NER system, BANNER [55] that was
implemented using CRFs. BANNER exploits a range of
orthographic, morphological and shallow syntax features,
such as part-of-speech tags, capitalisation, letter/digit
combinations, prefixes, suffixes and Greek letters. Com-
parisons between the several existing NER systems are
provided in [56]. For BANNER, Kabiljo et al. [56] reported
the F-measure values of 77.50 and 61.00 % under the
sloppy matching and strict matching criterion, respectively
with the JNLPBA shared task datasets.
1 2 3 4 5 6 7 8 9 1071.6
71.8
72
72.2
72.4
72.6
72.8
73
73.2
Iterations
F−
mea
sure SVM
Baseline
1 2 3 4 5 6 7 8 9 1073.2
73.4
73.6
73.8
74
74.2
74.4
74.6
74.8
Iterations
F−
mea
sure CRF
Baseline
(a) (b)
Fig. 5 Learning curves for biomedical data comparing a SVM based baseline and active learning approaches, b CRF based baseline and active
learning approaches
Table 18 Evaluation results of AL using ensemble approach on
biomedical data with threshold 0.2
Iteration Recall Precision F-measure
1 75.3 73.2 74.23
2 75.6 73.51 74.54
3 75.72 73.8 74.75
4 75.81 73.93 74.86
5 75.95 73.99 74.96
6 76.02 74.05 75.02
7 76.5 74.8 75.64
8 76.62 74.91 75.76
9 76.80 74.95 75.86
10 76.80 74.95 75.86
Table 19 Comparison results of AL on biomedical data using
ensemble approach v/s single classifier with threshold 0.2
Iteration SVM CRF Ensemble
1 71.61 73.23 74.23
2 71.94 73.33 74.54
3 72.15 73.47 74.75
4 72.36 73.57 74.86
5 72.62 73.76 74.96
6 72.80 74.21 75.02
7 73.02 74.46 75.64
8 73.02 74.7 75.76
9 73.02 74.7 75.86
10 73.02 74.8 75.86
Int. J. Mach. Learn. & Cyber.
123
Note that the proposed active learning technique
achieves better results just after the first iteration compared
to the other existing systems. This is because of the use of
diverse set of features. However it is to be noted that in our
system we identify and implement features without using
any domain knowledge and/or resources. Note that initially
we trained the systems using the training data of having
450 K word forms.
Comparison with existing active annotation techniques:
We compare the results obtained using the proposed active
learning techniques with some of the existing active
learning based techniques. In [27] authors have proposed a
multi-criteria based active learning technique for NER.
They evaluated their techniques on biomedical and English
data sets. They achieved average F-measure of 83.3 % for
English data and F-measure of 63.3 % for biomedical data
sets. But our proposed approach attains F-measure value of
88.69 % for English data and 75.86 % F-measure value for
biomedical data sets. Thus the proposed approach attains
performance improvements of 5.39 and 12.56 % F-mea-
sure values over the approach proposed in [27] for English
data and biomedical data, respectively. One of the possible
explanations behind this considerable performance
improvement is due to the use of rich features (described in
Sect. 4).
We also execute the active annotation algorithm pro-
posed in [25] on the Bengali and Hindi data sets. In [25] a
CRF based active annotation is proposed. It attains
F-measure values of 87.95 and 86.51 % for Bengali and
Hindi, respectively. But our proposed ensemble technique
attains the final F-measure values of 88.97 and 88.50 % for
Bengali and Hindi data sets, respectively.
6 Conclusion and future work
In this paper we have progressively proposed two methods
for active annotation that could be helpful for many
applications where there is a scarcity in the amount of
available labeled data, and its creation involves consider-
ably long time and huge expenses. We have proposed
algorithms, one based on SVM and the other based on
ensemble learning. Based on SVM, we devised a method to
select the uncertain examples to be added to the initial
training set. The uncertain examples were selected based
on the distance between the two classes from the separating
hyperplane. The ensemble approach combines SVM and
CRF both. For CRF, the uncertain samples are selected
based on the marginal probabilities. Ensemble utilizes both
the concepts, viz. distance from the separating hyperplane
and marginal probability. The proposed system is evaluated
for solving the problem of NER, an important pipelined
module in many NLP application areas. Experiments were
conducted on two resource-poor Indian languages, namely
Bengali and Hindi. In addition the systems have also been
evaluated for English and biomedical texts. We obtain
good accuracies for all the domains. The ensemble method
clearly dominates over the previous method.
This is a highly accurate and scalable technique which
can be easily used in other information extraction prob-
lems. The high accuracy is due to the use of maximum-
margin nature of SVMs and also due to the use of capa-
bility of CRFs to model correlations between neighboring
output tags. The system is scalable for training SVM and
CRF classifiers. The system is easy to use because of the
utilization of existing softwares in a straightforward way.
Table 20 Comparison with the existing approaches
System Used approach Domain knowledge/resources FM
Our proposed system Ensemble based active
learning (CRF and SVM)
POS, phrase 75.86
Guo Dong and Jian [48] final HMM, SVM Name alias, cascaded NEs dictionary, POS,
phrase POS, phrase
72.55
Guo Dong and Jian [48] HMM, SVM POS, phrase 64.1
Kim et al. [54] Two-phase model
with ME and CRF
POS, phrase, rule-based component 71.19
Finkel et al. [53] CRF Gazetteers, web-querying, surrounding abstracts,
POS abbreviation handling, BNC corpus,
70.06
Settles [52] ME POS, semantic knowledge sources of 17 lexicons 70.00
Saha et al. [38] ME POS, phrase 67.41
Park et al. [51] ME POS, phrase, domain-salient words using WSJ,
morphological patterns, collocations from Medline
66.91
Song et al. [49] final SVM, CRF POS, phrase, virtual sample 66.28
Song et al. [49] base SVM POS, phrase 63.85
Ponomareva et al. [50] HMM POS 65.7
Int. J. Mach. Learn. & Cyber.
123
More work can be done in this area using more than two
classifiers. Apart from this, genetic algorithms and multi-
objective optimization based feature selection technique
will be employed for determining appropriate set of fea-
tures. The proposed approach will be applied for other
information extraction problems in natural language and
biomedicine.
References
1. Dligach D, Palmer M (2011) Good seed makes a good crop:
accelerating active learning using language modeling. In: Pro-
ceedings of the 49th annual meeting of the association for com-
putational linguistics: shortpapers, Portland, Oregon. Association
for Computational Linguistics, pp 6–10
2. Dligach D, Palmer M (2009) Using language modeling to select
useful annotation data. In: Proceedings of human language
technologies, Portland, Oregon. Association for Computational
Linguistics, pp 25–30
3. Laws F, Heimer F, Sch€utze H (2012) Active learning for core-
ference resolution. In: 2012 conference of the North American
chapter of the association for computational linguistics: human
language technologies, Montreal, Canada. Association for Com-
putational Linguistics, pp 508–512
4. Settles B (2009) Active learning literature survey. In: Computer
sciences technical report 1648
5. Ekbal A, Bonin F, Saha S, Stemle E, Barbu E, Cavulli F, Girardi
C, Nardelli F, Poesio M (2012) Rapid adaptation of ne resolvers
for humanities domains using active annotation. J Lang Technol
Comput Linguist (JLCL) 26(2):26–38
6. Small K, Roth D (2010) Margin-based active learning for struc-
tured predictions. Int J Mach Learn Cybern 1(1–4):3–25
7. Wang XZ, Dong LC, Yan JH (2012) Maximum ambiguity-based
sample selection in fuzzy decision tree induction. IEEE Trans
Knowl Data Eng 24(8):1491–1505
8. Settles B (2008) Curious machines: active learning with struc-
tured instances. PhD thesis, University of Wisconsin-Madison
9. Tong S (2001) Active learning: theory and applications. PhD
thesis, Stanford University
10. Monteleoni C (2006) Learning with online constraints: shifting
concepts and active learning. PhD thesis, Massachusetts Institute
of Technology
11. Olsson F (2008) Bootstrapping named entity recognition by
means of active machine learning. PhD thesis, University of
Gothenburg
12. Olsson F (2009) A literature survey of active machine learning in
the context of natural language processing. In: Technical report
t2009:06, Swedish Institute of Computer Science
13. Schein AI, Ungar LH (October 2007) Active learning for logistic
regression: an evaluation. Mach Learn 68(3):235–265
14. Baldridge J, Palmer A (2009) How well does active learning
actually work? Time-based evaluation of cost-reduction strategies
for language documentation. In: Proceedings of the 2009 con-
ference on empirical methods in natural language processing
(EMNLP ’09) vol 1, Stroudsburg. Association for Computational
Linguistics, pp 296–305
15. Tomanek K, Olsson F (2009) A web survey on the use of active
learning to support annotation of text data. In: Proceedings of the
NAACL HLT 2009 workshop on active learning for natural
language processing, HLT ’09, Stroudsburg. Association for
Computational Linguistics, pp 45–48
16. Dasgupta S (2004) Analysis of a greedy active learning strategy.
In: Advances in neural information processing systems. MIT
Press, USA, pp 337–344
17. Balcan MF, Hanneke S, Vaughan J (2010) The true sample
complexity of active learning. Mach Learn 80(2–3):111–139
18. Settles B, Craven M (2008) An analysis of active learning
strategies for sequence labeling tasks. In: Proceedings of the
conference on empirical methods in natural language processing
(EMNLP’08), Stroudsburg. Association for Computational Lin-
guistics, pp 1070–1079
19. Reichart R, Tomanek K, Hahn U, Rappoport A (2008) Multi-task
active learning for linguistic annotations. In: Proceedings of
ACL-08: HLT, Columbus, Ohio. Association for Computational
Linguistics, pp 861–869
20. Riloff E, Jones R (1999) Learning dictionaries for information
extraction by multi-level bootstrapping. In: Proceedings of the
sixteenth national conference on artificial intelligence and the
eleventh innovative applications of artificial intelligence confer-
ence innovative applications of artificial intelligence (AAAI’99/
IAAI ’99), Menlo Park. American Association for Artificial
Intelligence, pp 474–479
21. Cucchiarelli A, Velardi P (March 2001) Unsupervised named
entity recognition using syntactic and semantic contextual evi-
dence. Comput Linguist 27(1):123–131
22. Etzioni O, Cafarella M, Downey D, Popescu AM, Shaked T,
Soderland S, Weld DS, Yates A (June 2005) Unsupervised
named-entity extraction from the web: an experimental study.
Artif Intell 165(1):91–134
23. Tomanek K, Hahn U (2009) Reducing class imbalance during
active learning for named entity annotation. In: Proceedings of
the fifth international conference on knowledge capture (K-
CAP’09), New York. ACM, pp 105–112
24. Becker M, Hachey B, Alex B, Grover C (2005) Optimising selec-
tive sampling for bootstrapping named entity recognition. In: Pro-
ceedings of the ICML workshop on learning with multiple views,
pp 5–11
25. Yao L, Sun C, Li S, Wang X, Wang X (2009) Crf-based active
learning for chinese named entity recognition. In: SMC, IEEE,
pp 1557–1561
26. Laws F, Schatze H (2008) Stopping criteria for active learning of
named entity recognition. In: Proceedings of the 22nd interna-
tional conference on computational linguistics (COLING’08), vol
1, Stroudsburg. Association for Computational Linguistics,
pp 465–472
27. Shen D, Zhang J, Su J, Zhou G, Tan CL (2004) Multi-criteria-
based active learning for named entity recognition. In: Proceed-
ings of the 42nd annual meeting on association for computational
linguistics (ACL’04), Stroudsburg. Association for Computa-
tional Linguistics
28. Ekbal A, Naskar S, Bandyopadhyay S (2007) Named entity
recognition and transliteration in Bengali. Named Entities Rec-
ognit Classif Use Spec Issue Lingvisticae Investig J 30(1):95–114
29. Ekbal A, Bandyopadhyay S (2009) A conditional random field
approach for named entity recognition in Bengali and Hindi.
Linguist Issues Lang Technol (LiLT) 2(1):1–44
30. Li W, McCallum A (2004) Rapid development of Hindi named
entity recognition using conditional random fields and feature
induction. ACM Trans Asian Lang Inf Process 2(3):290–294
31. Srikanth P, Murthy KN (2008) Named entity recognition for
Telugu. In: Proceedings of the IJCNLP-08 workshop on NER for
South and South East Asian languages, pp 41–50
32. Yao L, Sun C, Wu Y, Wang X, Wang X (2011) Biomedical
named entity recognition using generalized expectation criteria.
Int J Mach Learn Cybern 2(4):235–243
33. Vapnik VN (1995) The nature of statistical learning theory.
Springer-Verlag New York Inc., New York
Int. J. Mach. Learn. & Cyber.
123
34. Lafferty JD, McCallum A, Pereira FCN (2001) Conditional ran-
dom fields: probabilistic models for segmenting and labeling
sequence data. In: ICML, pp 282–289
35. Collins M, Singer Y (1999) Unsupervised models for named
entity classification. In: Proceedings of the joint SIGDAT con-
ference on empirical methods in natural language processing and
very large corpora
36. Joachims T (1999) Making large scale SVM learning practical.
MIT Press, Cambridge
37. Vlachos A (2006) Active annotation. In: Proceedings of EACL
2006 workshop on adaptive text extraction and mining, Trento
38. Saha SK, Sarkar S, Mitra P (2009) Feature selection techniques
for maximum entropy based biomedical named entity recogni-
tion. J Biomed Inform 42(5):905–911
39. Ekbal A, Bandyopadhyay S (2008) A web-based Bengali news
corpus for named entity recognition. Lang Resour Eval J
42(2):173–182
40. Tjong Kim Sang EF, De Meulder F (2003) Introduction to the
Conll-2003 shared task: language independent named entity
recognition. In: Proceedings of the seventh conference on natural
language learning at HLT-NAACL, pp 142–147
41. Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y (2004) Introduction to
the bio-entity recognition task at jnlpba. In: Proceedings of the
international joint workshop on natural language processing in
biomedicine and its applications (JNLPBA’04). Association for
Computational Linguistics, pp 70–75
42. Lin D, Wu X (2009) Phrase clustering for discriminative learning.
In: Proceedings of 47th annual meeting of the ACL and the 4th
IJCNLP of the AFNLP, pp 1030–1038
43. Suzuki J, Isozaki H (2008) Semi-supervised sequential labeling
and segmentation using Gigaword scale unlabeled data. In: Pro-
ceedings of ACL/HLT-08, pp 665–673
44. Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named
entity recognition through classifier combination. In: Proceedings
of the seventh conference on natural language learning at HLT-
NAACL
45. Chieu HL, Ng HT (2003) Named entity recognition with a
maximum entropy approach. In: Proceedings of CoNLL-2003,
HLT-NAACL, pp 160–163
46. Klein D, Smarr J, Nguyen H, Manning CD (2003) Named entity
recognition with character-level models. In: Proceedings of
CoNLL-2003, HLT-NAACL, pp 188–191
47. Wu D, Ngai G, Carput M (2003) A stacked, voted, stacked model
for named entity recognition. In: Proceedings of the CoNLL-
2003, HLT-NAACL, pp 200–203
48. Zhou G, Su J (2004) Exploring deep knowledge resources in
biomedical name recognition. In: Proceedings of the international
joint workshop on natural language processing in biomedicine
and its applications (JNLPBA ’04), pp 96–99
49. Song Y, Kim E, Lee GG, Yi B (2004) Posbiotm-ner in the shared
task of bionlp/nlpba 2004. In: Proceedings of the joint workshop
on natural language processing in biomedicine and its applica-
tions (JNLPBA-2004)
50. Ponomareva N, Pla F, Molina A, Rosso P (2007) Biomedical
named entity recognition: a poor knowledge hmm-based
approach. In: NLDB, pp 382–387
51. Park KM, Kim SH, Rim HC, Hwang YS (2004) Me-based bio-
medical named entity recognition using lexical knowledge. ACM
Trans Asian Lang Inf Process 5:4–21
52. Settles B (2004) Biomedical named entity recognition using
conditional random fields and rich feature sets. In: Proceedings of
the international joint workshop on natural language processing
in biomedicine and its applications (JNLPBA’04). Association
for Computational Linguistics, pp 104–107
53. Finkel J, Dingare S, Nguyen H, Nissim M, Sinclair G, Manning C
(2004) Exploiting context for biomedical entity recognition: from
syntax to the web. In: Proceedings of the joint workshop on
natural language processing in biomedicine and its applications
(JNLPBA-2004), pp 88–91
54. Kim S, Yoon J, Park KM, Rim HC (2005) Two-phase biomedical
named entity recognition using a hybrid method. In: IJCNLP,
pp 646–657
55. Leaman R, Gonzalez G (2008) BANNER: an executable survey
of advances in biomedical named entity recognition. In: Pro-
ceedings of the pacific symposium on biocomputing, pp 652–663
56. Kabiljo R, Clegg AB, Shepherd AJ (2009) A realistic assessment
of methods for extracting gene/protein interactions from free text.
BMC Bioinform 10:233. doi:10.1186/1471-2105-10-233
Int. J. Mach. Learn. & Cyber.
123