AlpacaTag: An Active Learning-based Crowd …As a crowdsourcing framework, AlpacaTag collects...

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 58–63Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

58

AlpacaTag: An Active Learning-based Crowd AnnotationFramework for Sequence Tagging

Bill Yuchen Lin†∗ Dong-Ho Lee†∗ Frank F. Xu‡ Ouyu Lan† and Xiang Ren†{yuchen.lin,dongho.lee,olan,xiangren}@usc.edu, [email protected]

†Computer Science Department ‡Language Technologies InstituteUniversity of Southern California Carnegie Mellon University

Abstract

We introduce an open-source web-based dataannotation framework (AlpacaTag) for se-quence tagging tasks such as named-entityrecognition (NER). The distinctive advantagesof AlpacaTag are three-fold. 1) Active in-telligent recommendation: dynamically sug-gesting annotations and sampling the most in-formative unlabeled instances with a back-endactive learned model; 2) Automatic crowdconsolidation: enhancing real-time inter-annotator agreement by merging inconsistentlabels from multiple annotators; 3) Real-timemodel deployment: users can deploy theirmodels in downstream systems while new an-notations are being made. AlpacaTag isa comprehensive solution for sequence label-ing tasks, ranging from rapid tagging withrecommendations powered by active learningand auto-consolidation of crowd annotationsto real-time model deployment.

1 Introduction

Sequence tagging is a major type of tasks in natu-ral language processing (NLP), including named-entity recognition (detecting and typing entitynames), keyword extraction (e.g. extracting as-pect terms in reviews or essential terms in queries),chunking (extracting phrases), and word segmen-tation (identifying word boundary in languageslike Chinese). State-of-the-art supervised ap-proaches to sequence tagging are highly depen-dent on numerous annotations. New annota-tions are usually necessary for a new domain ortask, even though transfer learning techniques (Linand Lu, 2018) can reduce the amount of themby reusing data of other related tasks. How-ever, manually annotating sequences can be time-consuming, expensive, and thus hard to scale.

∗Both authors contributed equally.

Backend Model

Consolidation & Incremental Training

MatchingFrequent NPs+ Dictionary

Crowd Annotators

A batch of raw sentences

Recommendations

Crowd Annotations

Unlabeled Instances

Instance Sampling viaActive Learning

Real-time Deployment API

ComplicatedDownstreamSystems

AlpacaTag

Figure 1: Overview of the AlpacaTag framework.

Therefore, it is still an important research ques-tion that how we can develop a better annotationframework to largely reduces human efforts.

Existing open-source sequence annotation tools(Stenetorp et al., 2012; Druskat et al., 2014a;de Castilho et al., 2016; Yang et al., 2018a) mainlyfocus on enhancing the friendliness of user inter-faces (UI) such as data management, fast taggingwith shortcut keys, supporting more platforms,and multi-annotator analysis. We argue that thereare still three important yet underexplored direc-tions of improvements: 1) active intelligent rec-ommendation, 2) automatic crowd consolidation,3) real-time model deployment. Therefore, wepropose a novel web-based annotation tool namedAlpacaTag1 to address these three problems.

Active intelligent recommendation (§3) aimsto reduce human efforts at both instance-level andcorpus-level by learning a back-end sequence tag-ging model incrementally with incoming humanannotations. At the instance-level, AlpacaTagapplies the model predictions on current unlabeledsentences as tagging suggestions. Apart from that,

1The source code is publicly available at http://inklab.usc.edu/AlpacaTag/.

http://inklab.usc.edu/AlpacaTag/

http://inklab.usc.edu/AlpacaTag/

59

we also greedily match frequent noun phrases anda dictionary of already annotated spans as recom-mendations. Annotators can easily confirm or de-cline such recommendations with our specializedUI. At the corpus-level, we use active learning al-gorithms (Settles, 2009) for selecting next batchesof instances to annotate based on the model. Thegoal is to select the most informative instances forhuman to tag and thus to achieve a more cost-effective way of using human efforts.

Automatic crowd consolidation (§4) of the an-notations from multiple annotators is an underex-plored topic in developing crowd-sourcing annota-tion frameworks. As a crowdsourcing framework,AlpacaTag collects annotations from multiple(non-expert) contributors with lower cost and ahigher speed. However, annotators usually havedifferent confidences, preferences, and biases inannotating, which leads to possibly high inter-annotator disagreement. It is shown very challeng-ing to train models with such noisy crowd annota-tions (Nguyen et al., 2017; Yang et al., 2018b). Weargue that consolidating crowd labels during an-notating can lead annotators to achieve real-timeconsensus, and thus decrease disagreement of an-notations instead of exhausting post-processing.

Real-time model deployment (§5) is also a de-sired feature for users. We sometimes need todeploy a state-of-the-art sequence tagging modelwhile the crowdsourcing is still ongoing, such thatusers can facilitate the developing of their tagging-required systems with our APIs.

To the best of our knowledge, there is no ex-isting annotation framework enjoying such threefeatures. AlpacaTag is the first unified frame-work to address these problems, while inheritingthe advantages of existing tools. It thus providesa more comprehensive solution to crowdsourcingannotations for sequence tagging tasks.

In this paper, we first present the high-levelstructure and design of the proposed AlpacaTagframework. Three key features are then introducedin detail: active intelligent recommendation (§3),automatic crowd consolidation (§4), and real-timemodel deployment (§5). Experiments (§6) are con-ducted for showing the effectiveness of the pro-posed three features. Comparisons with relatedworks are discussed in §7. Section §8 shows con-clusion and future directions.

2 Overview of AlpacaTag

As shown in Figure 1, AlpacaTag has an ac-tively learned back-end model (top-left) in addi-tion to a front-end web-UI (bottom). Thus, wehave two separate servers: a back-end modelserver and a front-end annotation server. Themodel server is built with PyTorch and supportsa set of APIs for communications between the twoservers and model deployment. The annotationserver is built on Django in Python, which in-teracts with administrators and annotators.

To start annotating for a domain of interest, ad-mins should login and create a new project. Then,they need to further import their raw corpus (withCSV or JSON format), and assign the tag spacewith associated colors and shortcut keys. Adminscan further set the batch size to sample instancesand to update back-end models. Annotators canlogin and annotate (actively sampled) unlabeledinstances with tagging suggestions. We furtherpresent our three key features in the next sections.

3 Active Intelligent Recommendation

This section first introduces the back-end model(§3.1) and then presents how we use the back-end model for both instance-level recommenda-tions (tagging suggestions, §3.2) as well as corpus-level recommendations (active sampling, §3.3).

3.1 Back-end Model: BLSTM-CRFThe core component of the proposed AlpacaTagframework is the back-end sequence taggingmodel, which is learned with an incrementalactive learning scheme. We use the state-of-the-art sequence tagging model as our back-endmodel (Lample et al., 2016; Lin et al., 2017;Liu et al., 2018), which is based on bidirectionalLSTM networks with a CRF layer (BLSTM-CRF). It can capture character-level patterns, andencode token sequences with pre-trained word em-beddings, as well as using CRF layers to capturestructural dependencies in tagging. In this section,we assume the model is fixed as we are talkingabout how to use it for infer recommendations.How to update it by consolidating crowd annota-tions is illustrated in the Section §4.

3.2 Tagging Suggestions (Instance-level Rec.)Apart from the inference results from the back-end model on the sentences, we also include twoother kinds of tagging suggestions mentioned in

60

(a) the sentence and annotations are at the uppersection; tagging suggestions are shown asunderlined spans in the lower section.

(b) after click on a suggested span, a floating window will show up near for confirming the types (suggested type is bounded with red line).

(c) after click a suggested type or press a shortcutkey (e.g. ‘p’), confirmed annotations will show upin the upper annotation section.

Figure 2: The workflow for annotators to confirm agiven tagging suggestion (“Hillary Clinton” as PER).

Fig. 1: (frequent) noun phrases and the dictionaryof already tagged spans. Specifically, after adminsupload raw corpora, AlpacaTag runs a phrasemining algorithm (Shang et al., 2018) to gathera list of frequent noun phrases, which are morelikely to be entities. Admins can also optionallyenable the framework to consider all noun phrasesby chunking as span candidates. These sugges-tions are not typed. Additionally, we also maintaina dictionary mapping from already tagged spans inprevious sentences to their majority tagged types.Frequent noun phrases and dictionary entries arematched by the greedy forward maximum match-ing algorithm. Therefore, the final tagging sugges-tions are merged from three sources with the fol-

lowing priority (ordered by their precision): dic-tionary matches > back-end model inferences >(frequent) noun phrases, while the coverage of thesuggestions are in the opposite order.

Figure 2 illustrates how AlpacaTag presentsthe tagging suggestions for the sentence “DonaldTrump had a dinner with Hillary Clinton in thewhite house.”. The two suggestions are shownin the bottom section as underscored spans “Don-ald Trump” and “Hilary Clinton” (Fig. 2a). Whenannotators want to confirm “Hilary Clinton” as atrue annotation, they first click the span and thenclick the right type in the appearing float window(Fig. 2b). They can also press the customizedshortcut key for fast tagging. Note that the PERbutton is underscored in Fig. 2b, meaning that itis a recommended type for annotators to choose.Fig. 2c shows that after confirming, it is added intofinal annotations. We want to emphasize that ourdesigned UI well solves the problem of confirmingand declining suggestions. The float windows re-duce mouse movement time very much and let an-notators easily confirm or change types by clickingor pressing shortcuts to correct suggested types.

What if annotators want to tag spans not recom-mended? Normally annotating in AlpacaTag isas friendly as other tools: annotators can simplyselect the spans they want to tag in the upper sec-tion and click or press the associated type (Fig. 3).We implement this UI design based on an open-source Django framework named doccano2.

1. select a span

3. get a tag2. click button or press shortcut key

Figure 3: Annotating a span without recommendations.

3.3 Active Sampling (Corpus-level Rec.)

Tagging suggestions are at the instance level,while what instances we should ask annotatorsto label is also very important to save humanefforts. Thus, the research question here ishow to actively sample a set of the most infor-mative instances from the whole unlabeled sen-tences, which is named active learning. Themeasure of informativeness is usually specific

2http://doccano.herokuapp.com/

http://doccano.herokuapp.com/

61

to an existing model, which means what in-stances (if labeled correctly) can improve themodel performance. A typical example of ac-tive learning is called Least Confidence (Culottaand McCallum, 2005), which applies the modelon all the unlabeled sentences and treats the in-ference confidence as the negative of informa-tiveness (1 − maxy1,...,yn P [y1, . . . , yn| {xij}]).For AlpacaTag, we apply an improved versionnamed Maximum Normalized Log-Probability(MNLP) (Shen et al., 2018), which eliminates theeffect of length of sequences by averaging:

maxy1,...,yn

1

n

n∑i=1

logP [yi|y1, . . . , yn−1, {xij}]

Simply put, we manage to utilize the currentback-end model for measuring the informative-ness of unlabeled sentences and then sample theoptimal next batch for annotators to label.

4 Automatic Crowd Consolidation

As a crowd-sourcing annotation framework,AlpacaTag aims to reduce inter-annotator dis-agreement while keeping their individual strengthsin labeling specific kinds of instances. We achievethis by applying annotator-specific back-end mod-els (“personal models”) and consolidate theminto a shared back-end model named “consensusmodel”. Personal models are used for personal-ized active sampling, such that annotators will la-bel the regions of data space that are most impor-tant to be labeled by them respectively.

Assume there are K annotators, we refer themas {A1, . . . , AK}. For each annotator Ai, we referthe corresponding sentences and annotations pro-cessed by it to be xi = {xi,j}lij=1 and yAi

i ∈ Y li .The target of consolidation is to generate predic-tions for the test data x, while the resulting labelsare referred as y. Note that annotators may an-notate different portions of the data and sentencescan have inconsistent crowd labels.

As shown in Fig. 4, we model the behav-ior of every annotator by constructing annota-tor representation matrices Ck from the networkparameters of personal back-end models. Thepersonal models share the same BLSTM lay-ers for encoding sentences and avoiding over-parameterizing. They are incrementally updatedevery batch, which usually consists of 50 ∼ 100sentences sampled actively by the algorithms men-tioned in §3.3. We further merge the crowd repre-

Crowd AnnotationsConsolidationT

aggingSuggestions

Personal Models

Consensus Model

Active

Sampling

sentence

BLSTM

Linear

tags scores

CRF

tag sequence

AnnotatorRepresentation

Incremental Training

Figure 4: Reducing inter-annotator disagreement bytraining personal models for personalized active sam-pling and constructing consensus models.

sentation Ck into a representation matrix C, andthus construct a consensus model by using C togenerate tag scores in the BLSTM-CRF architec-ture, which is improved from the approach pro-posed by Nguyen et al. (2017).

In summary, AlpacaTag periodically updatesthe back-end consensus model by consolidat-ing crowd annotations for annotated sentencesthrough merging personal models. The updatedback-end model continues to offer suggestions infuture instances for all the annotators, and thusthe consensus can be further solidified with rec-ommendation confirmations.

5 Real-time Model Deployment

Waiting for a model to be trained with ready hu-man annotations can be a huge bottleneck for de-veloping complicated pipeline systems in infor-mation extraction. Instead, users may want a se-quence tagging model that can be constantly up-dated, even when the annotation process is still un-dergoing. AlpacaTag naturally enjoys this fea-ture since we maintain a back-end tagging modelwith incremental active learning techniques.

Thus, we provide a suite of APIs for interactingwith the back-end consensus model server. Theycan work for both communication with the annota-tion server and real-time deployment of the back-end model. One can obtain inference results fortheir developing complicated systems by queryingsuch APIs for downstream applications.

6 Experiments

To investigate the performance of our imple-mented back-end model with incremental activelearning and consolidation, we conduct a prelim-

62

0

10

20

30

40

50

60

70

80

0 100 200 300 400 500 600 700 800 900 1000

F1

-sco

re (

%)

Number of Instances

w/o active samplingw/ active sampling

Figure 5: Effectiveness of incremental updating withand without active sampling.

inary experiment on the CoNLL2003 dataset. Weset iteration step size as 100 with the default orderfor single-user annotation, and incrementally up-date the back-end model for 10 iterations. Then,we apply our implementation for active samplingthe optimal batches of samples to annotate dynam-ically. All results are tested on the official test set.

As shown in Fig. 5, the performance with activelearning is indeed improved by a large margin overvanilla training effectively. We also find that withactive sampling, using annotations of about 35%training data can achieve the performance of using100%, confirming the reported results of previousstudies (Siddhant and Lipton, 2018).

For consolidating multiple annotations, we testour implementation with the real-world crowd-sourced data collected by Rodrigues et al. (2013)using Amazon Mechanical Turk (AMT). Ourmethods outperform existing approaches includ-ing MVT-SLM (Wang et al., 2018) and Crowd-X (Nguyen et al., 2017) (see Tab. 1).

7 Related Works

Many sequence annotation tools have been de-veloped (Ogren, 2006; Bontcheva et al., 2013;Chen and Styler, 2013; Druskat et al., 2014b)for basic annotation features including data man-agement, shortcut key annotation, and multi-annotation presentation and analysis. However,

Methods Precision(%) Recall(%) F1(%)

MVT-SLM 88.88(±0.25) 65.04(±0.80) 75.10(±0.44)Crowd-Add 89.74(±0.10) 64.50(±1.48) 75.03(±1.02)Crowd-Cat 89.72(±0.47) 63.55(±1.20) 74.39(±0.98)

Ours 88.77(±0.25) 72.79(±0.04) 79.99(±0.08)

Gold 92.12(±0.31) 91.73(±0.09) 91.92(±0.21)

Table 1: Comparisons of consolidation methods.

few of them enjoys the above-introduced three fea-tures of AlpacaTag. It is true that a few existingtools also support tagging suggestions (instance-level recommendations) as follows:• BRAT (Stenetorp et al., 2012) andGATE (Bontcheva et al., 2013) can offersuggestions with a fixed tagging model likeCoreNLP (Manning et al., 2014). However,it is hardly helpful when users need toannotate sentences with customized label setspecific to their interested domains, which isthe most common motivation for people touse an annotation framework.• YEDDA (Yang et al., 2018a) simply generates

suggestions by exact matching against a con-tinuously updated lexicon of already anno-tated spans. In comparison, this is a subsetof AlpacaTag’s recommendations. Notethat YEDDA cannot suggest any unseen spans.• WebAnno (Yimam et al., 2013) integrates a

learning component for suggestions, whichis based on hand-crafted features andgeneric online learning framework (MIRA).Nonetheless, the back-end model is too faraway from the state-of-the-art methods forsequence tagging and hard to customize forconsolidation and real-time deployment.

AlpacaTag’s tagging suggestions are morecomprehensive since they are from state-of-the-art BLSTM-CRF architecture, annotation dictio-nary, and frequent noun phrase mining. An uniqueadvantage is that it also supports active learn-ing to sample instances that are most worth la-beling to reduce human efforts. In addition,AlpacaTag further attempts to reduce inter-annotator disagreement during annotating by au-tomatically consolidating personal models to con-sensus models, which pushes further than just pre-senting inter-annotator agreement without doinganything helpful.

While recent papers have shown the powerof deep active learning with simulations (Sid-dhant and Lipton, 2018), a practical annota-tion framework with active learning is missing.AlpacaTag is not only an annotation frameworkbut also a practical environment for evaluating dif-ferent active learning methods.

8 Conclusion and Future Directions

To sum up, we propose an open-source web-basedannotation framework AlpacaTag that provides

63

users with more advanced features for reducinghuman efforts in annotating, while keeping theexisting basic annotation functions like shortcutkeys. Incremental active learning of back-endmodels with crowd consolidation facilities intel-ligent recommendations and reduce disagreement.

Future directions include supporting annotationof other tasks, such as relation extraction and eventextraction. We also plan to design a real-time con-sole for showing the status of back-end modelsbased on TensorBoard, such that admins can bet-ter track the quality of deployed models and earlystop annotating to save human efforts.

ReferencesKalina Bontcheva, Hamish Cunningham, Ian Roberts,

Angus Roberts, Valentin Tablan, Niraj Aswani, andGenevieve Gorrell. 2013. Gate teamware: a web-based, collaborative text annotation framework. InProc. of LREC.

Richard Eckart de Castilho, Eva Mujdricza-Maydt,Seid Muhie Yimam, Silvana Hartmann, IrynaGurevych, Anette Frank, and Christian Biemann.2016. A web-based tool for the integrated annota-tion of semantic and syntactic structures. In Proc. ofLT4DH@COLING.

Wei-Te Chen and Will Styler. 2013. Anafora: a web-based general purpose annotation tool. In Proc. ofNAACL-HLT.

Aron Culotta and Andrew McCallum. 2005. Con-fidence estimation for information extraction. InProc. of AAAI.

Stephan Druskat, Lennart Bierkandt, Volker Gast,Christoph Rzymski, and Florian Zipser. 2014a.Atomic: an open-source software platform formulti-level corpus annotation. In KONVENS.

Stephan Druskat, Lennart Bierkandt, Volker Gast,Christoph Rzymski, and Florian Zipser. 2014b.Atomic: An open-source software platform formulti-level corpus annotation.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proc. of NAACL-HLT.

Bill Y. Lin, Frank F. Xu, Zhiyi Luo, and Kenny Q. Zhu.2017. Multi-channel bilstm-crf model for emerg-ing named entity recognition in social media. InNUT@EMNLP.

Bill Yuchen Lin and Wei Lu. 2018. Neural adapta-tion layers for cross-domain named entity recogni-tion. In Proc. of EMNLP.

Liyuan Liu, Jingbo Shang, Frank F. Xu, Xiang Ren,Huan Gui, Jian Peng, and Jiawei Han. 2018. Em-power sequence labeling with task-aware neural lan-guage model. In Proc. of AAAI.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Rose Finkel, Steven Bethard, and David Mc-Closky. 2014. The stanford corenlp natural languageprocessing toolkit. In Proc. of ACL.

An Thanh Nguyen, Byron C. Wallace, Junyi Jessy Li,Ani Nenkova, and Matthew Lease. 2017. Aggregat-ing and predicting sequence labels from crowd an-notations. Proc. of ACL.

Philip V Ogren. 2006. Knowtator: a protege plug-in for annotated corpus construction. In Proc. ofNAACL-HLT.

Filipe Rodrigues, Francisco C. Pereira, and BernardeteRibeiro. 2013. Sequence labeling with multiple an-notators. Machine Learning.

Burr Settles. 2009. Active learning literature survey.Technical report, University of Wisconsin-MadisonDepartment of Computer Sciences.

Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren,Clare R. Voss, and Jiawei Han. 2018. Automatedphrase mining from massive text corpora. Proc. ofTKDE.

Yanyao Shen, Hyokun Yun, Zachary C Lipton, YakovKronrod, and Animashree Anandkumar. 2018.Deep active learning for named entity recognition.In Proc. of ICLR.

Aditya Siddhant and Zachary Chase Lipton. 2018.Deep bayesian active learning for natural languageprocessing: Results of a large-scale empirical study.In Proc. of EMNLP.

Pontus Stenetorp, Sampo Pyysalo, Goran Topic,Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu-jii. 2012. brat: a web-based tool for NLP-assistedtext annotation. In Proc. of EACL.

Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang,Marinka Zitnik, Jingbo Shang, Curtis P. Langlotz,and Jiawei Han. 2018. Cross-type biomedicalnamed entity recognition with deep multi-task learn-ing. Bioinformatics.

Jie Yang, Yue Zhang, Linwei Li, and Xingxuan Li.2018a. Yedda: A lightweight collaborative text spanannotation tool. In Proc. of ACL.

YaoSheng Yang, Meishan Zhang, Wenliang Chen, WeiZhang, Haofen Wang, and Min Zhang. 2018b. Ad-versarial learning for chinese ner from crowd anno-tations. In Proc. of AAAI.

Seid Muhie Yimam, Iryna Gurevych, Richard Eckartde Castilho, and Christian Biemann. 2013. We-banno: A flexible, web-based and visually supportedsystem for distributed annotations. In Proc. of ACL.

AlpacaTag: An Active Learning-based Crowd …As a crowdsourcing framework, AlpacaTag collects...

Documents

Transcript of AlpacaTag: An Active Learning-based Crowd …As a crowdsourcing framework, AlpacaTag collects...