Visual Error Analysis for Entity Linking · Entity linking (EL) is the task of automatically...

Proceedings of ACL-IJCNLP 2015 System Demonstrations, pages 37–42,Beijing, China, July 26-31, 2015. c©2015 ACL and AFNLP

Visual Error Analysis for Entity Linking

Benjamin HeinzerlingResearch Training Group AIPHES

Heidelberg Institute forTheoretical Studies gGmbH

Schloss-Wolfsbrunnenweg 3569118 Heidelberg, Germany

[email protected]

Michael StrubeHeidelberg Institute for

Theoretical Studies gGmbHSchloss-Wolfsbrunnenweg 3569118 Heidelberg, [email protected]

Abstract

We present the Visual Entity Explorer(VEX), an interactive tool for visually ex-ploring and analyzing the output of en-tity linking systems. VEX is designed toaid developers in improving their systemsby visualizing system results, gold anno-tations, and various mention detection andentity linking error types in a clear, con-cise, and customizable manner.

1 Introduction

Entity linking (EL) is the task of automaticallylinking mentions of entities (e.g. persons, loca-tions, organizations) in a text to their correspond-ing entry in a given knowledge base (KB), such asWikipedia or Freebase. Depending on the setting,the task may also require detection of entity men-tions1, as well as identifying and clustering Not-In-Lexicon (NIL) entities.

In recent years, the increasing interest in EL, re-flected in the emergence of shared tasks such asthe TAC Entity Linking track (Ji et al., 2014), ERD2014 (Carmel et al., 2014), and NEEL (Cano etal., 2014), has fostered research on evaluation met-rics for EL systems, leading to the development ofa dedicated scorer that covers different aspects ofEL system results using multiple metrics (Hacheyet al., 2014).

Based on the observation that representations inentity linking (mentions linked to the same KBentry) are very similar to those encountered in

1This setting is called Entity Discovery and Linking(EDL) in the TAC 2014/15 entity linking tracks, and En-tity Recognition and Disambiguation (ERD) in the MicrosoftERD 2014 challenge.

coreference resolution (mentions linked by coref-erence relations to the same entity), these metricsinclude ones originally proposed for evaluation ofcoreference resolutions systems, such as the MUCscore (Vilain et al., 1995), B3 (Bagga and Bald-win, 1998), and CEAF (Luo, 2005) and variantsthereof (Cai and Strube, 2010).

While such metrics, which express system per-formance in numeric terms of precision, recall,and F1 scores, are well-suited for comparing sys-tems, they are of limited use to EL system devel-opers trying to identify problem areas and compo-nents whose improvement will likely result in thelargest performance increase.

To address this problem, we present the VisualEntity Explorer (VEX), an interactive tool for vi-sually exploring the results produced by an ELsystem. To our knowledge, there exist no otherdedicated tools for visualizing the output of ELsystems or similar representations.

VEX is available as free, open-source soft-ware for download at http://github.com/noutenki/vex and as a web service at http://cosyne.h-its.org/vex.

In the remainder of this paper, we first give anoverview of VEX (Section 2), proceed to presentseveral usage examples and discuss some of the in-sights gained from performing a visual error anal-ysis (Section 3), then describe its implementation(Section 4), before concluding and discussing fu-ture work (Section 5).

2 The Visual Entity Explorer

After loading system results and gold standard an-notations in TAC 2014 or JSON format, as wellas the original document text files, VEX displays

37

Figure 1: Screenshot of VEX’s main display, consisting of document list (left), entity selectors (bottomright), and the annotated document text (top right).

gold annotations, correct results, and errors asshown in Figure 1. The document to be analyzedcan be selected via the clickable list of documentIDs on the left. Located bottom right, the entityselectors for gold, true positive, and false positiveentities (defined below) can be used to toggle thedisplay of individual entities2. The selected enti-ties are visualized in the top-right main area.

Similarly to the usage in coreference resolution,where a cluster of mentions linked by coreferencerelations is referred to as an entity, we define en-tity to mean a cluster of mentions clustered eitherimplicitly by being linked to the same KB entry(in case of non-NIL mentions) or clustered explic-itly by performing NIL clustering (in case of NILmentions).

2For space reasons, the entity selectors are shown onlypartially.

2.1 Visualizing Entity Linking Errors

Errors committed by an EL system can be broadlycategorized into mention detection errors and link-ing/clustering errors. Mention detection errors, inturn, can be divided into partial errors and full er-rors.

2.1.1 Partial Mention Detection ErrorsA partial mention detection error is a system men-tion span that overlaps but is not identical to anygold mention span. In VEX, partial mention detec-tion errors are displayed using red square brackets,either inside or outside the gold mention spans sig-nified by golden-bordered rectangles (cf. the firstand last mention in Figure 2).

2.1.2 Full Mention Detection ErrorsA full mention detection error is either (a) a sys-tem mention span that has no overlapping goldmention span at all, corresponding to a false pos-itive (FP) detection, i.e. a precision error, or (b) a

38

Figure 2: Visualization of various mention detec-tion and entity linking error types (see Section 2for a detailed description).

gold mention span that has no overlap with anysystem mention span, corresponding to a falsenegative (FN) detection, i.e. a recall error. In VEX,FP mention detections are marked by a dashed redborder and struck-out red text (cf. the second men-tion in Figure 2), and FN mention detections by adashed gold-colored border and black text (cf. thethird mention in Figure 2). For further emphasis,both gold and system mentions are displayed inbold font.

2.1.3 Linking/Clustering ErrorsEntities identified by the system are categorized– and possibly split up – into True Positive (TP)and False Positive (FP) entities. The mentions ofsystem entities are connected using dashed greenlines for TP entities and dashed red lines for FP en-tities, while gold entity mentions are connected bysolid gold-colored lines. This choice of line stylesprevents loss of information through occlusion incase of two lines connecting the same pair of men-tions, as is the case with the first and last mentionin Figure 2.

Additionally, the text of system mentions linkedto the correct KB entry or identified correctly asNIL is colored green and any text associated witherroneous system entity links red.

3 Usage examples

In this section we show how VEX can be used toperform a visual error analysis, gaining insightsthat arguably cannot be attained by relying onlyon evaluation metrics.

3.1 Example 1Figure 2 shows mentions of VULCAN INC.3 asidentified by an EL system (marked red and green)

3In this paper, SMALL CAPS denote KB entries.

Figure 3: Visualization showing a mention detec-tion error and an annotation error (see Section 3for a description).

and the corresponding gold annotation4 (markedin gold color). Of the three gold mentions, twowere detected and linked correctly by the systemand are thus colored green and connected witha green dashed line. One gold mention is sur-rounded with a gold-colored dashed box to indi-cate a FN mention not detected by the system atall. The dashed red box signifies a FP entity, re-sulting from the system having detected a mentionthat is not listed in the gold standard. However,rather than a system error, this is arguably an an-notation mistake.

Inspection of other entities and other documentsreveals that spurious FPs caused by gold anno-tation errors appear to be a common occurrence(see Figure 3 for another example). Since the su-pervised machine learning algorithms commonlyused for named entity recognition, such as Con-ditional Random Fields (Sutton and McCallum,2007), require consistent training data, such incon-sistencies hamper performance.

3.2 Example 2

From Figure 2 we can also tell that two men-tion detection errors are caused by the inclusionof sentence-final punctuation that doubles as ab-breviation marker. The occurrence of similar casesin other documents, e.g. inconsistent annotation of“U.S.” and “U.S” as mentions of UNITED STATES,shows the need for consistently applied annotationguidelines.

3.3 Example 3

Another type of mention detection error is shownin Figure 3: Here the system fails to detect “wash-ington” as a mention of WASHINGTON, D.C.,

4The gold annotations are taken from the TAC 2014 EDLEvaluation Queries and Links (V1.1).

39

likely due to the non-standard lower-case spelling.

3.4 Example 4

The visualization of the gold mentions of PAUL

ALLEN in Figure 1 shows that the EL system sim-plistically partitioned and linked the mentions ac-cording to string match, resulting in three systementities, of which only the first, consisting of thetwo “Paul Allen” mentions, is a TP. Even thoughthe four “Allen” mentions in Figure 1 align cor-rectly with gold mentions, they are categorized asa FP entity, since the system erroneously linkedthem to the KB entry for the city of Allen, Texas,resulting in a system entity that does not intersectwith any gold entity. The system commits a simi-lar mistake for the mention “Paul”.

3.5 Insights

This analysis of only a few examples has alreadyrevealed several categories of errors, either com-mitted by the EL system or resulting from goldannotation mistakes:

• mention detection errors due to non-standardletter case, which suggest incorporating true-casing (Lita et al., 2003) and/or a caselessnamed entity recognition model (Manning etal., 2014) into the mention detection processcould improve performance;

• mention detection errors due to off-by-oneerrors involving punctuation, which sug-gest the need for clear and consistently ap-plied annotation guidelines, enabling devel-opers to add hard-coded, task-specific post-processing rules for dealing with such cases;

• mention detection errors due to missing goldstandard annotations, which suggest per-forming a simple string match against alreadyannotated mentions to find cases of unanno-tated mentions could significantly improvethe gold standard at little cost;

• linking/clustering errors, likely due to theoverly strong influence of features based onstring match with Wikipedia article titles,which in some cases appears to outweighfeatures designed to encourage clustering ofmentions if there exists a substring match be-tween them, hence leading to an erroneouspartitioning of the gold entity by its varioussurface forms.

4 Implementation

In this section we describe VEX’s implementationand some of the design decisions made to achievean entity visualization suited for convenient erroranalysis.

VEX consists of three main components. Theinput component, implemented in Java 8, readsgold and system annotations files, as well as theoriginal documents. Currently, the annotation for-mat read by the official TAC 2014 scorer5, as wellas a simple JSON input format are supported. Allsystem and gold character offset ranges containedin the input files are converted into HTML spansand inserted into the document text. Since HTMLelements are required to conform to a tree struc-ture, any overlap or nesting of spans is handled bybreaking up such spans into non-overlapping sub-spans.

At this point, gold NIL clusters and systemNIL clusters are aligned by employing the Kuhn-Munkres algorithm6 (Kuhn, 1955; Munkres,1957), as is done in calculation of the CEAF met-ric (Luo, 2005). The input component thenstores all inserted, non-overlapping spans in an in-memory database.

The processing component queries goldand system entity data for each document andinventorizes all errors of interest. All data col-lected by this component is added to the respec-tive HTML spans in the form of CSS classes, en-abling simple customization of the visualizationvia a plain-text stylesheet.

The output component employs a tem-plate engine7 to convert the data collected bythe processing component into HTML andJavaScript for handling display and user interac-tion in the web browser.

4.1 Design Decisions

One of VEX’s main design goals is enabling theuser to quickly identify entity linking and clus-tering errors. Because a naive approach to entityvisualization by drawing edges between all possi-ble pairings of mention spans quickly leads to acluttered graph (Figure 4a), we instead visualizeentities using Euclidean minimum spanning trees,inspired by Martschat and Strube’s (2014) use of

5http://github.com/wikilinks/neleval6Also known as Hungarian algorithm.7https://github.com/jknack/handlebars.

java

40

(a) (b)

Figure 4: Cluttered visualization of an entity via its complete graph, drawing all pairwise connectionsbetween mentions (a), and a more concise visualization of the same entity using an Euclidean minimumspanning tree, connecting all mentions while minimizing total edge length (b).

spanning trees in error analysis for coreferenceresolution.

An Euclidean minimum spanning tree is a min-imum spanning tree (MST) of a graph whose ver-tices represent points in a metric space and whoseedge weights are the spatial distances betweenpoints8, i.e., it spans all graph vertices while min-imizing total edge length. This allows for a muchmore concise visualization (Figure 4b).

Since the actual positions of mention span el-ements on the user’s screen depend on varioususer environment factors such as font size andbrowser window dimensions, the MSTs of dis-played entities are computed using a client-sideJavaScript library9 and are automatically redrawnif the browser window is resized. Drawing ofedges is performed via jsPlumb10, a highly cus-tomizable library for line drawing in HTML doc-uments.

In order not to overemphasize mention detec-tion errors when displaying entities, VEX assumesa system mention span to be correct if it has a non-zero overlap with a gold mention span. For exam-ple, consider the first gold mention “Vulcan Inc”in Figure 2, which has not been detected correctlyby the system; it detected “Vulcan Inc.” instead.

8In our case, the metric space is the DOM document beingrendered by the web browser, a point is the top-left corner of atext span element, and the distance metric is the pixel distancebetween the top-left corners of text span elements.

9https://github.com/abetusk/euclideanmst.js. This library employs Kruskal’salgorithm (Kruskal, 1956) for finding MSTs.

10http://www.jsplumb.org

While a strict evaluation requiring perfect men-tion spans will give no credit at all for this par-tially correct result, seeing that this mention de-tection error is already visually signified (by thered square bracket), VEX treats the mention as de-tected correctly for the purpose of visualizing theentity graph, and counts it as a true positive in-stance if it has been linked correctly.

While VEX provides sane defaults, the visual-ization style can be easily customized via CSS,e.g., in order to achieve a finer-grained catego-rization of error types such as off-by-one mentiondetection errors, or classification of non-NILs asNILs and vice-versa.

5 Conclusions and Future Work

We presented the Visual Entity Explorer (VEX),a tool for visual error analysis of entity linking(EL) systems. We have shown how VEX can beused for quickly identifying the components of anEL system that appear to have a high potential forimprovement, as well as for finding errors in thegold standard annotations. Since visual error anal-ysis of our own EL system revealed several issuesand possible improvements, we believe perform-ing such an analysis will prove useful for other de-velopers of EL systems, as well.

In future work, we plan to extend VEX withfunctionality for visualizing additional error types,and for exploring entities not only in a single doc-ument, but across documents. Given the structuralsimilarities entities in coreference resolution and

41

entities in entity linking share, we also will addmethods for visualizing entities found by corefer-ence resolution systems.

Acknowledgements

This work has been supported by the German Re-search Foundation as part of the Research TrainingGroup “Adaptive Preparation of Information fromHeterogeneous Sources” (AIPHES) under grantNo. GRK 1994/1, and partially funded by theKlaus Tschira Foundation, Heidelberg, Germany.We would like to thank our colleague SebastianMartschat who commented on earlier drafts of thispaper.

ReferencesAmit Bagga and Breck Baldwin. 1998. Algorithms

for scoring coreference chains. In Proceedingsof the 1st International Conference on LanguageResources and Evaluation, Granada, Spain, 28–30May 1998, pages 563–566.

Jie Cai and Michael Strube. 2010. Evaluation metricsfor end-to-end coreference resolution systems. InProceedings of the SIGdial 2010 Conference: The11th Annual Meeting of the Special Interest Groupon Discourse and Dialogue, Tokyo, Japan, 24–25September 2010, pages 28–36.

Amparo E. Cano, Giuseppe Rizzo, Andrea Varga,Matthew Rowe, Milan Stankovic, and Aba-SahDadzie. 2014. Making sense of microposts namedentity extraction & linking challenge. In Proceed-ings of the 4th Workshop on Making Sense of Micro-posts, Seoul, Korea, 7 April 2014, pages 54–60.

David Carmel, Ming-Wei Chang, Evgeniy Gabrilovich,Bo-June Paul Hsu, and Kuansan Wang. 2014.ERD’14: Entity recognition and disambiguationchallenge. In ACM SIGIR Forum, volume 48, pages63–77. ACM.

Ben Hachey, Joel Nothman, and Will Radford. 2014.Cheap and easy entity evaluation. In Proceedingsof the 52nd Annual Meeting of the Association forComputational Linguistics (Volume 2: Short Pa-pers), Baltimore, Md., 22–27 June 2014, pages 464–469.

Heng Ji, Joel Nothman, and Ben Hachey. 2014.Overview of TAC-KBP2014 entity discovery andlinking tasks. In Proceedings of the Text Analy-sis Conference, National Institute of Standards andTechnology, Gaithersburg, Maryland, USA, 17–18November 2014.

Joseph B. Kruskal. 1956. On the shortest spanningsubtree of a graph and the traveling salesman prob-lem. Proceedings of the American Mathematical so-ciety, 7(1):48–50.

Harold W. Kuhn. 1955. The Hungarian method forthe assignment problem. Naval Research LogisticsQuarterly, 2(1-2):83–97.

Lucian Vlad Lita, Abe Ittycheriah, Salim Roukos, andNanda Kambhatla. 2003. Truecasing. In Proceed-ings of the 41st Annual Meeting of the Associationfor Computational Linguistics, Sapporo, Japan, 7–12 July 2003, pages 152–159. Association for Com-putational Linguistics.

Xiaoqiang Luo. 2005. On coreference resolutionperformance metrics. In Proceedings of the Hu-man Language Technology Conference and the 2005Conference on Empirical Methods in Natural Lan-guage Processing, Vancouver, B.C., Canada, 6–8October 2005, pages 25–32.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics: System Demonstrations, Balti-more, Md., 22–27 June 2014, pages 55–60. Associ-ation for Computational Linguistics.

Sebastian Martschat and Michael Strube. 2014. Recallerror analysis for coreference resolution. In Pro-ceedings of the 2014 Conference on Empirical Meth-ods in Natural Language Processing, Doha, Qatar,25–29 October 2014, pages 2070–2081.

James Munkres. 1957. Algorithms for the assignmentand transportation problems. Journal of the Societyfor Industrial & Applied Mathematics, 5(1):32–38.

Charles Sutton and Andrew McCallum. 2007. An in-troduction to conditional random fields for relationallearning. In L. Getoor and B. Taskar, editors, In-troduction to Statistical Relational Learning, pages93–128. MIT Press, Cambridge, Mass.

Marc Vilain, John Burger, John Aberdeen, Dennis Con-nolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In Proceed-ings of the 6th Message Understanding Conference(MUC-6), pages 45–52, San Mateo, Cal. MorganKaufmann.

42

Visual Error Analysis for Entity Linking · Entity linking (EL) is the task of automatically...

Documents

Transcript of Visual Error Analysis for Entity Linking · Entity linking (EL) is the task of automatically...