Richard Eckart de Castilho, Jan-Christoph Klie, Ute ...

1
Assisted Annotation INCEpTION Platform BEYOND WEBANNO: THE INCEPTION TEXT ANNOTATION PLATFORM Richard Eckart de Castilho, Jan-Christoph Klie, Ute Winchenbach, Iryna Gurevych Ubiquitous Knowledge Processing Lab (UKP-TUDA) Dept. of Computer Science Technische Universität Darmstadt This work has received funding from the German Research Foundation under the grants No. EC 503/1-1 and GU 798/21-1 (INCEpTION) and as part of the RTG Adaptive Preparation of Information from Heterogeneous Sources (AIPHES) under grant No. GRK 1994/1. WebAnno originated in CLARIN-D and is deployed in several national CLARIN infrastructures, used by various research groups as well as individual researchers INCEpTION takes the technology to a new level and provides a human-in-the-loop annotation platform combining machine learning and human expertise for rapid domain adaptation Upgrade today! Migration from WebAnno to INCEpTION is straightforward! Project Workflow Open Development Download INCEpTION at https://inception-project.github.io Project Setup Annotation Define data sources Define annotation scheme Define knowledge scheme Assign permissions KB Population Subcorporation Monitoring Assign workload Monitor progress Evaluate quality Identify relevant annotation units Annotate extracted units With categories from KB Linking to instances in KB Define schema (classes/properties) Edit instances Add qualifiers to statements ! Manager " Curator Curation Manual comparison Automatic aggregation # # # Sources # # # Export / Freeze $ Expert % Trained & “Crowd” Recommenders Continually learn from the users actions Asynchronous training does not slow down user interface Automatic evaluation to avoid inaccurate predictions (configurable quality threshold) Built-in recommenders: Dictionary-based, OpenNLP-based sequence classifier (part- of-speech, named entities, …) Corpus Creation Search Statistics Annotation Mentions Relations Syntax Knowledge Management Concepts Entities Events Automatic Suggestions Completion Deduplication Diversity Relevance Feedback Cross-Document Relations Entity Linking Event Linking Semantic Search Distant Supervision Active Learning INCEpTION INCEpTION aims to support three functionalities which are commonly required for text annotation projects but typically not available in a single tool: corpus creation; text annotation; knowledge management. The platform additionally provides assistive features such as machine learning recommenders to help users working on these tasks more efficiently. Integrating these functionalities into a single comprehensive platform permits addressing tasks typically not found in generic annotation platforms, such as entity linking, knowledge base population, cross-document coreference annotation, etc. Interoperability and Integration Custom KB Wikipedia DBPedia CLARIN TCF Legacy CoNLL formats UIMA CAS Binary+XMI Knowledge Resources RDF-based knowledge resources are supported Knowledge bases are accessed via SPARQL Configurable schema mapping allows supporting many different knowledge resources (RDFS, OWL, SKOS, …) Support for entity linking if knowledge base has full-text search capabilities Annotation Schemata Flexible type system configuration. Use UIMA type system descriptor to auto- configure an annotation project. DKPro Core types pre-configured (token, sentence, POS, lemma, morphological features, dependencies, coreference chains…) Data Formats Supports a range of different formats for importing and exporting annotated text corpora Support of the TCF format enables interoperability with the CLARIN-D WebLicht annotation services Full compatibility with CLARIN’s WebAnno including the import of entire annotation projects UIMA CAS data format enables interoperability with NLP pipelines like DKPro Core Annotation Services Connect external NLP services and machine learning tools to support the user during the annotation process Programming language independent HTTP- based protocol Protocol supports re-training the external tools UIMA CAS XMI data format; supported in Python via DKPro PyCAS; in Java via Apache UIMA Support for LAPPS Grid services Infrastructures & Platforms Supports the OpenMinTeD AERO protocol Remote configuration of annotation projects Import of documents for annotation Export of annotated documents Notifications on key events (e.g. document or project finished) via web hooks Support for external authentication mechanisms (e.g. SAML2/Shibboleth) Corpus Repositories Supports searching external corpus repositories and selectively importing data for annotation Supports ElasticSearch repositories Supports PubAnnotation service Modular approach makes it easy to add new types of repositories ... Text CoNLL-U WebAnno TSV DKPro TC Python- based machine learning Within the NLP and text mining landscape, an annotation platform like INCEpTION only covers a part of the overall text analysis needs. Therefore, it is important that the platform is open and interoperable with external services and resources. ... INCEpTION Integration goes beyond interoperability. E.g. when an external text mining platform wants to delegate annotation to the INCEpTION platform, it needs to be able to automatically set up annotation projects, import data, monitor the ongoing annotations and finally retrieve the annotated data for further use. System-level integration Data-level interoperability Active Learning Aims at reducing the time to learn by asking specific feedback from the user Using uncertainty-sampling strategy Compatible with any recommender that provides a confidence score User can freely switch between active learning and normal annotation $ Human in the loop Extensibility Knowledge Bases Recommenders Data Formats Annotation Editors Annotation Editor Sidebars Annotation Layers Project Importers/Exporters Annotation Feature types Event Logging Annotation Editor Extensions INCEpTION is a modular architecture providing many extension points where new functionality can be added and existing functionality can be changed - a selection of these is shown below: While most annotation tools are built in annotation projects, INCEpTION is an infrastructure software project and is not associated with any single annotation project. Acquiring early adopters and aligning with their use-cases is a key part of our mission. This motivates our open development philosophy: All code is open and publicly available on GitHub under the liberal Apache License 2.0 All development-related tasks and issues are publicly managed and discussed via GitHub ● Internal and private communication is kept at a minimum The architecture modular architecture is realized using the Spring framework. Dependency injection and events are used to achieve a loose coupling between the modules. Data Model RDF-based data model: classes, properties, instances, literals Support for class hierarchies Classes can have instances Support for typed properties including domain and range restrictions Editors for different value types (string, numeric, boolean, KB resources) Entity and Fact Linking Linking of annotations to KB classes and instances via an auto-complete field Optional contextual ranking of candidates Linked annotations shown on the KB page when a class/instance is selected Special support for linking factual statements in text documents to subject/predicate/object triples in the KB INCEpTION We thank CLARIN-D and the BMBF for sponsoring the presentation of the DFG project INCEpTION at the CLARIN Annual Conference 2019.

Transcript of Richard Eckart de Castilho, Jan-Christoph Klie, Ute ...

Assisted Annotation

INCEpTION Platform

BEYOND WEBANNO:

THE INCEPTION TEXT ANNOTATION PLATFORM

Richard Eckart de Castilho, Jan-Christoph Klie, Ute Winchenbach, Iryna Gurevych

Ubiquitous Knowledge Processing Lab (UKP-TUDA)Dept. of Computer Science

Technische Universität Darmstadt

This work has received funding from the German ResearchFoundation under the grants No. EC 503/1-1 and GU 798/21-1(INCEpTION) and as part of the RTG Adaptive Preparation ofInformation from Heterogeneous Sources (AIPHES) under grantNo. GRK 1994/1.

• WebAnno originated in CLARIN-D and is deployed in several national CLARIN infrastructures, used by various research groups as well as individual researchers

• INCEpTION takes the technology to a new level and provides a human-in-the-loop annotation platform combining machine learning and human expertise for rapid domain adaptation

Upgrade today!Migration from WebAnno to INCEpTION is straightforward!

Project Workflow

Open Development

Download INCEpTION at

https://inception-project.github.io

Project Setup AnnotationDefine data sourcesDefine annotation schemeDefine knowledge schemeAssign permissions

KB Population

Subcorporation

MonitoringAssign workload Monitor progress

Evaluate quality

Identify relevant annotation units

Annotate extracted unitsWith categories from KBLinking to instances in KB

Define schema (classes/properties)Edit instancesAdd qualifiers to statements

!Manager

"Curator

Curation

Manual comparisonAutomatic aggregation

###Sources

###Export / Freeze

$Expert

%Trained

&“Crowd”

Recommenders

● Continually learn from the users actions● Asynchronous training does not slow

down user interface● Automatic evaluation to avoid inaccurate

predictions (configurable quality threshold)● Built-in recommenders: Dictionary-based,

OpenNLP-based sequence classifier (part-of-speech, named entities, …)

CorpusCreation

SearchStatistics

Annotation

MentionsRelationsSyntax

KnowledgeManagement

ConceptsEntitiesEvents

AutomaticSuggestions

CompletionDeduplication

DiversityRelevance Feedback

Cross-DocumentRelations

Entity LinkingEvent Linking

Semantic SearchDistant SupervisionActive Learning

INCEpTION

INCEpTION aims to support three functionalities which arecommonly required for text annotation projects but typicallynot available in a single tool: corpus creation; textannotation; knowledge management.

The platform additionally provides assistive features such asmachine learning recommenders to help users working onthese tasks more efficiently.

Integrating these functionalities into a single comprehensiveplatform permits addressing tasks typically not found ingeneric annotation platforms, such as entity linking,knowledge base population, cross-document coreferenceannotation, etc.

Interoperability and Integration

Custom KB

Wikipedia

DBPedia

CLARIN TCFLegacy CoNLL formats

UIMA CAS Binary+XMI

Knowledge Resources● RDF-based knowledge resources are

supported● Knowledge bases are accessed via SPARQL● Configurable schema mapping allows

supporting many different knowledge resources (RDFS, OWL, SKOS, …)

● Support for entity linking if knowledge base has full-text search capabilities

Annotation Schemata● Flexible type system configuration.● Use UIMA type system descriptor to auto-

configure an annotation project.● DKPro Core types pre-configured (token,

sentence, POS, lemma, morphological features, dependencies, coreference chains…)

Data Formats● Supports a range of different formats for

importing and exporting annotated text corpora● Support of the TCF format enables

interoperability with the CLARIN-D WebLicht annotation services

● Full compatibility with CLARIN’s WebAnno including the import of entire annotation projects

● UIMA CAS data format enables interoperability with NLP pipelines like DKPro Core

Annotation Services● Connect external NLP services and machine

learning tools to support the user during the annotation process

● Programming language independent HTTP-based protocol

● Protocol supports re-training the external tools● UIMA CAS XMI data format; supported in Python

via DKPro PyCAS; in Java via Apache UIMA● Support for LAPPS Grid services

Infrastructures & Platforms● Supports the OpenMinTeD AERO protocol● Remote configuration of annotation projects● Import of documents for annotation● Export of annotated documents● Notifications on key events (e.g. document or

project finished) via web hooks● Support for external authentication

mechanisms (e.g. SAML2/Shibboleth)

Corpus Repositories● Supports searching external corpus

repositories and selectively importing data for annotation

● Supports ElasticSearch repositories● Supports PubAnnotation service● Modular approach makes it easy to add new

types of repositories

...

Text

CoNLL-UWebAnno TSV

DKPro TC

Python-based

machine learning

Within the NLP and text mining landscape,an annotation platform like INCEpTION onlycovers a part of the overall text analysisneeds. Therefore, it is important that theplatform is open and interoperable withexternal services and resources.

...

INCEpTION

Integration goes beyond interoperability. E.g. when an external text miningplatform wants to delegate annotation to the INCEpTION platform, it needs to beable to automatically set up annotation projects, import data, monitor the ongoingannotations and finally retrieve the annotated data for further use.

System-level integration Data-level interoperability

Active Learning

● Aims at reducing the time to learn by asking specific feedback from the user

● Using uncertainty-sampling strategy● Compatible with any recommender that

provides a confidence score● User can freely switch between active

learning and normal annotation

$Human in the loop

ExtensibilityKnowledge Bases

Recommenders

Data Formats

AnnotationEditors

Annotation Editor Sidebars

AnnotationLayers

ProjectImporters/Exporters

AnnotationFeature types

Event LoggingAnnotation EditorExtensions

INCEpTION is a modular architecture providing manyextension points where new functionality can be added andexisting functionality can be changed - a selection of theseis shown below:

While most annotation tools are built in annotation projects,INCEpTION is an infrastructure software project and is notassociated with any single annotation project. Acquiringearly adopters and aligning with their use-cases is a key partof our mission.This motivates our open development philosophy:

● All code is open and publicly available on GitHub under the liberal Apache License 2.0

● All development-related tasks and issues are publicly managed and discussed via GitHub

● Internal and private communication is kept at a minimum

The architecture modular architecture is realized using theSpring framework. Dependency injection and events areused to achieve a loose coupling between the modules.

Data Model

● RDF-based data model: classes, properties, instances, literals

● Support for class hierarchies● Classes can have instances● Support for typed properties including

domain and range restrictions● Editors for different value types (string,

numeric, boolean, KB resources)

Entity and Fact Linking

● Linking of annotations to KB classes and instances via an auto-complete field

● Optional contextual ranking of candidates● Linked annotations shown on the KB page

when a class/instance is selected● Special support for linking factual

statements in text documents to subject/predicate/object triples in the KB

INCEpTION

We thank CLARIN-D and the BMBF for sponsoring thepresentation of the DFG project INCEpTION at the

CLARIN Annual Conference 2019.