Integrating NLP using Linked Data

36
ISWC – 2013/10/23 Page 1 http://lod2.eu Creating Knowledge out of Interlinked Data LOD2 Presentation . 02.09.2010 . Page http://lod2.eu AKSW, Universität Leipzig Integrating NLP using Linked Data Sebastian Hellmann , Jens Lehmann, Sören Auer and Martin Brümmer http://nlp2rdf.org http://lod2.eu http://slideshare.net/kurzum

Transcript of Integrating NLP using Linked Data

Page 1: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 1 http://lod2.euCreating Knowledge out of Interlinked Data

LOD2 Presentation . 02.09.2010 . Page http://lod2.euAKSW, Universität Leipzig

Integrating NLP using Linked Data

Sebastian Hellmann, Jens Lehmann, Sören Auer and Martin Brümmer

http://nlp2rdf.orghttp://lod2.eu

http://slideshare.net/kurzum

Page 2: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 2 http://lod2.eu

Introduction

Page 3: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 3 http://lod2.eu

Introduction

Core problems in integrating NLP:

1. Too much heterogeneity

2. Almost no open standards available

3. Lack of open collaboration

4. Difficult and large domain

Page 4: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 4 http://lod2.eu

Hardly any reusability in NLP

• Free software (as in free beer), but no open licenses

• Few standards and few mappings

• Integration is hard-wired (you have to write software)

– for each tool, for each framework

Main benefits of using RDF, OWL and Linked Data are:

• lower entry barrier (as a client / user)

• easy data integration (linking, mapping)

• reusability of tools and conceptualisations (ontologies)

• off-the-shelf solutions for common tasks

Problem analysis

Page 5: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 5 http://lod2.eu

The Semantic Gap

Page 6: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 6 http://lod2.eu

Page 7: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 7 http://lod2.eu

NLP2RDF project

NLP2RDF (http://nlp2rdf.org)

- community project bootstrapped by LOD2

- develops NLP Interchange Format (NIF)

- umbrella project to combine (and consolidate) existing work

Page 8: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 8 http://lod2.eu

The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.

→ to create an eco-system of interopable web services

NIF Overview

Page 9: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 9 http://lod2.eu

The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.

• Reuse of existing standards such as RDF, OWL2, the PROV Ontology, LAF (ISO 24612), Unicode and RFC 5147

• Standardize access parameters, annotations (e.g. tokenization), validation and log messages

• Reuse of existing ontologies:

NIF Overview

Page 10: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 10 http://lod2.eu

Example NIF Workflow

NIF workflow, however, can obviously not provide any better performance (F-measure, speed) than a properly configured UIMA or GATE pipeline with the same components.

Page 11: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 11 http://lod2.eu

Use Cases

• Internationalization TagSet 2.0

• Part of Speech Tagging

• Wikifier API access via RDFaCE (Entity Linking)

Page 12: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 12 http://lod2.eu

• NIF will be the recommended RDF conversion of the Internationalisation Tagset 2.0 of W3C (ITS 2.0) - http://www.w3.org/TR/its20/

• NIF turns out to have a unique selling proposition regarding NLP and RDF

• There were no suitable alternative RDF vocabulary for this conversion available.

UC1 - Internationalisation Tagset 2.0

Page 13: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 13 http://lod2.eu

RDFa parsers loose all provenance information:

<http://examples.com/books/wikinomics> dc:title ''Wikinomics'' .

Source: https://en.wikipedia.org/wiki/RDFa

ITS 2.0

Source: http://www.w3.org/TR/its20/#EX-HTML-whitespace-normalization

Page 14: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 14 http://lod2.eu

UC1 - Internationalisation Tagset 2.0

Page 15: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 15 http://lod2.eu

UC1 - Internationalisation Tagset 2.0

String offset based on:- Unicode NFC, code points- ISO 24612- RFC 5147

Page 16: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 16 http://lod2.eu

Please see the paper:

UC2 – Part of Speech Tagging

http://purl.org/olia

Page 17: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 17 http://lod2.eu

UC3 – Wikifier API access via RDFaCE

https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki

Page 18: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 18 http://lod2.eu

UC3 - Wikifier API access via RDFaCE

http://rdface.aksw.org/

Page 19: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 19 http://lod2.eu

UC3 - Wikifier API access via RDFaCE

http://rdface.aksw.org/

Page 20: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 20 http://lod2.eu

Evaluation

Please see the paper!

1) Quantitative Analysis with Google Wikilinks Corpus as NIF RDF

• Crawl of 3 million web sites, 40 million Wikipedia links

• ~ 477 million triples in NIF

2) Questionnaire and Developers Study for NIF 1.0

• NIF 1.0 was released in September 2009

• Over 30 known implementations (22 not from authors)

• 14 developers participated in the study

• Minimal NIF implementation requires less than 500 LoC

3) Qualitative Comparison with other Frameworks and Formats

Page 21: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 21 http://lod2.eu

State of NIF 2.0

Corpora as Linked Data

• Wikilinks corpus - http://wiki-link.nlp2rdf.org

• KORE 50 - http://www.yovisto.com/labs/ner-benchmarks/

• DBpedia Spotlight dataset

Tools

• entityclassifier.eu – http://entityclassifier.eu

• Spotlight - http://spotlight.dbpedia.org

• Open NLP

• Stanford CoreNLP - https://github.com/NLP2RDF/software

• Validator - https://github.com/NLP2RDF/software

Page 22: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 22 http://lod2.eu

State of NIF 2.0• Rollout is in progress

• Distributed implementation at different speed and quality

• Software lifecycle:

• Implementation

• Testing/Validation

• Integration in the main software

• Deployment as a web service

• Hosted web services often not up to date while code base is

Page 23: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 23 http://lod2.eu

How to join - http://nlp2rdf.org

Page 24: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 24 http://lod2.eu

NLP2RDF provides infrastructure for your NLP ontologies

• Redundant, persistent hosting

• Maven packages

• Code and documentation generation

• Continuous Integration (planned)

• Indexing

• Validation of instance data

For ontology creators

Please write to me or the mailing [email protected]

Page 25: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 25 http://lod2.eu

• Early industrial uptake

• OpenLink, Vistatech.ie, Zemanta, Tenforce, Unister

• ITS 2.0 W3C standard was driven by localization industry

• NIF is open and free (CC0 planned)

• NIF is designed to be a cost-saver

Take home message

Not primarily aimed atincreasing features or performance (F-Measure)

Page 26: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 26 http://lod2.eu

Open Community – All feedback is welcome!

http://slideshare.net/kurzum

Websites:

http://nlp2rdf.org

http://lod2.eu

Thanks for your attention

Page 27: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 27 http://lod2.eu

Annotations

Page 28: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 28 http://lod2.eu

NIF

Page 29: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 29 http://lod2.eu

https://bitbucket.org/srfgkmt/stanbol-nlp

Scalability - Salzburg Research KMT

Page 30: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 30 http://lod2.eu

• Recommendation for RDF Literals

• http://unicode.org/reports/tr15/#Norm_Forms

Unicode Normal Form C

Page 31: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 31 http://lod2.eu

Tokenization

Christian Chiarcos, Julia Ritz, Manfred Stede: By all these lovely tokens... Merging conflicting tokenizations. Language Resources and Evaluation 46(1): 53-74 (2012)

Page 32: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 32 http://lod2.eu

• SPARQL queries produce (find) errors

• http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.ttl

• RLOG – An RDF Logging Ontology

• ./validate.jar -i nif-erroneous-model.ttl -t file

• Demo → character count

• Demo → all errors

Validation over specification

ALL DEMOS ARE AVAILABLE AT:

http://nlp2rdf.org/leipzig-24-9-2013

Page 33: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 33 http://lod2.eu

NIF

Demo:http://nlp2rdf.lod2.eu/demo.php

Page 34: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 34 http://lod2.eu

OLiA

http://purl.org/olia

Page 35: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 35 http://lod2.eu

NIF

Page 36: Integrating NLP using Linked Data

ISWC – 2013/10/23 – Page 36 http://lod2.eu

NIF