NLP2RDF Wortschatz and Linguistic LOD draft

24
http:// lod2.eu NLP2RDF Integration of Data, Tools and Applications with RDF/OWL in the Areas of Textmining and Linguistics PhD Thesis, Sebastian Hellmann

Transcript of NLP2RDF Wortschatz and Linguistic LOD draft

Page 1: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 1 http://lod2.eu

http://lod2.eu

NLP2RDFIntegration of Data, Tools

and Applications with RDF/OWL in the Areas of

Textmining and LinguisticsPhD Thesis, Sebastian Hellmann

Page 2: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 2 http://lod2.eu

Extensive Topic – What is the core?

Features for Machine Learning

Which features do I need for a certain Textmining task?

An introductory example :Resources: • Face Recognition Tool that detects color of the eyes (brown, green, blue)

and type of haircut (Vo-ku-hi-la, Mullet, GI Joe)• Database with Age and Occupation

Goal: predict income of persons• Young students earn less than old CEO‘s .

=> Color of eyes and haircut probably irrelevant!

Page 3: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 3 http://lod2.eu

Basic idea: a benchmarking framework

Input: • Task specification• Text• Training/test data

Output:• Tools and data required to solve the task

Do I need POS tags to classify Tourism documents?

Prerequisites:• Tools and applications need a standardized interface• Data needs a standardized format

Page 4: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 4 http://lod2.eu

Basic idea: a benchmarking framework

NLP2RDF stack

Page 5: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 5 http://lod2.eu

Basic idea: a benchmarking framework

Google Code project was created• Stanford parser was integrated• Ontologies were found and integrated• Pipeline implemented• Plugin system implemented• Some results were achieved

But…• Architecture not flexible enough (Pipeline)• Integration bound to Java• Data sources were not sufficient• Wikipedia/DBpedia too course-grained• Speed of integration too slow

Page 6: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 6 http://lod2.eu

Prerequisites

One step back:

1. Creation of data sets in RDF2. Data integration and linking of data sets3. Licences4. Standardized format for tool integration5. Acquisition of additional knowledge

Page 7: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 7 http://lod2.eu

Why RDF and OWL ?

1. RDF makes data integration easy: URIref, LinkedData2. OWL is based on Description Logics (Guarded Fragment)3. Availability of open data sets (access and licence)4. Diverse serializations for annotations: XML, Turtle, RDFa+XHTML5. Scalable tool support (Databases, Reasoning)6. If the only tool you have is a hammer, everything looks like a nail.

Page 8: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 8 http://lod2.eu

LOD Cloud - over 26 Billion Facts

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

DBpedia is central:• Cross-domain• Crystalization point (early bird)

Page 9: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 9 http://lod2.eu

Simplified:• Circles are Database Tables• Links are HTTP-Foreign Keys

Page 10: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 10 http://lod2.eu

LinkedData

http://www4.wiwiss.fu-berlin.de/rdf_browser/?browse_uri=http%3A%2F%2Fdata.nytimes.com%2FN12930380387917339601

Resembles database tableKey-Value pairsValues can be:• Datatypes (Strings, Integers)• URIs pointing to subjects in the

same table• URIs pointing to subjects in any

other table

Page 11: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 11 http://lod2.eu

SPARQL – optimizations for table joins

All soccer players, who played as goalkeeper for a club that has a stadium with more than 40.000 seats and who are born in a country with more than 10 million inhabitants

http://tinyurl.com/2uhuow9

Page 12: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 12 http://lod2.eu

SPARQL – optimizations for table joins

Page 13: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 13 http://lod2.eu

Creation of data sets: Wiktionary2RDF

Page 14: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 14 http://lod2.eu

Creation of data sets: Wiktionary2RDF

http://en.wiktionary.org/wiki/house• Covers 170 languages• Total of 10 million pages• 900.000 users• RDF Dump will increase number of editors• Same properties as Wikipedia (stable identifiers)

• Hundreds of Wiktionary parsers (especially for English)• Information is trapped in the Wiki• Structure changes make software obsolete

Why try it again?• DBpedia Extraction Framework is very mature (5 years, 15 developers)• Configuration over Code, Templates will allow Wiktionarians to update

Parsers• Early contact with the community

Page 15: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 15 http://lod2.eu

Creation of data sets: Wortschatz

Converted in 2009:

Matthias Quasthoff, Sebastian Hellmann und Konrad Höffner:Standardized Multilingual Language Resources for the Web of Data:http://corpora.uni-leipzig.de/rdf 3rd prize at the LOD Triplification Challenge, Graz, 2009

What was missing?• Research questions• Use cases• Other data sets to link to!• Wikipedia as a linking partner not suited • No servers

Page 16: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 16 http://lod2.eu

Wiktionary, Wortschatz, OLiA can become the Crystallization point for a Linguistic Linked Data Web

Four major types:• Lexical Semantic Resources• Dictionaries• Corporas• Schemas/Ontologies

Page 17: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 17 http://lod2.eu

Interlinking Wortschatz: Research and Use Case

Iterated Co-occurences can be done with SPARQLWiktionary and Wortschatz can be loaded in the same database

Interesting questions:• What is the overlap and coverage?• Which Wiktionary relation can be linked to which statistical relation?• Can we build tools that helps Wiktionary editors (Suggestions)?• Wiktionary links Words across languages. Are there any similar

patterns? • Can we validate the Wiktionary RDF dump with Wortschatz?

Page 18: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 18 http://lod2.eu

Open Licences – Focus of LOD2 and OKFN

http://ckan.net/

CKAN is an open registry of data and content packages. Harnessing the CKAN software, this site makes it easy to find, share and reuse content and data, especially in ways that are machine automatable.

Working Group on Open Data in Linguisticshttp://wiki.okfn.org/wg/linguistics

• Founded on Nov 2010• 6-7 Members• Membership open, please join

Page 19: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 19 http://lod2.eu

Standardized Formats: Part 1 – Corpora

http://www.sfb632.uni-potsdam.de/~d1/paula/doc/

PAULA XML is the Potsdamer Austauschformat für linguistische Annotation ("Potsdam Interchange Format for Linguistic Annotation"). It is an XML-based standoff representation format, which has been designed to represent data with heterogeneous annotation layers produced by different tools. For visualization and querying of PAULA XML data, the database ANNIS can be used.

Christian Chiarcos at work: PAULA will become POWLA and will be used for representation of corpora annotations.

Page 20: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 20 http://lod2.eu

Standardized Formats: Part 2 – the Web

Bottom layer of the NLP2RDF stack can be reused:An ontology to represent Strings (formerly the SSO).

In his latest book, Wikinomics, Don Tapscott explains deep changes in technology, demographics and business.

• URIs to represent Strings e.g. http://nlp2rdf.org/example/Don_Tapscott

• Relation between Strings: previous, next, sub, super• http://nlp2rdf.org/example/Don is a subString of the above

Page 21: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 21 http://lod2.eu

Standardized Formats: Part 2 – the Web

• RDFa allows for flexible in-line annotations• Multiple services can be ad-hoc integrated• Multiple layers of annotation can be used

• Full compatability with POWLA• Trade-off between flexibility and speed

Page 22: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 22 http://lod2.eu

Knowledge Acquisition

Tiger Corpus Navigator

Page 23: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 23 http://lod2.eu

Ontology Learning

Johanna Völker – Learning Expressive Ontologies (LExO)

# Example:# A fish is any aquatic vertebrate animal that is covered with scales,# and equipped with two sets of paired fins and several unpaired fins.## [fish] subClassOf [any aquatic vertebrate animal that is covered …]

#Construct {?sub rdfs:subClassOf ?super} {Construct {?sub owl:equivalentClass ?super} {?is a penn:BePresentTense .?is nlp:superToken ?is_any_aquatic_.?is_any_aquatic_ a olia:VerbPhrase .?is_any_aquatic_ nlp:syntacticSubToken [ nlp:normUri ?super] .?animal nlp:cop ?is .?animal nlp:nsubj ?fish .?fish nlp:superToken [ nlp:normUri ?sub] .}

Page 24: NLP2RDF Wortschatz and Linguistic LOD draft

NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 24 http://lod2.eu

Standing on the shoulders of giants

Markus Strohmaier,TU Graz

Johanna VölkerUni Mannheim

Christian ChiarcosSFB632 - Uni Potsdam

Sören AuerUni Leipzig

Jens LehmannUni Leipzig

Thank you for your attention