Language resources, standardization and
modern trends in NLPSimon Krek
Jožef Stefan Institute, Artificial Intelligence Laboratory, Slovenia
COST Action
Working Groups / Objectives
• WG1: Integrated interface to European dictionary content
• WG2: Retro-digitized dictionaries
• WG3: Innovative e-dictionaries
• WG4: Lexicography and lexicology from a pan-European perspective
Innovative e-dictionaries
• The third working group will focus on the development of digitally born dictionaries, focusing on the latest developments in e-lexicography and the interface between lexicography and computational linguistics. • Work will be carried out on:• the analysis of the possible impact of automatic acquisition of lexical data• the analysis of the interface between dictionary and computational lexica (cf.
wordnets) and syntactically and semantically annotated corpora (cf. FrameNet, SemCor, Senseval)• the investigation of the possible use of dictionary content for computational
linguistic applications
Electronic lexicography in the 21st century• The first eLex conference: New challenges, new applications, Louvain-
la-Neuve (Belgium), 22 to 24 October 2009• The second eLex conference: New applications for new users, Bled
(Slovenia), 10 to 12 November 2011• The third eLex conference: Thinking outside the paper, Tallinn
(Estonia), 17 to 19 October 2013• The fourth eLex conference: Linking Lexical Data in the digital age,
Herstmonceux Castle (UK), 11 to 13 August 2015
eLex 2011Language data for digital natives: old wine in a new bottle or...?
Text mining is a challenge
Content is a problem
Presentation is a bigger problem
What is in the middle?
(Web, Mobile) Design
Lexicography
Natural
Langua
ge
Process
ing
?
Text mining is a challenge
Content is a problem
Presentation is a bigger problem
Sinclair: Floating dictionary (2001)
• »A few years ago I felt that the time was ripe to plan a new kind of dictionary, one that would never exist on paper, but would be automatic or almost automatic in its selfupdating.
• It would, so to speak, float on top of a corpus, rather like a jellyfish, its tendrils constantly sensing the state of the language.
• As well as reporting on the settled usage and meanings of the words and phrases of a language, like a normal dictionary does, the floating dictionary, when interrogated, dips into the corpus and checks this information, offering instances that match its criteria for the senses; also it explores further to see if there are any instances that conflict with the criteria, and may signify a development of a sense or the emergence of a new usage altogether.
• Within the limits of its powers, it organises this evidence as a comment on the existing dictionary entry.«
Does dictionary content know itself?
• LT community now has a basic idea how to store various types of information• also SW community: RDF, RDFa, RDFS, OWL, SKOS, and more• standardization in human-oriented dictionary encoding was never
really successful (XML, TEI?)• the question is: if different types of lexicographic information
intended for human users will have to know each other – will the format be dictated by LT standards? (Probably yes.)
Similar domain, different task
• EU projects: http://www.xlike.org/, http://xlime.eu/
• The goal of the XLike project is to develop technology to monitor and aggregate knowledge that is currently spread across mainstream and social media, and to enable cross-lingual services for publishers, media monitoring and business intelligence.• xLiMe proposes to extract knowledge from different media channels
and languages and relate it to cross-lingual, cross-media knowledge bases. By doing this in near real-time we will provide a continuously updated and comprehensive view on knowledge diffusion across media.
Sevices
• Newsfeed• a clean, continuous, real-time aggregated stream of semantically enriched
news articles from RSS-enabled sites across the world• http://newsfeed.ijs.si/visual_demo/• http://enrycher.ijs.si/
• EventRegistry• a system that can analyze news articles and identify world events• can identify groups of articles in different languages that describe the same
event • http://eventregistry.org/
EventRegistry system architecture
ENeL perspective
• Complex story about events = complex story about words/languages
Slovene Estonian English German French Hungarian Croatian Basque Swedish …
Cross-lingual horizontal axis
Diachronic vertical axis
2015 1950 1900 1850 1800 …
Cross-lingual synchronic horizontal axis• "Never without data"• Existing lexical resources (dictionaries, BableNet, AnyNet, Linked Data, etc.)• Corpora, the Web and NLP
• Definition extraction (and generation)• RANLP 2009, International workshop on definition extraction• Language Technology for eLearning (http://www.lt4el.eu/)
• Extraction of grammatical or lexical information• Kookkurrenzdatenbank (http://corpora.ids-mannheim.de/ccdb/)• Sketch Engine (http://www.sketchengine.co.uk/)
• Extraction of good (dictionary) examples• ENeL Vienna workshop
• Extraction of translation equivalents• Linguee etc.
• Extraction of Multi-word Expressions (Parseme)
Automatically Constructed Dictionary Content
Complex multimodal information extraction
Explain, combine, exemplify
Definitions
Found
Generated
Combinations
Collocations
as subject
as object
Multi-word expressions
Knowledge-Rich Contexts
Real-time data
Streaming
News Feeds
Sounds, graphics and visuals
Sounds
Speech Synthesis
Recorded / Speech
Recognition
Graphics
Images
Videos
Multi-lingual, cross-lingual
(Hidden) parallel corpora
hub language
ENeL
• WG1: Integrated interface to European dictionary content
• WG2: Retro-digitized dictionaries
• WG3: Innovative e-dictionaries
• WG4: Lexicography and lexicology from a pan-European perspective
ENeL
• WG1: Integrated interface to European dictionary content
• WG2: Retro-digitized dictionaries
• WG3: Innovative e-dictionaries
• WG4: Lexicography and lexicology from a pan-European perspective
Retro-digitization
• Digital Agenda for Europe (Europe 2020 Strategy – one of the pillars)• Commission’s Recommendation on the
digitization and online accessibility of cultural material and digital preservation
• Put in place solid plans for their investments in digitization and foster public-private partnerships to share the gigantic cost of digitization (recently estimated at € 100 billion).
• Make 30 million objects available through Europeana by 2015, including all Europe's masterpieces which are no longer protected by copyright, and all material digitized with public funding.
Retro-digitized dictionaries
• encode and enrich dictionary data (standards and tools)• (the question is: if different types of lexicographic information
intended for human users will have to know each other – will the format be dictated by LT standards?)• definitions• examples• etymology• other types of information
• linking dictionary data with historical corpora• http://nl.ijs.si/imp/
Lexical Cloud
Integrated interface to European (dictionary / lexical) content
Any dictionary
Anypedia
AnyNet Any corpus
Any base
Conclusion
• any word/concept in any language on any device offers a story about its current life and its history• what is a "concept" (in the sense of "event")? X-Nets? Wikipedia?• what is the central format?
• what is the appropriate context?• EU projects? ICT? Cultural Heritage?• Infrastructure (e.g. Clarin)?
Top Related