Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque...

29
Computer-aided lexicography Creation, publication, and use of dictionaries: our experience at the Ixa NLP Group Xabier Artola Zubillaga [email protected] Faculty of Computer Science, Donostia IULA - InfoLex (UPF) 3 2010-11-26 Using the dictionary is not always fun How many legs has a fly? This looks like a past participle of some verb!!!: shrunk There must be a word for... to remove the hair from the skin of goats and sheep I need a verb now!: the fire ...s Is there any relationship between these words? Which one?: to burn, to blacken Which one is correct?: a quick shower or a fast shower

Transcript of Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque...

Page 1: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

Computer-aided lexicography

Creation, publication, and use of dictionaries:our experience at the Ixa NLP Group

Xabier Artola [email protected]

Faculty of Computer Science, Donostia

IULA - InfoLex (UPF) 32010-11-26

Using the dictionary is not always fun

How many legs has a fly?

This looks like a past participle of some verb!!!: shrunk

There must be a word for...

to remove the hair from the skin of goats and sheep

I need a verb now!:

the fire ...s

Is there any relationship between these words? Which one?:

to burn, to blacken

Which one is correct?: a quick shower or a fast shower

Page 2: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 42010-11-26

Using the dictionary is not always fun

Translating buy for into Spanish:

The company bought stock for investment purposesThey kept buying for several months

They bought stock for €3,000,000The defendant said he bought it for his brother

look after: what does it mean?

IULA - InfoLex (UPF) 52010-11-26

Outline of the presentation

CreationComputer-aided lexicography: text corpora and language databases

Dictionary editing environments

Knowledge representation issues

PublicationPrint

Electronic (on-line or whatever)

From the editing application to the final product

UseUse cases, users, and dictionary software functionality

Do we get from electronic dictionaries what we could expect from them?

Page 3: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 62010-11-26

Creation: dictionary making

Still in the 20th century: piles of index cards within shoeboxesWord usage was compiled largely on paper slips or index cards, as the basis for the creation of dictionary entries

Computer technologytext corpora (concordances, KWIC) to:

acquire real language use examples

discover and ascertain word senses, extract definitions

find and verify collocations

find neologisms

find out multiword lexical units

databases (wide sense) to store dictionary contents

IULA - InfoLex (UPF) 82010-11-26

Creation: dictionary making

Today's electronic dictionaries: where do we get dictionary content from?

print dictionaries (legacy): scanning

OCR

parsing of typographic features

importing it from glossaries, entry lists, other electronic dictionaries...

from scratch: editing (lexicographer)word processors

databases

XML editors

publishers' custom applications

dictionary editing software: Tshwanelex...

Page 4: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 102010-11-26

Creation: dictionary making

Building electronic dictionaries from legacy dictionaries: [scanning + OCR +] parsing of typographic features

Goal: to obtain a structural representation of the dictionary content (often in XML)

from text to a lexicographic database

Two real cases (Ixa NLP Group): eEH: from RTF to TEI SGML / XML (Arregi et al., 2003, 2007)

DBE: from RTF to TEI XML (Alegria et al., 2006a, 2006b)

IULA - InfoLex (UPF) 112010-11-26

eEH: from RTF to TEI SGML / XML

Sarasola I. Euskal Hiztegia. Kutxa Fundazioa: Donostia, 1996.

Basque monolingual dictionary, reference for the standard Basque dictionary (Hiztegi Batua, Academy of the Basque Language)

33,111 entries, 41,699 senses

Typical examples illustrating the use of words, drawn from corpora

From RTF to TEI SGML (later to TEI XML): DCG written in Prolog

TEI DTD: select / customize / enhance

Manual correction of the automatically obtained output

Page 5: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 122010-11-26

eEH: from RTF to TEI SGML / XML

eEH: electronic Euskal Hiztegia (electronic dictionary prototype)Sophisticated indexing system (no databases are used)

definition and example texts fully lemmatized

Users: ordinary

advanced (philologists, lexicographers, translators...)

Functionalityfull hypertext utility (from definitions and examples to corresponding entries)

basic query

advanced query• especially designed query language

• dictionary search as in a corpus

Problem: lack of editing environment

IULA - InfoLex (UPF) 132010-11-26

queryinterface

eEH: electronic dictionary prototype

querylanguage

Page 6: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 142010-11-26

queryinterface

eEH: electronic dictionary prototype

querylanguage

IULA - InfoLex (UPF) 152010-11-26

DBE: from RTF to TEI XML

Miyares Bermúdez E. (dir.) Diccionario Básico Escolar. Centro de Lingüística Aplicada, Santiago de Cuba. 2003.

School dictionary, monolingual

7,473 entries, 14,013 word senses (1st ed.)

From RTF to TEI P4 XML:Word macros

Ferret (semi-automatic learning software)

TEI DTD: select / customize / enhance

Manual correction of the automatically obtained output

leXkit: dictionary editing environment

Three on-line versions, two CDs, three print editions

Page 7: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 172010-11-26

DBE: CD and on-line (3rd version)

otherfunctionalityentry look-up

indexlook-up

orthographichelp

letterindexes

imagerequest

response

cross-references

IULA - InfoLex (UPF) 182010-11-26

Dictionary editing environments

Essential if databases or markup languages are chosen for dictionary knowledge representation

Wish listall kind of editing facilities: XML-transparent, navigation facilities, cross-reference building, wizards...

integrity constraint checking and consistency

multimedia integration

import facilities

collaborative editingWiktionary

dicussion forums• Ultralingua (online discussion forum)

• Leo collaborative bilingual dictionaries

Page 8: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 192010-11-26

Dictionary editing environments

Wish list (cont'd)customized output: dictionary publication

different dictionary products:• unabridged dictionary

• student's dictionary

• ...

export formats: • electronic versions: XML, HTML, other formats...

• print: PDF, desktop publishing software...

IULA - InfoLex (UPF) 202010-11-26

A real case: leXkit (Ixa NLP Group)

leXkit: a dictionary content management system (Alegria et al., 2006c)

Dictionary edition and maintenance

XML-based: Berkeley DBXML XML native database for storage

Client-server architecture: SOAP-based communication

Suitable for different kinds of dictionaries

Main features: Allows adding, deleting and modifying entries in a friendly fashion: XML details are transparent for the lexicographer

Provides the lexicographers with all the features of a full-fledged DBMS: full search capabilities, safe storage, concurrent access,etc.

Page 9: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 212010-11-26

leXkit

Main features (cont'd): Maintains entry states (version control and tracking)

Allows to automatically generate the files and components needed by a running application such as the current electronic DBE.

Tailored output is feasible: it allows to easy export data required in print editions, diversified electronic versions, etc.

ArchitectureClient

The component used by the lexicographer

Tool integration (corpora, other dictionaries...)

Server: database, concurrency, configuration files (dictionary schema definitions, wizards, etc.), import/export utilities, backups...

IULA - InfoLex (UPF) 222010-11-26

leXkit

Index:Dictionary entries

Search results

Editor:•Edition tree

•Predefined tasks

Viewer:•Entry preview(WYSIWYG)

•Integrated tools

edition textbox

dictionary tabs

Page 10: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 232010-11-26

leXkit

Viewer:•XML tab•Entry info

•Session control•...

views and info tabs

IULA - InfoLex (UPF) 242010-11-26

leXkit: system architecture

Page 11: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 252010-11-26

leXkit

Communication (client / server)SOAP web services (RPC model + cookies)

Intermediate declarative layer (XML)Dictionary specifications

Operations (context-dependent tasks)

Wizards (common edition operations, predefined searches...)

Other technical aspectsXSLT is widely used in the application

XSLTi: decarative language that adds interactivity to XSLT scripts

XML processing: Xerces + Xalan

Graphical interface: wxWidgets

HTML rendering: Mozilla (wxMozilla)

IULA - InfoLex (UPF) 262010-11-26

leXkit: wizards for the DBE

Page 12: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 272010-11-26

leXkit: conclusions

leXkit has been used at the CLA for editing the DBE's 2nd and 3rd editions: from 7473 entries / 14013 senses in the 1st

edition to 10557 entries / 19374 senses in the 3rd one.

The construction of leXkit was a vital tool in the qualitative leap of this work.

Dictionary edition applications are a must, especially if dictionaries are stored in databases or XML-encoded.

leXkit can be used by other lexicographical teams to create and update dictionaries. It is available as free software (open source) at http://sourceforge.net/projects/lexkit/.

IULA - InfoLex (UPF) 282010-11-26

Dictionary representation

Representation is the key factor for dictionary functionalitywe won't get what is not stored and adequately represented in the dictionary

the representation we choose conditions what we later on will be able to get from the dictionary

Physical leveltext (no access facilities, deficient structuring)

plain or somehow structured (CSV, tabular...)

rich text: typography, word processors

even the entry concept is diluted sometimes

risk: vicious circle (to be avoided)

Page 13: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 292010-11-26

Dictionary representation

Physical level (cont'd)

database: relational (structure, indexing, query and update facilities)

one database = one dictionary• is each pertinent information unit correctly represented in a field or

column?

integrated dictionary system (publishers)• publisher's general dictionary database

marked textHTML: mark-up language, presentation-oriented

SGML / XML: mark-up metalanguage, content-oriented

IULA - InfoLex (UPF) 302010-11-26

Dictionary representation

content-oriented marked text constitutes a better data model for the representation of dictionary content and structure than the relational model

lexical information is inherently complex

information apparently similar is represented in dictionaries using structurally different ways

intra-entry hierarchical structure is not adequately represented using the relational model

the information must be split in several tables: redundancy, factorization problems

construction of user-friendly graphical user interfaces is not always easy

query languages are often complex and non-intuitive

Page 14: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 312010-11-26

Dictionary representation

content-oriented marked text... (cont'd)

content-oriented marked texts (SGML, XML...)descriptive markup (structure, content)

more flexible data representation model

reflects better the lexicographic data model used in dictionaries

drawback: manageability and efficiency XML native databases: indexing, query and update facilities

TEI (Text Encoding Initiative): a whole chapter full of recommendations on marking up human-oriented dictionaries

IULA - InfoLex (UPF) 322010-11-26

Dictionary representation

Physical level (cont'd)

dictionary knowledge bases: reasoning, artificial intelligence techniques, knowledge representation languages

the only way to extract implicit knowledge from dictionary structures

Page 15: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 332010-11-26

Dictionary representation

Conceptual level

what information/knowledge is represented?orthography, pronunciation, grammar (mostly POS), register, definition...

morphology? irregular inflection paradigms?• important in learner's dictionaries, highly inflected languages...but

not only

• two real cases (Ixa NLP Group)• Elhuyar eu-es (MS Word plugin): eu and es lemmatization

• UZEI synonyms (MS Word plugin): eu lemmatization

IULA - InfoLex (UPF) 342010-11-26

Dictionary representation

Conceptual level (cont'd)

dictionary typology monolingual / bi- or multilingual

language dictionary / encyclopedic

general use / specific (terminology)

...

implicit knowledge: in definitions, examples, lexical semanticsWordNet, thesauri...

association lists, semantic networks

inference, reasoning

Page 16: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 352010-11-26

Publication: presentation, output

printhow to obtain the "file" to submit to the publisher?

electronictypology

on-line• on-line dictionaries (free, subscriptions...)

• dictionary directories: OneLook Dictionary Search

• multi-dictionary access tools: Euskalbar, a Firefox plugin that integrates ~30 dictionaries and corpora

• the web (corpus) as a dictionary

• translation memories, parallel corpora

IULA - InfoLex (UPF) 362010-11-26

Publication: presentation, output

typology (cont'd)desktop dictionary software

• standalone applications: personal computers, small handheld devices, mobile phones...

• integrated dictionaries, plugins: in word processors, web browsers...

• Elhuyar eu-es (MS Word plugin)

• UZEI synonyms (MS Word plugin)

• multi-dictionary tools: Babylon

machine-readable dictionaries: PDF...

formats: HTML, XML, PDF, PS, electronic book formats, application proprietary formats...

Page 17: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 372010-11-26

Publication: presentation, output

Which is the way leading from the editing environment or database to the print or to the electronic version?

DBECD and on-line: XML to HTML (dynamic transformation, XSLT)

print: XML to PDF (XSLT-FO)

Hiztegi Batua (Euskaltzandia, Basque Language Academy):on-line: XML to HTML (XSLT)

publishing: HTML to Quark (manually)

download: Quark to PDF

IULA - InfoLex (UPF) 382010-11-26

Publication: presentation, output

Which is the way... (cont'd)other solutions:

[general dictionary]• Oracle to HTML (web)

• Oracle to Quark (print)

[terminological dictionary] • 4D to Quark (print)

• 4D to XML (TBX) to XHTML (web)

customized output: proprietary formats (mainly in desktop dictionary software)

The longer the way...the easier is to get lost!update will be more costly

Page 18: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 402010-11-26

Use: functionality

Use caseslanguage input: typical lookup (definitions, multiword expressions...)

language output: is the dictionary well oriented to be used in language production situations?

much more information is needed when we want to actually use a word in speech or in writing than when we only want to understand a word in a passage.

translation tasks: language input and outputespecial information is needed: faux amis...

language learning activities: more information is needed about context of use, connotations of a word, collocations, etc.

IULA - InfoLex (UPF) 412010-11-26

Use: functionality

Users (models, profiles)native speakers

language learners

translators

students, children...

specialists: scientists, technicians...

Functionalitydo we get from electronic dictionaries what we could expect from them?

are they something more than their print counterparts?

Page 19: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 422010-11-26

Dictionaries of the future: http://www.oxforddictionaries.com/page/84

Print dictionaries have been joined by dictionaries in electronic form: these are often enriched with many additional features, such as sound recordings or sophisticated links to other related material.

...

It seems likely that by the middle of this century, if not before, all dictionaries will be in electronic form. This means that limitations of space, which have always been a serious issue for lexicographers and dictionary publishers, will be much less important. Dictionaries will be able to include more material: more words and definitions, interactive features, and multimedia content such as images, sound, and video. They will also be updated much more rapidly than ever before. But the general idea of a dictionary - a resource that provides explanations of words and how they are used - will probably remain the same.

IULA - InfoLex (UPF) 432010-11-26

Use: functionality

Functionality (cont'd)what we get

search facilities: from basic lookup to advanced queries

speed, storage facilities

orthographic help (closeness)

integration: word processors, reading applications...

new features: multimedia (recorded sounds, images, videos), hyperlinks

interactivity?

wish listdefinition and examples: corpus queries

navigation: fully hyperlinked (lemmatization of definitions, examples...)

morphology, grammar, derivation...

Page 20: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 442010-11-26

Use: functionality

wish list (cont'd)use of words, lexical combinatorics, collocations

• dictionary and corpus integration?

find a word from its definition, explore related concepts...:OneLook Reverse Dictionary (statistical language processing)

intelligent dictionary? why not integrate different kinds of information and tools (WordNet, thesauri, multimedia, collocations, thematic...) in powerful language help systems, and provide them with inference and reasoning capabilities?

• Hiztsua / SIAD (Artola, 1993; Agirre et al. 1994a, 1994b, 1997)

• AnHitz (Arregi, 1995; Agirre et al. 1996, 2000)

Have we investigated enough the ways users use dictionaries?

IULA - InfoLex (UPF) 452010-11-26

Hiztsua / SIAD: Intelligent Dictionary Help System

Built from a small French dictionary: Le Plus Petit Larousse (Librairie Larousse. Paris, 1980).

Definitions parsed using NLP techniques: morphology, syntax, definition patterns, lexico-semantic relationships

Building procedure:LPPL (typed directly into a DB GUI)

Dictionary Database (DDB, relational)

Dictionary Knowledge Base (DKB)

DKB: interrelated network of concepts (semantic network):hypernymy/hyponymy

synonymy, antonyms

meronymy

semantic roles

Page 21: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 462010-11-26

Hiztsua / SIAD: Intelligent Dictionary Help System

Frame-based system, allowing inheritance, inference, composition of lexical relationships

Prototype conceived and designed for human usersfrom the study of questions that human users would like to have answered when consulting a dictionary

Functionality that allows to extract and infer implicit knowledge hidden in the dictionary structures

definition queries, searches of alternative definitions

differences, relations and analogies between concepts

thesaurus-like word search

verification of concept properties and of interconceptualrelationships

...

IULA - InfoLex (UPF) 482010-11-26

Anhitz: A translator-oriented Dictionary System

Intelligent help system for human translatorsthe dictionary is conceived as an "active" tool that observes the activity of the user while he or she is working, providing him or her with "intelligent help"

Prototype based on two monolingual dictionaries (French and Basque):

two monolingual knowledge bases

one bilingual DKB establishes equivalence links between concepts from the monolingual dictionaries

diverse types of equivalence relationships: more general, more specific...

Page 22: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 492010-11-26

Anhitz: A translator-oriented Dictionary System

Functionality:empirical observation and study, using protocols and questionnaires, on the activity of professional and non-professional translators

to model the translator-dictionary interaction when translating lexical units from the source language into the target language

user's goals and intentions, dictionary queries made, observations, etc. have been recorded

monolingual and bilingual, locution, synonym... dictionaries

real use cases

functions classified according basically to three main activities:source text understanding

target text generation

search for translation equivalents

IULA - InfoLex (UPF) 502010-11-26

Anhitz: A translator-oriented Dictionary System

trans-lex

TRANSLATING THE SOURCE WORD

GETTING THE CONTEXT

SOURCE WORDUNDERSTANDING

SEARCHING FOR THE EQUIVALENT

TARGET WORDGENERATIONACQUIRING THE

MODEL OF THETRANSLATOR

Page 23: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 512010-11-26

Anhitz: A translator-oriented Dictionary System

examrths

comp-semsint-pat

prod

collocver-regdis-pro

dpro

TRANSLATING THE SOURCE WORD

TARGET WORD GENERATION

FINDING GENERATION HYPOTHESES

DISCRIMINATING GENERATION HYPOTHESES

GENERATION HYPOTHESIS VERIFICATION

FROM THE DICTIONARY ENTRY TO THE LEXICAL UNIT

IULA - InfoLex (UPF) 522010-11-26

Anhitz: A translator-oriented Dictionary System

Primitive functions:morphological analysis of a word form

choice of a dictionary entry / word sense in a given context

list of the possible senses that could be suitable for a word in a given context

definition request

reformulation of a definition

request of the properties of a concept

choice of a definition in a given context

request of differences or relationships between two concepts

verification of relationships between two concepts

definition verification

Page 24: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 532010-11-26

Anhitz: A translator-oriented Dictionary System

Primitive functions (cont'd):verification of the properties of a concept

thesaurus-like search of concepts

request of examples

direct lexical translation of a word form

verification of translation equivalents

semantic compatibility between two word senses according to a given relationship

search for syntactic constructions corresponding to a given pattern

search of lexical collocations

request of the verb regime

search for potential translation equivalents

IULA - InfoLex (UPF) 542010-11-26

To finish...

Dictionary edition: provide the lexicographer with advanced tools

Stress the importance of dictionary knowledge representation: we will get what we keep, and we will get it if we represent it adequately for the purpose required

We should investigate how users do use dictionaries, in order tobuild more "intelligent" systems, capable of anticipating users'needs and help them better

The dictionary of the future should be a "different" thing, not merely a "faster" print dictionary

integration of different kinds of information and tools in powerful language help systems

rich and heterogeneous functionality and access ways to the lexicon

Page 25: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 552010-11-26

Bibliography

Miyares Bermúdez E., Ruiz Miyares L., Álamo Suárez C., Pérez Marqués C., Artola ZubillagaX., Alegria Loinaz I., Arregi Iparragirre X.. 2010a. Las últimas ediciones del Diccionario BásicoEscolar de Cuba. IV Congreso Internacional de Lexicografía Hispánica. Tarragona.

Miyares Bermúdez E., Ruiz Miyares L., Álamo Suárez C., Pérez Marqués C., Artola ZubillagaX., Alegria Loinaz I., Arregi Iparragirre X.. 2010b. La segunda y tercera ediciones del Diccionario Básico Escolar. Euralex2010. Leeuwarden (The Netherlands).

Arregi X., Arriola J.M., Artola X., Díaz de Ilarraza A., Garcia E., Lascurain V., Soroa A., Uria L. 2007. Semiautomatic Construction of the Electronic Euskal Hiztegia Basque Dictionary (eEHBD). The XVIth biennial conference of the Dictionary Society of North America, Chicago.

Alegria I., Arregi X., Artola X., Astiz M., Ruiz Miyares L.. 2006a. Different issues in the design and development of the electronic Cuban Basic School Dictionary. E. Miyares, L. Ruiz eds., Linguistics in the Twenty First Century, 273-288. Cambridge Scholars Press, UK. ISBN: 1904303862.

Alegria I., Arregi X., Artola X., Astiz M., Ruiz Miyares L.. 2006b. Building an Electronic Version of the Cuban Basic School Dictionary. Proceedings EURALEX 2006 I, 243-250 (Turin, Italy). (ISBN 88-7694-918-6).

IULA - InfoLex (UPF) 562010-11-26

Bibliography

Alegria I., Arregi X., Artola X., Astiz M., Ruiz Miyares L.. 2006c. A Dictionary Content Management System. Proceedings EURALEX 2006 I, 105-109 (Turin, Italy). (ISBN 88-7694-918-6).

Soroa, A. Izaera heterogeneoko baliabide lexikalen integraziorako arkitektura batenproposamena. Datu-integrazioaren ikuspegitik egindako ekarpena. PhD Thesis. InformatikaFakultatea, UPV-EHU. 2004.

Arregi X., Arriola J., Artola X., Díaz de Ilarraza A., García E., Laskurain B., Sarasola K., SoroaA., Uria L.. 2003. Semiautomatic conversion of the Euskal Hiztegia Basque Dictionary to a queryable electronic form. T.A.L. journal. vol 44, num 2 p 107-124 ISSN: 1248-9433.

Arriola J., Artola X., Soroa A.. 2003. Automatic Extraction of verb patterns from Hauta-LanerakoEuskal Hiztegia. B. Oyharçabal ed., Inquiries into the lexicon-syntax relations in Basque. Supplements of ASJU no. XLVI (ISBN: 84-8373-580-6), 127-146. UPV/EHU, Bilbo.

E. Agirre, X. Arregi, X. Artola, A. Díaz de Ilarraza, F. Evrard, K. Sarasola, A. Soroa. 2003. An Intelligent Dictionary Help System. Encyclopedia of Library and Information Science, 2nd. Edition (ISSN/ISBN: 0-8247-2075-X [print]; 0-8247-4259-1 [web]), 1390-1401. Allen Kent (Marcel Dekker, Inc.), New York.

Page 26: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 572010-11-26

Bibliography

Agirre E., Arregi X., Artola X., Díaz de Ilarraza A., Sarasola K., Soroa A.. 2001. MLDS: A Translator-Oriented MultiLingual Dictionary System. Natural Language Engineering, 5 (4), 325-353. ISSN: 1351-3249. Cambridge University Press.

Agirre E., Ansa O., Arregi X., Artola X., Díaz de Ilarraza A., Lersundi M., Martinez D., SarasolaK., Urizar R.. 2000. Extraction Of Semantic Relations From A Basque Monolingual Dictionary Using Constraint Grammar. Proceedings of Euralex Sttutgart (Germany). 2000. ISBN 3-00-006574-1.

Arriola, J.M.. Euskal Hiztegia-ren azterketa eta egituratzea ezagutza lexikalaren eskuratzeautomatikoari begira. Aditz-adibideen analisia murriztapen-gramatika baliatuz, azpikategorizazioaren bidean. PhD Thesis. Filologia eta Historia-Geografia Fakultatea, UPV-EHU, 2000.

Patrick J., Zhang J., Artola X.. 2000. An Architecture and Query Language for a Federation of Heterogeneous Dictionary Databases. Computers and the Humanities (ISSN: 0010-4817).

Agirre E., Arregi X., Artola X., Díaz de Ilarraza A., Sarasola K., Soroa A.. 2000. A Methodology For Building Translator-Oriented Dictionary Systems. Machine Translation Journal. ISSN: 0922-6567. Kluwer Academic Publishers. V. 15 nº 4. pp. 295-310. 2000.

IULA - InfoLex (UPF) 582010-11-26

Bibliography

Agirre E., Arregi X., Artola X., Díaz de Ilarraza A., Sarasola K., Soroa A.. 1999. Un Diccionarioactivo vasco-castellano en un entorno de escritura. VI Simposio Internacional de ComunicaciónSocial. Santiago de Cuba, 25-28 de Enero de 1999.

Agirre E., Arregi X., Artola X., Díaz de Ilarraza A., Sarasola K., Soroa A.. 1997. Constructing an intelligent dictionary help system. Natural Language Engineering 2(3): 229-252. ISSN: 1351-3249. Cambridge University Press. Cambridge. 1997.

Arriola J., Artola X., Soroa A.. 1996. Hauta-lanerako Euskal Hiztegiaren analisi erdiautomatikoa. ASJU, Anuario del Seminario de Filología Vasca.

Agirre E., Arregi X., Artola X., Díaz de Ilarraza A., Ezeiza N., Sarasola K., Soroa A., A. Agirre, Patel H..1996. Design of a translator-oriented dictionary: Enhancement of a dictionary knowledge base by task modelling. Le traitement automatique du langage et les applications industrielles/Natural Language Processing and Industrial Applications. (NLP + IA96), Volume I, pp 1-6. Moncton, Canada. 1996.

Arriola J., Artola X., Soroa A.. 1996. Automatic extraction of lexical information from an ordinary dictionary. EURALEX'96, Göteborg (Sweden).

Patrick J., Zhang J., Artola X.. 1996. An Architecture for a Federation of Heterogeneous Lexical and Dictionary Databases. Joint International Conference ALLC/ACH'96, 221-225. Bergen (Sweden).

Page 27: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 592010-11-26

Bibliography

Agirre E., Arregi X., Artola X., Díaz de Ilarraza A., Sarasola K., Evrard F.. 1995. IDHS, MLDS: Towards Dictionary Help Systems for Human Users. Semantics And Pragmatics Of Natural Language: Logical And Computational Aspects. K. Korta & J. M. Larrazabal (Eds.), ILCLI Series, n. 1. Donostia.

Arriola J., Artola X., Soroa A.. 1995. Análisis automático del diccionario Hauta-Lanerako EuskalHiztegia. Procesamiento del lenguaje natural (SEPLN), Revista no. 17, 173-181. Bilbo.

Arregi, X.. ANHITZ: Itzulpenean laguntzeko hiztegi-sistema eleanitza. PhD Thesis. InformatikaFakultatea, UPV-EHU, 1995.

Agirre E., Arregi X., Artola X., Díaz de Ilarraza A., Sarasola K.. 1994a. Lexical Knowledge Representation in an Intelligent Dictionary Help System. Proceedings of COLING'94, vol. 1, 544-550. Kyoto (Japan).

Agirre E., Arregi X., Artola X., Díaz de Ilarraza A., Sarasola K., Evrard F.. 1994b. Intelligent dictionary help systems. Applications and Implications of current LSP Research. Eds. Brekke, M.; Andersen. I.; Dahl, T. & Myking, J., v. 1., 174-183. Fakbokforlaget (Norway).

Agirre E., Arregi X., Artola X., Díaz de Ilarraza A., Sarasola K.. 1994c. Analysing world-level translation activity to design a computerised dictionary. Proceedings of Euralex'94. Amsterdam.

IULA - InfoLex (UPF) 602010-11-26

Bibliography

Agirre E., Arregi X., Artola X., Díaz de Ilarraza A., Sarasola K.. 1994d. A methodology for the extraction of semantic knowledge from dictionaries using phrasal patterns. Proceedings of IBERAMIA'94. IV Congreso Iberoamericano de Inteligencia Artificial. McGraw-Hill. , 263-270. Caracas (Venezuela).

Arregi X., Artola X., Díaz de Ilarraza A., Sarasola K., Evrard F.. 1993. Sistema DiccionarialMultilingüe: aproximación funcional. Revista de la Asociación Española para el Procesamientodel Lenguaje Natural. Vol: 14, pp: 313-335.ISSN: 1135-5948.

Artola, X.. HIZTSUA: Hiztegi-sistema urgazle adimendunaren sorkuntza eta eraikuntza. Hiztegi-ezagumenduaren errepresentazioa eta arrazonamenduaren ezarpena. / Conception et construction d'un système intelligent d'aide dictionnariale (SIAD). Acquisition et représentationdes connaissances dictionnariales, établissement de mécanismes de déduction et spécificationdes fonctionnalités de base. PhD Thesis. Informatika Fakultatea, UPV-EHU, 1993.

Artola X., Evrard F.. 1992. Dictionnaire intelligent d'aide á la compréhension. Actas IV CongresoInternational EURALEX'90 (Benalmádena), 45-57. Barcelona.

Arregi X., Artola X., Díaz de Ilarraza A., Sarasola K., Evrard F.. 1991. Aproximación funcional a DIAC: diccionario inteligente de ayuda a la comprensión. Revista de la Asociación Españolapara el Procesamiento del Lenguaje Natural. Vol: 11, pp:127-138. ISSN: 1135-5948.

Page 28: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 612010-11-26

RTF: Rich Text Format (MS Word){\rtf1\ansi\ansicpg1252\uc1 \deff0\deflang1033\deflangfe1033{\fonttbl{\f0\froman\fcharset0\fprq2{\*\panose02020603050405020304}Times New Roman;}

…{\b\f69\fs16 aberastasun}{\fs14 .}{\b\i\fs14 }{\i\fs14 iz. }{\fs14 (1617; }{\i\fs14 abrastasun}{\fs14 1571).}{\b\fs14 1}{\i\fs14 . }{\fs14 Ondasun edo gauza baliotsuen ugaritasuna}{\i\fs14

. Aberastasunak ematen du aginpidea. }{\fs14 Ik. }{\b\fs14 diru}{\b\i\fs14 . }{\i\fs14Aberastasunez betea. Ez ohorerik ez aberastasunik. Garai hart

an Espainia guztian omen zen baso-oihanetan aberastasun handia. Basoetako aberastasuna. Zein zitezkeen gereziketa eta fruitu aberastasun horren iturburuak. }{\f69\fs12 II}{\fs14 }{\i\fs14 Pl. }{\fs14 Norbaitek dituen ondasun eta gauza baliozkoak}{

\i\fs14 . Herri baten aberastasunak eta baliabideak. Aberastasun galkorren ondoan ibiltzea. Euskarak bere baitan dituen aberastasunak. Aberastasunen banaketa zuzena. Aberastasunak hondatu. }{\b\fs14 2}{\fs14 .}{\i\fs14 }{\fs14

Aberatsa denaren nolakotasuna. Ant}{\i\fs14 . }{\b\fs14 pobretasun}{\fs14 ;}{\b\fs14behartasun}{\b\i\fs14 . }{\i\fs14 Aberastasunean bizi. Pobretasunetik aberastasunera. Aberastasunaren arriskuak.

\par }\pard \ql \li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 {

\par }}

IULA - InfoLex (UPF) 622010-11-26

Simplified DCG grammar to parse EH entries

Entry => Hdw [Relations] Category [date] [DefExamples].Hdw => [Homograph] [NonStdHdw | StdHdw].Homograph => bh number eh.NonStdHdw => cross bb hdw eb.StdHdw => bb hdw eb.Category => [subc] Category.Category => bi cat ei.DefExamples => Def [Examples] DefExamples | ε.Def => [SenseNumber][SenseGroup] def [Relations].SenseNumber => bs number es.SenseGroup => bsg grouptag esg.Relations => [SynRel | AntRel] Relations [Examples] | ε.SynRel => bsy synonyms esy.AntRel => ba antonyms ea.Examples => bi examples ei.

Page 29: Computer-aided lexicographyiula.upf.edu/materials/101126artola.pdf · 2013. 7. 29. · Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses

IULA - InfoLex (UPF) 632010-11-26

TEI XML encoding (DBE entry)

<entry id="d_d0e1701"><form>

<orth>decaer</orth><syll>de|ca|er</syll>

</form><gramGrp>

<pos>vintr.</pos><itype>(33)</itype>

</gramGrp><sense n="1">

<def>Ir a menos, perder una persona o cosa parte de laspropiedades que le daban su fuerza o valor.</def>

<eg><q>Con el paso del tiempo, su interés

<oVar>type="?">decayó</oVar>.</q></eg>

<xr><lbl>Sin.</lbl><ref>debilitar</ref><ref>disminuir</ref><ref>flaquear</ref><ref>desfallecer</ref>

</xr></sense><form type="infl">

<orth>decaído</orth><gram>(p.p.)</gram>

</form></entry>

IULA - InfoLex (UPF) 642010-11-26

DBE: print version (3rd ed.) page markers

figure refs.