A Unified Structure for Dutch Dialect Dictionary Data Folkert de Vriend 1, Lou Boves 1,2, Henk van...

1
A Unified Structure for Dutch Dialect Dictionary Data Folkert de Vriend 1 , Lou Boves 1,2 , Henk van den Heuvel 1 , Roeland van Hout 2 , Joep Kruijsen 2 , Jos Swanenberg 2 1 Centre for Language and Speech Technology (CLST) 2 Department of Linguistics Radboud University Nijmegen, The Netherlands The dialect vocabulary of the Netherlands and Flanders is recorded and researched in several Dutch and Belgian research institutes and universities. Most dictionary creation and research projects collaborate in the “Permanent Overlegorgaan Regionale Woordenboeken” (ReWo). In the project Digital databases and digital tools for WBD and WLD (D-Square) the dialect data published by two of these dictionary projects (Woordenboek van de Brabantse Dialecten (WBD) and Woordenboek van de Limburgse Dialecten (WBL) is being digitised. In addition, the D-square project aims to develop an infrastructure for electronic access to all dialect dictionaries collaborating in the ReWo. Eventually, this infrastructure will enable unified access to dialect geographic data for the complete Dutch language area through one interface and one set of research tools as if it were one homogeneous data collection. Introduction The dialect data reconsidered 1 All dialect dictionary projects in the ReWo use the same core data types, viz. form, sense and location. The most striking difference between the projects is the organisation of their data which is either form- based or sense-based. Form-based Sense-based The dialect data reconsidered 2 However, the nature of the data does not have an intrinsic “sense over form” or a “form over sense” hierarchy. Instead, the relation between the core data types is heterarchical: Heterarchical relation The core data types can be further Classifications classified in higher order structures. Advantages: • Optimal flexibility in working with the data. • Possibility of treating all data from the various dictionaries as one huge data set. • Differences in the more precise nature of each of the data types can be specified by the classifications. Implementation issues Encoding of the data • For the core data types, a relational database will be used. • For the classifications, XML will be used. Standardisation With the use of LEXUS we will adhere to: • The Data Category Registry • Lexical Markup Framework One interface for unified data: Google Earth: Concluding remarks The D-Square project lasts until the summer of 2007. The unified structure as described will then have been implemented for WBD and WLD. The project is partly funded by Netherlands Organisation for Scientific Research (NWO). The project website is: www.ru.nl /dialect/d2 Unifying the different classifications: Difficulties and solutions Sense Dictionaries use different taxonomies or no taxonomy at all. Use taxonomy already present in WBD and WLD. Senses from other dictionaries can be mapped onto this taxonomy. New senses can be added. Senseless forms (words with only a grammatical function) will be mapped to a separate branch of the taxonomy. Form Form classifications are based on a number of different linguistic criteria. It used to be up to the intuition of the editor what criteria prevailed. This results in the same form possibly being classified differently in different dictionaries. Expert users should be able to choose any of the possible classification mergers or no merger at all. General public is presented with one kind of merger by default. Location Place name ambiguity is introduced when merging location classifications. Either a geopolitical taxonomy covering all locations is introduced, or all locations are converted to a geocoding system that can be used for uniquely encoding geographical locations world wide: longitude and latitude.

Transcript of A Unified Structure for Dutch Dialect Dictionary Data Folkert de Vriend 1, Lou Boves 1,2, Henk van...

Page 1: A Unified Structure for Dutch Dialect Dictionary Data Folkert de Vriend 1, Lou Boves 1,2, Henk van den Heuvel 1, Roeland van Hout 2, Joep Kruijsen 2, Jos.

A Unified Structure for Dutch Dialect Dictionary Data

Folkert de Vriend1, Lou Boves1,2, Henk van den Heuvel1, Roeland van Hout2, Joep Kruijsen2, Jos Swanenberg2

1 Centre for Language and Speech Technology (CLST)2 Department of Linguistics

Radboud University Nijmegen, The Netherlands

The dialect vocabulary of the Netherlands and Flanders is recorded and researched in several Dutch and Belgian research institutes and universities. Most dictionary creation and research projects collaborate in the “Permanent Overlegorgaan Regionale Woordenboeken” (ReWo).

In the project Digital databases and digital tools for WBD and WLD (D-Square) the dialect data published by two of these dictionary projects (Woordenboek van de Brabantse Dialecten (WBD) and Woordenboek van de Limburgse Dialecten (WBL) is being digitised. In addition, the D-square project aims to develop an infrastructure for electronic access to all dialect dictionaries collaborating in the ReWo. Eventually, this infrastructure will enable unified access to dialect geographic data for the complete Dutch language area through one interface and one set of research tools as if it were one homogeneous data collection.

Introduction

The dialect data reconsidered 1

All dialect dictionary projects in the ReWo use the same core data types, viz. form, sense and location.

The most striking difference between the projects is the organisation of their data which is either form-based or sense-based.

Form-based Sense-based

The dialect data reconsidered 2

However, the nature of the data does not have an intrinsic “sense over form” or a “form over sense” hierarchy. Instead, the relation between the core data types is heterarchical:

Heterarchical relation

The core data types can be further Classificationsclassified in higher order structures.

Advantages:• Optimal flexibility in working with the data.• Possibility of treating all data from the various dictionaries as one huge data set.• Differences in the more precisenature of each of the data typescan be specified by the classifications.

Implementation issues

Encoding of the data

• For the core data types, a relational database will be used.

• For the classifications, XML will be used.

Standardisation

With the use of LEXUS we will adhere to:

• The Data Category Registry

• Lexical Markup Framework

One interface for unified data: Google Earth:

Concluding remarks

The D-Square project lasts until the summer of 2007. The unified structure as described will then have been implemented for WBD and WLD.

The project is partly funded by Netherlands Organisation for Scientific Research (NWO). The project website is: www.ru.nl/dialect/d2

Unifying the different classifications: Difficulties and solutions

Sense

Dictionaries use different taxonomies or no taxonomy at all.

Use taxonomy already present in WBD and WLD. Senses from other dictionaries can be mapped onto this taxonomy. New senses can be added. Senseless forms (words with only a grammatical function) will be mapped to a separate branch of the taxonomy.

Form

Form classifications are based on a number of different linguistic criteria. It used to be up to the intuition of the editor what criteria prevailed. This results in the same form possibly being classified differently in different dictionaries.

Expert users should be able to choose any of the possible classification mergers or no merger at all. General public is presented with one kind of merger by default.

Location

Place name ambiguity is introduced when merging location classifications.

Either a geopolitical taxonomy covering all locations is introduced, or all locations are converted to a geocoding system that can be used for uniquely encoding geographical locations world wide: longitude and latitude.