The E-MELD Project

25
Nov 21, 2005 University of Texas at Austin The E-MELD Project Helen Aristar Dry & Anthony Aristar The LINGUIST List Eastern Michigan U & Wayne State U

description

The E-MELD Project. Helen Aristar Dry & Anthony Aristar The LINGUIST List Eastern Michigan U & Wayne State U. E-MELD. Electronic Metastructure for Endangered Languages Documentation. 5 year NSF project, 2001-6 Linguist List, ELF, LDC Goal: To aid in - PowerPoint PPT Presentation

Transcript of The E-MELD Project

Page 1: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

The E-MELD Project

Helen Aristar Dry & Anthony Aristar

The LINGUIST List

Eastern Michigan U & Wayne State U

Page 2: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

E-MELDElectronic Metastructure for Endangered

Languages Documentation

5 year NSF project, 2001-6 Linguist List, ELF, LDC Goal: To aid in

•…the preservation of endangered languages data•…the development of infrastructure for electronic archives

Page 3: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Summary of the problem (2001):

EL resources were/are Difficult to find Difficult to use Difficult to preserve

Needed: More uniformity in naming, cataloguing,

annotating, i.e., interoperable standards More knowledge of how to create digital

resources that last

Page 4: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Problems with EL resources

Difficult to find At distributed sites Language names ambiguous No central catalog of resources or

cataloging information (metadata) Lack of interoperability among archives

Difficult to display accurately Idiosyncratic character encoding Specific fonts needed

Page 5: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Problems with EL resources, 2

Difficult to compare Non-standard terminology Idiosyncratic markup & annotation

schemes Difficult to manipulate or reuse

Specific software needed (incl. specific software version), e.g. MSWord 1.0

Meaning represented via formatting, which was not documented

bold represents “headword”

Page 6: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Problems with EL resources, 3

Impermanent—vulnerable to:

Deterioration of the physical media

Hardware obsolescence Software obsolescence

Page 7: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

PHONOGRAMMARCHIV - AUSTRIAN ACADEMY OF SCIENCE

slide from Dietrich Schüller, Director

Page 8: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Toward a Solution: E-MELD Components

Involve linguistics community in developing standards

Promote consensus about: Language Identification Metadata Annotation and markup

Teach and facilitate implementation of “best practices” in the creation of digital language documentation

Page 9: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Promoting consensus : annual workshops

2001, Santa Barbara, CA: The Need for Standards

E-MELD 2002, Ann Arbor, MI: Digitizing Lexical Information

E-MELD 2003, Lansing, MI: Digitizing Texts

E-MELD 2004, Detroit, MI: Databases and Best Practice

E-MELD 2005, Cambridge, MA: Linguistic Ontologies & Terminology

Page 10: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

2006 E-MELD Workshop on Digital Language Documentation

Michigan State University June 20-22, 2006 In conjunction with the 2006 Summer

Meeting of the Linguistic Society of America

Topic: Electronic Archiving and Digital Tools: Current State & Future Directions

Please come!

Page 11: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Finding resources: metadata OLAC metadata standards (subcommunity of

OAI) OLAC search engine on LL site:

http://linguistlist.org/olac OLAC metadata editor on LL site:

http://linguistlist.org/olac/ore XSL Stylesheets for transformation /

presentation of OLAC metadata Ethnologue/LL language codes proposed as

ISO standard

Page 12: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Using resources: comparing and finding annotation

Ontologies developed (as interlanguage between markups and as search aids) GOLD: General Ontology for

Linguistic Description (morphosyntax) OPF: Ontology of Phonetic Features

(based on Ladefoged & Madison) ODIN Project: mining interlinear

glossed text on the web (Will Lewis et al)

Page 13: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Using resources: Tools Tools to encourage use of the ontology:

OntoElan: text annotation (modification of MPI’s Elan)

OntoGloss: stand-off annotation tool FIELD: lexical input

Tool to encourage use of Unicode CharWrite: input of Unicode characters

Facility to encourage use of OLAC metadata Stylesheet library ORE

Page 14: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Facilitating ‘Best Practices’ in resource creation

Creation of reference website School of Best Practices in Digital Language

Documentation http://emeld.org/school/ Addressed to the individual linguist who

creates language documentation

Page 15: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

What should the linguist do?

To ensure that digital data endure long into the future:

1. Create an archival copy: Put the materials into an enduring file format.

2. Deposit the materials with an archive that will make a practice of periodically migrating them to new storage media as needed.

Page 16: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Organization of the School

Entrance Hall: orientation Classroom: lessons & tutorials Reading Room: bibliography Work Room: online work Tool Room: links to tools Help (incl. Ask an Expert) Case Studies: documentation of

10 ELs digitized according to best practices

Page 17: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Currently School has:

Documentation from 12 ELs:

Mocovi Kayardild

Monguor Potawatomi

Tofa Ega

Saliba Navajo

Biao Mien W. Sissala

(Chorote) (Nivacle)

Page 18: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Current Initiatives

Identify and record metadata for legacy documentation

Improve the ontology (GOLD) – incorporate suggestions from 2005 E-MELD workshop

Finish prototyped software

Page 19: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Future: finish prototyped software

OntoElan: ontology-aware modification of MPI’s Elan annotation tool

OntoGloss: ontology-aware stand-off annotation tool

CharWrite: downloadable tool for web-input of Unicode characters

FIELD: Field Input Environment for Linguistic Data

All but OntoGloss available through the School of Best Practices website

Page 20: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Current Initiatives: School of BP

Make the School even more practical Distinguish between good, better, best

practice Emphasize

explicit ‘how-to’ pages Different paths for different user types Advice from experts, e.g. “equipment on a

budget” page, Ask-An-Expert

Page 21: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Practices in resource creation

Good practice: ensure preservation Better practice: ensure longterm

intelligibility “We don’t want to create another

Rosetta Stone” - Whalen, 2003 Best practice: promote interoperability

Page 22: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

School of Best Practices in Digital Language Documentation

http://emeld.org/school/

Page 23: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

Future Directions

MultiTree LL-MAP

Page 24: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

What is MultiTree?

3-year grant Database of all hypothesized language

relations Ultimately linked to GIS database Interface to allow linguists to input updates Panel of experts to assess input

Page 25: The E-MELD Project

Nov 21, 2005 University of Texas at Austin

LL-MAP

Collect geographically linked linguistic data Build this into a GIS system, allowing layers of information to be

built into a single map

Then…

Build tools for querying, annotating and discussing this data Build tools which allow new language data from linguists and

anthropologists to be incorporated into this system