Language Archiving at the MPI Language Archiving at the MPI Peter Wittenburg MPI for...
-
Upload
virginia-barnett -
Category
Documents
-
view
229 -
download
0
Transcript of Language Archiving at the MPI Language Archiving at the MPI Peter Wittenburg MPI for...
Language Archiving at the MPI
Language Archiving at the MPI
Peter Wittenburg
MPI for Psycholinguistics
DOBES Archive (DOkumentation BEdrohter Sprachen
Documentation of Endangered Languages)
(funded by VolkswagenFoundation)Nijmegen
NL G
Rhein
Still a large variety of languages
• currently 6500 languages world-wide
• Distribution
• Africa 1995• S/SE Asia 1400• Neuguinea 1109 • Southamerica 419• North-Asia 380• Central-America 300• Pazific Area 250• Australia 250• North-America 209• Europe 209
Language Archiving at the MPI
Language endangerment
• 97 % of the people use 4% of the languages • 96% of the languages are being spoken by 3% of the people • approx 6000 of the languages are spoken by about 200 Mio
people• in average: 30.000 speaker per language
• for 50% less than 10.000, for 25% less than 1000
• for 50% the number of speakers is decreasing dramatically
• pessimistic view (according to Crystal):• 90 % of the languages will be extinct around 2100!!• i.e. every second week a language becomes extinct!!
Language Archiving at the MPI
what can we do?
Documentation + Revitalization• 2000 DOBES Programme of the VolkswagenFoundation
• many other initiatives and institutions – all to be complementary
• VolkswagenFoundation is devoted to primarily support research
• teams get funds for documentation (in general 3 years +)• had a very intensive pilot phase full of useful discussions • it was obvious that all teams felt the need to help the language
communities (including the archiving team)
Language Archiving at the MPI
How to do a language documentation?
• based on N. Himmelmann “Documentary and Descriptive Linguistics”
• Documentation: primary focus is on collection, transcription and translation of primary data (observations, elicitations, ...)
• Description: primary focus is on linguistic analysis and special phenomena
• the methods and the results are differentCollection Analysis
Result Corpus of utterances, notes on observations, comments of involved persons
Descriptive statements illustrated by a few examples
Procedures Observation, elicitation, recording, transcription, translation
Phonetic, phonological, morphosyntactic, semantic analysis
Methodological issues
Sampling, reliability, naturalness
Definition of terms and levels, justification, adequacy of analysis
Language Archiving at the MPI
How to do a language documentation?
• there is an overlap between the two poles: documentation and description no interlinear description without a morphological analysis
• Documentation has to• deliver a comprehensive representation of the “linguistic habitudes and
traditions” • document spoken language in its communicative and cultural
background• observed linguistic habitudes and meta knowledge• holistic view of language is important
• be interesting for other disciplines – in particular primary data• help the language community
• therefore a natural focus on audio&video recordings
Language Archiving at the MPI
DOBES language documentation
• language on its cultural background • “theory-neutral” representation • lots of multimedia (audion, video) recordings as basis• where possible base everything on primary data
• linguistic goals• annotations (orthographic transcription, translation, ...)• only for a small part a morphological/syntactical analysis• sketch grammar, limited topic-oriented lexicon
• also ethnologists, musikologogists, ethnobiologists involved • in total about 3 years
• idea: later generations should be able to reproduce the language• material could later be extended
Language Archiving at the MPI
DOBES Map
Aweti
KuikuroTrumai
Salar/Monguor
Teop
Wichita
Lacandon
Chipaya
ChacoLanguages
Waima’a
Svan/Udi/Tsova-Tush Tofa
Hocank
Marquesan
Chintang/Puma
!XooAkhoeHai//om
Mawe/Bakairi/ Katxuyana
Tsafiki
Iwaidja
Chontal
Bora/OcainaSaliba
Beaver
Sri LankaMalay
Sami Nenets
Jaminjung
Semang
Totoli
• 30 documentation teams (at MPI also 30 expeditions per year)• 1 Archiving Team
Archiv
Language Archiving at the MPI
Waima’a (East Timor)
Mauricio Belo, Caisido village John Bowden, Australian National University John Hajek, University of Melbourne Nikolaus Himmelmann, Ruhr-Universität Bochum
la enen iat before PTLOnce upon a time bu taha k’omu ruo bu wai-dura loo ligasaun iniHON mud ball and HON cricket make closeness RCPA ball of mud and a cricket were friends sire ruo laka khuu rahmhutu busa3p two go clean together gardenThe two of them went to clean the gardens together
Labial(Post-) alveolar
Velar Glottal
Stops voiceless unaspirated (p) t k '
voiceless aspirated (ph) th kh
voiceless ejective p' t' k'
voiced b d g
Fricatives plain (f) s h
??? s'
Nasals plain m n
??? mh nh
??? m' n'
Laterals plain l
??? lh
??? l'
Tap / trill r
rr
Glides plain w
??? wh
??? w'
Trumai (Amazonas)
Stephen Levinson, MPI Nijmegen Raquel Guirardello-Damian, Museu Paraense Emílio Goeldi
• about 100 people• about 51 speaking Trumai Language Archiving at the MPI
Salar/Monguor (China)
Salar villages along the Yellow River
Salar children above Dashyinix village
Shaman in Huzhu Mongghul county
Drummers in the Nadun festivalMinhe county
Painting the faces of possessed Wutu, Niandehu township
Language Archiving at the MPI
Tofa, Tozhu, Tsengel Tuva, Tuha (Sibiria)
• David Harrison (Yale)• Brian Donahoe (Manchester)• Sven Grawunder (Halle)
• Language—its structure and sounds. • Oral folklore—texts, narratives and personal stories, belief systems, naming systems. • Music—singing and sound mimesis. • Traditional ecology— nomadsm, pastoralism, hunting and reindeer herding
Shaman Ceremony
Language Archiving at the MPI
Language documentation for whom?
• for interested researchers
• for students and schools
• for journalists
• for the interested public
• for the language communities
• for future generations
Language Archiving at the MPI
For language communities
• language maintenance or even revitalization
• maintainenance of the language, identity, self-conciousness • creation of school and other educational material• support local/regional centers (create and dl complete copies)• improve access to archives
• in communities big interest in recordings – in particular video
Language Archiving at the MPI
For future generations
• in a future world of mono cultures it will become important to know about earlier diversity
• as now it will be important to know the own roots
• it may be relevant to point to the different types of languages
• let’s be honest: we don’t know what future generations will do with the
material
Language Archiving at the MPI
Why archives?
• many reasons• Dietrich Schüller: 80% of our recordings about culture and languages are endangered! storage inadequate (Meda, Formats, PC, ...)
• selection of suitable technologies requires expert knowledge
• creation of redundat storage and migration is important requires discipine and has to be independent on persons
• migration to new technologies can be very expensive
• only centres can do this
• AND: requires explicitness – at the end a viewable corpus
• international trend: DOBES, AILLA, ELAR, PARADISEC, LACITO, ...
Language Archiving at the MPI
What is a “modern” digital archive?
• traditional archives• focus on preserving physical content• access not permitted
• digital archives • physical object is almost irrelevant (Tape, CDROM)• content has to be preserved • why this revolutionary change?
• copies can be made lossless (let’s be careful with compression) • copies can be created with low costs
• modern digital archive• long-term preservation fo the content (Migration, Distribution) • access to the content • enrichement without affecting the content • sensitive management of access
• DOBES has to be a living archive (interactive, expandible) Language Archiving at the MPI
Long-term preservation
• can we guarantee survival of bit-streams? NO• we can increase the chances of survival? YES• our storage media are not adequate
• how to do it • continuous migration (copies to new generation)• world-wide distribution (now within Germany/NL)• problem of interpretability not solvable • have to take care of ethical/legal aspects• crucial for survival are maintenance costs
• all MPI material is available in 7 copies at different locations
2000 years0 years 1000 years500 years250 years
clay tabletsvarious e-media
Language Archiving at the MPI
Pillars of Digital Archives I
• strict separation of physical and logical access layers
• physical domain is for System Managers and Archive Managersand changes
• logical domain (created by linguists) remains and is stable• metadata is the glue – have to be maintained
domain ofphysical resources
conceptualdomain ofresources
corpus manager
usercreator
system manager
Language Archiving at the MPI
Pillars of Digital Archives II
Archive Organization Layer of Language Layer of Sessions
Lexicon
Intro FilmsNotes
SongBook
VideoRecording
Sound Recording
Annotations
Language Archiving at the MPI
Pillars of Digital Archives III
• separation between object and instance
• need Unique Resource IDs • and robust “Resolving” mechanism
mapping
mapping
mapping
MPIRepository
GWDGRepository
URIDResolver
MPIPortal
Metadata XYZ
Portal
Metadata
Language Archiving at the MPI
Pillar of Digital Archives IV
• need Versioning
• nothing may be deleted, but annotations will be changed!• research world is dynamic – we want enrichment/extension
URID Resolver
userx=readusery=readetc
userx=writeusery=readetc
Language Archiving at the MPI
Principles V – Authentication&Authorization
• authentication and authorization has to be separated
• URIDs are central link to authorization information
• need to have space for policies, procedures, declarations etc• but administrative effort has to be minimized!!!
URID Resolver
userx=readusery=readetc
userx=writeusery=readetc
Language Archiving at the MPI
Principles VI – Formats
• only open, well-documented and widely used formats (encoding standards) should be used in the archive
• where possible generic schemas should be the basis
• in DOBES strong recommendations for a few archival formats• JPEG/TIFF/PNG, MPEG2, Linear PCM, UNICODE, XML• Plain Text, HTML, (PDF) possible
• at MPI less restrictive (therefore great danger with some types)• for presentation purposes also MPEG1/4, MP3, HTML• as import formats large variety (Shoebox, CHAT, WORD, ...)• conversion as much as possible towards generic files (LMF, EAF, ?)
• archived objects have to be stored in a neutral way and accessible as individual objects
• no encapsulation for primary objects
• nevertheless: MPI archive takes almost all data (even 16mm films)• but conversion can be very costly
Language Archiving at the MPI
MPI Archive – state
• more than 150.000 Objects (in online archive - ~1/3 of the data)
• in total more than 15 TB • per year about 4 TB in addition
• several sub-archives (EL, SL, ESF, CGN, ...)
• MPI archive ingest is open for other people !!!
• completely structured by open XML files based on IMDI schema
• a complete machinery available• are working on URIDs & Versioning at this moment
Language Archiving at the MPI
MPI Archive – Access
MetadataTools
Archive Utility Layer
Domain ofRegistered Primary and Secondary Resources
Domain ofDescriptive Metadata
Primary Resources:TextsImagesSoundMovies
User
DataIngestion&
Management
UserAuthentication
AccessRights
The Archive
Archive Access
AnnotationExploitation
LexiconExploitation
TextExploitation
Ontological Knowledge
MediaAnnotation
Archive Enrichment
LexicalEncoding
WebCommentary
MPI Archive – Metadata and Simple Access
• metadata is open!• what is minimal metadata? – ongoing discussion
• IMDI Editor• BatchModifier (to change lots of IMDI files)• IMDI XML Browser (operates in distributed XML
domain)• IMDI HTML Browsing (on the fly transformation of XML)• structured search in XML and HTML domain• unstructured search in XML and HTML domain• searchable via Google• geographic browsing via Google Earth (work in progress)• DC/OLAC bridge via OAI port (all IMDI stuff can be harvested)
• manuals and training courses
• direct access to simple objects via plug-ins• complete sub-tree download
Language Archiving at the MPI
MPI Archive – Upload Access
• two options • manual integration exceptions are easy
too many teams (~60)• LAMUS controlled integration exceptions are difficult
users do it themselves (?)
• LAMUS features - web-based operation - request of a work space - specification of an accepted upload node (archive anchor) - extend and manipulate the corpus structure - upload metadata descriptions - upload any type of resources (configurable format control) - create a linked sub-archive in the workspace and integrate this
into the archive - checks to guarantee consistency and format compliance
Language Archiving at the MPI
MPI Archive – Utilization Access
Problem
• different structures and formats
• different terminologies
tools are ANNEX/LEXUS
Language Archiving at the MPI
MPI Archive – State of Access
Language Archiving at the MPI
• at this moment almost anything from DOBES is closed
• lots of requests by journalists
• first 15 teams have to finish these months
• working hard • changing a lot until last minute of course• expect some stuff to become open• but much to be handled on requests
End
Mark Abley (Canadian)
Each time we lose a languagethe ghosts who made use of itcast a new bell.The voices magnify. Soon,listen, they’ll outpeal the tongues of earth.
Thanks for your attention.
Language Archiving at the MPI
Lots of differences
Differences at all linguistic layers
• Phonemic• Prosody • Phonology• Morphology• Syntax• Semantics• Pragmatics
Reduced Languages
• Whistling of Gomera fishermen• Sign Language of Plains-Indians• “Computer” Languages• ...
Language Archiving at the MPI
Sound Systems
F1 F2 F3 F4 F5 F1
F2
F5
F4
F3F2
F1
Spectra and Formants Vocal – Distribution (28 languages)
Formants over time
• Rotoka (Papua-Neuguinea) • Vokals a/e/i/o/u • 6 Consonants p/t/k/v/r/g
• !Xoo (South-Africa)• 141 Sounds incl. click-sounds
Tone Systems
• modulation of segmental information by Prosody • stretches across phrases and sentences
• Tones: meaning of words• Swedish: 2 Tones (anden – ándén)• German: aufbäumen – aufBäumen
• Mandarin Chinese: 4 tones• Kantonese: 9 tones• Vietnamese: 8 tones• some so-aisan languages: up to 15 tones
dr ai st
i Zeug
i vermuten
i Stuhl/Sessel
i Bedeutung
Intonation
Mandarin Chin. 4
Language Archiving at the MPI
Morphosyntax
• Rules for the generation of words and grammatical structures
• strictly isolating languages: one morpheme – one word • Chinese is an isolating language
• another extreme are the polysynthetic languages
• example of the Yup’ik inuittuntussurqatarniksaitengqiggtuqtuntu ssur qatar ni ksaite ngqiggt uqRenntier jagen FUT sagen NEG wieder 3SG:IND er hatte noch nicht gesagt dass er Renntiere jagen wolle
basic principle: stem is inflected by many affixes
for us unusual: isolated core morphemes cannot be interpreted “ssur” uttered in isolation does not make sense
verb stem
Language Archiving at the MPI
Dialog style
• norms to express things/activities is different
• example from Kilivila (Trobriand Islands – Neuguinea)
Person: AmbeyaWhere do you go to?
Gunter: (wants to say: I will wash myself)Bala bakakayaI will go I will take a bath
Host: Bila bikakaya bike’ita bisisu bipaisewa3.Fut-gehen 3.Fut-baden 3.Fut-zurückkommen 3.Fut-sein 3.Fut-arbeitenHe will go – he will take a bath - he will come back – he will stay -
he will work.He will take a bath, come back again and work with us
Language Archiving at the MPI
Pronoms
• in Kilivila• the inclusive and exclusive Dual we two – myself and the others except you
• in Paamese (Vanuatu - Archipel)• in addition the Paukal “a few”
Language Archiving at the MPI
Spatial orientation
egocentric system
abovebehind
right
above
below
north
south
west
east
absolutesystem
• Herberger would use the egocentric system to describe the scene• Aborigines would chose the absolute system – for us hardly possible: “the ball lies east of the player”
Language Archiving at the MPI
Awareness
• since 1866 efforts to preserve diversity in nature
• 1991 problem in focus of American Linguistic Society • 1992 discussion at the Intern. Conference of Linguistics• 1992 AG for endangered languages in German linguistic society• 1993 UNESCO project to create the red list
• 2000 DOBES programme of the VolkswagenFoundation• within 2 decades broad awareness amongst linguists
• David Crystals amongst first semester students:• 75% don’t know anything about the problem• most don’t see a problem
• how does this come: attention for tigers etc but not for languages?
Language Archiving at the MPI
Factors are known
• external factors• military suppression• religious conversion• economic dominance • cultural dominance• educational suppression
• internal faktors• negative attitude towards own language• avoidance of discrimination • hope to earn (more) money • improvement of mobility• youngsters are trend followers• ...
Language Archiving at the MPI
MPI Archive – Content Overview
MDsession
s
video files
audio files
photos
othermedia
textual files
sub-types
MPI 18524 14085 5131 7774 1315 13979
365 EAF, 2377 CHAT, 5580 MediaTagger,
3568 PlainText/Shoebox, 1589 others
DOBES 1396 1043 1250 63 20 20546 EAF, 85 Shoebox,
72 others
Dutch Spoken Corpus
12767 12767 41832 to be converted to EAF
Dutch Bilingual
Database874 191 714 CHAT, EAF
ECHO Sign Language
168 296 181 EAF
ESF corpus 994 546 1775 in CHAT
Total 34723 15970 19339 7837 58686 136.555 objects