eWika: Digitalization of Philippine Languages
description
Transcript of eWika: Digitalization of Philippine Languages
![Page 1: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/1.jpg)
IsalinTranslate
eWika: Digitalization of Philippine Languages
Charibeth K. Cheng
March 19, 2008
![Page 2: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/2.jpg)
Machine Translation
• Automate translation
• A study under Natural Language Processing
MT System
Sentence in
SOURCE LANGUAGE
Sentence in
TARGET LANGUAGE
![Page 3: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/3.jpg)
ENG-FIL MT System Project
• 3-year project
• started last year
• funded by DOST-PCASTRD
• composition:– 6 faculty members of College of
Computer Studies– 15 computer science majors– assisted by the Filipino Dept and
Dept in English & Applied Linguistics of DLSU-M
![Page 4: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/4.jpg)
Agenda
• Architecture of the MT System
• Linguistic resources
• Demo of the Translation Engine
• Results for English to Japanese translation
![Page 5: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/5.jpg)
Architectural Design of the Program
Language Resources: • Lexicon (electronic dictionary), • Morphological Analyzer & Generator• Part-of-Speech tagger• Grammar,• Corpus (Tagged)
MT: Example-based
MT: Rule-based
User Interface
Output Modeller
Source Text Target Text
Translator Engine
![Page 6: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/6.jpg)
Challenge!
• Language resources– Quality of translation is dependent on it.– Built from almost non-existent digital forms– manual vs. automatic construction
![Page 7: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/7.jpg)
Lexicon Builder
• Used IsaWika! database as initial lexicon
• Created a lexicon extraction program to automatically determine candidate translation pairs from corpora
• Currently contains about 23,000 entries
• Co-occurring words are likely translation
• Challenge: Lexical resources – parallel corpora– part-of-speech tagger
Database
![Page 8: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/8.jpg)
Morphological Analyzer
• Initially collected morphological rules from grammar books
• Developed an example-based morphological phenomenon learner– learn from <inflected word, root-word> – example: <kumakain, kain>
• Challenge : Lexical resources– lexicon– part-of-speech tagger– morphological rules Generator
![Page 9: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/9.jpg)
Part-Of-Speech Tagger
• automatic association of parts-of-speech to words in a document
• existing Filipino tagger achieves < 80% accuracy
• Challenge : Lexical resource– tagged parallel corpora– lexicon– morphological analyzer– grammar
![Page 10: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/10.jpg)
Grammar
• Derived manually
• Challenge: Free word order in sentence formation.
The man bought an umbrella from the store.
• Bumili ang lalaki ng payong sa tindahan.
• Bumili sa tindahan ng payong ang lalaki.
• Ang lalaki ay bumili ng payong sa tindahan.
![Page 11: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/11.jpg)
Corpora
• used by the lexicon extractor and part-of-speech tagger, example-based MT
• came from translation works of DLSU English majors, verified by linguists
• consists of 207,000 words, 5000 of which are tagged
![Page 12: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/12.jpg)
Translation Rules
• currently learned from the corpora
• disadvantages– garbage-in-garbage-out– comprehensiveness
• need for linguistic-verified rules
![Page 13: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/13.jpg)
![Page 14: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/14.jpg)
Bringing it home …
• 171 Philippine Languages (SIL)• No Philippine Corpora• Unfortunately, today, the Philippines has one of
the highest rates of dying languages (Solfed Foundation Inc)
• “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)
![Page 15: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/15.jpg)
eWika: Digitalization of Philippine Languages
• Build the Philippine Corpus
• Build software tools to study or use the corpus– Across Languages– Across Regions– Across Forms and Genres– Across Land and Sea
![Page 16: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/16.jpg)
Across Languages
• 171 Philippine Languages (SIL List)
• Summer Institute of Linguistics http://www.ethnologue.com/
• Major languages
• Near extinction languages
• How about the languages in-between?
![Page 17: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/17.jpg)
Filipino Sign Language
• The History of Sign Language in the Philippines: Piecing Together the Puzzle (Abat & Martinez, 9th Phil Linguistics Congress, 2006)
• Deaf individuals: handicapped vs members of a linguistic minority
• Sign languages as true languages
![Page 18: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/18.jpg)
Across Boundaries
• Across Languages
• Across Regions• Across Forms and Genres
• Across Land and Sea
![Page 19: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/19.jpg)
Across Regions
• e-Wika: Connecting the Philippine Islands through Language• 17 Regions: The regions are: Ilocos Region (Region I),
Cagayan Valley (Region II), Central Luzon (Region III), CALABARZON (Region IV-A) , MIMAROPA (Region IV-B) , Bicol Region (Region V), Western Visayas (Region VI), Central Visayas (Region VII), Eastern Visayas (Region VIII), Zamboanga Peninsula (Region IX), Northern Mindanao (Region X), Davao Region (Region XI), SOCCSKSARGEN (Region XII), Caraga (Region XIII), Autonomous Region in Muslim Mindanao (ARMM), Cordillera Administrative Region (CAR), National Capital Region (NCR) (Metro Manila)
![Page 20: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/20.jpg)
![Page 21: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/21.jpg)
Across Boundaries
• Across Time: historical, contemporary
• Across Languages
• Across Regions
• Across Forms and Genres• Across Land and Sea
![Page 22: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/22.jpg)
Across Forms and Genres
• In various forms:
• Text
• Speech: speech to text system (ongoing project)
• Video: Filipino sign language
• In various Genres: categories of entries in the corpus
![Page 23: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/23.jpg)
Across Boundaries
• Across Time: historical, contemporary
• Across Languages
• Across Regions
• Across Forms and Genres
• Across Land and Sea
![Page 24: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/24.jpg)
Across Land and Sea
• Web-based application: c/o Solomon See (upload, download, tools)
• Contributors (Main players)
• Verify-ers
• Facilitators
• Server: DLSU-M commits to host the server for the next three years.
• Terms of Use: Research purposes.
![Page 25: eWika: Digitalization of Philippine Languages](https://reader036.fdocuments.in/reader036/viewer/2022062305/56815006550346895dbdd8a5/html5/thumbnails/25.jpg)
• The dream of building Philippine language resources and tools
• Many many many major hurdles to overcome
• Language Resources, Tools, & Peopleware: Needed