SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St....
-
Upload
elliott-ruffins -
Category
Documents
-
view
212 -
download
0
Transcript of SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St....
![Page 1: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/1.jpg)
SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced LanguagesSt. Petersburg, Russia
www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
ZeigenSieandereAppsfüreinfachesMultitaskingnebendemBrowseranInternetExplorernutztHardwarebeschleunigungWebsiteswerdenschnellergeladendamitSienochreibungslosersurfenkönnen
NimmdeineLieblingsmusiküberallhinmitkommtderiPodshufflemitSpeichergenugfürhundertevonSongsallewichtigenSongsfürsTrainingWiedergabelistenGeniusMixesPodcastsundHörbücher
Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation:
A Case Study on our German IT Corpus
Sebastian leidig, Tim Schlippe, Tanja Schultz
![Page 2: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/2.jpg)
2 15-May-2014
Motivation
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
From Microsoft's German website www.microsoft.de:
“Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an.”
“Internet Explorer nutzt Hardwarebeschleunigung. Websites werden schneller geladen, damit Sie noch reibungsloser surfen können.”
![Page 3: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/3.jpg)
3 15-May-2014
Motivation
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
With the globalization words from other languages come into a language without assimilation to the phonetic system of the new language
To economically build up lexical resources with automatic or semi-automatic methods
detect and treat them separately
![Page 4: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/4.jpg)
4 15-May-2014
Overview
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
combinationfeaturesInput
graphemeperplexity
g2p confidence
hunspell lookup(native)
hunspell lookup(English)
Wiktionarylookup
Googlehit count
voting
decision tree
SVM
Output
word list
word1
word2
word3
word4
word5
word6
classification
![Page 5: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/5.jpg)
5 15-May-2014
Outline
1. Motivation and Overview
2. Test Sets
3. Single Features
4. Combinations
5. Summary and Future Work
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 6: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/6.jpg)
6 15-May-2014
Test Sets - Domains
German IT websitewww.microsoft.de
4.6k unique words
German general newswww.spiegel.de
6.6k unique words
AfrikaansNCHLT corpus (Heerden, Davel, Barnard, 2013), (Basson, Davel, 2013)
9.4k unique words
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 7: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/7.jpg)
7 15-May-2014
Test Sets - Domains
Tag for “English”:
e.g. Software, Brain, …
Foreign hybridsCompound words
e.g. Schadsoftware, …
Grammatically adapted words
e.g. downloaden, …
Decisions based onAgreement of annotators
duden.de .Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Different word categories:Abbreviations:
e.g. UV, CIA, …
Other foreign wordsCompound words
e.g. Français, Niveau, …
![Page 8: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/8.jpg)
8 15-May-2014
Foreign words in different test sets
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 9: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/9.jpg)
9 15-May-2014
Single Features – Design Criteria
Features trained on commonly available resourcesWord lists, Pronunciation dictionaries, Spellchecker dictionaries, Wiktionary, Google
Thresholds without supervised trainingComparison between English and native models
New approaches
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 10: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/10.jpg)
10 15-May-2014
Grapheme Perplexity
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 11: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/11.jpg)
11 15-May-2014
Grapheme Perplexity
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 12: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/12.jpg)
12 15-May-2014
Grapheme-to-Phoneme Confidence
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phonetisaurus confidence
scores (costs)
![Page 13: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/13.jpg)
13 15-May-2014
Grapheme-to-Phoneme Confidence
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 14: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/14.jpg)
14 15-May-2014
Hunspell Lookup
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
classification
word list
word1
word2
word3
word4
spellchecker dictionaryEnglish: Hunspell-en
classification
Hunspell
dictionary lookup
derive word forms
classification
word list
word1
word2
word3
word4
spellchecker dictionaryGerman: Hunspell-de
classification
Hunspell
dictionary lookup
derive word forms
2 features performed
best
![Page 15: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/15.jpg)
15 15-May-2014
Hunspell Lookup
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
classification
word list
word1
word2
word3
word4
spellchecker dictionaryEnglish: Hunspell-en
classification
Hunspell
dictionary lookup
derive word forms
classification
word list
word1
word2
word3
word4
spellchecker dictionaryGerman: Hunspell-de
classification
Hunspell
dictionary lookup
derive word forms
![Page 16: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/16.jpg)
16 15-May-2014
Wiktionary Lookup
Check crowdsourced information from matrix language Wiktionary
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 17: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/17.jpg)
17 15-May-2014
Google Hit Count
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh
![Page 18: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/18.jpg)
18 15-May-2014
Google Hit Count
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh
![Page 19: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/19.jpg)
19 15-May-2014
Result: Single Features
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 20: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/20.jpg)
20 15-May-2014
Grapheme-to-Phoneme Confidence
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 21: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/21.jpg)
21 15-May-2014
Result: Single Features
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
On Spiegel-de test set: Higher ratio of words classified as English are wrong
![Page 22: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/22.jpg)
22 15-May-2014
Result: Combination
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 23: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/23.jpg)
23 15-May-2014
Performance after filtering difficult words (oracle)
Challenges
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 24: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/24.jpg)
24 15-May-2014
Conclusion and Future Work
Features based on available sources
New approaches:G2P confidence
Wiktionary
Further features:Part-of-speech (POS)
Context, trigger words
Capitalization
Translate and compare
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 25: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/25.jpg)
25 15-May-2014
благодари? м за внима? ние!
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
![Page 26: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/26.jpg)
26 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
References
![Page 27: SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.](https://reader035.fdocuments.in/reader035/viewer/2022070308/551beb00550346b4588b6301/html5/thumbnails/27.jpg)
27 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
References