Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

22
PLS Considerations on using PLS for Sloveni Pronunciation Lexicon Construction Jerneja Žganec Gros Alpineon d.o.o., Ljubljana, Slovenia [email protected] Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006

description

Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006. Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Ž ganec Gros Alpineon d.o.o. , Ljubljana, Slovenia [email protected]. ALPINEon - PowerPoint PPT Presentation

Transcript of Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Page 1: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

PLS

Considerations on using PLS for Slovenian

Pronunciation Lexicon Construction

Jerneja Žganec Gros

Alpineon d.o.o., Ljubljana, Slovenia

[email protected]

Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006

Page 2: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

ALPINEon

SI-PRON lexicon:– word list

– lexicon format

– phonetic transcription

– morpho-syntactic descriptions

Proposed extensions to PLS, SSML

Conclusions

Page 3: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Language specifics

Slovenian language:

– Slavic language, 2 million speakers, over 70 dialects

– complex inflectional paradigm (common to Slavic languages)

• including "dual" – like ancient Greek!

– lexical stress position – undefined (unlike some other Slavic

languages, e.g. Croatian never carries accent on the last syllable)

– many homographs, usually POS info helps at disambiguation:

• example: On je. (He is/eats). auxiliary_verb/indicative

Page 4: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Pron lex

Speech technology applications: – automatic speech recognition (ASR)– text-to-speech synthesis (TTS)– require consistent specification of pronunciation

– Slovenian: lexical stress position not fixed -> pron lex crucial

Pronunciation lexicons:– general: not supposed to be covered by PLS– application-specific

• word/phrase pronunciations

• application-specific proper nouns: personal&location names

Page 5: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Slovenian pron lex

General: – S5 (Gros et al., 1996)

– Onomastica (Derlić and Kačič, 1997)

– SImlex/SIflex (Verdonik et al, 2002)

– SI-LC-STAR (Verdonik and Rojc, 2004)

– AlpSynth (Gros et al., 2002)

– SI-BN (Žibert, 2005, Žgank; 2005)

Application-specific:– Gopolis, SpeechDAT, etc

Page 6: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

SI-PRON wordlist:

(a) 93,154 lemmas from SSKJ

(b) over 1,000,000 word form derived from (a) – morphol. deriv.

(c) additional word list:• corpus-based search

• 20,000 most freq inflected word forms not covered by SSKJ lemmas

(d) collocations, multi-word expressions

SSKJ: Slovar slovenskega knjižnega jezika

Word-list

Page 7: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Phonetic transcriptions

SSKJ lemmas: – automatic derivation, based on dynamic/tonemic accent information

– manual corrections for about 2.500 lemmas (words of foreign origin)

Word forms derived from SSKJ:

– automatic: SSKJ lemma pronunciation look-up, inflectional paradigms

Additional corpus-based word list:– automatic lexical stress assignment

– AlpSynth grapheme-to-phoneme rule set

Page 8: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

GTP rules

193 context-dependent grapheme-to-phoneme rules:

Leftcontext

Graphemestring

Rightcontext

Phonetictranscr.

Example Rule explanation

$ er _ [@r] Gaber @ occurs before each -r notfollowed by a vowel(Toporisic91, p.49)

= m f [F] Simfonija <m> in front of <f> and <v> ispronounced as a labiodental(Pravopis90, p. 145)

Page 9: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Transcription accuracy experiment

reference: hand-crafted pron lex, 30K lexemes, no loanwords(!)

automatic lexical stress assignment: 15% error rate

lexical stress & o/e pronunciation known in advance:

– transcription success rate 99.1% (0.6% handcrafting errors)

conclusion: for semi-automatic derivation of phonetic transcriptions

with a 0.3% error rate only lexical stress positions & e/o

need to be manually validated

Page 10: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

SI-PRON format

LC-STAR lexicon specs – STTS (Shamas & v Heuvel, 2004)

Pronunciation Lexicon Specification (PLS)– Version 1.0, W3C Last Call Working Draft 31 January 2006

• http://www.w3.org/TR/pronunciation-lexicon/

PLS:– Ver 1.0 not designed for TTS internal lexicons

– on the other hand, we want to have a stronger link between SSML and the lexicon

– we are even thinking of introducing POS attribute into token-like elements!

– leave these issues for PLS Ver 2.x or address them now?

Page 11: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Pronunciation variations

multiple pronunciations:

– several <phoneme> elements

– preferred pronunciation:

• indicated by the prefer element

• usually the 1st pronunciation from the SSKJ

• for some words, 2 prons are equally preferred, e.g.:

- male Slovenian nouns, terminating with "ilec" like

/borilec/, /darovalec/

- "iUts"/"ilts", "ilts"/"iUts", "ilts", or "iUts"

- typically account for more fluent "iUts" or overarticulated "ilts" pronunciation

Page 12: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Extensions…

proposed extension for PLS/SSML:

– a new optional attribute for the <phoneme> element:• pron-style attribute

• values: "fluent", "overarticulated"

– pron-style also for other elements (linkage SSML-lex!):

• <voice>, <speak>, <p>, <s>

• another optional attribute for the above elements: emotion for expressive TTS ?

- could this be covered by the new role attribute? - similar to <speaking_style>, proposed yesterday

Page 13: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Extensions…

dialects:

– user-friendly apps require dialect/sociolect pronunciation variations

– another optional attribute for the following elements:

<phoneme>, <voice>, <speak>, <p>, <s>

- rfc3066-like identifiers may be used to indicate dialects

Page 14: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Extensions…

PLS…. source/creator:

– only the <metadata> element

– source of multiple pronunciations:

• useful info when merging multiple PLS dox

• some sources/creators may be more reliable than

others…

- additional optional attribute pron-source for the

<phoneme> element

Page 15: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Extensions…

part-of-speech tags:

– Slovenian – complex inflectional paradigm

– morphological, syntactic and semantic(?) descriptors welcome in

future revisions of the PLS specification

– SSML: POS tags could be defined as an optional attribute of the

<token> element

lemma, MSD attributes used in SI-PRON

MULTEXT-East MSDs (Erjavec, 2004) – Telri, Concede

Multext-East LRs, http://nl.ijs.si/ME/V3

EAGLES,TEI P4 compliant

Page 16: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

MSDs

Page 17: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

MSDs

Page 18: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

MSDs

Page 19: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

MDSs

TTS-internal lexicon (for high-inflected languages)

– full-blown form (PLS or other)

– compact lexicons:

– exception lexicon

– derivational scheme/paradigm for providing

prefix/suffix morphological rules, indications of lexical

stress position shifts (hardly an issue of PLS)

Page 20: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Conclusion

SI-PRON pronunciation lexicon for Slovenian

proposed extensions to PLS, SSML

– pron-style attribute

– emotion attribute

– annotating dialects/sociolects

– source/creator attribute

– morpho-syntactic, semantic descriptors

Page 21: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

Alpineon

ZRC-SAZU • Fran Ramovš Institute of the Slovenian Language

Project Partners

L6-5405 project

– Research of Slovenian Language in Lexicography and Lexicology based on Digital Language Resources

– Spoken representation of Slovenian words:• http://bos.zrc-sazu.si/sskj.html

Page 22: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

PLS

THANK YOU FOR YOUR ATTENTION!