Candeias sti lg2p_vfinal

28
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Candeias 1 Fernando Perdigão 1,2 1 Instituto de Telecomunicações, Polo de Coimbra, Portugal 2 Universidade de Coimbra, DEEC, Portugal STIL 2011 8th Symposium in Information and Human Language Technology Oct. 14-26 2011 Cuiaba, Brazil GENERATING A PRONUNCIATION DICTIONARY FOR EUROPEAN PORTUGUESE USING A JOINT-SEQUENCE MODEL WITH EMBEDDED STRESS ASSIGNMENT

Transcript of Candeias sti lg2p_vfinal

Page 1: Candeias sti lg2p_vfinal

© 2005, it - instituto de telecomunicações. Todos os direitos reservados.

Arlindo Veiga1,2

Sara Candeias1

Fernando Perdigão1,2

1Instituto de Telecomunicações, Polo de Coimbra, Portugal2Universidade de Coimbra, DEEC, Portugal

STIL 20118th Symposium in Information and Human Language Technology

Oct. 14-26 2011 Cuiaba, Brazil

GENERATING A PRONUNCIATION DICTIONARY

FOR EUROPEAN PORTUGUESE

USING A JOINT-SEQUENCE MODEL

WITH EMBEDDED STRESS ASSIGNMENT

Page 2: Candeias sti lg2p_vfinal

2

SUMMARY

• Goal

• Problem Statement

• G2P System

• Joint-Sequence Model

• Stressed Vowel Assignment

• Results

• Conclusions

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

Page 3: Candeias sti lg2p_vfinal

3

GOAL

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

• To Generate a Pronunciation Dictionary for EP

• To Develop a G2P System for EP

Page 4: Candeias sti lg2p_vfinal

4

PROBLEM STATEMENT

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

What approaches?

How?Implementing an

automatic system for

converter G2P

• linguistic rules• Portuguese has an orthography roughly phonologically based

provides a good coverage of the association between G2P

• No natural human-language satisfies this assumption the

association between G and P is not quite one-to-one list of

exceptions

• Very complex, hard and tiresome

Page 5: Candeias sti lg2p_vfinal

5

PROBLEM STATEMENT

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

What approaches?

How?Implementing an

automatic system for

converter G2P

• linguistic rules

• statistics

• Using pronunciation examples it could be possible to predict

the pronunciation of unseen words by analogy

• Is not smart enough…

• vaga -> v „a g 6 vs. vagarosa -> v 6 g 6 r „O z 6

• linguistic rules

Page 6: Candeias sti lg2p_vfinal

6

PROBLEM STATEMENT

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

What approaches?

How?Implementing an

automatic system for

converter G2P

• linguistic rules

• statistics

• MIXED

Page 7: Candeias sti lg2p_vfinal

7

System based on a mixed approach funded on:

• a scholastic model: joint-sequence model

• rules for stressed vowel assignment

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

G2P SYSTEM

Alignment between graphemes and phonemes:

“one-to-one”

Page 8: Candeias sti lg2p_vfinal

8 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

JOINT-SEQUENCE MODEL

< B r a s i l >

/ b r 6 z i l /

Alignment between graphemes and

phonemes: “one-to-one”

Page 9: Candeias sti lg2p_vfinal

9 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

< c h a m o u > < t ê m >

/ S 6 m o / / t 6~ i~ 6~ i~ /

< B r a s i l >

/ b r 6 z i l /

Alignment between graphemes and

phonemes: “one-to-one”

JOINT-SEQUENCE MODEL

Page 10: Candeias sti lg2p_vfinal

10 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

< c h a m o u > < t ê m >

/ S 6 m o / / t 6~ i~ 6~ i~ /

Alignment between graphemes and

phonemes: “one-to-one”

JOINT-SEQUENCE MODEL

Page 11: Candeias sti lg2p_vfinal

11 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

• Implementing the Levenshtein algorithm (“1-01”)

• Defining alternative symbols

• Graphemes DIGRAPHS

< c h a m o u >

< S a m º >

/ S 6 m o /

JOINT-SEQUENCE MODEL

Page 12: Candeias sti lg2p_vfinal

12 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

• Implementing the Levenshtein algorithm (“1-01”)

• Defining alternative symbols

• Graphemes DIGRAPHS

• Phonemes SAMPA UniChar

< t ê m >

< t 6 ~ i ~ 6 ~ i ~ /

/ t i i /

/ t Æ i /

< c h a m o u >

< S a m º >

/ S 6 m o /

JOINT-SEQUENCE MODEL

Page 13: Candeias sti lg2p_vfinal

13 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

• Implementing the Levenshtein algorithm (“1-01”)

• Defining alternative symbols

• Graphemes DIGRAPHS

• Phonemes SAMPA UniChar

< c h a m o u >

< S a m º >

/ S 6 m o /

< t ê m >

/ t Æ i /

JOINT-SEQUENCE MODEL

Page 14: Candeias sti lg2p_vfinal

14 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

• Implementing the Levenshtein algorithm (“1-01”)

• Defining alternative symbols

• Graphemes DIGRAPHS

• Phonemes SAMPA UniChar

< c h a m o u >

< S a m º >

/ S 6 m o /

< t ê m >

/ t Æ i /

JOINT-SEQUENCE MODEL

Page 15: Candeias sti lg2p_vfinal

15 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

• Implementing the Levenshtein algorithm (“1-01”)

• Defining alternative symbols

• Graphemes DIGRAPHS

• Phonemes SAMPA UniChar

< c h a m o u >

< S a m º >

/ S 6 m o /

< t ê m >

/ t Æ i /

Graphonemes

GOAL: to compute the most probable

pronunciation of a word given the word‟s

graphoneme form

TECHNIQUE: using n-grams

JOINT-SEQUENCE MODEL

Page 16: Candeias sti lg2p_vfinal

16

System based on a mixed approach funded on:

• a scholastic model: joint-sequence model

• rules for stressed vowel assignment

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

G2P SYSTEM

• Several errors due to incorrect stress assignment:

solidamente, incansavelmente

Page 17: Candeias sti lg2p_vfinal

17

System based on a mixed approach funded on:

• a scholastic model: joint-sequence model

• rules for stressed vowel assignment

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

G2P SYSTEM

Marking the Vstressed improved the statistical model by

expressing graphoneme classes unequivocally

6 rules

Page 18: Candeias sti lg2p_vfinal

18 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

STRESSED VOWEL ASSIGNMENT

For adverbs ending in <mente> (< pido> → <rapidamente> (fast → quickly):

• An algorithm that divides the word into two parts, <ROOT> and <mente>.

• The <ROOT> part undertakes a specific module (list of graphematic patterns which have the Vstressed

identified).

To generate a univocal graphoneme, we attributed special symbols to the Vstressed

Page 19: Candeias sti lg2p_vfinal

19

To estimate the graphoneme‟s model:

• SpeechDat pronunciation dictionary• 15k entries

• Deletion of foreign words

• Change of some transcriptions

• Standardization of the pronunciation

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

VOCABULARY

Applied to the CETEMPúblico vocabulary

40k words 40k pronunciations

Page 20: Candeias sti lg2p_vfinal

20

CETEMPúblico 40k pronunciations:

• Iterative procedure:

• Long manual verification

• Correction of the transcriptions

• Comparison to the pronunciations of LOQUENDO

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

DICTIONARY

This dictionary was used for the training and test procedure.

• The majority of the transcriptions agreed.

• The transcriptions from our dictionary were the right ones most of the times.

Page 21: Candeias sti lg2p_vfinal

21

EXPERIMENTS

All experiments were based on the dictionary of the

40K pronunciations:

• with stress marking

• without stress marking

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

Final results were obtained by evaluating the average of the five partial

results.

To train and test the model, each one of these two dictionaries was

partitioned into five folds for a cross-validation procedure.

Page 22: Candeias sti lg2p_vfinal

22

The performance of the G2P conversion system was expressed

in two average error rates: average error rate of phonemes

(PER) and average error rate of words (WER)

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

RESULTS

Page 23: Candeias sti lg2p_vfinal

23

RESULTS

The following figures summarize the results obtained using n-

grams with n between 2 and 8

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

Page 24: Candeias sti lg2p_vfinal

24

RESULTS

The use of n-grams with large contexts (n greater than 5) did

not improve the system. In fact, there was a slight increase in

the error rates (lack of samples to estimate large contexts)

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

Page 25: Candeias sti lg2p_vfinal

25

RESULTS

The marking of the stressed vowel contributed to a significant

improvement in the system performance

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

Page 26: Candeias sti lg2p_vfinal

26

CONCLUSIONS

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

The joint-sequence model with embedded stress

assignment had good results.

By inspecting the test errors, we observed that most of them resulted

from uncommon grapheme patterns or compound words without graphic

stress marks.

The most frequent errors resulted from the pronunciation of the

stressed <e> and <o> since they could be pronounced as /E/ vs. /e/

(<selo>: verb vs. noun) and /O/ vs. /o/ (<ovos> (pl) vs. <ovo>(sing))

without any systematic rule.

Obrigada

Our system is freely available on http://www.co.it.pt/~labfala/g2p/ and

includes models, dictionaries and the G2P converter.

Page 27: Candeias sti lg2p_vfinal

© 2005, it - instituto de telecomunicações. Todos os direitos reservados.

Arlindo Veiga1,2

Sara Candeias1

([email protected])

Fernando Perdigão1,2

1Instituto de Telecomunicações, Polo de Coimbra, Portugal2Universidade de Coimbra, DEEC, Portugal

STIL 20118th Symposium in Information and Human Language Technology

Oct. 14-26 2011 Cuiaba, Brazil

GENERATING A PRONUNCIATION DICTIONARY

FOR EUROPEAN PORTUGUESE

USING A JOINT-SEQUENCE MODEL

WITH EMBEDDED STRESS ASSIGNMENT

Page 28: Candeias sti lg2p_vfinal

28

INTRODUCTION

STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011

Generate a Pronunciation Dictionary for PE

• Grapheme-to-Phoneme conversion (G2P)

Bom dia b‟o~ d‟i6 (en. Good morning)

• Applications: component of ASR and TTS systems

e.g. in language learning, machine translation,…

• For correct pronunciation we need:

• G2P, stress assignment

• Contribution of this paper:

• Show phonological constraints (vowel stressed)

• Evaluate a mixed approach for G2P system

• Turn the dictionary (the model and the converter) publicly available