LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic...

22
LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle- Density Languages Lynette Melnar & Chen Liu Application and Software Research Center, Motorola Labs Motorola, Schaumburg, IL 60196, USA Lynette.Melnar, [email protected]

Transcript of LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic...

Page 1: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and

Middle-Density Languages

Lynette Melnar & Chen Liu

Application and Software Research Center, Motorola Labs

Motorola, Schaumburg, IL 60196, USA

Lynette.Melnar, [email protected]

Page 2: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

The onlyway to go is by the lake on the bay

xxxxx

xxxxx

SsssssssSsssssssSsssssssSsssssssSssssssss

SsssssssSsssssssSsssssssSsssssssSssssssss

Dravidian?

Niger-Congo? Austro-Asiatic?

LLanguageanguage R Resourcesesources

Current language coverage:

Broader, faster language coverage critical!

Afro-Asiatic: Semitic

IE: GermanicIE: ItalicIE: Slavic

7

6

2

IE: Indic1

1

IE: Iranian?

Altaic: Korean

Sinitic: Mandarin

1

1

Altaic: Turkic

Altaic: Japonic1

1

Sinitic: CantoneseSinitic: Wu

1

1

Sinitic: Min?

Sinitic: Hakka?

Sinitic: Xiang?IE: Caribbean Spanish?

IE: Venezuelan Spanish?

IE: Rioplatense Spanish?

Page 3: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

• High-density languages:

– Majority languages associated with large, economically advantaged speaker populations

• significant proportions of which regularly use computers

– Bear official status or non-official predominant use in one or more countries

– Recognized as important by foreign governments

– Supported by writing traditions and have been studied and well documented in various types of language resources

• In particular, a high-density language is here associated with several commercially available speech resources of various types (and quality)

– Examples of such high-density languages are major dialects of Arabic, English, Mandarin, and Spanish

LLanguageanguage D Densityensity

Page 4: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

• Low-density languages:

– Speaker populations may be very small and economically disadvantaged

• and have little or no computer experience

– Have minority status in the countries where spoken

– Not judged to be very significant by foreign governments

– Not supported by writing systems or only supported by non-standardized writing systems and not well studied or documented in language resources

• Middle-density languages:

– A balance of those extremes exhibited by high- and low-density languages

• Many Chinese, Indian, and African languages in the emerging market

LLanguageanguage D Densityensity

Page 5: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

• Goal - extend the value of existing language resources by developing a tool that uses in-house source language data to automatically predict target-language ASR performance.

– Speech DBs are expensive (time/money).

– Use source DBs to quickly create (monophone) acoustic models for target languages lacking speech data.

• Approach – base the approach on a series of linguistic metrics characterizing the articulatory phonetic and phonological information of phonemes from both the target and source languages.

• Benefit - objectively measure phonological similarity among languages to:

– Help direct future multilingual exploration

– Guide a cost-effective DB acquisition strategy

EExtendxtend E Existingxisting L Languageanguage R Resourcesesources

Page 6: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

A combined phonetic-phonological crosslingual distance metric (CPP-CD) automatically selects phoneme models from existing source languages as surrogate models for a target language lacking existing speech data.

Lexica from all

languages

Featurematrix

Feature weights

Phoneticdistance

(X 2)

Monophoneme distribution distance

Biphoneme distribution distance

CPP-CD

Phonemes from all languages

CCrosslingual rosslingual PPhonemehoneme D Distanceistance (cf. Liu & Melnar Interspeech 2005)

Page 7: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

The pre-existing acoustic models for the top two least distant source-language phonemes for each target-language phoneme are selected for the crosslingual target-language model set.

Target Language: Latin American SpanishTarget phoneme: /i/

Source language phonemes:

Italian /i/ : CPP-CD = 0.445999111424272 {PhD(2)= 0} + MD(1)= 0.0931453475842772 + BD(1)= 0.352853763839995

Brazilian Portuguese /i/ : CPP-CD = 0.970328191371562 {PhD(2)= 0} + MD(1)= 0.420797537723557 + BD(1)= 0.549530653648005

CPP-CD CPP-CD (cf. Liu & Melnar Interspeech 2005)

Page 8: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

1.Embed the CPP-CD identified HMMs corresponding to the target language into the model training of the donating source languages;

2.Attach all donor language phoneme labels that have been selected as target-language surrogates with a target-language ID tag (LT)

− E.g., Donor language 1 phoneme inventory : a_LT, e_LT, u_LT, b_LT,…

3.Attach all donor language phoneme labels that have NOT been selected as target-language surrogates with the donor-language ID tag (L1, L2, L3, …)

− E.g., Donor language 1 phoneme inventory : i_L1, o_L1, d_L1, f_L1,…

4.Build a mega phoneme inventory, lexicon, and phoneme transcription set. The ID-tagged mega phoneme inventory includes all the phonemes from the donor languages and is used to retranscribe all the pronunciation lexica and phoneme transcriptions

MModelodel T Training raining (cf. Liu & Melnar ISCA MULTILING 2006)

Page 9: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

MModelodel T Training raining (cf. Liu & Melnar ISCA MULTILING 2006)

…dtarbu...vasr

Megainventory

…d_L1

t_L1

a_LTr_L1

b_LTu_L1

...v_LMs_LMr_LM…

Candidate

Candidate

Candidate

P1

PM

P1, …, PM : individual donor inventories

L1, …, LM , LT : language ID

…dtarbu...vasr

Megainventory

…d_L1

t_L1

a_LTr_L1

b_LTu_L1

...v_LMs_LMr_LM…

Candidate

Candidate

Candidate

P1

PM

P1, …, PM : individual donor inventories

L1, …, LM , LT : language ID

Page 10: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

To predict the performance of the crosslingual model set:

1.Assign each target phoneme an importance weight− Based on the phoneme's lexical frequency

2.Measure the contribution and interference effect of all the donor phonemes to each target phoneme

– For each target phoneme, define a matching set that consists of the two least distant source-language donor phonemes

– For each target phoneme, define a confusing set that consists of all the other donor-language phonemes

CPP CCPP Crosslingualrosslingual P Predictionrediction (CPP-CP) (CPP-CP)

Page 11: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

CPP CCPP Crosslingualrosslingual P Predictionrediction (CPP-CP (CPP-CP)

Target Phoneme : Lexical importance weightMatching set : Lesser CPP-CD distance is betterConfusing set : Greater CPP-CD distance is better

The combined contribution and interference effect of donor phonemes to all the target phonemes is derived from the same component distance measures used in the derivation of CPP-CD (phonetic, monophoneme, biphoneme).

Page 12: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

CPP CCPP Crosslingualrosslingual P Predictionrediction (CPP-CP (CPP-CP)

(For targetPhoneme #N)

(For targetPhoneme #2)

(For targetPhoneme #1)

CPP-CD

CPP-CP

Matchingset

Donorphonemes

Contrastingset

Contributioneffect

Interferenceeffect

Discriminativecontribution

effect

Discriminativecontribution

effect

Discriminativecontribution

effect

.

.

.

. . .

. . .

.

.

.

Page 13: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

CPP CCPP Crosslingualrosslingual P Predictionrediction (CPP-CP (CPP-CP)

• Because the prediction score is based on distance, the smaller the CPP-CP score, the higher the predicted performance of the crosslingual models

• For a set of languages, we evaluate CPP-CP scores relative to recognition phoneme error rates:– recognition results with the crosslingual models on the

native speech data

Where the prediction scores match the general trend of the recognition phoneme error rates, we consider the relative prediction scores reliable

Page 14: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

• Inconsistency in data quality and task complexity across languages

• Sub-optimal native model quality for some languages

• Non-unary mapping between the acoustic and non-acoustic phonological domains

• Biphoneme inventory size

• Target-source language proximity

CPP-CP: KCPP-CP: Knownnown P Performanceerformance F Factorsactors

Page 15: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

1.Compare CPP-CD approach with an acoustic distance approach (Bhattacharyya metric) and native monolingual modeling approach in ASR experiments

2.Test five target languages:

− Latin American Spanish, Italian, Japanese, Danish, and European Portuguese (all high-density - speech data for testing required)

3.Use twenty source languages from six major language groups:

– (i) Afro-Asiatic; (ii) Altaic; (iii) IE Germanic; (iv) IE Italic; (v) IE Slavic; and (vi) Sinitic

• For each crosslingual experiment, the target language is left out of the source language pool for model selection

CPP-CD: RCPP-CD: Recognitionecognition E Experimentxperiment

Page 16: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

Model Performance Comparison (word accuracy %)

CPP-CD: RCPP-CD: Recognitionecognition R Resultsesults

Target Language

Native Baseline

Genetic Relation

Biphoneme Inv.

Acoustic Distance

CPP-CD

Italian 98.42 Italic (5) 613 98.27 98.52

Spanish 94.49 Italic (5) 520 88.61 93.06

Japanese 95.36Japonic

(0)643 76.72 78.76

Portuguese

96.31 Italic (5) 776 77.91 72.74

Danish 94.36 Germ. (5) 980 72.95 70.15

Overall the CPP-CD and Bhattacharyya acoustic approaches perform similarly. The average recognition result using the Bhattacharyya-derived models is 82.89% while the average CPP-CD result is 82.65%.

Page 17: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

CPP-CP CCPP-CP Confirmationonfirmation • The trend of the CPP-CP scores matches that of the phoneme error rates

• Results suggest that CPP-CP scores are indicative of actual recognition performance

Results suggest that target-language crosslingual model sets that have a CPP-CP score of less than 1 are likely to achieve acceptable recognition performance levels

– Spanish, Italian, and Japanese CPP-CP models average a word accuracy of 90.11%

Comparison of phoneme error rates and CPP-CP scores (1)

Prediction score is based on linguistic distance and corresponds to phoneme error

rate

40

45

50

55

60

65

70

Spanish Italian Japanese Portuguese Danish

Ph

Err

or

(%)

0.5

0.7

0.9

1.1

1.3

1.5

1.7

1.9

Pre

dic

tio

n S

co

re

Ph Error (%) Prediction ScoreLinear (Ph Error (%)) Linear (Prediction Score)

Practical use threshold

Page 18: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

1.Test with lower-density target languages: Brazilian Portuguese, Cantonese, and Shanghainese

• Brazilian Portuguese, Cantonese, and Shanghainese may be considered middle-density languages in the sense that they are relatively underrepresented in terms of available speech resources.

− We consulted the speech resource catalogues of the European Language Resources Association (ELRA), Linguistic Data Consortium (LDC), and Appen Ltd. to informally evaluate speech database availability. For both Shanghainese and Cantonese only one speech database is available, while Brazilian Portuguese is associated with three publicly available speech databases.

2.Use same twenty source languages selected for the first experiment and likewise compare their crosslingual phoneme error rates and CPP-CP scores

CPP-CP: ECPP-CP: Experimentxperiment

Page 19: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

CPP-CP: ECPP-CP: Experimentxperiment

Target Language

Biphoneme Inventory

Related Languages

Average 665.75 4.125

Cantonese 320 2

Shanghainese 428 2

Spanish 520 5

Italian 613 5

Japanese 643 0

Eur. Portuguese 776 5

Danish 980 5

Br. Portuguese 1046 5

Comparison of performance factors among the eight test languages

520 5613 5

Spanish

Italian

Page 20: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

• The trends of the CPP-CP scores and phone error percentages correspond

• Among the recently added languages, only Cantonese has a prediction score below 1

– Crosslingual model set is expected to achieve a practical use recognition level

• Small Cantonese biphoneme inventory size contributes to low prediction score

CPP-CP: ECPP-CP: Experimentxperiment

Comparison of phoneme error rates and CPP-CP scores (2)

40

45

50

55

60

65

70

Spa

nish

Ital

ian

Can

tone

se

Japa

nese

Br.

Por

tugu

ese

Sha

ngha

ines

e

Eur

.P

ortu

gues

e

Dan

ish

Ph

Err

or

(%)

0.5

0.7

0.9

1.1

1.3

1.5

1.7

1.9

Pre

dic

tio

n S

core

Ph Error (%) Prediction Score

Linear (Ph Error (%)) Linear (Prediction Score)

Page 21: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

Completely automatic approach for predicting crosslingual phoneme distance and speech recognition performance

Approach requires:

− Source-language speech data and pronunciation lexica

− Target-language pronunciation lexicon

− NO target-language speech data

− NO acoustic measurements

Approach is especially useful for creating and validating crosslingual model sets for low- and middle-density languages

Tool can be of great assistance in entering an emerging market quickly and cost-effectively

CConclusiononclusion

Page 22: LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages.

LREC 2008May 28-30, Marrakech, Morocco

شكرا كلش هذا(Voilà, merci!)