LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic...
-
Upload
julianna-hancock -
Category
Documents
-
view
215 -
download
1
Transcript of LREC 2008 May 28-30, Marrakech, Morocco Borrowing Language Resources for Development of Automatic...
LREC 2008May 28-30, Marrakech, Morocco
Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and
Middle-Density Languages
Lynette Melnar & Chen Liu
Application and Software Research Center, Motorola Labs
Motorola, Schaumburg, IL 60196, USA
Lynette.Melnar, [email protected]
LREC 2008May 28-30, Marrakech, Morocco
The onlyway to go is by the lake on the bay
xxxxx
xxxxx
SsssssssSsssssssSsssssssSsssssssSssssssss
SsssssssSsssssssSsssssssSsssssssSssssssss
Dravidian?
Niger-Congo? Austro-Asiatic?
LLanguageanguage R Resourcesesources
Current language coverage:
Broader, faster language coverage critical!
Afro-Asiatic: Semitic
IE: GermanicIE: ItalicIE: Slavic
7
6
2
IE: Indic1
1
IE: Iranian?
Altaic: Korean
Sinitic: Mandarin
1
1
Altaic: Turkic
Altaic: Japonic1
1
Sinitic: CantoneseSinitic: Wu
1
1
Sinitic: Min?
Sinitic: Hakka?
Sinitic: Xiang?IE: Caribbean Spanish?
IE: Venezuelan Spanish?
IE: Rioplatense Spanish?
LREC 2008May 28-30, Marrakech, Morocco
• High-density languages:
– Majority languages associated with large, economically advantaged speaker populations
• significant proportions of which regularly use computers
– Bear official status or non-official predominant use in one or more countries
– Recognized as important by foreign governments
– Supported by writing traditions and have been studied and well documented in various types of language resources
• In particular, a high-density language is here associated with several commercially available speech resources of various types (and quality)
– Examples of such high-density languages are major dialects of Arabic, English, Mandarin, and Spanish
LLanguageanguage D Densityensity
LREC 2008May 28-30, Marrakech, Morocco
• Low-density languages:
– Speaker populations may be very small and economically disadvantaged
• and have little or no computer experience
– Have minority status in the countries where spoken
– Not judged to be very significant by foreign governments
– Not supported by writing systems or only supported by non-standardized writing systems and not well studied or documented in language resources
• Middle-density languages:
– A balance of those extremes exhibited by high- and low-density languages
• Many Chinese, Indian, and African languages in the emerging market
LLanguageanguage D Densityensity
LREC 2008May 28-30, Marrakech, Morocco
• Goal - extend the value of existing language resources by developing a tool that uses in-house source language data to automatically predict target-language ASR performance.
– Speech DBs are expensive (time/money).
– Use source DBs to quickly create (monophone) acoustic models for target languages lacking speech data.
• Approach – base the approach on a series of linguistic metrics characterizing the articulatory phonetic and phonological information of phonemes from both the target and source languages.
• Benefit - objectively measure phonological similarity among languages to:
– Help direct future multilingual exploration
– Guide a cost-effective DB acquisition strategy
EExtendxtend E Existingxisting L Languageanguage R Resourcesesources
LREC 2008May 28-30, Marrakech, Morocco
A combined phonetic-phonological crosslingual distance metric (CPP-CD) automatically selects phoneme models from existing source languages as surrogate models for a target language lacking existing speech data.
Lexica from all
languages
Featurematrix
Feature weights
Phoneticdistance
(X 2)
Monophoneme distribution distance
Biphoneme distribution distance
CPP-CD
Phonemes from all languages
CCrosslingual rosslingual PPhonemehoneme D Distanceistance (cf. Liu & Melnar Interspeech 2005)
LREC 2008May 28-30, Marrakech, Morocco
The pre-existing acoustic models for the top two least distant source-language phonemes for each target-language phoneme are selected for the crosslingual target-language model set.
Target Language: Latin American SpanishTarget phoneme: /i/
Source language phonemes:
Italian /i/ : CPP-CD = 0.445999111424272 {PhD(2)= 0} + MD(1)= 0.0931453475842772 + BD(1)= 0.352853763839995
Brazilian Portuguese /i/ : CPP-CD = 0.970328191371562 {PhD(2)= 0} + MD(1)= 0.420797537723557 + BD(1)= 0.549530653648005
…
CPP-CD CPP-CD (cf. Liu & Melnar Interspeech 2005)
LREC 2008May 28-30, Marrakech, Morocco
1.Embed the CPP-CD identified HMMs corresponding to the target language into the model training of the donating source languages;
2.Attach all donor language phoneme labels that have been selected as target-language surrogates with a target-language ID tag (LT)
− E.g., Donor language 1 phoneme inventory : a_LT, e_LT, u_LT, b_LT,…
3.Attach all donor language phoneme labels that have NOT been selected as target-language surrogates with the donor-language ID tag (L1, L2, L3, …)
− E.g., Donor language 1 phoneme inventory : i_L1, o_L1, d_L1, f_L1,…
4.Build a mega phoneme inventory, lexicon, and phoneme transcription set. The ID-tagged mega phoneme inventory includes all the phonemes from the donor languages and is used to retranscribe all the pronunciation lexica and phoneme transcriptions
MModelodel T Training raining (cf. Liu & Melnar ISCA MULTILING 2006)
LREC 2008May 28-30, Marrakech, Morocco
MModelodel T Training raining (cf. Liu & Melnar ISCA MULTILING 2006)
…dtarbu...vasr
…
Megainventory
…d_L1
t_L1
a_LTr_L1
b_LTu_L1
...v_LMs_LMr_LM…
Candidate
Candidate
Candidate
P1
PM
…
P1, …, PM : individual donor inventories
L1, …, LM , LT : language ID
…dtarbu...vasr
…
Megainventory
…d_L1
t_L1
a_LTr_L1
b_LTu_L1
...v_LMs_LMr_LM…
Candidate
Candidate
Candidate
P1
PM
…
P1, …, PM : individual donor inventories
L1, …, LM , LT : language ID
LREC 2008May 28-30, Marrakech, Morocco
To predict the performance of the crosslingual model set:
1.Assign each target phoneme an importance weight− Based on the phoneme's lexical frequency
2.Measure the contribution and interference effect of all the donor phonemes to each target phoneme
– For each target phoneme, define a matching set that consists of the two least distant source-language donor phonemes
– For each target phoneme, define a confusing set that consists of all the other donor-language phonemes
CPP CCPP Crosslingualrosslingual P Predictionrediction (CPP-CP) (CPP-CP)
LREC 2008May 28-30, Marrakech, Morocco
CPP CCPP Crosslingualrosslingual P Predictionrediction (CPP-CP (CPP-CP)
Target Phoneme : Lexical importance weightMatching set : Lesser CPP-CD distance is betterConfusing set : Greater CPP-CD distance is better
The combined contribution and interference effect of donor phonemes to all the target phonemes is derived from the same component distance measures used in the derivation of CPP-CD (phonetic, monophoneme, biphoneme).
LREC 2008May 28-30, Marrakech, Morocco
CPP CCPP Crosslingualrosslingual P Predictionrediction (CPP-CP (CPP-CP)
(For targetPhoneme #N)
(For targetPhoneme #2)
(For targetPhoneme #1)
CPP-CD
CPP-CP
Matchingset
Donorphonemes
Contrastingset
Contributioneffect
Interferenceeffect
Discriminativecontribution
effect
Discriminativecontribution
effect
Discriminativecontribution
effect
.
.
.
. . .
. . .
.
.
.
LREC 2008May 28-30, Marrakech, Morocco
CPP CCPP Crosslingualrosslingual P Predictionrediction (CPP-CP (CPP-CP)
• Because the prediction score is based on distance, the smaller the CPP-CP score, the higher the predicted performance of the crosslingual models
• For a set of languages, we evaluate CPP-CP scores relative to recognition phoneme error rates:– recognition results with the crosslingual models on the
native speech data
Where the prediction scores match the general trend of the recognition phoneme error rates, we consider the relative prediction scores reliable
LREC 2008May 28-30, Marrakech, Morocco
• Inconsistency in data quality and task complexity across languages
• Sub-optimal native model quality for some languages
• Non-unary mapping between the acoustic and non-acoustic phonological domains
• Biphoneme inventory size
• Target-source language proximity
CPP-CP: KCPP-CP: Knownnown P Performanceerformance F Factorsactors
LREC 2008May 28-30, Marrakech, Morocco
1.Compare CPP-CD approach with an acoustic distance approach (Bhattacharyya metric) and native monolingual modeling approach in ASR experiments
2.Test five target languages:
− Latin American Spanish, Italian, Japanese, Danish, and European Portuguese (all high-density - speech data for testing required)
3.Use twenty source languages from six major language groups:
– (i) Afro-Asiatic; (ii) Altaic; (iii) IE Germanic; (iv) IE Italic; (v) IE Slavic; and (vi) Sinitic
• For each crosslingual experiment, the target language is left out of the source language pool for model selection
CPP-CD: RCPP-CD: Recognitionecognition E Experimentxperiment
LREC 2008May 28-30, Marrakech, Morocco
Model Performance Comparison (word accuracy %)
CPP-CD: RCPP-CD: Recognitionecognition R Resultsesults
Target Language
Native Baseline
Genetic Relation
Biphoneme Inv.
Acoustic Distance
CPP-CD
Italian 98.42 Italic (5) 613 98.27 98.52
Spanish 94.49 Italic (5) 520 88.61 93.06
Japanese 95.36Japonic
(0)643 76.72 78.76
Portuguese
96.31 Italic (5) 776 77.91 72.74
Danish 94.36 Germ. (5) 980 72.95 70.15
Overall the CPP-CD and Bhattacharyya acoustic approaches perform similarly. The average recognition result using the Bhattacharyya-derived models is 82.89% while the average CPP-CD result is 82.65%.
LREC 2008May 28-30, Marrakech, Morocco
CPP-CP CCPP-CP Confirmationonfirmation • The trend of the CPP-CP scores matches that of the phoneme error rates
• Results suggest that CPP-CP scores are indicative of actual recognition performance
Results suggest that target-language crosslingual model sets that have a CPP-CP score of less than 1 are likely to achieve acceptable recognition performance levels
– Spanish, Italian, and Japanese CPP-CP models average a word accuracy of 90.11%
Comparison of phoneme error rates and CPP-CP scores (1)
Prediction score is based on linguistic distance and corresponds to phoneme error
rate
40
45
50
55
60
65
70
Spanish Italian Japanese Portuguese Danish
Ph
Err
or
(%)
0.5
0.7
0.9
1.1
1.3
1.5
1.7
1.9
Pre
dic
tio
n S
co
re
Ph Error (%) Prediction ScoreLinear (Ph Error (%)) Linear (Prediction Score)
Practical use threshold
LREC 2008May 28-30, Marrakech, Morocco
1.Test with lower-density target languages: Brazilian Portuguese, Cantonese, and Shanghainese
• Brazilian Portuguese, Cantonese, and Shanghainese may be considered middle-density languages in the sense that they are relatively underrepresented in terms of available speech resources.
− We consulted the speech resource catalogues of the European Language Resources Association (ELRA), Linguistic Data Consortium (LDC), and Appen Ltd. to informally evaluate speech database availability. For both Shanghainese and Cantonese only one speech database is available, while Brazilian Portuguese is associated with three publicly available speech databases.
2.Use same twenty source languages selected for the first experiment and likewise compare their crosslingual phoneme error rates and CPP-CP scores
CPP-CP: ECPP-CP: Experimentxperiment
LREC 2008May 28-30, Marrakech, Morocco
CPP-CP: ECPP-CP: Experimentxperiment
Target Language
Biphoneme Inventory
Related Languages
Average 665.75 4.125
Cantonese 320 2
Shanghainese 428 2
Spanish 520 5
Italian 613 5
Japanese 643 0
Eur. Portuguese 776 5
Danish 980 5
Br. Portuguese 1046 5
Comparison of performance factors among the eight test languages
520 5613 5
Spanish
Italian
LREC 2008May 28-30, Marrakech, Morocco
• The trends of the CPP-CP scores and phone error percentages correspond
• Among the recently added languages, only Cantonese has a prediction score below 1
– Crosslingual model set is expected to achieve a practical use recognition level
• Small Cantonese biphoneme inventory size contributes to low prediction score
CPP-CP: ECPP-CP: Experimentxperiment
Comparison of phoneme error rates and CPP-CP scores (2)
40
45
50
55
60
65
70
Spa
nish
Ital
ian
Can
tone
se
Japa
nese
Br.
Por
tugu
ese
Sha
ngha
ines
e
Eur
.P
ortu
gues
e
Dan
ish
Ph
Err
or
(%)
0.5
0.7
0.9
1.1
1.3
1.5
1.7
1.9
Pre
dic
tio
n S
core
Ph Error (%) Prediction Score
Linear (Ph Error (%)) Linear (Prediction Score)
LREC 2008May 28-30, Marrakech, Morocco
Completely automatic approach for predicting crosslingual phoneme distance and speech recognition performance
Approach requires:
− Source-language speech data and pronunciation lexica
− Target-language pronunciation lexicon
− NO target-language speech data
− NO acoustic measurements
Approach is especially useful for creating and validating crosslingual model sets for low- and middle-density languages
Tool can be of great assistance in entering an emerging market quickly and cost-effectively
CConclusiononclusion
LREC 2008May 28-30, Marrakech, Morocco
شكرا كلش هذا(Voilà, merci!)