Building A Highly Accurate Mandarin Speech Recognizer
description
Transcript of Building A Highly Accurate Mandarin Speech Recognizer
![Page 1: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/1.jpg)
1
Building A Highly Accurate Mandarin Speech Recognizer
Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI),
Aaron Heidel (NTU)Mari Ostendorf12/12/2007
![Page 2: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/2.jpg)
2
Outline Goal: A highly accurate Mandarin ASR Background Acoustic segmentation Acoustic models and adaptation Language models and adaptation Cross adaptation System combination Error analysis Future
![Page 3: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/3.jpg)
3
Background 870 hours of acoustic training data. N-gram based (N=1) ML Chinese word segmentation. 60K-word lexicon. 1.2G words of training text. Trigrams and 4-grams.
n2 n3 n4 Dev07-IV Perplexity
LM3 58M 108M --- 325.7
qLM3 6M 3M --- 379.8
LM4 58M 316M 201M 297.8
qLM4 19M 24M 6M 383.2
![Page 4: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/4.jpg)
4
Acoustic segmentation Former segmenter caused high deletion errors. It
mis-classified some speech segments as noises.
Speech segment min duration 18*30=540ms=0.5s
Vocabulary Pronunciation
speech 18+ fg
Noise rej rej
silence bg bg
Start/null End/null
speech
silence
noise
![Page 5: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/5.jpg)
5
New Acoustic Segmenter Allow shorter speech duration Model Mandarin vs. Foreign (English) separately.
Vocabulary Pronunciation
Mandarin1 I1 F
Mandarin2 I2 F
Foreign forgn forgn
Noise rej rej
Silence bg bg
Start/null End/nullForeign
silence
Mandarin1 Mandarin2
noise
![Page 6: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/6.jpg)
6
Two Sets of Acoustic Models For cross adaptation and system combo
Different error behaviors Similar error rate performance
System-MLP System-PLPFeatures 74
(MFCC+3+32)42(PLP+3)
fMPE no yesPhones 72 81
![Page 7: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/7.jpg)
7
MLP Phoneme Posterior Features Compute Tandem features with pitch+PLP input. Compute HATs features with 19 critical bands Combine two Tandem and HATs posterior vectors
into one. Log(PCA(71) 32) MFCC + pitch + MLP = 74-dim 3500x128 Gaussians, MPE trained. Both cross-word (CW) and nonCW triphones
trained.
![Page 8: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/8.jpg)
8
Tandem Features [T1,T2,…,T71] Input: 9 frames of PLP+pitch (42x9)x15000x71
PLP (39x9)
Pitch (3x9)
![Page 9: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/9.jpg)
9
HATS Features [H1,H2,…,H71] 51x60x71
…
E1
E2
E19
(60*19)x8000x71
![Page 10: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/10.jpg)
10
Phone-81: Diphthongs for BC Add diphthongs (4x4=16) for fast speech and modeling
longer triphone context. Maintain unique syllabification. Syllable ending W and Y not needed anymore.
Example Phone-72 Phone-81要 /yao4/ a4 W aw4北 /bei3/ E3 Y ey3有 /you3/ o3 W ow3爱 /ai4/ a4 Y ay4
![Page 11: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/11.jpg)
11
Phone-81: Frequent Neutral Tones for BC
Neural tones more common in conversation. Neutral tones were not modeled. The 3rd tone
was used as replacement. Add 3 neutral tones for frequent chars.
Example Phone-72 Phone-81了 /e5/ e3 e5吗 /ma5/ a3 a5子 /zi5/ i3 i5
![Page 12: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/12.jpg)
12
Phone-81: Special CI Phones for BC Filled pauses (hmm,ah) common in BC. Add
two CI phones for them. Add CI /V/ for English.
Example Phone-72 Phone-81
victory w V
呃 /ah/ o3 fp_o
嗯 /hmm/ e3 N fp_en
![Page 13: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/13.jpg)
13
Phone-81: Simplification of Other Phones Now 72+14+3+3=92 phones, too many triphones to
model. Merge similar phones to reduce #triphones. I2 was
modeled by I1, now i2. 92 – (4x3–1) = 81 phones.
Example Phone-72 Phone-81安 /an1/ A1 N a1 N词 /ci2/ I1 i2池 /chi2/ IH2 i2
![Page 14: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/14.jpg)
14
PLP Models with fMPE Transform PLP model with fMPE transform to compete
with MLP model. Smaller ML-trained Gaussian posterior model:
3500x32 CW+SAT
5 Neighboring frames of Gaussian posteriors. M is 42 x (3500*32*5), h is (3500*32*5)x1. Ref: Zheng ICASSP 07 paper
t t ty x Mh
![Page 15: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/15.jpg)
15
Topic-based LM Adaptation
Latent Dirichlet Allocation Topic Model
{w | w same story (4secs) }
0
One sentence
4s window is used to make adaptation more robust against ASR errors.
{w} are weighted based on distance.
![Page 16: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/16.jpg)
16
Topic-based LM Adaptation
0 0 0 01 2 64( ... ) ' ' ' '
1 2 64( ... )
![Page 17: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/17.jpg)
17
Improved Acoustic SegmentationPruned trigram, SI nonCW-MLP MPE, on eval06
Segmenter Sub Del Ins Total
OLD 9.7 7.0 1.9 18.6
NEW 9.9 6.4 2.0 18.3
Oracle 9.5 6.8 1.8 18.1
![Page 18: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/18.jpg)
18
Different Phone SetsPruned trigram, SI nonCW-PLP ML, on dev07
BN BC Avg
Phone-81 7.6 27.3 18.9
Phone-72 7.4 27.6 19.0
Indeed different error behaviors --- good for system combo.
![Page 19: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/19.jpg)
19
Decoding Architecture
MLP nonCW
qLM3
PLP CW+SAT+fMPE
MLLR, LM3
MLP CW+SAT
MLLR, LM3
qLM4 Adapt/Rescore qLM4 Adapt/Rescore
Confusion Network Combination
Aachen
![Page 20: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/20.jpg)
20
Topic-based LM Adaptation (NTU) Training, per sentence:
64 topics: = (1, 2, …, m) Topic(sentence) = k = argmax {1, 2, …, m} Train 64 topic-dep (TD) 4-grams
Testing, per utterance: {w}: N-best confidence based weighting + distance
weighting Pick all TD 4-grams whose i is above a threshold. Interpolate with the topic-indep. 4-gram. Rescore N-best list.
![Page 21: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/21.jpg)
21
CERs with diff LMs (internal use)
AM(adapt. hyps)
PLP(MLP)
MLP(PLP)
MLP(Aachen)
PLP(Aachen)
Rover
LM3 10.2 9.6 9.9 10.1 --
qLM4 10.2 9.7 10.0 10.1 --
LM4 10.0 9.6 9.8 10.0 9.1
AdaptedqLM4
9.7 9.3 9.6 9.7 8.9
![Page 22: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/22.jpg)
22
Topic-based LM Adaptation (NTU)
AM(adapt. hyps)
PLP(MLP)
MLP(PLP)
MLP(Aachen)
PLP(Aachen)
CNCRover
LM4 10.0 9.6 9.8 10.0 9.1
Adapted qLM4 9.7 9.3 9.6 9.7 8.9
“q” represents “quick” or tightly pruned.
Oracle CNC: 4.7%. Could it be a broken word sequence? Need to verify that with word perplexity and HTER.
![Page 23: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/23.jpg)
23
2006 ASR System vs. 2007
SUB DEL INS TOTAL2006system
7.2 6.5 0.4 14.1
2007system
5.5 3.0 0.4 8.9
CER on Eval07
37% relative improvement!!
![Page 24: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/24.jpg)
24
Eval07 BN ASR Error Distribution
66 BN snippets (Avg CER 3.4%)
05
101520
0.0% 50.0% 100.0% 150.0%
% snippets
CER
(%)
SRI
![Page 25: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/25.jpg)
25
Eval07 BC ASR Error Distribution
53 BC snippets (avg CER 15.9%)
0
10
20
30
40
50
0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0%
% snippets
CER
(%)
SRI
![Page 26: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/26.jpg)
26
What Worked for Mandarin ASR? MLP features MPE CW+SAT fMPE Improved acoustic segmentation, particularly
for deletion errors. CNC Rover.
![Page 27: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/27.jpg)
27
Small Help for ASR Topic-dep. LM adaptation. Outside regions for additional AM adaptation data. A new phone set with diphthongs to offer different
error behaviors. Pitch input in tandem features. Cross adaptation with Aachen
Successful collaboration among 5 team members from 3 continents.
![Page 28: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/28.jpg)
28
Error Analysis on Extreme CasesSnippet Dur CER HTER
a) Worst BN 87s 10.9% 47.73%
b) Worst BC 72s 24.9% 48.37%
c) Best BN 62s 0 12.67%
d) Best BC 77s 15.2% 14.20%
CER not directly related to HTER; genre matters. Better CER does ease MT.
![Page 29: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/29.jpg)
29
Error Analysis (a) worst BN: OOV names (b) worst BC: overlapped speech (c) best BN: composite sentences (d) best BC: simple sentences with disfluency
and re-starts.
![Page 30: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/30.jpg)
30
Error Analysis OOV (especially names) Problematic for both ASR/MT
Overlapped speech What to do? Content word mis-reco (not all errors are equal!)
升值 (increase in value) 甚至 (even) Parsing scores?
徐 昌 霖徐 成 民徐 长 明 Xu, Chang-Lin
黄 竹 琴黄 朱 琴黄 朱 勤皇 猪 禽黄 朱 其 Huang, Zhu-Qin
![Page 31: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/31.jpg)
31
Error Analysis MT BN high errors
Composite syntax structure. Syntactic parsing would be useful.
MT BC high errors Overlapped speech ASR high errors due to disfluency Conjecture: MT on perfect BC ASR is easy, for
its simple/short sentence structure
![Page 32: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/32.jpg)
32
Next ASR: Chinese OOV Org Names Semi-auto abbreviation generation for long
words. Segment a long word into a sequence of shorter
words Extract the 1st char of each shorter words: World Health Organization WHO
(Make sure they are in MT translation table, too)
![Page 33: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/33.jpg)
33
Next ASR: Chinese OOV Per. Names Mandarin high rate of homophones: 408 syllables 6000
common characters. 14 homophone chars / syllable!! Given a spoken Chinese OOV name, no way to be sure which
characters to use. But for MT, don’t care anyway as long as the syllables are correct.!!
Recognizing repetition of the same name in the same snippet: CNC at syllable level Xu {Chang, Cheng} {Lin, Min, Ming} Huang Zhu {Qin, Qi}
After syllable CNC, apply the same name to all occurrences in Pinyin.
![Page 34: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/34.jpg)
34
Next ASR: English OOV Names English spelling in Lexicon, with (multiple)
Mandarin pronunciations: Bush /bu4 shi2/ or /bu4 xi1/ Bin Laden /ben1 la1 deng1/ or /ben1 la1 dan1/ John /yue1 han4/ Sadr /sa4 de2 er3/ Name mapping from MT?
Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re-train n-gram.
![Page 35: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/35.jpg)
35
Next ASR: LM LM adaptation with fine topics, each topic
with small vocabulary size. Spontaneous speech: n-gram backtraces to
content words in search or N-best? Text paring modeling? 我想那 ( 也 )( 也 ) 也是 我想那也是 I think it, (too), (too), is, too. I think it is, too.
If optimizing CER, stm needs to be designed such that disfluency is optionally deletable.
![Page 36: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/36.jpg)
36
Next ASR: AM Add explicit tone modeling (Lei07).
Prosody info: duration and pitch contour at word level
Various backoff schemes for infrequent words More understanding why outside regions not
helping with AM adaptation. Add SD MLLR regression tree (Mandal06). Improve auto speaker clustering
Smaller clusters, better performance
![Page 37: Building A Highly Accurate Mandarin Speech Recognizer](https://reader035.fdocuments.in/reader035/viewer/2022062502/568155c6550346895dc39a82/html5/thumbnails/37.jpg)
37
ASR & MT Integration Do we need to merge lexicon? ASR <= MT. Do we need to use the same word segmenter? Is word/char -level CNC output better for
MT? Open questions and feedback!!!