HELMER F RIEDMAN LLP 2012 Annual Conferenc… · HELMER F RIEDMAN LLP ... 20 . ,
How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for...
-
Upload
shanon-joseph -
Category
Documents
-
view
227 -
download
2
Transcript of How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for...
![Page 1: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/1.jpg)
How to handlepronunciation variation in ASR:By storing episodes in memory?
Helmer StrikCentre for Language and Speech Technology (CLST)Radboud University Nijmegen, the Netherlands
Radboud University Nijmegen
![Page 2: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/2.jpg)
Radboud University Nijmegen
Overview
Contents : Variation, invariance problem ASR : Automatic Speech Recognition HSR : Human Speech Recognition ESR : Episodic Speech Recognition
![Page 3: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/3.jpg)
Radboud University Nijmegen
Invariance problem (1)
One of the main issues in speech recognition is the large amount of variability present in speech.SRIV2006: ITRW on Speech Recognition and Intrinsic Variation
Invariance problem:Variation in stimuli, invariant perceptAlso visual, tactile, etc.Studied in many fields, no consensus
2 paradigms InvariantEpisodic
![Page 4: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/4.jpg)
Radboud University Nijmegen
Invariance problem (1)
Example 1: Speech
Dutch word: “natuurlijk” (naturally, ‘of course’) [natyrlk] [natylk]… [tyk]
Multiword expressions (MWEs): lot of reductionmany variants
![Page 5: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/5.jpg)
Radboud University Nijmegen
Invariance problem (2)
Example 2: Writing (vision)
natuurlijk natuurlijk
natuurlijk natuurlijk
natuurlijk natuurlijk
natuurlijk natuurlijk
natuurlijk natuurlijk
natuurlijk natuurlijk
Familiar ‘styles’ (fonts, handwriting)are recognized better
![Page 6: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/6.jpg)
Radboud University Nijmegen
ASR - Paradigm
Invariant, symbolic approach : utterance sequence of words sequence of phonemes sequence of states parametric description : pdf’s / ANN
![Page 7: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/7.jpg)
Radboud University Nijmegen
ASR - Paradigm
Same paradigm (HMMs), since 70’s Assumptions : incorrect, questionable Insufficient performance
ASR vs. HSR : error rates 8-80x higher Slow progress (ceiling effect?) Simply using more and more data is not sufficient
(Moore, 2001)
A new paradigm is needed!However, only few attempts
![Page 8: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/8.jpg)
Radboud University Nijmegen
HSR - Indexical information
Speech - 2 types of information :
1. Verbal info. : what, contents2. Indexical info. : how, form
e.g. environmental and speaker-specific aspects(pitch, loudness, speech rate, voice quality)
![Page 9: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/9.jpg)
Radboud University Nijmegen
HSR - Indexical information
Traditional ASR model: Verbal information is used Indexical information
Noise, disturbances Preprocessing:
o Strip offo Normalization (VTLN, MLLR, etc.)
And in HSR?
![Page 10: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/10.jpg)
Radboud University Nijmegen
HSR - Indexical information
HSR : Strip off indexical information?
No!
Familiar voices and accents :recognize and mimic
Indexical informationis perceived and encoded
![Page 11: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/11.jpg)
Radboud University Nijmegen
HSR - Indexical information
Verbal & indexical information :processed independently?
No!
Familiar ‘voices’ are recognized better
Facilitation, also with ‘similar’ speech
![Page 12: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/12.jpg)
Radboud University Nijmegen
HSR - Indexical and detailed information
Experimental results:indexical information andfine phonetic detail (Hawkins et al.)influence perception
Difficult to explain / integrate in the traditional, invariant model
New models: episodic models,for auditive and visual perception
![Page 13: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/13.jpg)
Radboud University Nijmegen
ESR - Basic idea
A new paradigm for ASR is needed:An episodic model !!??
Training : Store trajectories - (representatives of) episodes
Recognition : Calculate distance between X and sequences of stored
trajectories (DTW) Take the one with minimum distance : the recognized
word
![Page 14: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/14.jpg)
Radboud University Nijmegen
ESR – Invariant vs. episodic
phone-based HMM ESR-------------------------------------------------------------
Unit:[ Phone Syllable, word, … ]
Representation:States - pdf’s or ANN Trajectories
Compare:Trajectory (X) & states Trajectory (X) & Trajectories
Parsimonious representation Extensive representationComplex mapping Simple mapping‘Variation is noise’ Variation contains info.Normalization Use variation
![Page 15: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/15.jpg)
Radboud University Nijmegen
Phone ‘aj’ from ‘nine’.
X = begin
3 parts: aj(, aj|, aj)
Representationpdf’s (Gaussians)
Much detail, dynamic information is lost
Trajectories: details
![Page 16: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/16.jpg)
Radboud University Nijmegen
Unit: phone(me)
Switchboard (Greenberg et al.):deletion: 25% of the phonessubstitution: 30% of the phones together 55%!!
Difficult for a model based on ‘sequences of phones’.Syllables: less than 1% deleted
Phonetic transcriptions and their evaluation :Large differences between humansWhat is the ‘golden reference’?Speech – a sequence of symbols?
![Page 17: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/17.jpg)
Radboud University Nijmegen
Unit: Multiword expressions (MWEs)
MWEs (see poster) :A lot of reduction;
many phonemes deleted, or substitutedMany variants (= sequences of phonemes)
more than 90 for 2 MWEs studiedDifficult to handle in ASR systems with current methods
for pronunciation variation modeling.Reduction, e.g. for a MWE: 4 words with 7 syllables
reduced to ‘1 entity’ with 2 syllables
What should be stored?Units of various lenghts?
![Page 18: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/18.jpg)
Radboud University Nijmegen
An episodic approach for ASR
Advantages:More information during search:
dynamic, indexical, fine phonetic detailContinuity constraints can be used
(reduces the trajectory folding problem)Model is simpler
Disadvantage:More information during search: complexity
Brain: a lot of storage and ‘CPU’ Computers: more and more powerful
![Page 19: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/19.jpg)
Radboud University Nijmegen
An episodic approach for ASR
Strik (2003) ITC-irst, Trento, Italy; ICPhS, Barcelona De Wachter et al. (2003) Interspeech-2003 Axelrod & Maison (2004) ICASSP-2004 Maier & Moore (2005) Interspeech-2005 Aradilla, Vepa, Bourlard (2005) Interspeech-2005 Matton, De Wachter, et al. (2005) SPECOM-2005
Promising results The computing power and memory that are needed to
investigate the episodic approach to speech recognition are (becoming) available
![Page 20: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/20.jpg)
Radboud University Nijmegen
The HSR-ASR gap
HSR & ASR – 2 different communitiesDifferent people, departments, journals, terminology, goals, methodologies
Goals, evaluationHSR: simulate experimental findingsASR: reduce WER
![Page 21: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/21.jpg)
Radboud University Nijmegen
The HSR-ASR gap
Marr (1982) – 3 levels of modeling:1. Computational2. Algorithmic3. Implementational
HSR - (larger) differences at higher levels
ASR – implementations, end-to-end models using real speech signals as input Thousands of exp.: WER has been gradually reduced However, essentially the same model New model: performance (WER), funding, etc.
![Page 22: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/22.jpg)
Radboud University Nijmegen
The HSR-ASR gap - bridge
Use same evaluation metric for HSR & ASR systems: reaction times (Cutler & Robinson, 1992)
Use knowledge or components from the other field (Scharenborg et al., 2003).
Use models that are suitable for HSR & ASR researchEvaluation from HSR & ASR point of view
S2S – Sound to Sense (Sarah Hawkins)Marie Curie Research Training Network (MC-RTN)Recently approved by the EU
![Page 23: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/23.jpg)
Radboud University Nijmegen
Episodic speech recognition
![Page 24: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/24.jpg)
Radboud University Nijmegen
![Page 25: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/25.jpg)
Radboud University Nijmegen
ESRASA model
T1
T2
TN
B1
B2
BE
C12
C11
C22
CE2
CE
W
F1
F2
FN
attention weights
... ... ...
EA1
EA2
EAE
WA1
WA2
WAW
episodes association weights
words
feature vector
episode activation
B1
B2
BW
word activation
![Page 26: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/26.jpg)
Radboud University Nijmegen
ESRASA model
ESRASAEpisodic Speech Recognition And Structure Acquisition
The ESRASA model is inspired by several previous models, especially model described in Johnson (1997)WRAPSA (Jusczyk, 1993), and CGM (Nosofsky, 1986)
The ESRASA model is a feedforward neural network with two sets of weights: atTention weights Tn and assoCiation weights Cew. Besides these two sets of weights, words, episodes (for speech units), and their base activation levels (Bw and Be, respectively) will be stored in memory.
![Page 27: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/27.jpg)
Radboud University Nijmegen
ESRRecognition
L items in lexicon
S items in subset
1 item, the winner
Preselection
Competition
L items in lexicon
S items in subset
1 item, the winner
Preselection
Competition
X
![Page 28: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/28.jpg)
Radboud University Nijmegen
ESRPreselection
Why preselection? Reduce CPU & memory Increase performance Also used in DTW-based pattern recognition
applications Used in many HSR models
![Page 29: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/29.jpg)
Radboud University Nijmegen
ESRCompetition
Recognize unknown word X : Calculate distance between X and sequences of
stored episodes (DTW) Take the one with minimum distance : the recognized
word
Use continuity constraints (as in TTS)
![Page 30: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/30.jpg)
Radboud University Nijmegen
ESRDTW: Dynamic Time Warping
![Page 31: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/31.jpg)
Radboud University Nijmegen
ESR – ResearchPreselection ?
Best method?Compare: kNN – k nearest neighbor Lower bound distance : Ddtw Dlb d Build an index for the lexicon
Is preselection needed?Compare: with & without preselection
![Page 32: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/32.jpg)
Radboud University Nijmegen
ESR – ResearchUnits for preselection ?
Compare : Syllable Word Begin (window of fixed length)
![Page 33: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/33.jpg)
Radboud University Nijmegen
ESR - ResearchUnits for competition ?Compare : Syllables Words In combination with multisyllables?
Multisyllables (reduction, resyllabification) Ik weet het niet -> kweeni Op een gegeven moment -> pgeefment Zeven-en -> ze-fnen
![Page 34: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/34.jpg)
Radboud University Nijmegen
ESR - ResearchExemplars ?
How to select exemplars : DTW distances + hierarchical clustering VQ : LVQ & K-means
Trade-off normalization & (size) lexiconCompare normalization techniques : TDNR, MVN, HN VTLN
![Page 35: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/35.jpg)
Radboud University Nijmegen
ESR - ResearchFeatures ?
Compare : Spectral features : MFCC, PLP, LPC Articulatory features (ANN) Combine spectral & articulatory feat.
Different features for preselection & competition?
![Page 36: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/36.jpg)
Radboud University Nijmegen
ESR - Research Distance metrics ?
Compare (frame-based metrics) : Euclidean Mahalanobis Itakura (for LPC) Perceptually-based?
Distance metric for trajectories?
![Page 37: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/37.jpg)
Radboud University Nijmegen
HMM-based ASR Information sources
HMM-based ASR, roughly 3 ways :1. Class-specific HMMs2. Multistream3. 2-pass decoding
Disadvantages :1. Many classes2. Synchronization & recombination3. Pass 1 : no / less knowledge
![Page 38: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/38.jpg)
Radboud University Nijmegen
ESR - ResearchInformation sources
ESR : compare 2 trajectoriesAll details are available during search, e.g. context &
dynamic informationCompare shape + timing of feat. contours
F0 rise: early or final, half or complete
Tags can be added to the lexicon+ continuity constraints
![Page 39: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/39.jpg)
Radboud University Nijmegen
HSR - Foreign English Examples
Conversation about Italy.
dropped / robbed
I was robbed in Milan.By parachute?
[ FEE 1 ]
![Page 40: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/40.jpg)
Radboud University Nijmegen
HSR - Indexical information
HSR : Strip off indexical information?No!
Familiar voices and accents :recognize and mimic [ FEE 2 ]
Indexical informationis perceived and encoded
![Page 41: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/41.jpg)
Radboud University Nijmegen
HSR - Indexical information
Verbal & indexical information :processed independently? No!
Familiar ‘voices’ are recognized better[ FEE 3 ]
Facilitation, also with ‘similar’ speech[ FEE 4 ]
![Page 42: How to handle pronunciation variation in ASR: By storing episodes in memory? Helmer Strik Centre for Language and Speech Technology (CLST) Radboud University.](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649f4e5503460f94c6f633/html5/thumbnails/42.jpg)
Radboud University Nijmegen
ASR - Pronunciation variation
SRIV2006:ITRW on Speech Recognition and Intrinsic Variation
Pronunciation variation modeling for ASR : Improvements, but generally small Current ASR paradigm : suitable?
Phonetic transcriptions and their evaluation : Large differences between humans What is the ‘golden reference’? Speech – a sequence of symbols?