presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training...

33
Exemplar-Based Processing for Speech Recognition presented by Andreas Gaich SS 2013 SPSC Advanced Signal Processing Seminar 2

Transcript of presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training...

Page 1: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Exemplar-Based Processing for Speech Recognition

presented by

Andreas Gaich

SS 2013 SPSC

Advanced Signal Processing Seminar 2

Page 2: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

2 Andreas Gaich

Overview

Introduction

• Speech Recognition • Frame- and segment-based models • Global-data vs. exemplar-based models

State of the art techniques

• Overview • k-NN Classification • Sparse Representations • Template Matching • Latent Perceptual Mapping

Experimental results Conclusions

Page 3: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

3 Andreas Gaich

Introduction – Speech Recognition

Speech Recognition Problem

• Find a principled way of modeling the physical phenomena generating the observed data and the uncertainty in it.

• Uncertainties : f.e. dialects, vocal tract variations, corruption by noies, etc...

Approaches of modeling

• Eager (offline) learning => GLOBAL-DATA MODELING

Uses all available training data to build a model before the test sample is seen

• Lazy (memory-based) learning => EXEMPLAR-BASED MODELING

Selects a subset of exemplars from the training data to build a local model specifically for every test sample

Test sample informs the construction of the model

Page 4: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Introduction – Speech Recognition

The sequence recognition problem

• Find the sequence of words that corresponds to a waveform

• Done by maximizing the posterior probability

U... Sequence of subword units; X… Sequence of observations

• Subword units are „u“ representations for words „w“

=> …Language model; …Pronunciation model; …Acoustic model

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

4 Andreas Gaich

Page 5: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Introduction – Frame- and segement-based models

Frame-based models

• Observations are a temporal sequence of acoustic feature vectors computet at regular time intervals

• HMMs as a popular methodology to calculate the acoustic model

• Efficient probabilistic acoustic models by making a „First Order Markov

Assumption“ between states and assuming „Frame-by-Frame Independece“ between observations

!!! FRAME-BY-FRAME INDEPENDENCE ASSUMPTION IS UNREALISTIC !!!

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

5 Andreas Gaich

Page 6: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Introduction – Frame- and segement-based models

Segment-based models

• Introduction of an unobservable segmentation variable S

• Allows for modeling dependencies of multiple frames in each segment

• Difficulty of defining segments

Boundaries f.e. at large spectral changes in frame-level observations X

• Calculation of

State trajectory modeling Converting all observations within a segment to a new „Segmental

Feature Vector“ [2]

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

6 Andreas Gaich

Page 7: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Introduction – Global-data vs. exemplar-based models

• Virtually all speech recognition systems uses global-data models

e.g. GMMs, NNs, SVMs as underlying computational engine

• Global models must simplify the process of speech production by making assumptions of independence

e.g. gender, dialect, environmental noise, etc…

• Global modals seek for an average representation that is reliable

• This is questionable at least for two reasons

1.) A lower-dimensional representation of speech feature vectors is possible (around 40 suggested for frame-based systems)

2.) Model parameters can be unreliable if there are not enough training samples for the specific class

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

7 Andreas Gaich

Page 8: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Introduction – Global-data vs. exemplar-based models

• Exemplar-based techniques build local modals using a few relevant training examples

=> Do not suffer from data sparsity issues

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

8 Andreas Gaich

Page 9: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Overview

Workflow

• Feature Extraction

Potentially apply feature transformation to reduce the influence of noisy and irrelevant features

Acoustic inventory can be composed of fixed-length feature vectors (frames), or variable length sequences of such vectors (templates)

• Exemplar Selection

Ident. instances from training data most relevant to each test instance

• Instance Modeling

• Frame- or segment-based decoding

Compute the acoustic score Perform f.e. a Viterbi Search

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

9 Andreas Gaich

Page 10: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Overview

Exemplar Selection

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

10 Andreas Gaich

Page 11: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – kNN Classification

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

11 Andreas Gaich

Page 12: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – kNN Classification

Exemplar Selection

• Obtain a set C containing indices of the nearest k exemplars out of the whole training data collected in matrix H with dim(D x N)

Instance Modeling

• Estimate class posteriors

… rows of G corresponding to class q G… binary matrix that associates each exemplar in H with class labels i(C)… indicator vector representing the kNNs in H of a test instance

• Weighted kNN also possible

F.e. proportional to the distance between test and training instance

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

12 Andreas Gaich

Page 13: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – kNN Classification

Decoding

• Classification by majority vote

• For speech recognition needs to be converted to frame-based observation likelihods

Normalize that it sums up to 1, then it‘s a direct measure for the observation likelihoods

• kNN also possible to use in „multiple-frame“ and „segment-based“

frameworks

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

13 Andreas Gaich

Page 14: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Sparse representations

Formulation

• Concatenate training samples of class i into Matrix

x… feature vector from training sets of class i

• Then a test sample from the same class can be represented as a linear combination

• A priori the membership of y is unknown, so define a matrix with the whole training set containing all k classes

• H is an overcomplete dictionary in terms of solving

=> should be sparse and only be non-zero for elements inH which belong to the same class as y

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

14 Andreas Gaich

Page 15: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Sparse representations

• Sparse solution is found by means of Lasso (refer to ASP SE1 WS12/13)

Minimize L2 distance while enforcing sparsity using an L1 constraint Exemplar Selection

• Intelligent choice of dictionary H is necessary

kNN search Random sampling of training data

Instance Modeling

• Estimate class posteriors (same precedure as kNN)

• If posteriors represent phone classes, they are refered as „SR phone identification features“

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

15 Andreas Gaich

Page 16: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Sparse representations

Decoding

• Same procedure as in the kNN chapter • Alternatively construct new features to train a GMM

Noise robustness using SR

• Spares imputation (SI)

Some features stay relatively uncorrupted under noisy speech Define a „binary mask“ m that represents the uncorrupted dimensions Obtain SR by solving

• Source seperation

Describe x as linear combination Requires a representation where noise and speech add linearly Estimate class posteriors

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

16 Andreas Gaich

Page 17: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Template matching

• Nonparametric method

• Compares reference templates directly with the observed features X

Make use of the entire database Use Meta Information enriched in the training data (labels, gender, etc)

Exemplar Selection

• First generate a rough word graph to keep the number of candidate segments and corresponding class labels manageable

Done by HMM system or Bottom-up template selection Augmented to subword segmentation

• Collect a set of k-NN templates for each word arc, that match the word

identity u and resemble the sequence of acoustic features X

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

17 Andreas Gaich

Page 18: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Template matching

Modeling

• Template matching uses variable-length units

=> Dynamic Time Warping (DTW)

Sum up local distances between template and input and minimize the overall cost to find the right trajectory

• Within-template allignment:

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

18 Andreas Gaich

Page 19: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Template matching

• Consider additional costs at boundaries

Template transition cost Language model cost

• Template Transision Costs

Penalty for incorrect acoustic-phonetic context Incorporate non-verbal information to find consistent paths (f.e. male –

female)

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

19 Andreas Gaich

Page 20: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Template matching

Decoding

• Single best Viterbi decoding

Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations, etc)

Improvement by „Data Sharpening“

• Simpler and yet better:

Average scores before decoding Then use the Viterbi decoder again

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

20 Andreas Gaich

Page 21: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Template matching

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

21 Andreas Gaich

Page 22: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Latent perceptual mapping (LPM)

• Operates with data-driven acoustic units (no prior model assumed)

• Speech segments are treated as a bag of acoustic units drawn from a limited acoustic vocabulary

• Training comprises three main steps:

Extracting relevant „units“ from a given set of phoneme instances Deriving a unit-document co-occurance matrix Mapping the phoneme instances to a dimensionality reduced latent

space after singular value decomposition (SVD) of the co-occurance matrix

LPM is related to SRs and Template Matching

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

22 Andreas Gaich

Page 23: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Latent perceptual mapping (LPM)

Exemplar Selection

• Unsupervised clustering of feature vectors from m phoneme segments

• After vector quantization the resulting sequence of units are broken into n-gram units (1≤n≤20)

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

23 Andreas Gaich

Page 24: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Latent perceptual mapping (LPM)

Exemplar Selection – cont‘d

• Choose the best informative units by investigation of empirical measurements

Indexing Power

Empirical probability

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

24 Andreas Gaich

Page 25: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Latent perceptual mapping (LPM)

Exemplar Selection – cont‘d

• Compute Co-occurance matrix by counting the number of times each unit appears in a phoneme instance

• Dimensionality reduction of matrix F by SVD R approximates the rank of F

Phoneme segments from the training set are mapped to the vectors in the latent space and then used as acoustic prototypes

A test segment X is mapped onto the latent space as well and classification is done by distance calculation to the acoustic prototypes

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

25 Andreas Gaich

Page 26: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Techniques – Link between different approaches

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

26 Andreas Gaich

LPM

Page 27: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Experimental results

• Most results are evaluated on TIMIT

A small-vocabulary phonetic corpus Recorded and transcribed by TI and MIT Provides standardized training, development, and test sets as well as

time-alligned phonetic transcriptions

• Exemplar-based methods are compared to the state of the art implementation of the „best“ classical approach

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

27 Andreas Gaich

Page 28: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Experimental results – kNN

Classification

• Classification error about 21% in contrast to GMM classifier with 21,6%

Recognition

• Same performance as GMM/HMM-based approach for small vocabulary

• Improvement over GMM/HMM-based approach if less than 3 hours of training data is used in a large vocabulary task

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

28 Andreas Gaich

Page 29: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Experimental results – Sparse representations

Classification

• Classification error less than 15%, the best result reported on TIMIT

Recognition

• Phonetic error rate (PER) of 18, 6% on TIMIT; 0,8% absolute improvement over GMM/HMM systems

• Word error rate (WER) of 18,7%; 0,3% improvement over state-of-the-art GMM/HMM systems

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

29 Andreas Gaich

Page 30: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Experimental results – Template matching

Classification

• Classification error less of 20,7% (GMM classifier about 21,6%)

Recognition

• Small vocabulary: outperforms a HMM-based systes (3,07% versus 3,35% WER)

• Large vocabulary: 21% relative improvement, which results overall in a 7,6% absolute WER

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

30 Andreas Gaich

Page 31: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Experimental results – Latent Perceptual Mapping

Classification

• Early experiments with LPM focused on dimensionalty reduction rather than accuracy improvements

• By retaining 10% of the maximum dimensionality of the latent space, frame-based LPM operating on vector-quantized phone segments scores at a level of both DTW and discrete-parameter HMM systems

• Template-based LPM using short, variable-length units achieve the same level of performance at a dimensionality less than or comparable to that of the original acoustic space

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

31 Andreas Gaich

Page 32: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Conclusions

• Exemplar-based techniques stay closer to the underlying speech process by building local modals, while at the same time keeping the number of parameters parsimonious

• Exemplar-based methods complement, in a robust manner, the information captured within global-data models

• Exemplar-based processing could potentially support inference from any corpus enriched with information not immediately usable by global-data models, such as prosody

• It is critical to keep on improving the computational efficiency of exemplar-based methods due to growing data sets in speech recognition

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

32 Andreas Gaich

Page 33: presented by - SPSC · • Single best Viterbi decoding Remains sensitive to errors in the training database (incorrect annotations, bad segmentations, highly unusual pronunciations,

Institut für Signalverarbeitung und Sprachkommunikation

Graz 29.04.2013 Advanced Signal Processing Seminar 2

33 Andreas Gaich

References [1] B. Ramabhadran, D. Nahamoo, D. Kanavesky, D. Van Compernolle, K. Demuynck, J.F. Gemmeke, J.R. Bellegarda, S. Sundaram, “Exemplar-Based Processing for Speech Recognition”, IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 98 -113, Nov. 2012. [2] J.R. Glass, “A probabilistic framework for segment-based speech recognition”, Comput. Speech Lang., vol. 2-3, pp. 137-152, Apr.-July 2003. [3] T. N. Sainath, B. Ramabhadran, D. Nahamoo, D. Kanevsky, and A. Sethy, “Exemplar-based sparse representation features for speech recognition”, in Proc. Interspeech, 2010, pp. 2254–2257. [4] K. Demuynck, D. Seppi, H. Van hamme, and D. Van Compernolle, “Progress in example-based automatic speech recognition”, in Proc. Int. Conf. Acoustics, Speech and Signal Processing, 2011, pp. 4692–4695. [5] M. De Wachter, K. Demuynck, D. Van Compernolle, and P. Wambacq, “Data driven example based continuous speech recognition”, in Proc. European Conf. on Speech Communication and Technology, 2003, pp. 1133-1136. [6] S. Sundaram and J. Bellegarda, “Latent perceptual mapping with datadriven variable-length acoustic units for template-based speech recognition”, in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Mar. 2012, pp. 4125–4128.