Download - DOCTORAL THESIS - cvut.cznoel.feld.cvut.cz/speechlab/publications/051_disertace07.pdfdeluje ˇcasovy´ vy´voj energi´ı ve frekvenˇcn´ıch sub-p´asmech spektra, na metodu LP-TRAP

Czech Technical University in Prague

Faculty of Electrical Engineering

DOCTORAL THESIS

Petr Fousek March 2007



DOCTORAL THESIS

Extraction of Features for Automatic Recognition

of Speech Based on Spectral Dynamics

Petr Fousek

March 2007



Department of Circuit Theory

Extraction of Features for Automatic Recognition

of Speech Based on Spectral Dynamics

by

Petr Fousek

PhD Program: Electrical Engineering and Information Technology

Branch of Study: Electrical Engineering Theory

Supervisor: Doc. Ing. Petr Pollak, CSc.

Supervisor–Specialist: Prof. Hynek Hermansky

Abstract

This work is oriented towards new approaches in automatic recognition of speech utilizing

neural networks and Hidden Markov Models (HMM). The main interest is in developing

novel features extracted from spectral dynamics.

It is generally acknowledged that the underlying message in speech is contained in

auditory spectrogram. However, it is not well known how the relevant information is

distributed in the spectrogram and how it should be converted into features.

The first part of the thesis investigates on the distribution of information in the au-

ditory time-frequency plane. In particular, we study how much of the symmetric time

context surrounding the particular time instant can possibly be useful for features. We

also look at how the length of this context changes with the varying size of the feature

vector. After showing that most of the information comes from the center of the symmet-

ric contextual window, we propose to warp the time axis so as to devote more modeling

power to the window center at the expense of boundaries. Finally we focus on modulation

properties of speech: Considering that the information being transmitted by speech is

encoded in temporal modulations of the auditory spectra, we explore which modulations

should be preserved in features and which of those are the most important.

In the next part we focus on the development of features explicitly encoding temporal

evolution of energies in frequency sub-bands called LP-TRAP (Linear Predictive Tempo-

RAl Patterns). LP-TRAPs bypasses frame-based analysis and can preserve fine temporal

structure of sub-band events. The method is implemented in C++ and tunable parame-

ters are optimized. The idea of pre-warping the temporal axis in order to stress the central

part of sub-band energy trajectories is implemented and shown to be beneficial.

Subsequently is proposed a novel feature extraction technique named M-RASTA, which

extends and generalizes the RelAtive SpecTrAl filtering (RASTA) method. It filters speech

spectrogram with a bank of 2-D time-frequency filters with varying temporal resolutions.

The result is projected onto phoneme probabilities by neural network. It is inspired by

the earlier findings about the information in spectro-temporal plane, namely focusing on

the central parts of TRAPs and preserving only the important modulation spectrum. The

technique is optimized, implemented in C++ and compared to baseline features.

The last studied topic proposes a new alternative approach to mainstream HMM word

recognition, where each targeted word is classified by a separate binary classifier against

all other sounds. The system uses only discriminatively trained neural network classifiers.

Since the proposed framework focuses on capturing only the words of interest, it is able

to reasonably reject all out-of-vocabulary words.

The observations and conclusions of this work are based on experimental evidence

using two standard independent speech recognition tasks.

iii

Abstrakt

Prace je orientovana smerem k novym metodam pro automaticke rozpoznavanı reci

zalozenym na umelych neuronovych sıtıch a skrytych Markovovych modelech (HMM).

Duraz je kladen na vyvoj novych prıznaku odvozenych z dynamiky spektra.

Je obecne uznavano, ze informace o tom, co ma byt recı sdeleno, je obsazena ve spektro-

gramu. Nicmene nenı presne znamo, jakym zpusobem je tato informace ve spektrogramu

rozprostrena a jakym zpusobem by mela byt prevedena do prıznaku pro rozpoznavanı.

Prvnı cast dizertacnı prace zkouma prave rozlozenı informace v casove-frekvencnı ro-

vine spektrogramu. Zamerıme se na to, jak velky casovy kontext symetricky obklopujıcı

zkoumany casovy okamzik muze byt uzitecny pro prıznaky. Podıvame se, zda se sırka to-

hoto kontextu menı s poctem prıznaku, ktere mame k dispozici. Dale ukazeme, ze nejvetsı

cast informace pochazı ze stredu casoveho kontextu a navrhneme metodu borcenı casove

osy, ktera umoznı lepe modelovat stred kontextu na ukor jeho okraju. Na konci teto casti

se venujeme modulacnım vlastnostem reci. S uvazenım, ze informace prenasena recı je

zakodovana v casovych zmenach – modulacıch – spektra, se pokusıme zjistit, ktere modu-

lace by mely byt zachovany v prıznacıch a ktere z nich jsou nejdulezitejsı.

V dalsı casti se zamerıme na jednu z metod vypoctu prıznaku, ktera explicitne mo-

deluje casovy vyvoj energiı ve frekvencnıch sub-pasmech spektra, na metodu LP-TRAP

(Linear Predictive TempoRAl Patterns). Metoda nevyuzıva segmentalnı analyzy, proto je

schopna zachovat detailnı strukturu casovych udalostı na urovni sub-pasem. Algoritmus

je implementovan v jazyce C++ a je provedena optimalizace parametru metody. Je im-

plementovana vyse uvedena myslenka borcenı casove osy a experimentalne je ukazan jejı

prınos.

Dale je navrzena nova metoda vypoctu prıznaku M-RASTA, ktera rozsiruje a zo-

becnuje RASTA filtraci (RelAtive SpecTrAl filtering). M-RASTA filtruje spektrogram reci

pomocı banky dvourozmernych casove-frekvencnıch filtru s ruznym casovym rozlisenım.

Vysledek filtrace je promıtan na aposteriornı pravdepodobnosti fonemu pomocı neuronove

sıte. Metoda je inspirovana zavery z predchozıch castı, zejmena se zameruje na stred casove

trajektorie a zachovava pouze potrebnou cast modulacnıho spektra. Metoda je castecne

optimalizovana, implementovana v C++ a srovnana s jinymi typy prıznaku.

V poslednı casti prace je navrzen novy alternativnı postup rozpoznavanı slov bez HMM,

ve kterem je kazde cılove slovo odliseno od jakychkoliv jinych zvuku pomocı nezavisleho

binarnıho klasifikatoru. Ke klasifikaci jsou vyuzity pouze neuronove sıte. Protoze se

navrzeny system zameruje pouze na cılova slova, je schopen do rozumne mıry zamıtnout

vsechna slova mimo slovnık.

Poznatky a zavery prezentovane v teto praci jsou experimentalne dolozeny s vyuzitım

dvou nezavislych rozpoznavacıch uloh.

v

Acknowledgment

The post-gradual study takes quite a bit of lifetime to finish. Meanwhile, one manages to

meet a number of good people, have a lot of fun, learn many new things and eventually do

a piece of work. I happened to spend my PhD at two places, namely at my alma mater,

Czech Technical University in Prague, and at IDIAP Research Institute, Martigny.

In Prague I worked under the guidance of my supervisor Petr Pollak. Petr let me to

the scientific world, took me to my first conference and arranged my stage at IDIAP. He

was also the source of big theoretical support and practical experience. It was him who

introduced me to the world of linux for which I am endlessly grateful.

My thanks go also to all other members of the lab for the nice working and off-work

atmosphere. In particular, to Vaclav Hanzl (our linux and system guru) for his in-depth

technical and theoretical support; to Jan Novotny for fruitful discussions, for his digits

recognizer cook-book and also for his microphone which I stole him; to Jindrich Zd’ansky

for Perl and his irresistible write-only scripts; and to Hynek Boril for all the progressive

and constructive discussions, for very effective and smooth cooperation, for sharing his

hot-dog machine in the lab and for being a friend. Finally, I want to thank to the head of

our department, prof. Sovka, for his perfect and enthusiastic DSP lectures, through which

I long ago decided for this field.

During my stage at IDIAP I was guided by my co-supervisor Hynek Hermansky. Hynek

allowed me to recognize the cutting edge in the world of speech recognition. His encyclo-

pedic knowledge of the latest technology, his open mind and wild ideas make him a unique

personality whom it is an outstanding experience to work with. I greatly appreciate his

catching optimism and his involvement in our work.

I am grateful to all guys from the speech recognition group for their cooperation and

nice atmosphere in the lab. First of all, to my friend Frantisek Grezl for teaching me

TRAPs, being able to answer any TRAP-related questions and providing me with all his

scripts and tools. Our everyday free-hand cooking sessions with Franta gave no doubt

rise to our best ideas in ASR. I also thank to Petr Motlıcek who, besides the serious

work, helped me with exploring the Alps on foot and on bike. Among other lab-mates, I

enjoyed the cooperation with Marios Athineos, Petr Svojanovsky, Hamed Ketabdar, Mikko

Lehtonen and Hemant Misra. The swift and secure working environment at IDIAP is

certainly due to admins Frank Formaz, Norbert Crettol and the director, Herve Bourlard.

vii

viii

I want to thank to all guys from FIT Brno Speech Processing lab for kindly providing

me with lot of source code, support and for domesticating that “guy from Prague”.

As the PhD study is not entirely a lucrative job, I owe my thanks to my parents

for material support. Apart from that, my parents have always provided me with an

absolutely reliable and all-inclusive home, which I appreciate the most.

Finally, my saturated thanks go to my wife Petra for being with me, for her bullet-proof

patience, indissoluble tolerance and for making our life colorful.

— big thanks go to Franta, Pet’a and Pet’ka for reading the manuscript

and to my father for the picture with the poor guy in the desert.

Thank you.

Contents

1 Preface 1

2 Introduction 3

2.1 Statistical speech recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Back-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Incorporating spectral dynamics in features . . . . . . . . . . . . . . . . . . 7

2.2.1 Long time context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Studied topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.3 Main goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Survey: From short term spectrum to spectral dynamics 13

3.1 Obtaining auditory spectrogram . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Frequency resolution of speech spectrum . . . . . . . . . . . . . . . . 13

3.1.2 Bank of filters in time domain . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Spectrogram filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Temporal filtering and modulation spectrum . . . . . . . . . . . . . 14

3.2.2 2-D Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Parametric spectrogram representation . . . . . . . . . . . . . . . . . . . . . 15

3.4 Parametrizing spectrogram by probabilistic features . . . . . . . . . . . . . 16

3.4.1 Hybrid and TANDEM architectures . . . . . . . . . . . . . . . . . . 16

3.4.2 TRAP architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.3 Further development of TRAP . . . . . . . . . . . . . . . . . . . . . 18

3.4.4 Multi-stream systems . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Recognition and evaluation framework 21

4.1 Software tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Recognizer architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Multi Layer Perceptron as posterior probability estimator . . . . . . 25

4.3 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Evaluation tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4.1 English digits recognition – Stories & Numbers95 . . . . . . . . . . . 29

4.4.2 Conversational Telephone Speech – CTS . . . . . . . . . . . . . . . . 32

4.5 Baseline results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

ix

x

5 Information in time-frequency plane 37

5.1 Limits of useful temporal context for TRAP features . . . . . . . . . . . . . 37

5.1.1 Mean TRAPs of phonemes . . . . . . . . . . . . . . . . . . . . . . . 37

5.1.2 Truncating TRAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.3 Fixed MLP topology – truncating TRAP-DCT . . . . . . . . . . . . 41

5.1.4 Extension – combining DCTs of different lengths . . . . . . . . . . . 43

5.2 Focusing on TRAP center . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2.1 Warping time axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Linear Predictive TempoRAl Patterns (LP-TRAP) 51

6.1 Introduction to LP-TRAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2 Extracting LP-TRAP features . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2.1 Importance of Hilbert envelope . . . . . . . . . . . . . . . . . . . . . 53

6.2.2 Obtaining frequency sub-bands . . . . . . . . . . . . . . . . . . . . . 54

6.2.3 FDLP – Frequency-Domain Linear Prediction . . . . . . . . . . . . . 55

6.2.4 Free parameters in algorithm . . . . . . . . . . . . . . . . . . . . . . 57

6.3 Experimentally optimizing LP-TRAP . . . . . . . . . . . . . . . . . . . . . 58

6.3.1 Sub-band envelope compression and LPC order . . . . . . . . . . . . 58

6.3.2 Sampled FDLP temporal envelopes vs. FDLP cepstra as features . . 58

6.3.3 Input window length . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3.4 Overlap of frequency sub-bands . . . . . . . . . . . . . . . . . . . . . 60

6.3.5 LP model order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3.6 Evaluating optimized features on S/N and CTS tasks . . . . . . . . 61

6.4 Warping time axis in LP-TRAP . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.4.1 Temporal resolution of LP-TRAP . . . . . . . . . . . . . . . . . . . 62

6.4.2 Two ways to warp time axis . . . . . . . . . . . . . . . . . . . . . . . 63

6.4.3 Non-linear time warping and sampling theorem . . . . . . . . . . . . 63

6.4.4 Warping function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Multi-resolution RASTA filtering (M-RASTA) 69

7.1 Introduction – M-RASTA from different perspectives . . . . . . . . . . . . . 69

7.2 M-RASTA features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.2.1 Temporal filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.2.2 2-D – time-frequency filters . . . . . . . . . . . . . . . . . . . . . . . 72

7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.3.1 Filtering with single filter . . . . . . . . . . . . . . . . . . . . . . . . 73

7.3.2 Combining two temporal filters . . . . . . . . . . . . . . . . . . . . . 74

7.3.3 Tuning the system for best accuracy . . . . . . . . . . . . . . . . . . 76

7.3.4 Robustness to channel noise . . . . . . . . . . . . . . . . . . . . . . . 78

7.3.5 Modulation frequency properties . . . . . . . . . . . . . . . . . . . . 79

7.3.6 Discrepancy between MLP and HMM – phoneme posteriograms . . 81

7.4 Combining LP-TRAP and M-RASTA . . . . . . . . . . . . . . . . . . . . . 83

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

xi

8 Extensions: Towards recognition by means of keyword spotting 85

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8.2 Detecting a word in two steps . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8.2.1 From frame-based estimates to word level . . . . . . . . . . . . . . . 87

8.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

8.3.1 Training and testing data sets . . . . . . . . . . . . . . . . . . . . . . 88

8.3.2 Initial experiment – checking viability . . . . . . . . . . . . . . . . . 89

8.3.3 Simplifying the system – omitting intermediate steps . . . . . . . . . 90

8.3.4 Optimizing false alarm rate – Enhanced system . . . . . . . . . . . . 90

8.3.5 Keyword spotting on frame level . . . . . . . . . . . . . . . . . . . . 92

8.3.6 Keyword spotting in unconstrained speech . . . . . . . . . . . . . . . 93

8.4 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

9 Summary and conclusion 95

9.1 Summary of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

9.2 Original contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9.4 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A Class coverage of S/N & CTS tasks. 99

B Summary of experiments on CTS Task. 101

C On target classes for band-classifiers in TRAP 103

D Sub-phoneme targets for TANDEM classifier 107

Bibliography 110

List of Figures

2.1 Diagram of a typical ASR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Example of a three-state sequential HMM. . . . . . . . . . . . . . . . . . . . 7

2.3 Illustration of long-term feature extraction. . . . . . . . . . . . . . . . . . . 9

3.1 Scheme of Hybrid MLP/HMM system. . . . . . . . . . . . . . . . . . . . . . 17

3.2 Scheme of TANDEM system combining MLP and GMM/HMM. . . . . . . 17

3.3 Scheme of TRAP system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Block scheme of feature extractor CtuCopy. . . . . . . . . . . . . . . . . . . 23

4.2 General scheme of front-end for long-term features. . . . . . . . . . . . . . . 25

4.3 Scheme of three-layer MLP and neuron. . . . . . . . . . . . . . . . . . . . . 26

4.4 Structure of the corpora in Stories-Numbers95 (S/N) task. . . . . . . . . . . 30

5.1 Mean TRAPs for 41 phonemes and 6 non-phoneme classes of CTS. . . . . . 38

5.2 Mean TRAPs and standard deviations for selected phonemes of CTS. . . . 39

5.3 Influence of TRAP length on FER and WER. . . . . . . . . . . . . . . . . . 40

5.4 Influence of TRAP-DCT length on FER and WER. . . . . . . . . . . . . . 42

5.5 First four DCT bases weighted by Hamming window. . . . . . . . . . . . . . 44

5.6 Chosen bases of DCT applied to 100 ms and 1000 ms TRAPs. . . . . . . . 44

5.7 Illustration of combining bases of two DCTs of different sizes. . . . . . . . . 45

5.8 Combining DCT size 2 features with another DCT size 2 features. . . . . . 45

5.9 Symmetric TRAP warping function. . . . . . . . . . . . . . . . . . . . . . . 47

5.10 Discrete mapping from TRAP size 101 to TRAP size 21. . . . . . . . . . . . 48

5.11 Warping time axis in TRAPs. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1 LP-TRAP feature extraction scheme. . . . . . . . . . . . . . . . . . . . . . . 52

6.2 Illustration of forming frequency sub-bands from DCT spectrum. . . . . . . 55

6.3 Duality between time and frequency domains in LP-TRAP. . . . . . . . . . 55

6.4 Detailed LP-TRAP feature extraction scheme. . . . . . . . . . . . . . . . . 57

6.5 Compression of sub-band dynamics in LP-TRAP. . . . . . . . . . . . . . . . 58

6.6 Bank of filters in LP-TRAP. Influence of blap factor on filter widths. . . . 60

6.7 WER and FER as a function of blap factor in LP-TRAP. . . . . . . . . . . 60

6.8 WER and FER as a function of LP order fp in LP-TRAP. . . . . . . . . . 61

6.9 Illustration of temporal resolution of LP-TRAP vs. TRAP. . . . . . . . . . 62

6.10 Warping temporal axis in LP-TRAP feature extraction. . . . . . . . . . . . 63

6.11 Illustration of binning process in warping LP-TRAPs. . . . . . . . . . . . . 66

7.1 M-RASTA feature extraction scheme. . . . . . . . . . . . . . . . . . . . . . 71

xiii

xiv

7.2 Normalized impulse responses of the first two Gaussian derivatives. . . . . . 72

7.3 Normalized frequency responses of the first two Gaussian derivatives. . . . . 72

7.4 Detail of the first two Gaussian derivatives for various σ. . . . . . . . . . . . 72

7.5 Example of impulse responses of 2-D RASTA filters with σ = 60 ms. . . . . 73

7.6 M-RASTA with only one filter: FER and WER dependencies on filter width. 74

7.7 Combining two temporal filters in M-RASTA: error as a function of σ. . . . 75

7.8 Shrinking the modulation bandwidth in M-RASTA by limiting σ range. . . 79

7.9 Determining bandwidth of a bank with two filters. . . . . . . . . . . . . . . 80

7.10 Mapping between σ and bandwidth of associated g1,2 filters. . . . . . . . . . 80

7.11 FER and WER as a function of cutoff frequency in modulation spectrum. . 81

7.12 Posteriograms for utterance “nine”. . . . . . . . . . . . . . . . . . . . . . . . 82

7.13 Scheme of Hilbert-M-RASTA feature extraction. . . . . . . . . . . . . . . . 83

8.1 Scheme of hierarchical keyword spotting. . . . . . . . . . . . . . . . . . . . . 86

8.2 Example of keyword posteriogram. . . . . . . . . . . . . . . . . . . . . . . . 86

8.3 Impulse responses of matched filters for eleven keywords. . . . . . . . . . . . 88

8.4 Finding the keyword position from its posterior probability. . . . . . . . . . 88

8.5 Omitting intermediate processing steps from hierarchical keyword spotting. 90

8.6 Alarm thresholds as a function of false alarm rate (floored at threshold 0.05). 91

8.7 Operation ranges of the proposed system and HMM recognizer. . . . . . . . 92

8.8 Frame-level evaluation of Keyword MLPs, Test set. . . . . . . . . . . . . . . 92

C.1 TRAP MLP architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

D.1 Illustration of phonemes as MLP training targets for an utterance “seven”. 108

List of Tables

4.1 Performance of baseline features on S/N task. . . . . . . . . . . . . . . . . . 34

4.2 Suitable grammar scale factors for CTS task. . . . . . . . . . . . . . . . . . 34

4.3 Performance of baseline features on CTS task. . . . . . . . . . . . . . . . . . 35

5.1 Warping factors w and compression ratios for given lengths M . . . . . . . . 49

5.2 Performance of selected warped TRAPs on CTS. . . . . . . . . . . . . . . . 50

6.1 Performance of sampled temporal envelopes vs. of cepstra in LP-TRAP. . . 59

6.2 Influence of LP-TRAP input segment length on FER and WER. . . . . . . 59

6.3 Performance of optimized LP-TRAPs on S/N task. . . . . . . . . . . . . . . 61

6.4 Performance of optimized LP-TRAPs on CTS task. . . . . . . . . . . . . . . 61

6.5 Performance of warped LP-TRAPs as a function of fp, ncep, warp. S/N

task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.6 Performance of warped LP-TRAPs, CTS task. . . . . . . . . . . . . . . . . 67

7.1 Best reached FER and WER for three feature pair types. . . . . . . . . . . 76

7.2 Adding frequency derivatives to M-RASTA features. . . . . . . . . . . . . . 77

7.3 Looking for a suitable number of temporal filters in M-RASTA. . . . . . . . 77

7.4 Performance of M-RASTA on CTS. . . . . . . . . . . . . . . . . . . . . . . . 78

7.5 Influence of channel mismatch on WER (M-RASTA and other features). . . 79

7.6 Comparing LP-TRAP, M-RASTA and Hilbert-M-RASTA on S/N task. . . 84

7.7 Comparing LP-TRAP, M-RASTA and Hilbert-M-RASTA on CTS task. . . 84

8.1 WER as a function of false alarms per hour on digits recognition task. . . . 91

8.2 Results on a joint set of digits and unconstrained speech. . . . . . . . . . . 93

A.1 Class coverage of S/N task, MLP sets 1 & 2. . . . . . . . . . . . . . . . . . 99

A.2 Class coverage of CTS task, training sets. . . . . . . . . . . . . . . . . . . . 100

B.1 Summary of experiments on CTS task. . . . . . . . . . . . . . . . . . . . . . 102

C.1 Emulating non-separable classes in TRAP Band-MLP targets. . . . . . . . . 105

D.1 Performance of phoneme states used as MLP targets, S/N task. . . . . . . . 108

D.2 Performance of phoneme states used as MLP targets, CTS task. . . . . . . 109

xv

List of Abbreviations

ANN Artificial Neural Network

AR Autoregressive (model)

AR-MA Autoregressive - Moving Average (model)

ASR Automatic Speech Recognition

CRBE CRitical Band Energy

CTS Conversational Telephone Speech

DCT Discrete Cosine Transform

DTW Dynamic Time Warping

FA False Alarm (rate)

FAcc Frame Accuracy

FDLP Frequency-Domain Linear Prediction

FER Frame Error Rate

FFT, F[·] Fast Fourier Transform

FIR Finite Impulse Response (filter)

FOM Figure Of Merit

GMM Gaussian Mixture Model

H[·] Hilbert transform

HMM Hidden Markov Model

HSR Human Speech Recognition

HTK Hidden Markov model ToolKit

KLT Karhunen-Loeve Transform

LDA Linear Discriminant Analysis

LP, LPC Linear Prediction, Linear Predictive Coding

LP-TRAP Linear Predictive TempoRAl Patterns

LVCSR Large-Vocabulary Continuous Speech Recognition

MFCC Mel-Frequency Cepstral Coefficients

MLP Multi-Layer Perceptron

M-RASTA Multi-resolution RelAtive SpecTrA (filtering)

OOV Out Of Vocabulary (word)

P () Probability

PCA Principal Components Analysis

PLP Perceptual Linear Prediction

TDLP Time-Domain Linear Prediction

TRAP TempoRAl Patterns

VTLN Vocal Tract Length Normalization

WER Word Error Rate

xvii

Chapter 1

Preface

Speech has been the most natural and important means of communication among humans

since tens of thousands years. Production of speech and hearing has had even more time

to evolve as the ability to detect and classify sounds was the key feature to survival. Re-

cent studies of cortical neurons of mammals indicate that the brain develops fundamental

classification abilities even in early prenatal stage, before the organs that capture exter-

nal stimuli are being developed [69]. Later, the human fetus starts to react to mother’s

voice by movements. It confirms that speech production and recognition capabilities are

by nature thoroughly optimized and mutually adapted. Every day human beings process

immense volumes of acoustic stimuli, performing subconsciously continual filtering, denois-

ing, channel normalization, adaptation to speaker, classification, and complex high level

grammatical and contextual searches that enable for very reliable means of information

transmission.

The idea of automatic recognition of speech by machine is much younger (although

its origin can virtually go back to the invent of writing, dating about five thousand years

ago). Attempts for automatic conversion between written and spoken form of message,

known as speech synthesis (Text To Speech, TTS) and Automatic Speech Recognition

(ASR), were triggered by Edison’s invent of the phonograph in 1877, but seriously evolved

after major breakthroughs in 1960’s, namely Fast Fourier Transform [28], cepstral analysis

[82], Linear Predictive Coding [13, 57], Dynamic Time Warping [18] and Hidden Markov

Modeling [85]. Within a half of the century of development boosted by the recent boom of

computers and digital technologies allowing for acquiring and processing extreme amounts

of data, there have been gradual and considerable achievements. Although, the ultimate

goal, a system that would be able to substitute humans in this task, has not yet been

reached. The reason is that the task is quite complex and current ASR is very fragile. It

is now possible to build a laboratory system tuned to deliver good transcript of recorded

utterances, but when exposed to real-life voice, the system usually fails due to the lack of

robustness [43]. Even the current commercial solutions need an additional in-place tuning

and adaptation after deployment. It suggests that we still do not know how to properly

process the speech data. What is it that we should extract from the data and how to

present it to the learning and classification systems?

Apart from the linguistic message which is the target for ASR, the speech contains

much more information. It describes the speaker himself, his mood (joy, anxiety, excite-

ment), even his health (hoarse, sleepiness). The speech also differs for the context in

1

2 1 Preface

which the speaker is. Calling a friend on the phone would likely be more casual than a

formal talk to an audience. All those aspects are called the intra-speaker variations. The

changes from one speaker to another introduce the inter-speaker variations. Furthermore,

there is an information in the speech about the transmission channel and microphone,

and the acoustic image of the ambient called background noise (e.g. other sounds, speech,

music, noise). One part of that non-linguistic information may be important for different

domains such as speech coding, speaker recognition, health monitoring, etc., however, for

ASR it represents an unwanted variability which should be alleviated. The ASR approach,

which will be introduced in the following section, is able to deal with this diversity to some

extent, which has been proved by years of use. On one hand it is a good reason to stick to

this knowledge and build upon it; on the other hand there is a challenge for novel feature

extraction approaches, which is the main concern of this thesis.

The commonly used short-term features such as MFCC or PLP [74, 47] encode the

envelope of the power spectrum into cepstral features. Such representation is motivated

by findings from speech production (LPC analysis, cepstral analysis), and mainly speech

perception (Bark, ERB, and Mel frequency scales, Equal Loudness perception, Intensity-

Loudness Power Law [47]). Though this knowledge makes the features reasonably speaker

independent, they are still vulnerable to distortions. For example, in cheap AC-powered

radios, harmonics of 50 Hz often leak into the speech, which does not noticeably influence

intelligibility. However, such distortion alters the overall spectral shape, which affects

cepstral coefficients and the recognizer based on short-term features may thus fail.

Another obstacles can appear easily in real environment: By changing the distance of

the speaker from the microphone or even changing the microphone itself causes a change

in the cepstrum which may have severe impact on ASR performance. Current commercial

dictation systems (e.g. from Nuance) often supply a custom close talk microphone to

alleviate this source of variability. This issue can generally be treated either with similar

hardware workarounds which pose an unpleasant constraints on the user and make the

system less portable, or by a more sophisticated signal processing means. Many techniques

have been developed for attenuating such distortions. Additive noises can be reasonably

suppressed with many variants of spectral subtraction, e.g. [90, 71, 34, 73]. Channel noise

suppression techniques operate on logarithmic spectrum or cepstrum where convolution

transforms into addition, e.g. spectral “normalization” by subtracting long-term average

from cepstrum [12], or RASTA filtering allowing for on-line processing [52]. However,

these techniques are knowledge-based and put further assumptions on the speech, for

example the noise has to be stationary or at least slow-changing for the spectral subtraction

to work. Generally, these techniques can only help by post-processing already existing

features that are not robust to make them more robust. But, consider a speech which is

corrupted by a severe channel noise in a particular frequency sub-band, say by a complete

deletion of that sub-band. It was shown that humans are amazingly resistant to such

phenomena. Even deletion of everything within the 800 Hz–4 kHz band from the speech

did not prevent human listeners to recognize nonsense syllables at about 90% rate [70].

Given the standard short-term cepstral features derived from such speech, it is impossible

to recover the “clean” features. Nevertheless, there exist ways to overcome similar issues

in ASR. This work shows that features derived from the spectral dynamics, rather than

from the absolute spectra, can conveniently serve this purpose.

Chapter 2

Introduction

To be able to position the thesis within the ASR framework, a typical speech recognition

system will be introduced first with its skills and weaknesses, followed by the motivation

for the presented study.

2.1 Statistical speech recognizer

The main question of a statistical ASR system is: “What is the most likely sequence

of words M given an acoustic observation X?” The probability of a word sequence can

be written as P (M|X,Θ), where M = {m1,m2, . . . ,mN} is the sequence of words mi,

X = {x1,x2, . . . ,xM} is a sequence of acoustic observations represented by feature vectors

xi, and Θ are parameters of the model. The most likely sequence is then

M = arg maxM

P (M|X,Θ). (2.1)

As this probability cannot be evaluated directly, we apply Bayes rule:

M = arg maxM

P (X|M,Θ)P (M|Θ)

P (X|Θ). (2.2)

We get two terms in numerator. P (X|M,Θ) is the probability of the acoustic ob-

servation given a sequence of words. P (M|Θ) is the probability of word sequence M

independent on the acoustic observation. P (X|Θ) in denominator is the a priori proba-

bility of the acoustic observation which is constant over all word sequences M and can be

omitted due to argmax. In principle, eq. 2.2 enables independent acoustic and language

modeling, yet every word has to be modeled with an individual model. It is not very

efficient as soon as there are more words in the vocabulary and, for large-vocabulary ASR,

it is even impossible. But, if there exist a set of sub-word units to which all words can be

mapped, we can expand the nominator in eq. 2.2. Let us consider a set of units called

phones Q = {q1, q2, . . . , qK}. If a sequence of words M can be transcribed with a set of

possible phone sequences Q, then the probability of the word sequence can be evaluated

by summing over all phone sequences. Eq. 2.2 then becomes

3

4 2 Introduction

M = arg maxM

∑

Q

P (X|Q,M,Θ)P (Q,M|Θ) (2.3)

= arg maxM

∑

Q

P (X|Q,ΘAM )P (Q|M,ΘPM )P (M|ΘLM ). (2.4)

At the step from eq. 2.3 to eq 2.4 in the first term we assume that the acoustic

observation is conditionally independent on the word sequence given the phone sequence

(in other words, it does not matter how the words transcribe in phones). The second term

was just factorized in two independent factors. According to eq. 2.4, the ASR problem

splits in three parts:

P (X|Q,ΘAM ) – likelihood of the acoustic observation given a fixed set of sub-word units

and given a subset of parameters known as Acoustic model (likelihood that it is

the observation sequence X that the sequence of phonemes Q has generated).

P (Q|M,ΘPM ) – likelihood of the phoneme sequence Q given the word sequence M and

the parameters subset called Pronunciation model (likelihood that the words M

transcribe as Q – qualified by pronunciation dictionary).

P (M|ΘLM ) – likelihood of the word sequence M given parameters of the Language

model (this introduces grammatical constraints).

Estimation of these probabilities constitute a three wide branches of ASR research

on their own. Pronunciation modeling deals with variations of pronunciation, typically

using finite state automata which model transition probabilities between phones using a

priori or data-driven rules. Language modeling incorporates a knowledge about language

(grammar, semantics) into ASR typically by N-gram grammars which model probabilities

of transitions between N subsequent words. The models are trained on large text corpora.

The goal of the acoustic modeling is a mapping between acoustic input and the chosen

sub-word units. It is typically accomplished by Hidden Markov Models (HMM) which

model temporal structure and rate variations of speech with a finite state automata; a

measure of similarity of a particular feature and a sub-word unit is quantified by Gaussian

probability density functions or neural networks. The parameters of the acoustic model are

trained on large labeled speech corpora. This thesis is oriented towards acoustic modeling,

therefore it will not further discuss the language-related parts of ASR.

Fig.2.1 shows how the three ASR branches are distributed in a typical ASR system.

The recognizer comprises speech data acquisition and preprocessing, extraction of features,

estimation of class probabilities, and decoding. It can be split to a so-called front-end and

back-end, which originates from the notion of distributed telephone ASR. The front-end

is implemented at the telephone side and the back-end at the server side.

2.1.1 Front-end

The preprocessing block first converts the input speech into a digitized signal and then

possibly applies DSP enhancement techniques such as DC removal, normalization, equal-

ization, and noise suppression. Some noise suppression techniques can be efficiently inte-

grated in feature extraction as will be shown later. Subsequently, the front-end extracts

2.1 Statistical speech recognizer 5

feature extraction

preprocessingtext

class

decodinglikelihood

estimation

acoustic modeling language modelingpronunciation,

front−end back−end

Figure 2.1: Diagram of a typical ASR.

features from the speech. The goal is to preserve the important linguistic and other rel-

evant information and to reduce the redundancy of the data. The information rate of

telephone quality speech is about 64 kpbs (present European telephone standard). Typi-

cal algorithms used in low-rate speech coding need an order of magnitude less (2400 bps

for LPC10 standard) and the intelligibility can still be preserved for bitrates <500 bps

[79]. ASR differs from speech coding in that it only needs to preserve the linguistic mes-

sage which bitrate is around 50 bps [35], three orders of magnitude less than the original

speech. The algorithms thus aim at separating the relevant information from the ir-

relevant. Moreover, the subsequent classification and decoding algorithms are generally

computationally-intensive and it would not be efficient to process the full bandwidth sig-

nal. Development of front-end algorithms also helps to recover the underlying processes

behind speech and to understand the generation and perception processes.

Typically, the feature extraction is based on some form of short-term spectrum esti-

mated in a frame-based manner. Every 10–20 ms the signal is windowed with a 20–30 ms

long Hamming window. By taking FFT of the windowed frame we move to the frequency

domain and get about 100–200 samples of the spectrum. Such an approach assumes a

stationarity within the frame. Though this assumption is not met exactly, the 20–30 ms

long frame is a reasonable approximation. The features are derived only from the mag-

nitude since the phase is believed not to contain much of the relevant information for

speech recognition1. The envelope of the magnitude (or power) spectrum is smoothed

with a bank of auditory-like filters. In frequency domain this represents a simple binning

of the spectral samples. The filter bank contains 15–30 filters approximately equidistant

on a logarithmic frequency axis which emulates a property of the human speech percep-

tion. The dynamics of the filtered spectrum is compressed by the logarithm. The final

auditory-like spectrum with 15–30 bins is further smoothed and decorrelated either with

discrete cosine transform yielding Mel-Frequency Cepstral Coefficients (MFCC) [74, 29],

or with linear prediction with further auditory-like operations yielding Perceptual Linear

Predictive cepstrum (PLP) [47]. By these operations, the feature extraction reduces the

bandwidth from 8000 numbers per second (speech) to about 1200 numbers per second

(features). Recall that the linguistic rate of speech is about 50 bits per second, there is

still a lot of redundancy that could be reduced. However, other relevant yet non-linguistic

factors may need to be preserved in features for a successful decoding in the back-end.

1The fact that the ear is frequency deaf is referred to as Ohm’s Law of Acoustics [8].

6 2 Introduction

2.1.2 Back-end

The feature vectors are passed to the back-end. There the goal is to map the speech

frames onto chosen sub-word units and to decode the message. Since a widely used sub-

word units are the phones2, we will demonstrate the idea using phones. The mapping

from feature vectors to phones is not one-to-one. One phone is an event in time with a

characteristic temporal structure, which usually spans over several frames. This structure

helps to identify the phone. Given that speech frames themselves are assumed stationary,

to preserve the phone structure, every phone has to be modeled by several phases, called

states. Typically, one phone is formed by 3–7 subsequent states. The problem of mapping

features to the phonemes thus splits in two parts. The first is a static pattern matching

(comparing the feature to a template) and the second is a sequence recognition (time-

alignment).

To be able to match a state of a phone to a feature vector, there must exist some

discriminative function capable of evaluating a measure of similarity between the observed

feature and a stored statistical template (created during a training phase). For this purpose

mixtures of multidimensional Gaussian probability density functions (Gaussian Mixture

Model, GMM) are commonly used, which evaluate the similarity by means of a likelihood.

The GMMs assume the features to have multinomial Gaussian distribution which puts

constraints on front-end algorithms. Alternatively, the similarity can be estimated by a

trained neural network. Neural networks generally put less constraints on features and

have some advantages over GMMs: They allow for an arbitrary mapping between input

and output, and they can produce posterior probabilities3.

Approximating the fluent speech by a sequence of distinct stationary segments may

be quite crude, yet it allows for using well developed mathematical tools. The natural

variability of speech requires the similarity measures mentioned above and also some dy-

namic time-alignment between states and frames. That all is a task for Hidden Markov

Models (HMM). HMMs are stochastic finite state automata. The HMM is a generative

model, meaning that it aims at modeling the underlying process of speech generation. It

consists of several states S = {s1, s2, . . . , sK} and a topology of connections between them.

As speech is a temporal event, a sequential (left-to-right) model is used most often, see

example in Fig.2.2. There is a set of parameters associated with an HMM:

Transition probabilities a(sl|sk) – probabilities of a transition from state sk to sl (or

to stay at sk when k = l). These are modeled by Markov models of the first order,

which means that the next state depends only on the current state and not on any

previous states. The first order assumption allows for estimating the hidden state

sequence with only a small time context.

Emission probability b(xn|si) – a measure of a match of the observed feature and the

state si’s template, modeled by the above mentioned probability density functions.

Emission probability estimates how likely it is that the state si emitted (generated)

the observation xn.

2The phone is an elementary acoustic unit while a phoneme is a similar linguistic unit. Since these

units can often be mapped unambiguously, we will sometimes interchange these terms in the text.3The probability is a value between 0–1 as opposed to the likelihood which has generally no limits. A

system producing posterior probabilities has discriminative properties by nature.

2.2 Incorporating spectral dynamics in features 7

a(s |s )2 1

a(s |s )3 2

b(x|s )2

b(x|s )3

a(s |s )2 2

a(s |s )3 3

x

b(x|s )1

s s s1 2 3

1 1

x x

a(s |s )

Figure 2.2: Example of a three-state sequential HMM, characterized by transition proba-

bilities a(sk|sl) and emission probabilities b(x|si).

The joint likelihood of passing the observations though the model can be evaluated

using these probabilities under two following conditions. The first is the above mentioned

1st order Markov assumption and, secondly, it is assumed that the observations are in-

dependent on the previous states and features. If the assignment between features and

states were known, the joint probability would be a simple product of probability terms.

However, the sequence of states is unknown – hidden, therefore the probability that the

sequence of acoustic vectors X = x1,x2, . . . ,xT was emitted by the model M is given by

a sum over all possible state sequences S = s(1), s(2), . . . , s(T ),

P (X|ΘM ) =∑

S

a(s(1))

T∏

t=1

b(xt|s(t))a(s(t + 1)|s(t)), (2.5)

where a(s(1)) is an a priori probability of beginning at state s(1). There exist a number

of well-developed algorithms for efficient HMM training and decoding, see e.g. [85].

2.2 Incorporating spectral dynamics in features

It can be seen in Fig. 2.1 that the back-end of the ASR system is a complex multi-field

task which takes into account all the a priori lexical, phonetic, and grammatical properties

of the language that are incorporated in the pronunciation and language models. However,

the main information about the message comes through the acoustic model by means of

speech features, which arise from the front-end processing. It is evident that an appropriate

treatment of speech in the front-end is fundamental. If by a fallacious manipulation some

vital speech artifacts get lost or distorted during feature extraction, there is no more way

to recover them at the back-end. On the other hand, the complex information-merging

processes that take place at the back-end can perform reliably only when the acoustic

input serves highly relevant and discriminative features. Therefore, in a journey towards a

more robust ASR it seems coherent to put efforts in developing a robust front-end, rather

than trying to reclaim the information later at back-end using equalization, compensation,

normalization and other techniques. This idea is the main motivation for the front-end

orientation of this thesis.

8 2 Introduction

2.2.1 Long time context

Speech is the means of communication. The communication channel consists of three

parts: the transmitter (speaker = production), transmission (the air pressure), and the

receiver (the listener = perception). If we compare HSR and ASR, both systems share

the first two blocks. There is thus an analogy between the listener and the recognizer.

Since the evolution has optimized the use of the channel by human perception, why not

to emulate it in ASR? This idea motivated the perception-based approach. J. B. Allen

[8] wrote: “Until the performance of automatic speech recognition hardware surpasses

human performance in accuracy and robustness, we stand to gain by understanding the

basic principles behind how humans recognize speech.”. So far, the ASR based on the

short-term analysis is very fragile and ASR is far behind HSR, so we have to look at the

human recognition.

Inspiration by Humans

Temporal properties of speech are determined by the inertia of production organs which

causes fluent transitions between phones known as co-articulation. The phones are not

acoustically distinct, instead, they overlap. The significant information about the phoneme

can be found as far as 200 ms from its center [99] and most of the co-articulation happens

within a syllable-length (150–200 ms) region [44]. The perception has consistently time

constant about 200 ms given by the known property of temporal masking. It suggests

that if one wants to have a complete information about a phone in a signal, about half a

second long segment needs to be selected.

Introducing Long Context in ASR

In the current ASR the temporal evolution of speech is modeled poorly. Features are

derived from mere 25 ms and the HMMs where the temporal properties should be modeled

are constrained with the first order assumption (see above). The need of longer temporal

evidence in ASR is further documented by the success of delta features [38] which describe

to some extent local evolution of features. The suggestion that it is the change in spectrum

which carries the information was verified in perceptual experiments in [39] and for ASR

proved by the RASTA filtering which eliminates the steady components of the spectrum

[52]. Even the study of modulations in the speech supports this notion: most of the

modulations occur between 1 – 16 Hz with a peak at 4 Hz, which again corresponds to

time constants 150–250 ms [17, 10]. This idea suggests that the approach of extracting

the features “across frequency” might be better done “across time”, see Fig. 2.3.

Harvey Fletcher’s HSR experiments introduced the critical band frequency channels

and he suggested that humans use these channels independently of one another [36]. The

“across time” processing in the independent channels forms the core of the recent TRAP

features [53], which has been shown to improve robustness and performance of ASR.

2.3 Thesis overview 9

time

frequency

time

frequency

Figure 2.3: Speech spectrograms. Left: Short-term features are extracted from one speech

frame “across frequency”. Right: TRAP-like features are extracted in every frequency

band from a long-term trajectory, “across time”.

2.3 Thesis overview

2.3.1 Studied topics

This thesis goes along the lines of features extracted from the temporal evolution of the

spectrum. The objective is to find new ways of converting the dynamics in spectro-

temporal representation into features that would allow for better robustness against the

non-linguistic variation in speech than offer the existing techniques. There are some par-

ticular ideas that will be studied in the next chapters:

• It was mentioned above that by co-articulation the information about a phoneme

can spread up to half a second interval. But, the density of the information along

the 500 ms may not be homogeneous and the uniform attention paid to the segment

may not be optimal. In the human eye most of the information also comes from the

center. How to put more emphasis on the central part and less on the boundaries?

Would it help ASR?

Answers to these questions will be searched for in Chapter 5, Information in time-

frequency plane.

• Frame-based analysis assumes a stationarity within the examined frame. However,

speech is fluent and co-articulated. There is no underlying framing in it. Some special

speech events such as plosives cannot be preserved in the short-term spectrum which

smooths them out, especially when the phase is not used. The magnitude spectrum

of a sequence and the same but time-reversed sequence are the same! Is there a way

to extract features which would preserve the temporal properties even within frame?

And, is there a need of preserving such fine details?

These questions will be discussed in Chapter 6 on Linear Predictive Temporal Pat-

terns.

• Modulation spectral domain allows for partial separation of speech from non-

linguistic artifacts thanks to the limited modulation range of speech. The sepa-

ration can be implemented by temporal filtering of the sub-band energy trajectories.

10 2 Introduction

Replacing one filter with a bank of multi-resolution filters can represent frequency

decomposition. Particular impulse responses derived from Gaussian function seem

to be consistent with the evolving knowledge about mammalian auditory cortex.

Would it be desirable to combine these thoughts in a new speech representation?

Would it have some advantage over other features?

These ideas will be explored in Chapter 7 on Multi-resolution RASTA filtering.

• Daily experience suggests that not all words in the conversation, but only a few im-

portant ones, need to be accurately recognized for satisfactory speech communication

among human beings. The important key-words are more likely to be rare-occurring

high-information-valued words. Human listeners can identify such words in the con-

versation and possibly devote extra effort to their decoding. On the other hand, in

a typical ASR, acoustics of frequent words are likely to be better estimated in the

training phase and language model is also likely to substitute rare words by frequent

ones. As a consequence, important rare words are less likely to be well recognized.

Keyword spotting bypasses this problem by attempting to find and recognize only

certain words in the utterance while ignoring the rest.

An intermediate product of the typical ASR is the assignment between frames and

classes (see the previous section). In this thesis it will mostly be posterior probabil-

ities of phones estimated by a neural network. Visual inspection of a time sequence

of these phone estimates often gives the underlying word. It suggests that there

could be a way to automatically process the sequence and detect only the words of

interest. This is further discussed in Chapter 8, Extensions.

2.3.2 Thesis outline

The thesis extends the previous studies of feature extraction from spectro-temporal plane

and proposes novel approaches. Since there are several compact topics studied, an indi-

vidual chapter is devoted to each of them. To be able to evaluate the properties of the

presented front-ends and to compare them, the evaluation tasks and criteria need to be

defined before the chapters with front-ends are given. Thus, the experiments related to

the individual approaches are always presented within the appropriate chapter.

• Chapter 3 gives a technical background for the subsequent chapters. It surveys the

relevant state-of-the-art and describes the techniques that are used and adopted in

this thesis.

• Chapter 4 introduces the recognition and evaluation framework which is used

throughout the rest of the thesis. First it describes the tools used for feature extrac-

tion and recognition. Subsequently are presented the evaluation tasks and corpora

along with chosen typical front-ends which will be used as a baseline to which the

proposed systems will be compared. For a rough notion of the task complexity, the

baseline experimental results are also given.

• Chapter 5 studies the properties of the adopted approaches for extracting features

from the time-frequency plane. The first goal is to limit the large input space from

2.3 Thesis overview 11

where the features are extracted. Further it studies the distribution of the ASR-

relevant information in the time-frequency plane and introduces the idea of time

axis warping. Finally, a small study on target classes for MLP classifier is given.

It is shown that temporal context of 1000 ms is large enough to capture all the

useful information. The context can be reduced down to 200–400 ms without a

significant drop in performance. Most of the ASR-relevant information seems to

come from the central frames, distant frames can be largely sub-sampled without

loss in performance, which can reduce TRAP complexity up to 5 times. Finally it

is shown that replacing phoneme targets in MLP classifier by phoneme-state targets

(3 states per phoneme) can significantly improve ASR.

• Chapter 6 is devoted to Linear Predictive Temporal Patterns (LP-TRAP) as a means

of presenting the information in the time-frequency plane to the neural net classifier.

LP-TRAP bypasses the common frame-based processing in the front-end and allows

for preserving fine temporal structure of the energy trajectory in frequency sub-

bands. It is done by modeling the energy trajectory by Linear Prediction.

As there are a number of tunable parameters in the novel LP-TRAP features, they

are optimized at first. Further it is shown that LP-TRAP features can outperform

conventional approaches. The idea of time-warping is implemented and shown to

improve the performance.

• Chapter 7 presents the Multi-Resolution RASTA Filtering (M-RASTA). The tech-

nique extends earlier works on delta features and RASTA filtering by processing

temporal trajectories by a bank of band-pass filters with varying resolutions. Since

the applied filters have zero-mean impulse responses, the technique is inherently

robust to linear distortions.

The M-RASTA features are shown to outperform baseline as well as other approaches

proposed in this thesis. The resistance to channel noise is illustrated. The M-RASTA

features are shown to be complementary to conventional features. Combining M-

RASTA with conventional features yields an additional improvement in performance.

• Chapter 8 views the ASR from the perspective of the keyword spotting. It presents

an alternative approach to the ASR in which each targeted word is classified by

a separate binary classifier against all other sounds. No time alignment is done.

The recognizer is formed by a cascade of two neural network classifiers, the first

estimating phoneme probabilities and the second estimating word probabilities.

On a small vocabulary task, the system still does not reach the performance of

the state-of-the-art but its simplicity, the ease of adding new target words, and its

inherent resistance to out-of-vocabulary sounds may prove significant advantage in

many applications.

• Chapter 9 summarizes the findings from the earlier chapters, draws final conclusions,

summarizes the contribution of the thesis and sketches the future research.

12 2 Introduction

2.3.3 Main goals

The main goals of the thesis can be summarized as follows:

1. To get acquainted with the state-of-the-art in feature extraction from spectral dy-

namics with particular interest in long-term features utilizing neural network classi-

fiers.

2. To study the chosen techniques and to propose improvements with respect to the

robustness against non-linguistic factors and the complexity.

3. To design a new robust speech representation from spectral dynamics using available

MLP and HMM technology. To propose an alternative to the HMM framework for

decoding words.

4. To implement the developed algorithms in standalone tools (C++), to assemble ASR

evaluation tasks, and to integrate the proposed algorithms in these tasks.

5. To verify and optimize the properties of the proposed techniques using evaluation

tasks. To study their strengths and weaknesses, to compare them to the existing

state-of-the-art features and to assess the contribution to the ASR.

Chapter 3

Survey: From short term

spectrum to spectral dynamics

This chapter is a survey through the recent knowledge about spectral dynamics and its

relevance to state-of-the-art ASR. A particular interest is in techniques for feature extrac-

tion which consider long temporal context or relate to the field. The ideas and approaches

adopted in this work are described in detail.

3.1 Obtaining auditory spectrogram

Most of today’s features are derived from some form of spectral representation of speech,

which was originally motivated by the human sound perception.

3.1.1 Frequency resolution of speech spectrum

Harvey Fletcher’s HSR simultaneous masking experiments revealed critical band frequency

channels and he suggested that humans use these channels independently on one an-

other [36]. The bandwidth of these channels was found to be roughly one octave. The

logarithmic-like behavior of the human perception lead to the currently popular melodic

and Bark scale frequency warpings [74, 47]. Some studies report about 20 frequency chan-

nels to be suitable for modeling the perception [8]. By applying these filter banks to

speech, the frequency resolution gets effectively reduced. However, it was shown that such

reduction does not affect speech intelligibility. Recent studies suggest that even less than

10 frequency channels can still preserve the intelligibility [44].

Similar observations were noted in ASR: Burget and Hermansky [21] inspected base

vectors of Linear Discriminant Analysis (LDA) which was applied along the bins of linear

short-term spectrum. The resulting data-driven bases exhibit critical-band-like frequency

resolution. It was also confirmed in [72]. The need of higher resolution at low frequencies,

which lowers approximately logarithmically towards higher frequencies, was confirmed also

by Umesh et al. in a search for such frequency warping that would minimize the differences

between speakers [95].

13

14 3 Survey: From short term spectrum to spectral dynamics

3.1.2 Bank of filters in time domain

Since there exists the Cooley and Tukey’s FFT algorithm, the auditory spectrum is typ-

ically obtained from the short-term FFT by applying a bank of filters in the frequency

domain. However, the sub-band energies can also be obtained by emulating analog band-

bass filters in time domain, which can enable for better time resolution and get closer to

the human perception. This idea leads to LP-TRAP features presented in Chapter 6.

Motlıcek [81, 80] applies a set of auditory-like Gammatone filters directly to the speech

in time domain. The band-pass signals are demodulated and low-pass filtered to yield the

energy envelope trajectories, which are subsequently sampled at the required frame rate.

Tyagi et al. [93] calculate their so called fepstrum in a similar manner, using a bank of

band-pass filters in time domain, that are linearly spaced and have rectangular frequency

responses. They subsequently take the logarithm of every band’s output magnitude and

then down-sample this “log-energy” trajectory. Finally, they apply DCT along each tra-

jectory which yields the fepstum. Athineos [14] attempts to derive the true sub-band

energies from the Hilbert envelope of the signal. Being close to this approach, Dimitriadis

et al. [32] approximate the sub-band energies by the Teager-Kaiser operator which is less

computationally expensive.

3.2 Spectrogram filtering

The time-frequency plane can be filtered along its time or frequency dimension, or both

dimensions (2-D filtering). The filtering along frequency comprises e.g. auditory filter

banks or DCT in MFCC computation. However, more important for this work is the

temporal filtering and 2-D filtering, which forms the core of the Multi-Resolution RASTA

approach presented in Chapter 7.

3.2.1 Temporal filtering and modulation spectrum

Filtering temporal trajectories of sub-band energies (or generally features, as cepstrum is

linearly transformed log-energy) has been widely used in ASR. Virtually any processing

on the feature sequence can be viewed as filtering. Furui’s well known delta features [38]

apply 1st and 2nd order derivatives to the sequence of cepstral coefficients. Time filtering

can be used to suppress linear distortions by removing the steady component from the

sequence, as implemented in Cepstral Mean Subtraction technique [12].

Generally, the time trajectory of a feature can be seen as a time signal with certain

frequency properties. It contains slow as well as fast components representing modula-

tions. Thus, the temporal filtering can be seen as an operation in modulation-frequency

domain. Assuming that 1) communication channel is stationary, 2) spectral changes are

the carrier of message in speech, it suggests that in modulation-frequency domain the

channel noise can be separated from speech. Hermansky and Morgan [52] filtered out

non-speech artifacts and linear distortions from speech by band-passing the trajectories

of compressed critical-band energies with temporal RASTA filter. Arai et al. [9, 10, 11]

explored what modulations are needed for speech and speaker recognition as performed

by humans, by band-pass filtering time-trajectories of LPC or MFC coefficients and play-

ing the re-synthesized speech to human listeners. They found that modulations between

3.3 Parametric spectrogram representation 15

1.5–16 Hz are important and other components can be filtered out without a loss of in-

telligibility. Kanedera et al. [63, 62] filtered similarly sub-band energy trajectories and

evaluated the performance of the ASR. Consistently with Arai, they reported bandwidth

2–8 Hz with a peak at 4 Hz (corresponding to the average syllable rate) to be crucial.

Avendano et al. [17] and van Vuuren [98] apply LDA along TRAPs to derive impulse re-

sponses of RASTA filters. Another example of successful time-filtering operation is DCT

or PCA transform applied to TRAPs in order to decorrelate the features and reduce their

dimension [88].

3.2.2 2-D Filtering

Any combination of linear frequency and temporal filtering can be interpreted as a 2-D

filtering of the time-frequency plane, for example MFCC features (DCT across frequency)

followed by RASTA filtering (filtering across time). However, 2-D filtering does not have

only this “artificial” interpretation. Recent works revealed that ASR can benefit from

explicit 2-D filtering emulating some properties of the mammalian auditory cortex [30, 31]:

Kleinschmidt, Gelbart and Meyer [66] applied a number various 2-D Gabor filters to the

auditory spectrogram and used a data-driven feature selection method to find optimized

feature sets. Their robustness was shown on Aurora 2 & 3 tasks. They also report

that over 40% of automatically selected features exhibit diagonal characteristics [75]. It

favorizes TANDEM approach over TRAP as the TRAP is not designed to preserve inter-

band relations (refer to section 3.4 for more details). Nevertheless, 2-D filtering can enable

TRAP for such relations: Grezl et al. [45] apply 3 bands×3 frames operators to auditory

spectrogram and improve features for TRAP MLPs and LVCSR [46, 64]. Kajarekar et al.

[60] and Valente and Hermansky [96] use LDA to derive similar 2-D discriminants.

3.3 Parametric spectrogram representation

Some authors attempted to use parametric model for spectrogram representation.

Motlıcek et al. [81, 80] derived TRAPs using temporal Gammatone filters and applied

LP model directly to the TRAPs. The fact that the approach was not successful revealed

that the phase information in TRAP is necessary for ASR: the phase information gets

discarded in LP model as it is derived from autocorrelation1.

Athineos et al. [15] used LP modeling for another sort of TRAPs, LP-TRAPs. How-

ever, in contrary to the previous approach, he applied the LP in frequency domain, hence

fitting the model to the shape of LP-TRAP itself and not to its spectrum. Since here the

phase information was preserved, the approach was successful. This approach was further

developed in this work and will be described in full detail in Chapter 6. Athineos [16]

later extended the use of LP model to 2-D spectral modeling.

1 More precisely, since LP analysis minimizes a certain measure of distance between the modeled subject

and the fit in power spectral domain, the module of the sub-band modulation spectrum was preserved, yet

its phase was discarded, which in turn corrupted the temporal properties of TRAP.


3.4 Parametrizing spectrogram by probabilistic features

Whatever has been the way from speech to spectrogram, the main question remains, how

to convert the spectrogram in features useful for ASR. Probabilistic features can serve this

purpose.

Probabilistic features are considered those, proceeding on their way to the final feature

vector through posterior probabilities of some classes, mostly phonemes. A widely used

framework for obtaining estimates of posterior probabilities are Multi-Layer Perceptron

(MLP) neural networks, for their simplicity and solid mathematical background. As MLPs

are frequently used throughout the thesis, they are introduced in detail in section 4.2.1. For

now, let us picture the MLP as a box being able to learn any non-linear mapping function

between its input and output vectors from the labeled training data. The training data

are presented to the MLP in the form of input vectors associated with a particular output

class. The trained MLP provides an estimate of posterior probabilities of all classes given

an input sample. Besides other advantages of MLPs, they can reduce the feature dimension

and can be conveniently integrated in existing ASR framework.

3.4.1 Hybrid and TANDEM architectures

Both, the Hybrid and TANDEM architectures use MLP to project a certain portion of

speech posteriogram onto phoneme classes.

The earlier approach, Hybrid, was proposed by Morgan and Bourlard [77]. The con-

ventional GMM/HMM system does the acoustic modeling by means of Gaussian mixtures,

which parameters are trained by maximizing the likelihood of the observed data given the

models (ML training) [85]. In Hybrid approach, the GMMs are replaced by an MLP which

is trained discriminatively and can also represent emission probabilities needed for HMM

decoding. One one hand, Hybrid approach simplifies the training, as the full expectation-

maximization algorithm is not needed, on the other hand, some of the matured powerful

techniques available in GMM-HMM are lost, such as tied context-dependent triphones or

model adaptation based on Maximum Likelihood Linear Regression (MLLR) [40].

TANDEM architecture was proposed later by Hermansky et al. [48]. TANDEM by-

passes the drawbacks of either GMMs or MLPs by combining the techniques and benefits

from the best of both. The class posteriors are again estimated by MLPs, however, instead

of being directly used as emission probabilities, they are fed to the GMM/HMM system

as features. The purpose of log and KLT transforms is to modify the feature distribution

and to decorrelate the features, so that they can be better modeled by GMMs with di-

agonal covariance matrix. TANDEM usually outperforms Hybrid, especially under noise

conditions [48], which comes at the price of more complex training and decoding.

In both architectures the MLP allows to easily incorporate temporal context in fea-

tures. The MLP is trained on 9 consecutive frames of PLP coefficients and it estimates

phoneme posteriors, hence using a mid-term temporal context of about 100 ms. The

architectures are pictured in Fig. 3.1 and 3.2.

3.4.2 TRAP architecture

TRAPs feature in a substantial part of this work, therefore they will be discussed in more

detail.

3.4 Parametrizing spectrogram by probabilistic features 17

time

extraction

PLP

39 featu

res

351 ife

atu

res

phonem

es

probabilitiesemission

Viterbi decoder

HMM

text

9 frames

MLP

1/priors

Figure 3.1: Scheme of Hybrid MLP/HMM system.

time

GMM/HMM

text

HMM

GMM

extraction

PLP

39 featu

res

351 ife

atu

res

phonem

es

log+

KLTMLP

features

25−50

9 frames

Figure 3.2: Scheme of TANDEM system combining MLP and GMM/HMM.

Conventional short-term systems are built upon the idea of spectral patterns as the

fundamental discriminative features for ASR. Decades have proved this approach right,

however, short-term features are very sensitive to non-linguistic variability. Inspired by

Fletcher’s pioneering work [36], researchers started to think of processing the speech in

independent sub-bands [19]. Instead of slicing the spectrogram across frequency, they cut

across time, thus replacing the frequency context by temporal context.

After some initial trials [92], the number of frequency sub-bands stabilized at approxi-

mately critical-band resolution and the temporal context has widened up to 1 second. The

primary source of information was the log-critical-band spectrogram, obtained as an inter-

mediate product in PLP calculation [47]. 1000 ms long trajectories of the CRitical-Band

log-Energies (CRBE) were examined by Hermansky and Sharma [53, 54]. They calcu-

lated CRBEs from a phonetically labeled corpora and assigned a phoneme label to each

frame. Subsequently, they concatenated 50 consecutive frames before and after the cur-

rent frame, thus forming 101-frames long TempoRAl Pattern (TRAP) of log-energies for

every sub-band and every frame. They calculated so called Mean TRAPs of all phonemes

by averaging all TRAPs assigned the particular phoneme label2. Study of Mean TRAPs

suggested to use these prototype patterns for sub-band phonetic classification. They used

a simple distance measure to compare TRAPs of an unknown speech to Mean TRAPs

of every phoneme, and trained MLP to estimate phoneme probabilities given these 435

distances (15 bands×29 phonemes). Use of these posteriors in Hybrid system did not out-

perform PLPs. Even an agglomerative clustering of Mean TRAPs in five “Broad TRAPs”,

although nicely matching with real phonetic categories, did not improve ASR [53].

Better success was reported when using non-linear neural classifier even for the band-

specific classification (Neural TRAP). Discriminatively trained MLPs are able to better

distinguish among classes, especially at the class boundaries, as discussed in [48].

2This illustrative experiment will be reproduced in section 5.1.1.


��

��

��

��

��

��

��

��

Critical band spectrogramBand MLPs

phoneme posteriors

GMM/HMM

log

+

KLT

features25−50

text

Merger MLP

��

��

��

��

��

��

��

��

��

��

��

��

��

��

log

log

log

log

transform

TRAP

transform

TRAP

Figure 3.3: Scheme of TRAP system.

The general scheme of Neural TRAP (further referred to as TRAP) is pictured in

Fig. 3.3. The band-specific (or band-conditioned) MLPs as well as the merger are trained

on phoneme targets. On small-vocabulary task, TRAP reaches competitive performance

to PLP-TANDEM; when both probabilistic features are combined by averaging in log

domain, the result outperforms either features [53].

3.4.3 Further development of TRAP

Since Sharma and Hermansky successfully developed the Neural TRAP, more efforts have

been put in this architecture. First it was the improvement of the input to TRAP, meaning

the way of obtaining the spectrogram, its filtering and various projections, as reviewed in

sections 3.1.2 and 3.2. On top of TRAP, mean and variance normalizations were shown

to improve the performance especially under noise conditions [54, 64].

Second, the architecture itself has evolved. Much of the improvement came from

International Computer Science Institute in Berkeley, Oregon Graduate Institute and Brno

University of Technology.

Chen and Zhu proposed two major modifications, Hidden Activation TrapS (HATS)

and Tonotopic Multi-Layered Perceptron (TMLP) [101, 24, 25, 23]. HATS train the Merger

MLP using outputs from the hidden units in Band-MLPs. Hence, the merger is not trained

using posterior probabilities, but using matched filters for basic patterns appearing in

TRAPs. The authors report only about 20 patterns per band need to be modeled. TMLPs

replaces the hierarchy of MLPs with only one network with two hidden layers, where

the units in the first, tonotopic layer, discriminate patterns in every band independently

and the outputs are merged together in the second, fully connected layer. Reported

improvements are significant on LVCSR tasks [102, 103].

Schwarz et al. [87] utilized TRAPs for phoneme recognition. They deal with insuffi-

cient amount of training data by replacing Band MLPs with pre-windowed linear oper-

ations (PCA, DCT) and by splitting the temporal context in left and right parts which

they report to require less training examples [88]. Recently they proposed to split not

only the temporal context, but also frequency context and experimented with hierarchical

structures of MLPs [89].

3.4 Parametrizing spectrogram by probabilistic features 19

3.4.4 Multi-stream systems

A substantial improvement in features for ASR has been reached after the complementarity

of long-term features to short-term features revealed. When tuning the systems for their

best possible accuracy on a given task, researchers experimented with many possibilities

of combining multiple data streams.

Probabilistic features enable to be easily combined. If multiple ANNs (experts) are

trained on the same data set with the same targets but with different representations

(e.g. TRAP and TANDEM), one could simply average their posteriors to get an improved

estimate [59]. However, it poses certain assumptions upon the experts, which are not

always met, therefore a number of alternatives have been proposed [83, 78, 56, 76].

Today’s state-of-the-art systems using probabilistic features typically append some

form of long-term features to short-term features and further tune the system by using

transformations for decorrelation and dimensionality reduction [20, 101]. As an example,

the so called Combined-augmented features used by ICSI Speech group [102] consist of

39 short-term features obtained from HLDA-transformed3 PLP+∆+∆∆+∆∆∆ features,

which are mean and variance-normalized over conversational side, appended by 25 KLT-

decorrelated probabilistic features, obtained from inverse-entropy combination of TRAP

and PLP-TANDEM phoneme posteriors.

3HLDA stands for Heteroscedastic Linear Discriminant Analysis.

Chapter 4

Recognition and evaluation

framework

This chapter introduces the recognition and evaluation framework which is used through-

out the rest of the thesis. Habitually, the experimental framework would be introduced

in the experimental part, after the theory has been presented, but this thesis deals with

multiple topics which all share the same evaluation tasks so that the results were directly

comparable, therefore the evaluation framework and the baselines have to be introduced

before the subsequent chapters.

4.1 Software tools

First, the software tools that are used for experiments will be introduced. A part of them

was implemented by the author and made public available on the internet, another part

was obtained from other research sites.

HTK

The Hidden Markov Model Toolkit (HTK) is a widely used open-source toolkit for building

and manipulating hidden Markov models [100]. It will be used for training acoustic models

and will act as the back-end in all HMM-based recognizers.

QuickNet

QuickNet is a suite of software that facilitates the use of multi-layer perceptrons (MLPs)

in statistical pattern recognition systems. It contains a program for efficiently training

MLPs, a program for using MLPs to do pattern recognition and a library of C++ objects

that handle MLP training and perform I/O operations with various file formats. QuickNet

was developed in the Speech Group at the International Computer Science Institute by

David Johnson and it is an open-source project. Further information about QuickNet can

be found on the internet [4].

21

22 4 Recognition and evaluation framework

Auxiliary tools for QuickNet from SPRACHcore project

SPRACHcore is a software release of the neural network speech recognition tools developed

by ICSI plus other partners and it is an open source project. The downloadable package

contains tools for neural net training and recognition, feature calculation, and sound file

manipulation. The tools are maintained by Dan Ellis and Chuck Wooters. More on

SPRACHcore is available on the internet [5]. For the purposes of this thesis the following

utilities were used:

feacalc – A feature calculation program.

feacat - A utility for conversion and trimming of data files.

pfile utils - Specialized programs to transform data in pfile format.

CtuCopy

CtuCopy is an efficient command line tool written in C++ implementing speech enhance-

ment and feature extraction algorithms. It is similar to HCopy from HTK Toolkit and

it also supports HTK file format. It uses fftw library for fast DFT calculations [1]. It

is an open-source program developed by the author and it is available under the GNU

license on the web site of the Speech Processing Group at CTU Prague [6]. Originally the

CtuCopy was developed during the author’s masters study to enable efficient use of speech

enhancement techniques in cascade with feature extraction. When these two blocks are im-

plemented in one tool, there is no more need to reconstruct the speech prior to the feature

extraction, because the spectrum obtained from the speech enhancement techniques can

be passed directly to the filter bank of front-end. Such all-in-one tool offers more efficiency

and flexibility of the DSP algorithms than if several one-purpose tools were used. Later,

the CtuCopy has been extended with more capabilities which, to the author’s knowledge,

are not available in other open-source tools.

Basic function: CtuCopy acts as a filter with speech waveform file(s) at the input, and

either a speech waveform file(s) or feature file(s) at the output, see Fig. 4.1. Several

input formats are supported: MS Wave, raw file with PCM data, A-law data or

mu-law data in both byte orders or the on-line input.

Preprocessing: The preprocessing block segments and windows the signal and can op-

tionally apply preemphasis, dither and remove possible DC offset.

Speech enhancement: CtuCopy implements several speech enhancing methods based

on spectral subtraction. Extended Spectral Subtraction - exten [90] combines Wiener

filtering and spectral subtraction with no need of a voice activity detector (VAD).

Other methods are based on spectral subtraction with VAD which can be either

external (data are being read from a file) or from the internal cepstral detector. The

noise suppression methods can be applied either directly to speech spectra or to the

auditory spectra obtained with a bank of filters. The enhanced speech can either

be reconstructed and written to the output file or passed to the feature extraction

process. Channel noise can be suppressed with general RASTA filtering either with

the original filter published by Hermansky [52] or with arbitrary impulse responses

specified by the external file.

4.1 Software tools 23

Postprocess

Liftering

Parametrization

DCT

PLP

LPC

ck

ak

Sk

preprocess

spectrum

iFFT + OLA

speech

FFT

user−defined

signal

Filter Banks

melodic

Bark

linear

expolog

Additive

Spectral

Subtraction:

Extended SS

SS with VAD

Channel

filtering

RASTA

HTK/pfile

features

Enhanced

speech

Figure 4.1: Block scheme of the universal feature extractor and speech enhancer CtuCopy.

Filter banks: The bank of filters offers a set of frequency scales (melodic, Bark, expolog,

linear) and three filter shapes (triangular, rectangular, trapezoidal) which can be

arbitrarily combined to form standard or user-defined filter banks.

Feature types, file formats: In feature extraction mode a number of common features

can be extracted from either original or enhanced speech, e.g. Linear Predictive

Coefficients (LPC) or Cepstral coefficients (LPCC) , Perpceptual Linear Predictive

cepstral coefficients (PLP), Mel-scale Cepstral Coefficients (MFCC), and magni-

tude/logarithmic spectra. Features can be saved either in HTK format or, thanks

to Petr Schwarz from Brno University of Technology for his C class, also in pfile

format.

“fdlp” tool

This command line tool written in C++ implements LP-TRAP feature extraction using

Frequency Domain Linear Prediction (FDLP) as will be discussed in chapter 6. The

program acts as a filter with speech waveform file(s) at the input and a data file with

LP-TRAP features in pfile format at the output. Conceptually it is similar to CtuCopy

and it also uses fftw library for fast DCT and DFT calculations. It was developed by the

author. Since the documentation is incomplete, fdlp has not been made public available.

Trapper

A tool from Brno University of Technology which, given a data file with critical band

energies computed from the speech by feacalc or CtuCopy, forms a data file with TRAP

features that are being used by QuickNet. Various transforms and normalizations can


be applied to the TRAPs. Trapper is an internal tool kindly provided to the author by

Speech Processing Group at Brno University of Technology [7].

Miscellaneous

Other one-purpose tools implemented by the author in C++ related to this thesis:

pfile gauss – 2-dimensional filtering of the TRAP features. The filtering over time uses

Gaussian derivatives as impulse responses and filtering over the frequency approx-

imates the first and second order difference between neighboring sub-bands. The

result is the M-RASTA feature as discussed in Chapter 7.

pfile shuffle – Reshuffles the frames of the input feature file in a random order and stores

the result in the output file. It is used for preprocessing the neural net training data

to optimize the training.

pfile warp – Warps temporal axis in TRAP feature file using a symmetrical exponential

function by resampling the TRAP trajectory. The function can either expand the

central part of TRAP and compress the boundaries or vice-versa. Number of output

samples per TRAP determine whether the trajectory gets rather stretched or shrunk.

More will be explained in Chapter 4.

trap shapes – A tool for computing Mean TRAP as introduced by Hermansky and

Sharma [53] using multiple approaches. It produces mean TRAPs plus their vari-

ances and occurrence counts for every class. It can also compute class confusion

matrices since the inputs to the program are both the reference labeling and neural

net outputs.

hilbert gauss – A mix of pfile gauss and fdlp. Similarly to pfile gauss, this tool

filters the auditory spectrogram with 2-D filters. However, the input spectrogram is

not obtained with the short-term spectral analysis as in TRAP, but using sub-band

Hilbert envelopes as in LP-TRAP.

etrap – The Energy-TRAP feature extractor. It extracts features from the energy contour

of the speech. The energy is obtained with Hilbert envelope and is parametrized

with LP cepstrum. It is a special case of LP-TRAP cepstral features for the whole

frequency band. The features can complement conventional features.

4.2 Recognizer architecture

There will be two evaluation tasks introduced in this chapter. The general structure of both

systems is similar and resembles the typical recognizer from Chapter 1. There is a front-end

and a back-end. The front-end computes features which are used by the pattern-matching

and decoding mechanisms in the back-end. The back-end is a GMM/HMM system with

models of phones, either context-independent (CI) or context-dependent (CD). Let us now

look very briefly at the structure of the front-end, as it is needed for defining the evaluation

criteria.

The front-end can represent common short-term features such as PLP or MFCC, which

are obtained from the speech using the above introduced tools HCopy from HTK, feacalc

4.2 Recognizer architecture 25

or CtuCopy. These features will be used as a baseline. Long-term features which are

studied in this thesis are derived essentially in two steps, see Fig. 4.2: First the speech

in converted into some spectral representation specific to the method, which is typically

high-dimensional. The dimensionality needs to be reduced. For that purpose serves the

neural network which does the non-linear projection and dimensionality reduction. It

also acts as the phoneme probability estimator, which is important for the Frame Error

Rate evaluation criterion defined later. Before the phoneme posteriors obtained from the

neural network can be passed to the HTK back-end, they need to be “gaussianized” and

decorrelated. This is approximated by logarithm and Karhunen-Loeve Transform (KLT)

in feacat and pfile klt tools, respectively.

feature

extractionprobability

phoneme

estimator

ph

on

em

es

sum = 1

Speech

KLT

log

+Features

Figure 4.2: General scheme of front-end for long-term features.

4.2.1 Multi Layer Perceptron as posterior probability estimator

Since MLP neural network is a workhorse used throughout this thesis, some details about

its background are given in this section. The text does not intend to thoroughly cover a

background on artificial neural networks, which can be found e.g. in [55]. It rather serves

as the extension of section 3.4 introducing probabilistic features.

In this work, MLP is used as a statistical classifier for estimating the probability that

the given acoustic observation corresponds to a particular class (phoneme). The MLP is

being used for phoneme classification from several reasons:

1. MLP is a discriminative classifier, as opposed to generative models such as HMMs,

hence it can provide an estimate of posterior probabilities of the classes.

2. With only one hidden layer, MLP is able to learn any non-linear mapping function

between input and output from the data, provided that it has enough hidden units

and enough training data [67]. Hence, no assumptions about the distribution of the

input data are required.

3. MLP is simple and it has a solid mathematical background in the sense that there

exist a robust gradient-descent training algorithm – Error Back-Propagation [86].

4. MLP can largely reduce the redundancy and dimensionality of data, since its output

is typically smaller than its input.

5. QuickNet software provides efficient algorithms for MLP training and forward-

passing.


Structure of MLP

This work utilizes only one ANN architecture, MLP with three layers of neurons, see

Fig. 4.3. Between neighboring layers, all neurons are fully connected and the signal flow is

only in one direction, i.e. there are no feedback loops. Such structure is called feed-forward

MLP.

(z)2

3(z)

bias

xw

wx

x

1

2

1

i i

(z)z

Neuron

activationfunction

INPUT

layer

HIDDEN

layer layer

OUTPUT

Figure 4.3: Scheme of three-layer MLP and neuron.

Neurons in the first layer only distribute the input to the subsequent layer, they do not

implement any calculation. The neurons from the second and the third layers implement

general function

g(x) = φ

(

b +

m∑

i=1

wixi

)

= φ(z), (4.1)

where xi are the neuron inputs, wi are weights, b is bias and φ(z) is a non-linear activation

function of one variable z (z is a weighted linear summation of xi plus the bias). In

particular, the second-layer neurons use sigmoid activation function, which maps all real

numbers to the interval < 0, 1 >,

φ2(z) =1

1 + e−z. (4.2)

The third-layer neurons use softmax activation function, which ensures that outputs of all

C neurons from the output layer act like probabilities, i.e. their values lay between 0 and

1 and they sum up to one:

φ3,k(z) = probk =ezk

∑Cj=1 ezj

, (4.3)

where k = 1 . . . C are indices of output neurons.

Training procedure

The weights and biases of the MLPs are subject to the iterative gradient-descent training.

It is done in a on-line supervised manner, in which a large amount of input–output pairs

are repeatedly presented to the network. The training minimizes a cross-entropy error

criterion between the network output and the desired output [42]. One cycle of presenting

all data to the network is called an epoch.

During the training the MLP is learning the mapping function between its input and

output. It is desirable that the MLP captures only the gross trend and not the details

4.3 Evaluation criteria 27

which are due to the variability in data. If the MLP was over-trained, it would loose its

generalizing property and it could not predict the output for an unseen data. To prevent

this situation, the MLP is being evaluated on an independent data (cross-validation set,

CV) after each training epoch. Usually 90% of the available data are used for training

and 10% for cross-validation. The performance is measured by Frame Error Rate (FER),

which will be introduced in the next section. The early-stopping training strategy is being

used [2], which says:

1. Start the training with a learning rate of 0.008.

Learning rate value determines the speed of training. It specifies to what extent the

weights and biases can change between epochs.

2. Repeat the training until the FER on CV data does not improve over the

previous training epoch by more than 0.5%.

Since than, the learning rate is halved before each epoch, which increases precision

at the local optimum.

3. Continue training, but halve the learning rate in every epoch.

Halving the learning rate initially boosts the FER improvements, however, eventually

the learning rate becomes so small that the improvement is minimal. When the FER

again improves by less than 0.5%, the training is stopped.

It typically takes 7 – 10 epochs to train an MLP.

Features and targets for MLP

In this work, MLP is mostly used to project input vectors with hundreds of features to

posterior probabilities of tens of phonemes. It is done every 10 ms.

A large amount of phonetically labeled corpora are used for training. The labels can

either be obtained manually, which is expensive, or by automatic alignment using HMMs,

which gives slightly worse results but for huge corpora is inevitable [37]. Training data

contains input-output pairs of feature vectors and associated phoneme targets. As there

are one output neuron for every class, the training targets are coded in the 1-out-of-C

manner, representing ideal posterior probabilities. The training targets are hard, which

means that if the frame belongs to a certain phoneme, the target assigned to that phoneme

is set to 1 and all other targets are set to 0.

Though all MLPs in TRAP and TANDEM architectures typically use phonemes as

targets since they are easy to obtain, it is questionable if these classes are optimal. Ap-

pendixes C and D deal with this question in detail.

4.3 Evaluation criteria

As the objective of this thesis is to develop front-end algorithms for speech recognition,

the natural way of evaluating their performance is to build a recognizer, have it recognize

testing utterances and compute the word error rate. Word error rate is defined as the

sum of the number of erroneous words divided by the number of words in the reference

transcription,


WER =D + I + S

N· 100%, (4.4)

where D, I, S denote deleted words, inserted words, and substituted words, respec-

tively. N is the reference number of words. The values of D, I, S are obtained from a

comparison of the recognized output and the reference transcription using NIST align-

ment procedure as implemented in HTK Toolkit [100] and NIST’s SCLITE [3]. The WER

criterion has the advantage that it is widely used and can be easily implemented. The

drawback is that since WER quantifies the quality and reliability of the whole system

by only one number, the measure can sometimes be misleading. It serves well for simple

systems such as digits recognizer, but for LVCSR this measure may not be sufficient.

An additional criterion evaluates how well the neural classifier is estimating the

phoneme posterior probabilities. It operates on a frame basis and is called Frame Er-

ror Rate (FER). FER is a fraction of misclassified frames given the true labels. More

precisely, it is the number of frames Fmis where the maximum posterior does not match

the underlying class label over the overall number of frames F ,

FER =Fmis

F· 100%. (4.5)

It is defined only for front-ends with MLP classifier and it is evaluated on cross-

validation set (will be defined in the next section).

4.4 Evaluation tasks

Having defined the evaluation criteria and the structure of the recognizer, the last thing

that remains to be specified is the task itself.

Modern ASR is a complex system starting with a speech data acquisition, prepro-

cessing, extraction of features, classification, and decoding. Each of these blocks are

usually being developed with the assumption of independence on the other blocks, since

a global optimization is generally not feasible. However, the assumption of independence

can hardly be granted and the resulting achievements can thus be quite subjective and

only locally optimal. The simplest way to minimize the influence of the parts of the

system that are not of interest is to avoid them. Therefore, since the primary objective

of this thesis is a study of feature extraction techniques, the chosen evaluation task is

a small-vocabulary speech recognition. Such a task has a number of advantages. First,

it eliminates a possible bias coming from the language modeling (LM), as a strong em-

phasis on LM in large-vocabulary ASR (LVCSR) may smear the studied particularities

of features, especially if the only criterion is the word error rate. Small-vocabulary ASR

uses only a simple LM, the experiments are thus less LM-dependent. Second, the small-

vocabulary recognition system is more simple and transparent than LVCSR: it has less

degrees of freedom, which enable for back-tracing and analysis, allowing to get much closer

to the particular phenomena. Ultimately, one can learn more from a simpler task. The

drawback is that the results seen on small-vocabulary ASR may not hold for LVCSR. The

reason is that in LVCSR the information may come from different sources (better training

process, strong LM) or – and this is one more argument against LVCSR – state-of-the-art

LVCSR recognizer is a complex and carefully tuned equilibrium which can be easily broken

by introducing any brand new approach, no matter how promising it actually is.

4.4 Evaluation tasks 29

There are two tasks being used in this thesis. First, there is a digits recognizer for

English, which will be used for all data-driven development, optimizations and evaluations.

The second one is a simplified recognizer of English Conversational Telephone Speech

(CTS) with a limited vocabulary.

4.4.1 English digits recognition – Stories & Numbers95

Two speech corpora were used, OGI-Stories and OGI-Numbers95 [26, 27]. Both con-

tain speech recorded over a telephone channel in the same recording conditions (8 kHz,

16 bits, shortpack compression) and both contain a training subset, which is transcribed

on phoneme level by hand (with time-stamps). OGI-Stories contains spontaneous contin-

uous speech with rather large vocabulary, OGI-Numbers95 contains strings of digits and

numbers. This task will be further abbreviated as S/N (Stories/Numbers95).

Four data sets were created from these corpora (see also Fig. 4.4):

MLP set 1 – 208 files from Stories (2.8 hrs) with frame-level phoneme labels.

MLP set 2 – 3590 files from Numbers95 (1.7 hrs) containing digits and numbers with

frame-level phoneme labels.

HMM train set – 2547 files from Numbers95 containing strings of 11 digits from zero

to nine plus oh (1.3 hrs). It is a subset of MLP train set 2.

Test set – 2169 files/12437 words from Numbers95 containing strings of 11 digits

(1.7 hrs) with word transcription.

The MLP sets are used in the following way. When there is only one MLP used in

the front-end, which is the case for TANDEM systems PLP-TANDEM and M-RASTA,

then the MLP is trained on a joint set MLP set 1 + 2, see Fig. 4.4. The MLP is

actually trained on 90% of the data, the remaining 10% is used for MLP cross-validation

to determine the end of the training. When the front-end contains more than one MLP

in tandem, which is the case for TRAP and LP-TRAP, then the sub-band classifiers train

on MLP set 1 and the merging MLP trains on MLP set 2. Reasons for this division are

rather historical as it was adopted from Pratibha Jain, Frantisek Grezl, and originally from

Sangita Sharma. However, their back-end was different from this work, so the results are

not directly comparable.

The OGI-Stories corpus contains phonetically rich sentences, which can be beneficial

for the MLP training: All phonemes have similar occurrence which prevents the classifier

to patronize some classes at the expense of others. The HMM train and Test sets contain

speech from distinct speakers with sequences of digits.

MLP Setup

4.5 hours of training data in MLP sets 1 and 2 provide about 1.5 million speech frames

assuming the segmentation 25/10 ms. Training targets are 29 English phonemes – only

those contained in the ten recognized digits. Their list can be found in Appendix A.1. All

MLPs are 3-layer with the input, hidden, and output layer. The sizes of the MLPs for the

individual architectures are expressed in Input x Hidden x Output size:


2.8 hrs

1.7 hrsMLP set 1 MLP set 2

1.7 hrsHMM train set

1.4 hrs

Test set

OGI − Stories OGI − Numbers95

Figure 4.4: Structure of the corpora in Stories-Numbers95 (S/N) task.

• PLP-TANDEM: 351x1800x29 units (9 frames x 39 features at the input, 1800 hidden

units, and 29 units in the output layer mapped to phoneme targets).

• M-RASTA: 448x1800x29 units.

The TRAP architectures have a set of MLPs for every frequency sub-band denoted

Band-MLP and a merging network called Merger-MLP. Their sizes are:

• Band-MLP: Nx100x29 (N features at the input, 100 hidden units, 29 phoneme out-

puts). N depends on the TRAP length and in case of LP-TRAPs also on LPC order.

It typically varies between 50–100.

• Merger-MLP: 435x300x29 (435 inputs are formed by the outputs of 15 Band-MLPs

with 29 outputs each; 300 hidden units, 29 phoneme outputs).

Such neural networks have from 350k (TRAP-TANDEM) to 900k (TANDEM) train-

able parameters. After all the neural networks have been trained using qnstrn tool from

QuickNet, the 29 phoneme posteriors at the MLP output are passed through logarithm

and KLT transform. The KLT is a data-driven method which requires the training data

to derive the base vectors. For this purpose the HMM train set is used. The training

process for TANDEM and TRAP-TANDEM architectures is the following:

TANDEM:

1. Train MLP on MLP sets 1+2.

2. Forward HMM train & Test sets.

3. Apply log and compute KLT bases on

HMM train set.

4. Apply KLT transform on HMM train

& Test sets.

TRAP-TANDEM:

1. Train Band-MLPs on MLP set 1.

2. Forward MLP set 2, HMM train set,

and Test set through Band-MLPs.

3. Train Merger-MLP on forwarded MLP

set 2.

4. Forward HMM train set & Test set

through Merger-MLP.

5. Apply log and compute KLT bases on

HMM train set.

6. Apply KLT transform on HMM train

& Test sets.

4.4 Evaluation tasks 31

HMM Setup

When all the features for HMM train and Test sets are ready, then the back-end can be

trained and evaluated. The back-end is an HTK system written in Perl. Its key features

are:

• GMM/HMM system modeling context-independent phones,

• 22 phoneme HMMs (only the phonemes contained in the 11 recognized digits),

• 5 emitting states, each with 32 Gaussian mixtures,

• 11 target words (“zero” to “nine” plus “oh”) in 28 pronunciation variants.

The 22 phoneme HMMs are initialized from hand-labeled HMM train set by HInit

tool and re-estimated using the Baum-Welch procedure [85] by HRest tool. Subsequently,

all files that contain other words than the eleven digits are sorted out to eliminate un-

known phonemes. The Embedded re-estimation step performed next does internally the

phoneme alignment which would not be possible with unknown phonemes present. The

re-estimation is run 5 times to yield final models.

Test utterances are recognized using HVite tool given the features, a grammar with

11 digits plus silence in a loop, and a pronunciation dictionary with 28 pronunciation

variants. The performance is evaluated in terms of WER by HResults tool and the word

insertion penalty is allowed be tuned to yield the same number of insertions and deletions.

Statistical Significance

The word error rate is a statistical measure and should be accompanied with some confi-

dence interval. A simple approach adopted from [41] will help to indicate the statistical

significance for S/N task.

Let us suppose that two recognizers are evaluated at the same task, reaching two WER

scores p1, p2. The goal is to decide if there is enough evidence to conclude that either

p1 = p2 or p1 6= p2. A null hypothesis H0 saying that p1 = p2 is tested for a chosen

significance level α. If H0 can be rejected, then the scores are statistically different,

otherwise the difference is probably due to the chance.

Our basic question can read “By how much we need to improve on the given WER for

the improvement to be significant?” There are 12437 words in HMM Test set, the WER

hovers typically around 4% (will be shown later). Following the approach for significance

α = 0.05 we get the minimal difference in scores 0.5%. It means that if the baseline reaches

4.0% WER, the proposed algorithm needs to be better than 3.5% WER for the difference

to be significant at 95%. For 99% significance the difference is 0.7%.

Note that this measure is only an indication as it assumes an independence among the

errors in the tested words and also between the two experiments. There exist more accurate

significance measures, but most of them have to be re-evaluated for every experiment since

they depend on the recognized sequences and thus apply only to the specific run [41].


4.4.2 Conversational Telephone Speech – CTS

The main goal of the development of new feature extraction and recognition techniques

is to enhance the state-of-the-art ASR, which is typically an LVCSR system trained on

large amounts of data. It is not computationally feasible to develop on the full scale

task as the turnaround time is long, hence the experiments are usually done on simpler

tasks. However, it is often the case that the results reported on a small-vocabulary task

do not generalize to the state-of-the-art system. The task introduced in this section is

a compromise between the complexity and reliability on one hand and the simplicity on

the other hand. It is based on the data used by EARS Rich Transcription system and is

shown that the conclusions taken on it do translate to large task. A complete description

of the task can be found in [22]. Its key properties are:

• Two gender-dependent recognizers,

• telephone data (8 kHz) from various sources (cell phones, ISDN, analog lines)

• 32 hours of train data (16 hrs per gender) selected from Fisher and Switchboard

corpora,

• about half an hour of development + half an hour of test data per gender selected

from NIST RT-03 Evaluation [3],

• 1000 words in about 5k pronunciation variants, out-of-vocabulary-words (OOV) on

test set < 7.5%,

• bi-gram language model (inferred from NIST 2004 CTS evaluation).

Since architectures of the recognizers for both genders are the same, the following text

describes either one of them.

MLP Setup

The MLP setup is similar to the S/N and Speecon tasks. All MLPs and also KLT are

trained on the first 14.6 hours of the training data, the remaining 1.6 hours is a cross-

validation set. Training targets were obtained using automatic forced-alignment using the

state-of-the-art system from SRI [22]. There are 47 classes out of which 46 are used as

MLP targets (44 English phonemes + silence + other events) and one class with data not

suitable for training (about 0.6% of all frames). Details can be found in Appendix A.2.

This was decided after the preliminary experiments have shown that training on all targets

only lowers the FER and does not improve WER. There are about 5.7 million training

frames in about 16 k files. The sizes of MLPs are N x 2000 x 46 units in case of TANDEM

(N depends on the technique) and in case of TRAP the Band-MLPs have N x 100 x 46

units, the merger has (46x15) x 300 x 46 units.

After all the training, development, and testing data have been passed through the

trained MLPs, log, and KLT, they are converted to HTK features.

4.5 Baseline results 33

HMM Setup

Here the back-end is little more complicated than for S/N task as it is closer to LVCSR

task. It is a GMM/HMM systems with decision-tree tied state triphone HMMs, each HMM

with 3 emitting states and 32 mixtures per state. First the CI phoneme models with one

mixture are initialized from the flat start and trained by the Expectation-Maximization

algorithm (HERest tool). Then they are converted to CD models and these are clustered

until there is a required number of states. Subsequently, the number of mixtures is being

doubled and HMMs are re-estimated 5 times after each increase, until there are the final

32 mixtures.

Apart from the number of states being a knob to tune on the development set, also

the grammar scale factor can be tuned, balancing the acoustic and the language model.

However, tuning of these parameters is costly and will not be always done. Preliminary

experiments have shown that the WER is affected the most by these three factors:

• Grammar-scale factor. It depends on features (their acoustic modeling ability), but

mainly on the number of features in the vector, which affects the dynamic range of

the likelihood (tested with 39, 46 and 64 features).

• Number of triphone states. It varies for every feature kind. The general rule “the

more the better” may not be applied to all features.

• Normalization of the features. It seems useful to do speaker-normalization on the

train set and utterance-normalization on the test set.

Test utterances containing the CTS with a limited vocabulary are recognized using

HVite tool given the features, language model, pronunciation dictionary and the set of

CD phonemes. Output hypotheses are evaluated using SCLITE tool in terms of WER.

The CTS task allows for previewing the feature’s behavior on a large task. The draw-

back is a rather high computational demand. On a one-processor machine (Athlon 2200)

the training and evaluation takes almost two weeks. Because of this, only one gender,

male, is being used in all evaluations.

4.5 Baseline results

Four front-ends are used as a baseline:

MFCC – mel-frequency axis, 26 triangular filters from 0 – 4 kHz, no preemphasis, no

liftering, 12 cepstral coefficients + c0 + ∆ + ∆∆.

PLP – Bark-frequency axis, 15 trapezoidal filters from 0 – 4 kHz, no preemphasis, no

liftering, EQ-LD, IN-LD, 15th order LPC, 12 cepstral coefficients + c0 + ∆ + ∆∆.

PLP-TANDEM – 9 consecutive frames of the above PLP features at the MLP input

(351 features).

TRAP – critical band energies (CRBE) taken from the PLP computation, logarithm

applied. TRAP formed as 101 consecutive frames of CRBEs. Thus, each of the 15

band-MLPs have 101 inputs.


Features FER[%] WER[%]

MFCC - 4.5

PLP - 5.2

PLP–TANDEM 18 4.8

TRAP 19 4.7

Table 4.1: Performance of baseline features on S/N task.

Performance of these features on S/N task is given in Tab. 4.1.

Preliminary experiments on CTS task with the baseline features PLP, PLP-TANDEM,

TRAP suggested that the above mentioned factors may affect the performance more than

the features themselves. Therefore the factors were explored at first.

The suitable number of clustered triphone states depends on the type of features, hence

it should be optimized for every experiment. Reasonable starting values, meaning that

the WER should not be more than about 5% worse than the optimum, are between 1000 –

2000 states. For male gender generally more states are needed [22]. As the optimization is

very costly (involving the HMM retraining and evaluation), most of the time the optimum

is chosen from about three variants in the presented experiments.

The grammar scale factor post-multiplies the language model likelihoods from the

word lattices [100] and by default is 1.0. The bigger value, the more weight is given to the

language model. The WER was evaluated for some combinations of three vector sizes (39

for PLP, 46 for TANDEM, and 64 for combined features) and factors 3, 8, 14, 20, 25, 30,

40. As the recognition is costly, only several combinations were evaluated. For 46 features,

the reasonable choice of the factor was between 14 and 30 with WER ± 1%. Factor 8

yielded 5% (absolute) worse performance. For 64 features, the reasonable range (± 1%

WER) was between factor 20 and 30. Factor 40 yielded 3% (absolute) worse WER. For

39 features the factor was set at 14 [22]. Based on these observations, the default setting

for the grammar scale factor was set as shown in Tab. 4.2.

Number of features in the vector 39 46 64

Grammar scale factor 14 20 25

Table 4.2: Grammar scale factors for particular feature vector sizes (CTS task).

The baseline features were evaluated with the appropriate grammar scale factors and

a suitable number of tied triphone states. The word insertion penalty was fixed in all

experiments at zero. The results are given in Tab. 4.3. Note that ePLP denotes enhanced

PLP features. They were normalized with respect to the vocal tract length (VTLN)

and energy normalized over all speech from one speaker (training data) or per utterance

(evaluation data).

4.5 Baseline results 35

MLP cross-validation Devel set Test set

Features FER[%] WER[%] WER[%]

MFCC - 52.5 50.2

PLP - 53.2 51.6

ePLP - 46.4 43.8

ePLP–TANDEM 35 46.3 43.3

TRAP 40 54.8 53.3

TRAP–DCT 39 53.3 51.2

Table 4.3: Performance of baseline features on CTS task.

Chapter 5

Information in time-frequency

plane

The ASR knowledge allows for the assumption that the auditory-like spectrum preserves

complete linguistic information. However, the speech is a temporal event and so is also

its spectrum. Due to the co-articulation, the ASR-relevant information is spread across

some interval of the spectrogram. The goal of this chapter is to help to understand the

distribution of the relevant information in the time-frequency space. Such knowledge

would allow to improve the existing techniques, to make them more discriminative against

non-linguistic artifacts, and possibly also suggest new approaches. It will be accomplished

by a detailed study of TRAP features.

5.1 Limits of useful temporal context for TRAP features

To be able to study the spatial distribution of a phenomenon, the first step is to limit the

search space to some reasonable boundaries. By restricting our interest to probabilistic

features with intermediate phoneme classes, the main question can be: “How much of

temporal context is necessary for separating among phonemes?” The answer will be

searched for by first studying the mean temporal trajectories of phonemes. However, as

the ultimate target for ASR are words and not phonemes, the subsequent experimental

search for the context limits will be done with respect to both, phoneme and word criteria,

FER and WER.

5.1.1 Mean TRAPs of phonemes

To get a first insight in temporal patterns of the sub-band energies, the mean TRAPs were

computed according to the work of Hermansky and Sharma [53] from the training part of

CTS database.

Fifteen CRitical Band log-Energies (CRBE) were extracted from the speech with

short-term analysis and 100 Hz rate using command feacalc -plp 12 -deltaorder

0 -domain logarithmic -dither -frqaxis bark -samplerate 8000 -win 25 -step

10. TRAPs were formed as 101 samples long trajectories of CRBEs (about 1000 ms).

They were assigned labels according to the frames in the middle. There are overall about

11 million frames in CTS training part. The class coverage is given in Appendix A.2.

37

38 5 Information in time-frequency plane

There are 41 English phonemes plus silence (SIL), word fragment interruption point

(FIP), laughter (LAU), filler phonemes (PUH, PUM) and a class for all sounds rejected

from the training (REJ). All TRAPs belonging to the same class were averaged to yield

the Mean TRAPs. For each of the 101 points in every Mean TRAP, a standard deviation

was also calculated. Fig. 5.1 shows the Mean TRAPs and standard deviations at the 5th

critical band (450 – 640 Hz) for all 47 classes.

aa ae ah ao aw ax

ay b ch d dh dx

eh er ey f g hh

ih iy jh k l m

n ng ow oy p r

s sh t th uh uw

v v y z zh

SIL

dB10

0

−10

0−500 ms +500 ms

FIP LAU PUH PUM REJ

Figure 5.1: Mean TRAPs (black line) for 41 phonemes and 6 non-phoneme classes of CTS,

5th band. Standard deviations σ (blue dashed line). Green area represents a measure of

confidence mean±σ.

Observation

Typically the TRAP wavelets for phonemes span about 200–300 ms and the lowest variance

(highest confidence) is near the center. Non-phonetic classes have somewhat wider shapes

and higher variance at the center, reflecting that these segments are typically longer and

contain artifacts with higher variability. Although not illustrated in the figure, mean

5.1 Limits of useful temporal context for TRAP features 39

TRAPs in other bands have different shapes but similar time constraints. It suggests

that 1 second TRAP is long enough to capture complete temporal structure relevant to

phoneme-based ASR and it is likely that the minimum necessary interval is even shorter,

yet not shorter than about 200 ms. Hence, the base TRAP length was set to 1000 ms for

further experiments .

Mean TRAPs from MLP Posteriors

Analogically to the above experiment, similar Mean TRAPs were computed from the same

data, but with different labels. Here, the TRAPs were weighted by phoneme posteriors

obtained from a trained MLP, not the discrete labels. The MLP was trained on the 5th

band only.

SIL dx zh Mean − hand

Mean − MLP

std.dev. − hand

std.dev. − MLP

Figure 5.2: Mean TRAPs and standard deviations for selected phonemes of CTS in 5th

band. Labels were obtained either from hand-labeling or from trained MLP.

Though the MLP frame error rate was quite large (60–70%), the resulting Mean TRAPs

were almost the same as in the above experiment, as shown in Fig. 5.2. It suggests that

1) on the sub-band level all phonemes cannot be distinguished and, 2) it is possible to

find some patterns which are common to more phonemes. This idea relates to the work

of Kajarekar and Hermansky [61], who aimed at finding a set of optimal broad sub-word

classes for ASR. They reported no success when using these targets as HMM states. Later

efforts with UTRAP system [51] extended the idea to all bands; a neural system with one

universal classifier for all bands with 9 classes reached the performance of TRAPs with

over an order of magnitude less trainable parameters. The existence of shared classes was

the main motivation for the experiment in appendix C.

5.1.2 Truncating TRAP

Having settled the maximum TRAP length of 1000 ms, a study was done with shortening

the TRAP. The initial goal was to find the best possible accuracy with the TRAP length

being the only free parameter. Subsequently a minimal length was searched for which

still preserved the quality of the system with optimal length. The objective was not to

minimize the TRAP length to make the system smaller; the point was rather to find where

the most important information comes from. It was supposed that the critical length could

be detected at the point where the system started to break.

Experiment setup

The experiment was based on the S/N task and the TRAP system introduced in Section

4.4.1:


• TRAP length varies from 161 frames (≈ 1600 ms) down to 1 frame.

• MLP sizes: Nx100x29 units in Band-MLPs, 435x300x29 units in Merger-MLP.

N equals TRAP length in frames.

• 2 types of TRAP: normal, and mean–normalized over TRAP length.

• Evaluation criteria are WER and FER.

For every TRAP length a new set of MLPs and HMMs was trained. Subsequently,

FER on the merger-MLP and WER were evaluated and plot in the graphs at Fig. 5.3.

0 200 400 600 800 1000 1200 1400 1600 180015

20

25

30

35

TRAP length [ms]

FE

R[%

]

mean normalization over TRAP

no normalization

0 200 400 600 800 1000 1200 1400 1600 18003.5

4

4.5

5

5.5

6

TRAP length [ms]

WE

R[%

]

mean normalization over TRAP

no normalization

Figure 5.3: Influence of TRAP length on FER (left) and WER (right). MLP size depends

on the TRAP length.

Observation

FER criterion:

• The dependency of FER on the TRAP length is smooth. The only outlier point

for TRAP length 101 was caused by a slightly worse convergence of MLP training,

which did not affect markedly the WER.

• When optimizing for the frame level phoneme recognition, the TRAP length should

be between 400–1000 ms. Longer TRAPs than 1000 ms start to hurt the perfor-

mance. Shorter TRAPs than 400 ms are not long enough and can severely affect the

FER.

• Mean-normalized TRAPs require about 50 ms (5 frames) larger window than no-

normalized TRAPs. A possible explanation might be that the missing information

about the mean in mean-normalized TRAPs can be recovered from wider context.

WER criterion:

• The dependency of WER on the TRAP length is jittery and only a rough trend can

be interpreted.


• WER seems to be almost independent of TRAP length in wide range. TRAP can be

truncated down to 300 ms (normal TRAP) and even to 100 ms (mean-normalized

TRAP) before significantly deteriorating in WER.

This surprising phenomenon can possibly be explained if the phoneme posteriors

are seen generally as features: Although for short TRAP the frame-level phoneme

posteriors are “noisy”, they still apparently contain the linguistic information, which

can be isolated in HMMs thanks to their temporal smoothing mechanisms with

strong a priori constraints.

NOTE: There is an extreme case of this experiment worth noting. It is when the TRAP

length is only one frame. Such a degenerated system is interesting as it is essentially only

the short-term auditory spectrum at the input. There is no context.

The Band-MLPs have a topology of 1 input – 100 hidden – 29 output units and they

have to make a decision about the underlying phoneme given only one number – sub-band

energy. Surprisingly, having only such subtle information, they are still able to properly

classify 25–30% of phonemes1! Recall that Band-MLPs for standard 51-frames TRAP

properly classify 30–40% frames.

The Merger-MLP properly classifies 50% frames (51-TRAP system: 82%). The word

error rate reaches respectable 18.7% (51-TRAP system: WER = 4.6%).

From a different point of view, such a system is in principle comparable to the PLP

without ∆ and ∆2, which gives 17.9% WER and confirms the consistence of the results.

5.1.3 Fixed MLP topology – truncating TRAP-DCT

The reason for the jitter in the WER curves from Fig. 5.3 is that the MLP topology differs

among various TRAP lengths. This consistently introduces a stochastic component in

WER. In order to make the curves smoother, the MLP topology has to be fixed over the

experiment. This can be accomplished by TRAP-DCT. The idea of TRAP-DCT is as

follows.

1. To form TRAPs of the required length.

2. To apply Hamming window along TRAP to reduce the TRAP boundary transitions.

3. To project the windowed TRAPs to the first K DCT bases.

4. To use the projections as features for MLP.

The number of DCT bases (DCT size) K used for projection needs to be fixed. K

needs to be at most equal to the TRAP length N (in frames) for the DCT to be defined,

K ≤ N . Hence, low K allows to shorten the TRAP down to K frames (K×10 ms), but as

there are few features then, the performance is low. High K allow for good performance,

but the TRAP cannot be truncated below K frames. Therefore K was left to be a free

parameter.

1Classification rate is expressed in terms of Frame Accuracy. It is defined as 100% - FER.


MLP architecture

The TRAP-DCTs are commonly treated by a TRAP MLP architecture with 15 band

networks plus merger network. However, preliminary experiments have shown that for

small K the dependencies of WER and FER on TRAP length are noisy. It is because the

band MLPs have very few inputs and their decision cannot be reliable. This issue can be

avoided by replacing all 15 plus 1 MLPs by one big MLP (for small K), which preserves

the trend in results and smooths the curves. Specifically, the setup was the following.

DCT size K MLP Architecture

2, 4, 6 1 MLP, size = (K × 15)x1800x29

(100k–200k trainable parameters)

7, 11, 25, 50 15 Band-MLPs, sizes = Kx100x29

1 Merger-MLP, size = 435x300x29

(200k–260k trainable parameters)

The experiment was carried out in two loops.

• For every DCT size K out of 2, 4, 6, 7, 11, 25, and 50, do:

• Vary the TRAP length N from the minimum K × 10 ms to about 1000 ms.

• For each K and TRAP length N , calculate features (using the above four points),

train MLPs and HMMs, evaluate FER and WER.

The resulting dependencies are plotted in Fig. 5.4.

0 200 400 600 800 1000 1200 1400 160010

15

20

25

30

35

40

45

50

55

60

TRAP length [ms]

FE

R [%

]

DCT size 2

DCT size 4

DCT size 6

DCT size 7

DCT size 11

DCT size 25

DCT size 50

0 200 400 600 800 1000 1200 1400 1600

4

6

8

10

12

14

TRAP length [ms]

WE

R [%

]

DCT size 2

DCT size 4

DCT size 6

DCT size 7

DCT size 11

DCT size 25

DCT size 50

Figure 5.4: Influence of TRAP length on FER (left) and WER (right). TRAP-DCT

features are used, so MLP size is fixed for every trajectory.

Observation

• The number of useful DCT bases saturates at K = 25, reaching the best FER about

15% and the best WER about 4.3%. More bases start to hurt the performance.

• For best FER a long context (400–1000 ms) is required. This agrees to the experi-

ment in Section 5.1.2.


• WER criterion seems to prefer shorter context as the minima of all curves lie at the

left side. Best WER was found for context 200–400ms.

• The lower the feature dimension, the shorter the optimal TRAP length.

Conclusion

The experiment has shown that relatively fast modulations in speech are the most impor-

tant for ASR. However, slower modulations are also useful, as they contain complementary

information.

In particular, when the feature size is limited, then shorter TRAPs are the first choice:

300 ms for good FER and 100 ms for good WER. Slower modulations contained in longer

TRAPs up to 1000 ms can be useful when the feature size does not matter.

5.1.4 Extension – combining DCTs of different lengths

The observations from the previous experiment may deserve more discussion. Consider

the shape of the applied DCT bases shown in Fig. 5.5. They are cosinusoids weighted by

Hamming window. In discrete time with frames n = 0 . . . N − 1, N is the TRAP length,

DCTk[n] = cos

(kπ

N(n + 0.5)

)

︸︷︷︸

DCT

·(

0.54 − 0.46 cos(

2πn

N

))

︸︷︷︸

Hamming

, (5.1)

where k = 1 . . . K is the DCT base index2. The actual projection of TRAP samples x[n]

onto DCT bases is done with

X[k] = ak

N∑

n=0

x[n] · DCTk[n], (5.2)

where ak is a scaling constant. DCT can be seen a bank of very narrow band-pass filters

tuned at different frequencies. The center frequencies fc are linearly spaced from 0 Hz to

fs Hz, where fs is spectrum sampling frequency (here fs = (10 ms)−1Hz= 100 Hz). The

absolute filter’s center frequency in Hertz is determined by the ratio k/N :

fc(k,N) =1

2π

kπ

Nfs =

k

2Nfs [Hz]. (5.3)

Thus, the same frequency can be obtained in multiple ways. Let us take an example.

Let DCTNk be the kth DCT base applied to the TRAP of the length N samples. According

to eq. 5.3, the first two DCT bases applied to 10-frames long TRAP (denoted DCT101,2) have

frequencies 5 Hz and 10 Hz, respectively3. The DCT1001,2 bases applied to ten times longer

TRAP have frequencies 0.5 Hz and 1 Hz, respectively, but the the tenth and twentieth

bases (DCT10010,20) have again frequencies 5 Hz and 10 Hz. What differs between DCT10

1

and DCT10010 is in principle just the length of the “cosinusoid” as shown in Fig. 5.6. It

suggests an interesting question, which bases perform better, shorter or longer ones? In

other words, when projecting TRAP to DCT bases with multiple periods, is it necessary

to use all the periods?

2The DCT base for k = 0 is also defined, but without Hamming window it is a constant (cos 0 = 1),

therefore it will not be considered here.3The explanation does not consider Hamming windowing for simplicity.


0−1

−0.5

0

0.5

1DCT1

DCT2

DCT3

DCT4

Figure 5.5: First four DCT bases weighted by Hamming window.

Recall that low-order DCT features from the previous experiment reach relatively good

accuracies. DCT71,2 coefficients perform 8.5% WER, DCT15

1,2,3,4 5.5% WER. This incites to

try combining the low-order DCT bases that gave good performance with other low-order

DCTs of different TRAP length, instead of calculating higher-size DCT of only one length.

For example, instead of calculating four features DCT151,2,3,4, to use two plus two features

DCT151,2 + DCT7

1,2. It is illustrated in Fig. 5.7.

Experiment

A simple experiment was carried out, similar to the experiment in the previous section.

It combined the best-performing DCT size 2 features (DCT71,2) with another two DCT

size 2 features (DCTM1,2). The hypothesis was: It is possible to find a combination of four

DCT bases DCT71,2+DCTM

1,2, which outperforms the best DCT size 4, DCT151,2,3,4. If this

was true, better features could be found than TRAP-DCT by optimizing the projection

bases. The experiment proceeded as follows.

1. The promising features from DCT size 2 were found (top blue curves from Fig. 5.4).

2. They were concatenated with other DCT size 2 features. It was repeated for all

TRAP lengths.

3. The performance was compared to the DCT size 4 features.

(The same procedure was run independently for both criteria, FER and WER.)

0 100 200 300 400 500 600 700 800 900 1000−1

0

1

time [ms]

0−1

0

1

DCT1

10

DCT2

10

DCT1

100DCT

10

100DCT

20

100

DCT bases for 100 ms (10 samples)TRAP

DCT bases for 1000 ms (100 samples)TRAPTRAP center

Figure 5.6: Chosen bases of DCT applied to 100 ms and 1000 ms TRAPs.


DCT1

N

DCT2

N

DCT3

N

DCT4

N

DCT1

N

DCT2

N

DCT2

N/2

DCT1

N/2

First four bases ofstandard DCT

Combing two DCTsof different sizes After weighting with Hamming window

time

Figure 5.7: Illustration of combining bases of two DCTs of different sizes. Left part shows

the conventional bases (DCTN1,2,3,4) and combined bases (DCTN

1,2+DCTN/21,2 ). Right part

shows the same bases, but windowed with Hamming window.

0 200 400 600 800 1000 120020

25

30

35

40

45

50

55

60

65

70

TRAP length [ms]

FE

R [

%]

DCT size 2

DCT size 4

300 ms TRAP → DCT size 2 + two more coeffs DCT size 2

0 100 200 300 400 500 600 700 800 900 1000 11002

4

6

8

10

12

14

16

TRAP length [ms]

WE

R [

%]

DCT size 2

DCT size 4

300 ms TRAP → DCT size 2 + two more coeffs DCT size 2

Figure 5.8: Combining DCT size 2 features with another DCT size 2 features. FER (left)

and WER (right). Dashed ellipse = best of DCT size 4, solid ellipse = combined 2+2

features.

Observation

Results are plotted in Fig. 5.8. The hypothesis was confirmed by both criteria.

• FER: DCT1,2 bases applied to 300 ms TRAP can be profitably augmented by sim-

ilar features derived from either shorter or longer TRAPs. Best seems to combine

approximately 300 ms + 100 ms or 300 ms + 800 ms long bases. The combined

features outperform the best of DCT size 4 (300 ms).

• WER: DCT1,2 bases of the length 150 ms can be successfully combined with DCT1,2

of the length 30 ms, outperforming the best of DCT size 4 (150 ms).

Conclusion

TRAP-DCT uses a set of cosinusoidal projection bases of the same lengths and increasing

frequencies. The “high-frequency” bases have multiple periods of the cosine within the


window. It was shown that using all periods can actually harm performance and it is

beneficial to truncate these bases (see Fig. 5.7, left). In other words, it seems desirable

that the length of a particular cosinusoidal base be proportional to its frequency.

This knowledge is one of the motivations for Multi-RASTA filtering, where the pro-

jection bases have fixed shapes and they differ only in widths. This framework will be

introduced in Chapter 7. Note that Multi-RASTA filters have remarkably similar shapes to

the bases that were successfully used in this section (see the rightmost column in Fig. 5.7),

though their origin is different.

5.2 Focusing on TRAP center

The previous sections studied the necessary length of speech needed for ASR with respect

to WER and FER criteria. The TRAP lengths for which the performance was not far

from the best ranged between 100 ms – 1000 ms. In this section it will be explored what

is the density of relevant information within TRAP.

TRAPs treat all the points in the trajectory the same way. However, by intuition, the

most important part should be located near the TRAP center. Consider a human eye:

there most of the information comes from the area where the eye is focused on and only

a minor message comes from the periphery. If there was an analogy with speech, then it

could be beneficial to put different attention to different parts of the TRAP4. But how

the attention is defined here? For example, TRAP-DCT aim at emphasizing the center of

the trajectory by weighting it with Hamming window before applying DCT. The following

reasoning seeks for a more explicit proportioning of the attention. It assumes that if a

region with higher density of information is parametrized by more features, the system

performance should raise.

5.2.1 Warping time axis

Let us assume that the complete linguistic information is contained in the 101 x 15 samples

of the auditory spectrogram. If the information density is not uniform, then some parts of

the spectrogram are neccessarily redundant. These sparse-information regions that require

less features could be sub-sampled. A simple approach to achieve this is to bin together

neighboring TRAP points and to use only the resulting point. To further simplify the

problem so that it can be solved experimentally, the density along the frequency axis is

assumed uniform. The main goal can now be interpreted by two questions.

• What is the mapping function between the linear time axis and a warped time axis

with constant information density?

• How redundant are the 101 samples in 1000 ms long TRAP trajectory? In other

words, how many TRAP samples are needed to preserve the information?

4It can be argued that the MLP is able to learn any nonlinear projection and this reasoning does not

make sense. But that is only a theory assuming many conditions are met which are actually not, for

example the lack of train data. If the world was perfect, one could feed the MLP directly with speech

samples. In practice, it makes a lot of sense to “help” the MLP with proper feature extraction and selection.

5.2 Focusing on TRAP center 47

Warping Function

The mapping is approximated by a center-symmetric function which one half is given by

f(x) = 1 − (1 − x)w, x ∈< 0, 1 >, (5.4)

where w is the only parameter. The symmetrization takes place along the point

[x=1,f(x)=1]. The complete function maps the warped output axis (x axis) to the input

linear axis (y axis), as shown in Fig. 5.9.

One could ask, why the output is mapped to the input and not vice-versa. The goal is

to design a warped axis which, when sampled uniformly, produces samples having uniform

density of information. Hence, the most important is to be able to project the samples of

the warped axis onto the linear axis. The mapping from linear to warped axis is not so

important.

Fig. 5.9 shows the influence of w on the warping function. w = 1.0 represents 1:1

mapping, w > 1.0 prioritizes center, w < 1.0 prioritizes boundaries (included for com-

pleteness). This function was chosen for three reasons. 1) The “warping effect” is nicely

proportional to w. 2) The function defaults to straight line for w = 1. 3) For a given

w, the warping near the center is stronger than at the boundaries (note the slopes for

w = 2.6 near points 1.0 and 0.0 on horizontal axis), which protects the boundaries from

being sampled too sparsely when high w values are used.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

OUTPUT warped axis

INP

UT

lin

ea

r a

xis

w

w = 0.6

w = 2.6

w = 1.0

Figure 5.9: Symmetric TRAP warping function, various w factors. Coordinates [1,1]

denote TRAP center. Red dashed lines illustrate the projection of the uniformly sampled

warped axis to the linear axis.

Now the function has to be discretized, it means to be sampled at equidistant points of

the horizontal axis. The input TRAP size is 101 points and the number of output points is

M (M ∈< 1, 101 >). M is odd, which means that both edge points as well as the middle

point will be warping-independent. The example of the discrete mapping from 101 to 21


points is illustrated in Fig. 5.10. For w = 1.0, about 5 input points are binned (averaged)

to yield one output point, the ratio is 5:1. For w = 2 the mapping is 1:1 in the center and

10:1 at the boundaries. Obviously, it does not make sense to raise the w too much, as the

central points would start to repeat at the output.

Experiment Setup

101-points input TRAP representing about 1000 ms were mapped to M = 51, 31, 21, 15,

and 11 points. Lower counts do not make sense according to the experiment from Section

5.1.2. For each M , four warpings were tested:

Linear – linear mapping, w = 1.0, all points are compressed with the same average ratio

101:M .

High – strong warping such that mapping is 1:1 at the center.

Medium – less aggressive warping such that mapping is 1:2 at the center.

Inverse – priority to boundaries, mapping is 1:1 at the boundaries.

All the tested warping factors and output sizes are summarized in Tab. 5.1.

5 10 15 20

10

20

30

40

50

60

70

80

90

100

OUT

IN

w = 1

5 10 15 20

10

20

30

40

50

60

70

80

90

100

OUT

IN

w = 2

5 10 15 20

10

20

30

40

50

60

70

80

90

100

OUT

INw = 0.5

Figure 5.10: Discrete mapping from TRAP size 101 to TRAP size 21.

The experiment was based on the same setup as TRAP-length experiment (see Section

5.1.2), with TRAP back-end and S/N task. For every combination of output sizes M and

factors w, warped-TRAP features were calculated from mean-normalized TRAPs using the

pfile warp tool and fed to the ANN/HMM framework, FER and WER were evaluated.

Observation

The recognition results with warped TRAPs are shown in Fig. 5.11.

• Linear mapping 101 to M points smoothly deteriorates FER and WER as M drops.

There is a well-performing exception of M = 51, where each two points in TRAP are

5.2 Focusing on TRAP center 49

warping type

Linear High Medium Inverse

Output boundary center boundary center

size M ratio w ratio w ratio ratio w ratio

101 1:1 1.00 1:1 1.00 1:1 1:1 1.00 1:1

51 2:1 1.37 3:1 1.18 1:1 1:3 0.51 12:1

31 3:1 1.67 5:1 1.33 2:1 1:4 0.31 34:1

21 4:1 2.00 10:1 1.63 2:1 1:6 0.21 54:1

15 6:1 2.40 14:1 1.96 2:1 1:12 0.15 66:1

11 9:1 3.00 22:1 2.30 2:1 1:18 0.11 78:1

Table 5.1: Warping factors w and compression ratios for given lengths M .

0 10 20 30 40 50 60 70 80 90 100 11017

18

19

20

21

22

23

24

25

26

27

Feature vector size

FE

R [

%]

No warping

Strong center emphasis

Medium center emphasis

Boundary emphasis

0 10 20 30 40 50 60 70 80 90 100 1104

4.5

5

5.5

6

6.5

7

7.5

8

8.5

9

Feature vector size

WE

R [%

]

No warping

Strong center emphasis

Medium center emphasis

Boundary emphasis

Figure 5.11: Warping time axis in TRAPs. Left: FER, right: WER.

binned together. This has a similar effect as using 2x longer speech frames, which

still preserves modulations up to 20 Hz (2x 10 ms frame ↔ fs = 50 Hz). Some

authors use segmentation 32/16 ms ordinarily.

• Preserving fine resolution near the TRAP center at the expense of boundaries indeed

makes sense. It proves the hypothesis that TRAPs are redundant and that the crucial

information lies near the center. The severe sub-sampling of boundaries with ratio

10:1 for M = 21, w = 2.0 achieved the best performance in both FER and FER (see

also Fig. 5.10). Using down to 15 features per TRAP kept the performance. 11

features started to harm.

• The inverse warping which preserves boundaries at the expense of center failed,

supporting the notion of the major importance of center.

Extension to CTS Task

The promising warpings were evaluated also on the CTS task and compared to the stan-

dard TRAPs with 101 samples and 21 samples, see Tab. 5.2.

In frame error rate, warped TRAPs reached comparable performance to 101-TRAP,

significantly outperforming short 21-TRAP. It confirms the validity of the idea of non-


Devel set Test set


TRAP 101 40.0 54.5 53.4

TRAP 21 44.3 55.2 52.8

warpTRAP M = 21, w = 2.0 40.5 55.5 53.8

warpTRAP M = 15, w = 2.4 40.6 56.5 54.3

Table 5.2: Performance of selected warped TRAPs on CTS.

uniform information density allowing to use less features. In word error rate, the obser-

vations from S/N task did not translate to LVCSR. At average, warped TRAP performs

1% worse than TRAP. The discrepancy between FER and WER resembles figures from

Sec. 5.1.2. A possible explanation is that on CTS task, FER is fully determined by features,

while WER is influenced by more inputs (HMM temporal constraints, LM, dictionary).

Chapter 6

Linear Predictive TempoRAl

Patterns (LP-TRAP)

This chapter deals with Linear Predictive TepoRAl Patterns (LP-TRAP). The idea of

LP-TRAP originates from Marios Athineos, Hynek Hermansky and Daniel P. W. Ellis

[15]. During author’s stay at IDIAP Research Institute, he cooperated with the first

two LP-TRAP inventors. The outcome was a C++ implementation of LP-TRAP feature

extraction, partial optimizations of the technique and evaluations on S/N and CTS task.

In addition, the author experimented with non-linear warping of the temporal axis of

LP-TRAP.

6.1 Introduction to LP-TRAP

Speech is by nature a continuous time signal. Though human hearing performs implicitly

some form of spectral analysis, time remains the primary dimension.

Short-term feature extraction begins with segmenting the speech in about 100 Hz rate,

thus reducing the temporal resolution by two orders of magnitude. On one hand it allows

to assume stationarity within the segment and apply FFT, on the other hand it pushes

time dimension aside: any temporal detail below 10 ms is lost. Even more it might seem

unfortunate when considered that the next processing step is a filter bank with spectral

resolution of only about 20 samples. The uncertainty principle suggests that given so

reduced frequency resolution, the temporal resolution could have been much finer than it

is. Why dropping the information which is there and could be important? What if faster

temporal events in speech (e.g. stops) improve ASR when better localized?

Among feature extraction approaches that focus primarily on temporal structure are

TRAPs, which estimate phoneme posterior probabilities from the time evolution of sub-

band energy. However, TRAPs originate from the short-term spectral samples, not imply-

ing any finer details. There should be a more straightforward way to obtain the temporal

evolution of sub-band energy. In principle, the signal could have been split in sub-bands

by a simple filter bank in time domain. Then the band-specific temporal resolution would

be determined just by the impulse response of the particular filter and it would not be

influenced by frame analysis. Section 3.1.2 in survey introduces several works towards

this approach. Note that to get the final estimates of energy envelopes, one would have

to demodulate the band-pass signals and take the square. Anyway, the point is that such

51

52 6 Linear Predictive TempoRAl Patterns (LP-TRAP)

approach is only an approximation to the true energy envelope called Hilbert envelope,

which can also be evaluated.

The squared Hilbert envelope represents instantaneous energy in signal and can be

calculated directly from speech samples. It also allows for sub-band processing. Sub-

band Hilbert envelopes are energy trajectories similar to TRAPs, but with full temporal

resolution. They need to be smoothed to suppress the presence of glottal pulses, but

instead of an ad-hoc universal low-pass filter hidden in the frame-based processing, in

LP-TRAPs the smoothing is accomplished by Linear Prediction1. The involved auto-

regressive model, when applied appropriately (will be further discussed), can smooth the

trajectory and at the same time capture fine temporal details in milliseconds accuracy. Its

behavior can be adjusted by two additional mechanisms: 1) transform applied on spectral

values 2) warping temporal axis. Finally, the LPC coefficients can be easily converted into

cepstral coefficients, which can serve as a natural and elegant TRAP representation.

6.2 Extracting LP-TRAP features

LP-TRAPs are extracted in several steps as illustrated in Fig. 6.1. Very briefly, speech is

segmented in long segments (500–1000 ms) with a shift of 10 ms (to be consistent with

TRAPs) and transformed in frequency domain by Discrete Cosine Transform (DCT).

The spectrum is then split in auditory-like frequency sub-bands. In each sub-band, the

fragment of DCT is solved for coefficients of the LP model. Note the duality between time

and frequency: If we were in time domain, the LPC would approximate a power spectrum;

in frequency domain the LPC approximates something close to “power of a time signal”,

more precisely the Hilbert envelope of the sub-band. For the speech recognition, either

the LP coefficients or cepstral coefficients can be used. Alternatively can be calculated

temporal envelopes of each sub-band, resembling TRAPs.

DCT f

sub−bandssegmentation

FDLPLP coeffs

a[k] / c[k]

FFT

Figure 6.1: LP-TRAP feature extraction scheme.

Now, let us now look closer at some particularities of the algorithm. The scope of the

next sections is the following:

• Section 6.2.1 shows the importance of Hilbert envelope for LP-TRAP. It will also

explain why the spectrum is calculated using DCT.

• Section 6.2.2 clarifies how the DCT spectrum is split in frequency sub-bands.

1Note that here the LPC smooths temporal envelope instead of power spectrum, it is applied in

frequency domain instead of time domain, and therefore is called Frequency-Domain Linear Prediction

(FDLP) [14].

6.2 Extracting LP-TRAP features 53

• Section 6.2.3 explains the principle of Frequency-Domain Linear Prediction (FDLP).

A trick for modeling dips and peaks equally well by AR model is presented.

6.2.1 Importance of Hilbert envelope

Here will be shown very schematically yet vividly the importance of Hilbert envelope.

When thinking about how to estimate the energy trajectory (or envelope) of a signal, the

first idea might be to simply take the power of the signal and possibly smooth the resulting

trajectory by a low-pass filter. But, is this right?

Imagine a simple stationary signal, a sinusoid. What is its envelope? Intuitively, the

amplitude of the sinusoid is constant and so should be its envelope. However, the power of

sinusoid is not constant. It is a value which varies between zero and the square amplitude,

depending on phase. Why would the energy envelope be dependent on an incidental phase

in time? The envelope does not depend on phase, so taking the power is clearly not the

right way. But, how to “make” the signal phase-independent? A rough, yet illustrative

answer is: Accompany the sinusoid with a cosinusoid of the same amplitude. Obviously,

this cannot be achieved by an addition, but by forming the complex signal with the sinusoid

being its real part and the cosinusoid being its imaginary part, a complex exponential.

Since sin2 x + cos2 x = 1, the absolute value of such complex signal is constant. This is in

principle the Hilbert envelope.

To make things clearer and little more formal, recall some basics from signal theory.

Let x(t) be a time signal. Then

x(t) =1

πt∗ x(t) (6.1)

is its Hilbert transform (H[·]). This convolution represents nothing but a phase change

by 90◦, which can be shown using Fourier transform (F[·]):

F

[1

πt

]

= −j sgn(ω), (6.2)

where j is the imaginary unit and sgn function returns the sign. The above mentioned

complex signal is called analytic signal,

x+(t) = x(t) + jx(t). (6.3)

More important for our purposes is its Fourier transform

F[x+(t)] = X+(ω) = X(ω) + j(−j sgn(ω)X(ω)) =

2X(ω), ω > 0

X(0), ω = 0

0, ω < 0

(6.4)

which is a causal (or “one-sided” spectrum). This is actually a dual form of Krammers–

Kronig relation which says that real and imaginary parts of Fourier transform of a causal

signal a(t) are related (ℜ denotes real part of a complex number):

F[a(t)] = ℜ[A(ω)] + jH [ℜ[A(ω)]] . (6.5)

Here, the causal signal is X(ω) and the real signal is x(t), hence

F−1[X(ω)] = x(t) + jH [x(t)] . (6.6)


For the purposes of modulated (band-pass) signals, let us also define the complex envelope

simply as demodulated analytic signal

x(t) = x+(t) exp−2πf0t . (6.7)

Finally, the Hilbert envelope of the signal x(t) is

h(t) = |x(t)| = |x+(t)|. (6.8)

Some authors also define the temporal envelope as

ε(t) = h(t)2. (6.9)

The temporal envelope of the sub-band signal is the subject of modeling in LP-TRAPs.

Discrete Cosine Transform instead of FFT

According to eq. 6.4, the envelope could be in principle calculated by taking Fourier

transform of the signal, canceling its left part (ω < 0) and taking the absolute value of

the inversely transformed signal. But, consider that the envelope is going to be modeled

by linear prediction. LP applied in temporal domain first calculates the autocorrelation

and then solves Yule-Walker’s equations. The autocorrelation can optionally be obtained

using Wiener-Kchinchin theorem [94] as

R(t) = F−1[|F[x(t)]|2]. (6.10)

When the LP is applied in frequency domain (FDLP), it calculates the autocorrelation not

from the signal, but from the spectrum. Therefore the spectrum needs to be purely real.

The real spectrum in turn implies symmetric signal in time. Therefore in LP-TRAPs,

prior to taking the Fourier transform, the input signal is symmetrized, which makes all its

components even. In other words, the signal is composed of cosines only. It can be shown

that in case of a discrete symmetrized signal xsym = {x1, x2, . . . , xn, xn−1, xn−2, . . . , x2} of

the length 2(n− 1), FFT of xsym turns into DCT of the original signal2 x(t) of the length

n. Hence, DCT is preferred over FFT as it is more computationally efficient.

6.2.2 Obtaining frequency sub-bands

Once the speech spectrum is available, it can be split in frequency sub-bands. The sub-

band signals can be seen as modulated band-limited signals. Before they can be modeled,

they have to be demodulated3. In contrary to the time-domain, in frequency domain this

operation is very simple.

The sub-band selection and demodulation from eq. 6.7 is carried out by selecting a

certain part of DCT spectrum and shifting it towards zero. The demodulated sub-band

spectrum is thus formed as

Xb = {X[firstb],X[firstb + 1], . . . ,X[lastb], padded zeros}, (6.11)

2To be precise, both transforms are equal only for a certain DCT definition. For example, in case of

DCT type I the two transforms differ only in scaling.3Actually, due to the equality 6.8, the demodulation is in principle not required, but it comes for free

when forming the sub-band spectra.


where X is the DCT of the input speech, b denotes the sub-band index and firstb, lastbare the limits of the chosen spectral samples. The process is illustrated in Fig. 6.2. The

selected samples are padded with zeros to the length 2N − 1, which is needed for LP

model: LP calculates autocorrelation using Wiener-Kchinchin theorem and the padded

zeros prevent it from being cyclic.

last bfirst b

X[0] X[1] X[1000]

X[200]

Full DCT

Sub−band DCT, band b X[300] 00

X[300]X[200]

Figure 6.2: Illustration of forming frequency sub-bands from DCT spectrum.

It should be noted for completeness that before the sub-bands are modeled with lin-

ear prediction, there is a Gaussian window applied to the chosen DCT samples (before

appending the zeros). It is used with the same purpose as triangular frequency filters in

MFCC calculation or trapezoidal filters in PLP calculation. Its properties will be discussed

in section 6.3.4.

6.2.3 FDLP – Frequency-Domain Linear Prediction

FDLP is the core of LP-TRAP feature extraction. Its purpose is to approximate the

sub-band temporal envelopes ε(t) by autoregressive (AR) model. AR modeling enables to

record the important trend of the modeled function into coefficients of FIR filter. FDLP

is a dual form of time-domain linear prediction (TDLP).

0 10 20 30 40 50−1

0

1

0 1000 2000 3000 4000

−1

0

1

0 10 20 30 40 50−60

−40

−20

0 1000 2000 3000 4000−60

−40

−20

0 10 20 30 40 50−60

−40

−20

time [ms]

0 1000 2000 3000 4000−60

−40

−20

frequency [Hz]

dB dB

dB dB

Signal DCT of signal

FDLP

TDLP

Squared Hilbert envel.

Power spectrum

Figure 6.3: Duality between time and frequency domains in LP-TRAP. Left column shows

a part of speech, its squared Hilbert envelope and FDLP model fit. Right column displays

DCT of the same signal, power spectrum of the signal and conventional LPC fit.


The duality between TDLP and FDLP is illustrated in Fig. 6.3. The conventional use

of LPC is to approximate the shape of the vocal tract, the TDLP thus models the shape of

the power spectrum. On the contrary, FDLP approximates the temporal envelope of the

speech. Typically, LP derives its coefficients by solving Yule-Walker equations, for which

it needs autocorrelation coefficients. In TDLP the autocorrelation is calculated from time

signal, in FDLP it is calculated from DCT spectrum. Further details about FDLP can be

found in [14].

The LP model allows for large dimensionality reduction. It is associated with a trade-

off between modeling precision and the number of parameters, which can be controlled by

LP model order. Similarly to TDLP, in FDLP the smoothing property of the AR model

is desirable to some extent. The FDLP model order will be subject to the experimental

optimization.

Trick for modeling peaks and dips in trajectory equally well

Autoregressive model well captures spectral peaks but tends to disregard dips. This prop-

erty might be desirable in standard TDLP when modeling the vocal tract, but in FDLP,

the dips need to be modeled equally well as peaks. It could be satisfied by AR-MA model

(Autoregressive Moving Average) at the cost of non-elegant iterative procedure. Fortu-

nately, there is a powerful workaround which preserves the desirable analytic calculation

of the AR model and at the same time allows for an arbitrary weighting between precision

in peaks and dips. It is an intermediate nonlinear operation – compression of the spectrum

in TDLP or of ε(t) in case of FDLP. It was proposed by Hermansky et al. [50].

Consider a discrete envelope of the length N. Instead of minimizing an error as a

function of the analyzed envelope ε[n] and the modeled envelope ε[n], we minimize an

error employing transformed variables:

E =G2

N

N∑

n=1

ε[n]

ε[n]→ ET =

G2T

N

N∑

n=1

T[ε[n]]

T[ε[n]], (6.12)

G stands for gain in AR model and T[·] is the non-linear transform. The inverse

transform has to be applied when evaluating the power spectrum of the AR model (or the

approximated envelope in FDLP case):

ε[m] = T−1

[

G2T · c

|1 +∑K

k=1 ake−2πjkm/M |2

]

, m = 1 . . . M, (6.13)

where ak are LP coefficients of the order K, M is the number of envelope samples and

c is a normalization constant. Note that the modeled trajectory can be evaluated in an

arbitrary number of points M , which does not necessary need to match N (except for the

modeling stage, eq. 6.12).

In FDLP, the transform T[·] represents compression, T[·] = (·)cmpr, cmpr < 1, for

cmpr = 1 it defaults to no compression. Its influence on the envelope shape will be

illustrated in section 6.3.1. To be able to implement the compression, the autocorrelation

needs to be calculated using eq. 6.104.

4To be strict, due to the compression, the R(t) (or the discrete R[n]) cannot be called autocorrelation

anymore, but it remains referred so for simplicity.


6.2.4 Free parameters in algorithm

Considering the above details, the LP-TRAP scheme can be redrawn as in Fig. 6.4. Red

numbers represent “bandwidths” – sizes of data vectors. From DCT sub-bands up to LPC

the sizes represent one sub-band only. Sub-band DCT sizes include padded zeros.

FFT (gain, zeropad)(2/cmpr)

FFT (a[k], zeropad)

f

| |FFT iFFT

a[k] + gain c[k]

(8200)

(500−8000)

(500−8000)

(100)

log

(101)

(15)

LPC

(8200)

DCT

time domain frequency domain

(2*cmpr)

(100)

"autocorrelation"

sub−bands

(500−8000)

speech sub−band Hilbert envelopes LP−TRAPs

LP−cepstra

len, step

blap

cmpr

fp

ncep

traplen

Figure 6.4: Detailed LP-TRAP feature extraction scheme. Red bracketed numbers in

italics = data vector sizes, fixed font in magenta = parameters of the algorithm.

There are several free parameters in the system that could be optimized. Some of

them will be optimized in experiments and some of them will stay fixed. Those staying

fixed have either been optimized earlier, or their optimization is not critical.

Parameters not to be optimized:

• Speech segmentation step step. It will stay fixed at 10 ms for consistency with other

speech features.

• Compression of sub-band Hilbert envelopes cmpr distributing the LPC modeling

power between peaks and dips. This parameter has been optimized in earlier works

[15].

• Number of cepstral coefficients ncep per band. This parameter has also been opti-

mized in [15].

• LP-TRAP sampling interval, or the LP-TRAP length traplen. The FDLP en-

velopes, being the frequency responses of LP filters, can be evaluated in arbitrarily

number of points. The used traplen defaults to standard TRAP of the size 101

points. It will not be optimized as FDLP cepstra will be shown to outperform

FDLP envelopes.

Parameters to be optimized:

• Input segment length len. Initial choice is roughly 1 second (8200 samples of 8 kHz

speech). Shorter segments might be more suitable.


• Bank of band-pass filters applied to DCT. The filters are Gaussian-shaped and uni-

formly spaced on Bark frequency axis at 1 Bark intervals. The free parameter is

their width (overlap), controlled by blap factor.

• LPC order fp offering fp/2 poles per band for modeling.

6.3 Experimentally optimizing LP-TRAP

After the LP-TRAP feature extraction had been implemented in standalone C++ appli-

cation fdlp, the free parameters were optimized on S/N task. The best recorded settings

were also evaluated on CTS. The experimental setup was the same as for the baseline

TRAP architecture (refer to the section 4.4 for details). For the purpose of comparison,

conceptually similar TRAP and TRAP-DCT systems were used (sections 4.4 and 5.1.3).

Recall their respective accuracies 19% FER, 4.7% WER and 20% FER, 4.2% WER.

6.3.1 Sub-band envelope compression and LPC order

The influence of envelope compression factor cmpr and number of poles fp on ASR accu-

racy was already scanned by Athineos in [15], the best values found (cmpr = 0.1, fp =

50) were adopted as initial setting here. Their influence on FDLP model is demonstrated

in Fig. 6.5.

signal x(t)

0

−60

−30

0

temporal envelope e(t)

dB

0

−60

−30

0

LP−TRAP cmpr=1

dB

0

−60

−30

0

LP−TRAP cmpr=0.1

dB

0

−60

−30

0

LP−TRAP cmpr=−0.5

dB

signal x(t)

0

−60

−30

0

temporal envelope e(t)

dB

0

−60

−30

0

LP−TRAP fp=20

dB

0

−60

−30

0

LP−TRAP fp=70

dB

0

−60

−30

0

LP−TRAP fp=200

dB

Figure 6.5: 1000 ms LP-TRAPs. Left part: Compression of sub-band dynamics prior to

LP modeling (fp=30), cmpr<0 inverts the dynamics and AR model resembles MA model.

Right part: Various LP orders for cmpr=0.1.

6.3.2 Sampled FDLP temporal envelopes vs. FDLP cepstra as features

In LP-TRAP technique, the sub-band temporal envelopes are parametrized by frequency-

domain linear prediction. FDLP model is determined by coefficients of the FDLP filters.

Sampled frequency responses of these filters represent sub-band energy trajectories (LP-

TRAPs). However, since there are FDLP coefficients available, the LP-TRAPs as such

do not seem to be the best possible representation anymore: It was shown previously

and also in this work that coefficients representing sub-band modulation spectra typically

outperform TRAPs. Conventionally, such coefficients are obtained by the DCT transform

applied on temporal log-energy envelopes (TRAP-DCT). In LP-TRAP technique, these

6.3 Experimentally optimizing LP-TRAP 59

modulation spectral coefficients – or FDLP cepstral coefficients (denoted c[k] in Fig. 6.4)

– can be elegantly obtained directly from the FDLP coefficients via recursion.

For these reasons, FDLP cepstra were presumed to work better than sampled FDLP

envelopes. To support our decision to focus only on cepstral features by some experimental

results, a comparison of two LP-TRAP features with different settings of free parameters

with either cepstra or envelopes are shown in Tab. 6.1. The precise settings of len, step,

blap, cmpr, fp and ncep are not important for now, hence the settings are called simply

I and II.

Features FER [%] WER [%]

I – envelope 18.3 4.6

I – cepstra 18.2 4.1

II – envelope 18.0 4.3

II – cepstra 18.3 3.9

Table 6.1: Performance of sampled LP temporal envelopes vs. LP cepstra as parametriza-

tion in LP-TRAP under two different settings of free parameters (I and II).

This experiment was actually carried out after all optimizations described in the next

sections had been completed, so it acts as a sanity check with the optimized settings in

use. The “envelope” settings use 101 samples of LP-TRAP per band as MLP inputs, the

“cepstra” settings use 50 FDLP cepstral coefficients per band as MLP inputs.

Cepstral features were consistently outperforming envelope features in word error rate

not only in the two presented experiments. Therefore in further experiments only cepstral

features were used.

6.3.3 Input window length

The dependence of recognition accuracy on LP-TRAP input segment length len was

evaluated. As LP-TRAPs are conceptually similar to TRAPs for which the optimal length

ranges between 300–1000 ms, only two lengths were evaluated, 1000 ms and 500 ms. To

preserve the MLP topology, LP order 50 was used in both cases and Band-MLPs were

trained on 50 cepstral coefficients. Tab. 6.2 shows that 500 ms LP-TRAP performs better

than 1000 ms.

To further support the observation, an additional experiment was run, which, instead

of keeping the MLP topology constant, preserved the “pole rate” (ratio of the LP model

order to the segment length). Even though the MLP managed more parameters, neither

this case outperformed 500 ms input (the FER even dropped). Suggested segment length

is thus 500 ms.

LP order fp Segment length len [ms] FER [%] WER [%]

50 1000 19.0 4.7

50 500 18.3 4.3

100 1000 20.7 4.4

Table 6.2: Influence of LP-TRAP input segment length on FER and WER.


6.3.4 Overlap of frequency sub-bands

Fifteen Gaussian filters uniformly spaced on Bark scale were applied on the DCT of the

signal. Specifically, the full frequency scale from 0 Bark to fbmax Bark (where fbmax is

the Bark equivalent to fs/2 Hz) was split to 16 intervals and the filters were centered at

fbcent,i = fbmax/i, i = 1 . . . 15. The fb denotes frequency in Bark units, the relation to

linear Hertz axis is

f = 600 ∗ sinh(fb/6.0) [Hz,Bark]. (6.14)

The frequency dependence of filter gain for the ith filter is given by

gi(fb) = exp

(

−1

2(fb − fbcent,i)

2 · 10blap

)

, (6.15)

with a floor at -48 dB. The blap factor varied between -0.5 and +1.0, see Fig. 6.6. Its

effect on FER and WER is shown in Fig. 6.7.

0 1000 2000 3000 40000

0.5

1

f[Hz]

blap = −0.5

0 1000 2000 3000 40000

0.5

1

f[Hz]

blap = 0.375

0 1000 2000 3000 40000

0.5

1

f[Hz]

blap = 1.0

Figure 6.6: Bank of filters in LP-TRAP. Influence of blap factor on filter widths.

−0.5 0 0.5 118

18.5

19

19.5

20

20.5

21

blap

FE

R [%

]

−0.5 0 0.5 14

4.5

5

5.5

blap

WE

R [

%]

Figure 6.7: WER and FER as a function of blap factor in LP-TRAP.

The results suggest that the width of overlapping frequency sub-bands is an important

factor, which can largely affect recognition accuracy. Its optimal choice (yielding WER

= 4.1%) seems to lie around blap = 0.375. If the bandwidth of the filters is defined by

-3 dB drop in gain (represents the gain of 0.71 in Fig. 6.6), then the optimal blap = 0.375

approximately represents a bank of filters tied to each other without overlap.

6.3.5 LP model order

With other values fixed, len = 500 ms, cepstral order ncep = 50, blap = 0.255 and

envelope compression cmpr = 0.1, the only parameter left free was the LPC order fp,

5At the time when this experiment was run, the optimized value for blap was not yet known.

6.3 Experimentally optimizing LP-TRAP 61

which varied from 20 to 150. The dependencies at Fig. 6.8 suggest that though LP order

does not seem to be critical, it should not be lower than about 30. The best performing

value (WER = 4.1%) was observed for fp = 70, offering 35 poles for modeling 500 ms

sub-band envelope.

0 50 100 15018

18.5

19

19.5

20

20.5

21

fp

FE

R [

%]

0 50 100 1504

4.5

5

5.5

fp

WE

R [

%]

Figure 6.8: WER and FER as a function of LP order fp in LP-TRAP.

6.3.6 Evaluating optimized features on S/N and CTS tasks

Sections 6.3.1 – 6.3.5 optimized the values of free parameters in LP-TRAP algorithm. The

best settings found on S/N task are:

• len = 500 ms (input segment length),

• fp = 70 (LP order),

• blap = 0.375 (sub-band overlapping factor),

• cmpr = 0.1 (sub-band envelope compression factor for LP),

• ncep = 50 (number of output cepstral coefficients).

Performance of these features was compared to the baseline TRAP-DCT on S/N task

(Tab. 6.3) and on CTS task (Tab. 6.4).


LP-TRAP 18.2 4.1

TRAP-DCT 19.7 4.2

Table 6.3: Performance of optimized LP-TRAPs on S/N task.

Devel set Test set


LP-TRAP 39.1 53.1 50.5

TRAP-DCT 39.2 53.3 51.2

Table 6.4: Performance of optimized LP-TRAPs on CTS task.

On both tasks LP-TRAP performed about the same as TRAP-DCT. This is very

encouraging result as the proposed novel features perform not worse than one of the best


published long-term representations. On the other hand, it brings a question why the

LP-TRAPs are not even better.

6.4 Warping time axis in LP-TRAP

One of the major motivations for LP-TRAPs was the potential of very accurate temporal

modeling. This section experiments with the temporal resolution of LP-TRAP and pro-

poses further improvement by warping the temporal axis to devote more modeling power

to the central parts of the trajectory.

6.4.1 Temporal resolution of LP-TRAP

Let us inspect the temporal resolution of LP-TRAP with an artificial signal. It is formed

by two pulses sampled at 8 kHz, each of which 10 samples long (1.3 ms), with 10 ms pause

between them (see the top pane in Fig. 6.9). The signal is processed by LP-TRAP and

TRAP algorithms in 15 critical bands. The six plots below show the energy envelope in

the 5th band. The leftmost pane illustrates the sub-band Hilbert envelope from LP-TRAP

calculation. Note that it contains the information which separates the pulses.

7980 8000 8020 8040 8060 8080 8100 8120sample

test signal

10 20 30 40 50sample

100 200 300 400 500sample

100 200 300 400 500sample

10 20 30 40 50sample

10 20 30 40 50sample

1000 2000 3000 4000sample

Hilbert envelope

LP−TRAP fp=50, 51points

LP−TRAP fp=50, 501points

TRAP, 50 points LP−TRAP, fp=20, 501 points

LP−TRAP, fp=20,51 points, warp=2.0

Figure 6.9: Illustration of temporal resolution of LP-TRAP vs. TRAP.

It can be observed that:

• TRAP cannot distinguish the pulses as it is calculated from 25 ms long windows.

• LP-TRAP with 70th order FDLP fits both peaks, however, 51 samples of the envelope

are not enough for separation.

• LP-TRAP cepstrum of the order 70 can separate the pulses (the envelope with 501

samples is in principle calculated from the LP-TRAP cepstrum).

6.4 Warping time axis in LP-TRAP 63

• LP-TRAP with 20th order FDLP cannot distinguish the peaks.

The rightmost pane of the figure shows a preview of the FDLP fit of a pre-warped

temporal axis. The warping acts as a magnifying glass sliding over the signal and enables

to effectively capture both peaks even with FDLP of order 20. It also allows to distinguish

both peaks in 51 samples of LP-TRAP envelope.

6.4.2 Two ways to warp time axis

The idea of “magnifying glass” can be implemented in two ways.

• Considering that LP coefficients in LP-TRAP capture more temporal details than

the final sampled envelope (which has been shown in the last section), one could

evaluate the LP-TRAP envelope with a denser sampling (see the two panes with LP-

TRAP fp=70 in Fig. 6.9) and subsequently apply a non-uniform binning introduced

in Warping TRAPs, section 5.2.1, pp. 48. By doing so, the final envelope would scale

up the resolution in the center at the expense of boundaries, preserving the original

number of samples. However, this would only optically increase the resolution in the

sampled envelope and not in the LP model itself.

• To be able to capture more details in the LP model, the temporal axis has to be

warped prior to LP modeling, see Fig. 6.10. The sub-band Hilbert envelope has

the full available resolution and can be non-uniformly resampled to yield a warped

envelope, which is subsequently modeled by LP. Thus, LP is forced to devote more

modeling power to the center of the window and the model parameters as well as

the final LP-TRAP envelope can capture more details.

f

sub−bands

(8200)

(500−8000)

| |(2*cmpr)

IN = linear

OU

T =

warp

ed

speech

a[k] + gain c[k]

LP−cepstra

FFT (a[k], zeropad)

(2/cmpr)FFT (gain, zeropad)

log

LP−TRAPs

(8200)

DCT

LPC

(100)

(101)

(15)

"autocorrelation"

(100)

sub−band Hilbert envelopes

(?) iFFT

FFT

(?)

Figure 6.10: Warping temporal axis in LP-TRAP feature extraction.

Note: One might think of warping directly the input signal instead of the envelope. It

is not possible due to the sub-band processing.

6.4.3 Non-linear time warping and sampling theorem

We will adopt the second warping approach. In a more detailed view, there is an issue to

be solved, the proper sampling (denoted by red exclamations in Fig 6.10). What should


be the proper FFT and iFFT sizes and how to actually do the non-linear mapping?

Non-warped case

Let Nin be the size of FFT calculating the sub-band Hilbert envelope and Nout be the

size of iFFT calculating the autocorrelation. Without warping, Nin is determined by the

sub-band DCT size NDCT , which is appended by zeros to the length Nin = 2NDCT − 1,

to avoid aliasing in autocorrelation (the spectral compression is not considered). For FFT

purposes, the nearest higher power of 2 is used.

Nout is determined by the LP order fp,

Nout ≥ 2fp− 2 (6.16)

because LP uses only the first fp samples of autocorrelation and because the iFFT output

is symmetric, having only Nout/2+1 non-redundant samples. Such Nout is typically lower

than Nin. Mismatch in Nin and Nout can thus safely be avoided by setting both Nin and

Nout to the greater of Nin and Nout.

Warped case

With warping, the reasoning is little more complex. The following phenomena have to be

considered:

• The warping function “zooms in” the center and the needed resolution of the enve-

lope increases. So would do the FFT size, in order to interpolate all needed samples.

In theory, for proper resampling the required Nin would be very high6. An approx-

imation is to find the maximal slope α of the sampled warping function and set

Nin ≥ αNout.

• Recall that Nout is given by eq. 6.16 and that Nin ≥ αNout. When Nin is larger than

Nout, it means that many input samples of Hilbert envelope are to be mapped to

only several samples of the warped envelope. This in turn requires that the linear

Hilbert envelope be low-passed prior to the projection. However, a uniform low-pass

filtering would discard the needed details in the center and the whole idea would be

lost. But, there is a reasonable approximation to the non-uniform low-pass. One can

bin together several neighboring samples of the linear envelope. The process will be

clarified in the following section.

6.4.4 Warping function

The mapping is done using the same function as for TRAPs (refer to section 5.2.1, pp. 47).

Recall its analog version

f(x) = 1 − (1 − x)w, x ∈< 0, 1 >, (6.17)

6For two constant sampling frequencies the minimal up-sampling factor is given by their least common

multiple. Here, one of the frequencies is being changed, which in theory requires to calculate the factor

for every possible position on the warped axis and find again the least common multiple of all particular

factors. Such resampling is clearly not feasible.


Discretization

The input Hilbert envelope of size Nin is real, positive and symmetric (due to FFT), it

means it has Min = Nin/2 + 1 meaningful points. The same applies to the output warped

envelope, which has Mout = Nout/2 + 1 meaningful points. Min/out are always odd, which

means that both boundary points, as well as the middle point of the envelope are warping-

independent. The discrete warping function (its left symmetric half) maps every output

sample of the warped axis to an input sample:

in[k] = Nin/4 − round

[

Nin/4

(Nout/4 − k

Nout/4

)w]

, k = 0, . . . ,Nout/4. (6.18)

Note that the factor 4 comes from 2×2, first for the symmetric spectrum and second for

only a half of the warping function. The function has to be center-symmetrized along the

point k = Nout/4, which is shared. Hence, the final function maps Mout to Min samples.

Required FFT size

To derive the required Nin for a given Nout, the tangent or slope of the discrete warping

function has to be evaluated from two points near the center:

slope =

[

1 −(

N/4−N/4N/4

)w]

−[

1 −(

N/4−(N/4−1)N/4

)w]

(N/4)−1= (6.19)

=N

4{[1 − 0] − [1 − (

1

N/4)w]} =

N

4(

4

N)w = (

N

4)1−w,

where N = Nout. Thus the required input FFT size is

Nin ≈ ceil(Nout/slope) = ceil

[

4

(Nout

4

)w]

. (6.20)

Example: The 0.5 s trajectory sampled at 8 kHz is to be warped with w = 1.5 and

modeled by FDLP of the order fp=70. Using 6.16 we get Nout ≥ 138, the next

power of 2 gives Nout = 256. Eq. 6.20 yields

Nin = ceil

(

4

(256

4

)1.5)

= 2048,

maximum zoom in the center is 8×. For a stronger warp w = 2.0 we get Nin = 16386,

zoom = 64×.

FFT complexity grows with n log n which in real implementation limits the possible

warp factors to about w = 2.0.

Properly resampling linear input to non-linear output

Eq. 6.18 projects output samples to input samples of the envelope. However, as it was

mentioned above, the mapping does not sample from the the input; instead, there is a

binning done. It is illustrated in Fig. 6.11. Let us look at the the point “2” in the figure.


Instead of sampling the input only in the specific position denoted by the small arrow (and

violating the sampling theorem), an average from all samples between points A and B is

taken. The actual position these interlaced points can be obtained from eq. 6.18 simply

by substituting k by s = k ± 1/2.

OUT

1

2A

B3

IN

Figure 6.11: Illustration of binning process in warping LP-TRAPs.

6.4.5 Experiments

The warping was tested first on S/N task and chosen configurations were subsequently

evaluated on CTS task. The configuration common to all experiments was: segment

length len = 500 ms with step = 10 ms overlap, sub-band overlap blap = 0.375, dynamic

compression cmpr = 0.1. There were three free parameters:

• LP order fp. Initial value was 70.

• Cepstral feature size ncep. Initial value was 50. (Chosen by an analogy with TRAP-

DCT features where this size performed reliably, refer to Fig. 5.4).

• Warping factor warp≥1. Values below 1.0 deteriorated the performance in prelimi-

nary experiments, consistently with warping TRAPs (section 5.2.1).

Goals:

1. To find if the warping improves performance. This was tested with fixed fp = 70,

ncep = 50. Higher values should not have been necessary. If lower values were

optimal for warped case, the observation should not be biased anyway, as more

features generally do not harm.

2. To learn the influence of lowering LP order fp. Without warping, performance with

low fp should be worse than for high fp. With warping, two extreme cases might

happen: 1) If the performance improved and approached the linear case with high

fp, it would mean then only the center of the envelope seems important and the

temporal resolution of LP-TRAP without any warping is good enough. 2) If the

performance did not improve with warping, it would suggest that low order LP

basically cannot sufficiently capture the envelope shape.

3. To learn the influence of lowering cepstral order ncep on performance. Experience

with TRAP-DCT at S/N task suggests that low cepstral order might not harm. But

what happens when the temporal axis is warped?


As the warped LP-TRAPs are computationally demanding, the error surface was sam-

pled only at certain combinations of the three parameters.

S/N task – results

The S/N task setup (MLP and HMM) was consistent with previous experiments (refer to

the baseline TRAP architecture, section 4.4). All possible combinations of fp = 20/70,

ncep = 15/51 and warp = 1.0/1.5/2.0 were evaluated. The results are summarized in

Tab. 6.5.

LP order Cepstral size Warp. factor

fp ncep warp FER[%] WER[%]

70 50 1.0 18.7 4.3

70 50 1.5 18.7 4.3

70 50 2.0 18.6 4.3

70 15 1.0 19.0 4.5

70 15 1.5 18.4 4.0

70 15 2.0 18.5 4.5

20 15 1.0 19.0 4.3

20 15 1.5 18.5 4.7

20 15 2.0 18.4 4.4

Table 6.5: Performance of warped LP-TRAPs as a function of fp, ncep, warp. S/N task.

The S/N task did not fulfill the expectations. None of the results were significantly

worse or better than the default settings (the first row), including random sampling the

error surface in interlaced positions (not displayed). It can be explained by the fact that

S/N task vocabulary does not contain speech with fast transients, except for the plosive

“t” in two. There might thus not be a need for better temporal modeling.

CTS task – results

Contrary to S/N task, the CTS evaluation displayed a nice and consistent picture. The

setup followed the baseline TRAP architecture (refer to section 4.4.2). The following

observation can be made from Tab. 6.6:

LP order Cepstral size Warp. factor WER[%]

fp ncep warp Devel set

70 50 1.0 52.6

70 50 1.75 51.4

70 15 1.0 55.0

70 15 1.75 52.6

20 15 1.0 54.8

20 15 2.0 53.8

Table 6.6: Performance of warped LP-TRAPs as a function of fp, ncep, warp. CTS task.

• The warping notably improved the performance in all cases. The best WER was

reached for fp=70, ncep=50 (default values) and warp=1.75 (1.2% better than with-

out warping). On CTS task it is the best word error rate out of all compared

long-term representations (TRAP, TRAP-DCT, M-RASTA).


• Lowering LP order to fp=20 deteriorated the performance. Warping was helpful,

but the score with LP order 70 was not reached. It suggests that low-order LP model

is not able to capture the important details in the envelope, even when it is warped.

• Lowering cepstral order to fp=15 provides quite interesting answers.

With LP order 70 and linear time, using only the first 15 cepstral coefficients per band

was clearly insufficient (compare rows 1 and 3). However, when warping was applied,

then only 15 coefficients reached the score of 50 coefficients per band (compare rows

1 and 4)! It suggests that by warping the important information gets compressed in

lower cepstral coefficients.

Full 50 cepstral coefficients were still better choice (compare rows 2 and 4), which

means that the feature size of 50 per sub-band is not redundant and that the warping

helps the LP model to better capture the detailed information from central parts of

the window.

6.5 Conclusion

The original idea from Marios Athineos of representing speech by means of sub-band mod-

ulation spectra obtained with sub-band Hilbert envelopes smoothed by linear prediction

was implemented in fdlp tool. The method does not impose any sampling constraints

to the spectrum of the speech (as opposed to about 10 Hz sampling rate of conventional

approaches) and therefore allows for precise localization of temporal events.

Using fdlp the method was optimized on S/N and CTS tasks, resulting in features

performing comparably to the best long-term representations (TRAP-DCT) known to the

author. It can be concluded that:

• LP-TRAP cepstral features perform significantly better than sampled LP-TRAP

envelopes. This is consistent with TRAP and TRAP-DCT observations.

• LP-TRAP cepstra markedly outperform TRAP-DCT (by about 1.4%) on CTS task.

However, on S/N task both features perform about the same. Better temporal

localization of events in LP-TRAP than of TRAPs thus seems useful in more complex

tasks that are able to utilize the detailed information.

• Pre-warping the temporal axis in order to stress the central part of 500 ms trajecto-

ries allows to model temporal envelopes sufficiently well by less cepstral coefficients

than in linear case (15 instead of 50 coefficients at CTS task), thus reducing the

feature bandwidth by 70% without loss in performance. In addition, when the

bandwidth is not an issue, using all 50 coefficients can markedly improve the score

over the linear case (1.2% improvent was achieved at CTS task).

It should be noted that up to now the warping has not been thoroughly explored.

More appropriate warping functions could be suggested and optimized by more care-

ful scanning of the error surface. There is thus a potential for further improvement.

Chapter 7

Multi-resolution RASTA filtering

(M-RASTA)

This chapter introduces a novel speech representation for ASR. The technique extends

earlier works on RASTA filtering by applying a bank of two-dimensional band-pass filters

as a pre-processing step in TANDEM feature extraction. The filters are applied to the

auditory-like speech spectrogram and the set of resulting spectra are projected to phoneme

posteriors using MLP. Since the filters have zero-mean impulse responses, the technique

is inherently robust to linear distortions.

7.1 Introduction – M-RASTA from different perspectives

Before presenting the M-RASTA feature extraction, let us summarize related works and

their ideas.

Relationship to TRAP and TANDEM

M-RASTA extract features from the spectro-temporal plane similarly to TANDEM or

TRAP and benefits from combining both architectures. It uses temporal context up to

1000 ms like TRAPs, yet uses only one MLP like TANDEM. The long context enables

in principle to preserve the complete information about phoneme. Using one MLP with

inputs from all bands enables to learn virtually any inter-band relationships. Finally, there

is a nice property inherited from both systems: It was shown in this work and also by

other authors that PLP–TANDEM with 9 frames of PLP performs better than TRAP on

clean speech. However, in contrary to TRAPs, PLP coefficients are not robust to linear

distortions as the DFT involved in cepstral calculation causes that any distortion (even

in a narrow bandwidth) affects all coefficients. In M-RASTA, the MLP has access to the

spectra (like TRAPs), though filtered with temporal filters, therefore the robustness to

linear distortions is preserved.

Modulation Spectrum and Robustness

The log-energy in a critical sub-band have its own dynamic structure which is recorded

in a raw form in TRAP vectors. A spectrum of TRAP (modulation spectrum) quantifies

to what extent the dynamics contains slow and fast changes. This modulation domain

69

70 7 Multi-resolution RASTA filtering (M-RASTA)

is exactly where strong a priory knowledge can be implemented, enabling to partially

separate speech from other unwanted artifacts.

It was already mentioned that the vocal tract has its intrinsic inertia which prevents

it from very fast movements. Too slow movements are not efficient for communication.

Experiments with human and machine recognition (reviewed in section 3.2) found that

active modulation range of speech lies between 1.5–16 Hz with a maximum around 4 Hz.

On the contrary, non-speech artifacts often lie outside this range. Channel noise is typically

stationary or relatively slow-changing, and incidental noises such as clicks and bangs can

be very fast. Band-pass filtering modulations can thus significantly improve robustness as

proved by RASTA filtering. M-RASTA aims at preserving this property.

Temporal Decomposition of Energy Trajectory

From the point of view of temporal decomposition, M-RASTA is closely related to TRAP-

DCTs. There, DCT bases weighted by Hamming window are convolved with TRAPs,

which is actually a temporal decomposition to a set of cosinusoidal bases. M-RASTA

differs from DCT in that it applies bases – or filters – derived from Gaussian function.

Experiments from section 5.1.4 at page 43 suggested that better projection bases than

those of TRAP-DCT could possibly be found. It seemed desirable that the length of

a particular cosinusoidal base was proportional to its frequency. In other words, it was

suggested to use a projection base of constant shape and only vary its width. M-RASTA

implements this idea by using impulse responses with similar wavelets differing only in

widths.

Emulating Cortical Receptive Fields

One of the inspirations for M-RASTA filters were differently motivated, though related

efforts of Kleinschmidt and Gelbart. They used two-dimensional time-frequency Gabor

filters and explicitly stated the relation of such speech processing with a known physiology

of auditory cortex. They attempted for a simplified version of Shamma’s model of cortical

processing [66].

7.2 M-RASTA features

The combination of temporal filters and frequency filters applied to the auditory spectro-

gram can be interpreted as 2-D filtering of the spectro-temporal plane. In M-RASTA the

2-D filtering is implemented by first processing critical band trajectories with temporal

filters and subsequently applying frequency filters to the result, see diagram at Fig. 7.1.

1. Critical-band auditory spectrum is obtained in the same way as in TRAP (see sec-

tion 5.1.1, p. 37).

2. Temporal trajectory of energy in each sub-band is filtered with a bank of fixed-length

low-pass FIR filters. Their impulse responses represent Gaussian functions of several

different widths.

3. The first and the second temporal differentials are computed from the smoothed

trajectories, yielding a set of N modified spectra at every 10 ms (labeled “t-filtered

spectra” in the diagram). The same filter-bank is used for all bands.

7.2 M-RASTA features 71

+1

−1

0

TA

ND

EM

pro

babili

ty e

stim

ato

r

FIR

bank

FIR

bank

time (frames)

critical bands

−0.5

+1

−0.5

2

frequency filtering

Auditory spectrogram

3−tap FIR

t−filtered spectra

temporal filtering

Figure 7.1: M-RASTA feature extraction scheme.

4. From each of N modified spectra the first and second frequency derivatives are

calculated. It yields two additional feature streams (labeled ∆ and ∆2 in diagram).

5. Feature vector which forms the input to the TANDEM MLP is then obtained by

concatenating all feature streams. MLP is trained to estimate phoneme posteriors.

7.2.1 Temporal filters

Instead of filtering sub-band trajectories with the low-pass Gaussian function and subse-

quently computing the differentials, the trajectories are directly filtered with the first and

second differentials of the Gaussian, representing impulse responses of band-pass filters.

The impulse responses are obtained by sampling analytic derivatives of the Gaussian that

are given by

g1[x] = −x exp(−x2

2σ2) · k1, (7.1)

g2[x] = (x2 − σ2) exp(−x2

2σ2) · k2, (7.2)

where x is time, x ∈ 〈−500, 500〉 ms with the step of 10 ms, standard deviation σ

determines the effective width of the Gaussian and k1, k2 are scaling constants. The

derivatives will be further referred to as g1 and g2. Filters with low σ values (hi-pass)

have finer temporal resolution, high σ filters (low-pass) cover wider temporal context and

yield smoother trajectories. All filters are zero-phase FIR filters, i.e. they are centered

around the frame being processed. Length of all filters is fixed at 101 frames, corresponding

to roughly 1000 ms of signal, thus introducing a processing delay of 500 ms.

First and second derivatives of Gaussian function have zero-mean by the definition.

By using such impulse responses we gain an implicit mean normalization of the features

within a temporal region proportional to the value of σ, which infers robustness to linear

distortions. Impulse responses given by Eq. 7.1 are shown in the left part of Fig.7.2, the

right parts shows impulse responses given by Eq. 7.2. Respective frequency responses are

illustrated in Fig. 7.3.

Since the impulse responses are discretized and limited in length to 101 samples, the

real Gaussian derivatives are approximated with certain error, which increases towards

both extremes of the σ value. For small σ the sampling is too sparse, for larger σ there

are significant discontinuities at the endpoints due to the finite truncation of the infinite


−500 0 500

g1[x]

time [ms]

g1[x]g

1[x]g

1[x]g

1[x]

−500 0 500

g2[x]

time [ms]

g2[x]g

2[x]g

2[x]g

2[x]

Figure 7.2: Normalized impulse responses of the first two sampled and truncated Gaussian

derivatives g1 and g2 for σ = 8 – 130 ms.

100

101

−40

−30

−20

−10

0

modulation frequency [Hz]

dB

σ=130 ms

σ=8 ms

100

101

−40

−30

−20

−10

0

modulation frequency [Hz]

dB

Figure 7.3: Normalized frequency responses of the first two sampled and truncated Gaus-

sian derivatives g1 and g2 for σ = 8 – 130 ms.

Gaussian function, both introducing DC offset, see Fig. 7.4. Note that g1 has odd sym-

metry and has always zero mean, but the sampled and/or truncated g2 may have non-zero

mean.

−5 0 5

−1

−0.5

0

0.5

1

frames [10ms]

g1

g2

−5 0 5

−1

−0.5

0

0.5

1

frames [10ms]

g1

g2

−50 0 50

−1

−0.5

0

0.5

1

frames [10ms]

g1

g2

Figure 7.4: Detail of the first two sampled and truncated Gaussian derivatives for σ = 6 ms

(left), σ = 8 ms (center), σ = 130 ms (right).

The limits for σ were found using somehow arbitrary criterion that DC offset of the

sampled impulse response must not exceed 10% of the maximal absolute value of the

response. It preserves the normalizing property important for robustness. The resulting

range is σ ∈ (6, 130) ms. The σ values used in experiments are spaced logarithmically.

7.2.2 2-D – time-frequency filters

The first and second frequency derivatives are approximated by 3-tap FIR filters with

impulse responses {−1, 0,+1} and {−0.5, 1,−0.5}, introducing three-Bark frequency con-

7.3 Experiments 73

text. Combination of temporal and frequency filters yields 2-D filters. Examples of their

impulse responses with 101×3 taps are in Fig 7.5. Vertical axis is smoothed to illustrate

the filtering effect on the original spectrum (prior to the critical-band integration).

−0.5

0

0.5

g1[x]

g2[x]

∆2

∆

−500 ms +500 ms0

0 Bark

+1 Bark

−1 Bark

Figure 7.5: Example of impulse responses of 2-D RASTA filters with σ = 60 ms.

7.3 Experiments

This section presents chronologically the development progress on S/N task from testing

individual filters, combining more filters, to evaluations of the M-RASTA with optimized

settings on S/N and CTS tasks.

7.3.1 Filtering with single filter

The aim of the first experiment was to learn what is the effect of temporal filtering of the

auditory spectrogram with one filter. It proceeded as follows:

• Auditory spectra were calculated from the speech.

• For each filter shape (g1, g2) and chosen width σ (varying logarithmically from 6 ms

to 130 ms):

– Auditory spectra were processed with the filter,

– a new MLP was trained using the modified spectra,

– a new HMMs were trained,

– FER and WER were evaluated.

Setup details

• 15 critical band energies,

• 101-tap FIR filters (max. 1000 ms length), 9 different σ,

• MLP topology 15 × 1800 × 29 units.

If no filter was applied, the MLP would be trained on plain auditory spectra. Such a

system would have 50.5% FER and 19.1% WER and can serve as a baseline. The results

of the g1 and g2 filtering are in Fig. 7.6.


8 10 20 30 40 50 70 90 13050

55

60

65

70

75

80

σ [ms]

FE

R [

%]

108 20 30 40 50 70 90 1300

10

20

30

40

50

60

σ [ms]

WE

R [

%]

g1

g2

Figure 7.6: M-RASTA with only one filter: FER and WER dependencies on filter width.

Observation

• All dependencies are quite smooth. The only outlier point is g2 for σ = 6 ms. It

is probably caused by too sparse sampling of the narrow Gaussian derivative (see

Fig. 7.4). Because of this, the used σ range was limited to 8–130 ms for all further

experiments.

• Similar to the experiments in Chapter 5, WER criterion prefers much faster changes

than FER criterion. The best FER (52%) was observed for the second derivative g2

with a wide σ = 64 ms. The Best WER (11.0%) was observed for the first derivative

g1 with a narrow σ = 10 ms.

• When used as the only MLP input, the filtered spectrum can perform markedly

better than the plain spectrum, even though the DC component have been removed

(recall 11.0% vs. 19.1%). This observation is encouraging as it indicates a big

potential of channel noise resistance without any harm on accuracy.

• Minima of all dependencies fall nicely within chosen σ range, suggesting that neither

longer impulse responses, nor finer sampling of the input spectra seem to be needed.

7.3.2 Combining two temporal filters

Having settled the range of filter widths σ, the next concern was combining the filters.

All possible pairs of filters with seven different σ values (0.8, 1.1, 1.6, 3.2, 4.5, 6.4, 9.0)

and both shapes g1 and g2 were made. The input MLP size doubled to 30 features. For

every σ and filter pair g1−g1, g1−g2, g2−g2 a new system was trained and evaluated. See

Fig. 7.7 for results.

Observation

• Filter pairs of the same shape (g1−g1, g2−g2) generally performed better for

distant σ, because the features are more complementary, see top and bottom rows

of Fig. 7.7.

7.3 Experiments 75

8 10 20 30 40 50 60 70 8030

35

40

45

50

55

60

65

70

σ [ms]

FE

R [%

]

Combining g1−g

1 : Frame Error Rate

8

11

16

32

45

64

90

8 10 20 30 40 50 60 70 805

10

15

20

25

σ [ms]

WE

R [%

]

Combining g1−g

1 : Word Error Rate

8

11

16

32

45

64

90

8 10 20 30 40 50 60 70 8030

35

40

45

50

55

60

65

70

σ [ms]

FE

R [%

]

Combining g1−g


8

11

16

32

45

64

90

8 10 20 30 40 50 60 70 805

10

15

20

25

σ [ms]

WE

R [%

]

Combining g1−g

2 : Word Error Rate

8

11

16

32

45

64

90

8 10 20 30 40 50 60 70 8030

35

40

45

50

55

60

65

70

σ [ms]

FE

R [%

]

Combining g2−g


8

11

16

32

45

64

90

8 10 20 30 40 50 60 70 805

10

15

20

25

σ [ms]

WE

R [%

]

Combining g2−g

2 : Word Error Rate

8

11

16

32

45

64

90

Figure 7.7: Combining two temporal filters gx–gy in M-RASTA. Performance as a function

of widths σ. Legend = width of gx [ms], horizontal axis = width of gy [ms].

• Observation from the previous experiment holds: 1) g2 filters reached consistently

better FER than g1 filters and the opposite applied to WER, 2) FER preferred wider

filters σ ≈ 32 − 90 ms, WER preferred narrower filters, σ ≈ 8 − 32 ms.


• Heterogeneous pairs (g1−g2) outperform homogeneous pairs in FER and are

comparable in WER.

• For heterogeneous pairs, distant σ were no more preferred, suggesting that different

filter shape brings more complementary information than distant σ. It is further

supported by the next point.

• Best FER came from longer σ, concretely σ1 = 32 ms, σ2 = 90 ms (σn denotes

width of gn filter). It matches with the previous experiment with only one filter, see

Fig. 7.6, left.

• Best WER came from narrow σ in range 8 – 32 ms.

The best scores for each of the three filter combinations are summarized in Tab. 7.1.

Filters FER [%] σ1 σ2 WER [%] σ1 σ2

g1−g1 47 1.6 6.4 6.5 0.8 3.2

g2−g2 39 3.2 9.0 8.1 1.1 4.5

g1−g2 32 4.5 6.4 6.5 1.6 0.8

Table 7.1: Best reached FER and WER for three feature pair types with their respective

σ values.

The experiment confirmed the complementarity of different temporal filters. Their

combination improves recognition accuracy.

7.3.3 Tuning the system for best accuracy

Having sampled the properties of temporal filters, we can proceed with optimizations. The

initial knowledge is:

• Both filter shapes g1 and g2 should be employed.

• Widths should fit the range σ ∈ (8, 130) ms.

• Combining multiple filters improves accuracy.

First, a bank of 16 filters was formed, consisting of 8 first order and 8 second order

derivatives of Gaussian function g1 and g2. Each derivative had 8 widths placed equidis-

tantly on logarithmic scale, σ = 8.0, 11.9, 17.7, 26.4, 39.4, 58.6, 87.3, 130.0 ms. The bank

was applied to all 15 temporal trajectories of critical-band log-energies at all frequencies,

yielding 16× 15 = 240 spectral features per frame. These formed the main feature stream

t-stream.

The other two feature streams were formed by applying two 3-tap FIR filters to the

outputs of each of the 16 filters, across frequencies, as illustrated in Fig. 7.1, representing

2-D filters. Frequency derivatives for the first and last critical bands are not defined, so

we ended up with two additional feature streams ∆f and ∆2f, each of size 16 × 13 = 208

features.

Subsequently, three MLPs were trained with respective inputs:

• 240 features – t stream only,

7.3 Experiments 77

• 448 features – t+∆f streams, 240 + 208 features appended,

• 656 features – t+∆f+∆2f streams, 240 + 208 + 208 features appended.

Topology of MLPs differed among the three setups only in the input layer size: N ×

1800 × 29 units. As a comparative baseline, TRAP-DCT (101 frames, 51 features) was

used, because it is conceptually similar (refer to section 5.1.3). Its accuracy is 19.7% FER

and 4.2% WER. M-RASTA systems can be seen in Tab. 7.2.

Used feature streams FER [%] WER [%]

t 19.3 4.3

t+∆f 17.4 3.4

t+∆f+∆2f 17.8 3.6

Table 7.2: Adding frequency derivatives to M-RASTA features.

Observation

• M-RASTA with only temporal filters did not outperform the baseline.

• Augmenting t-stream with frequency derivatives ∆f brought large improvement re-

sulting in a system outperforming all competitors (PLP, PLP-TANDEM, TRAP,

TRAP-DCT, LP-TRAP). Although ∆f do not bring in any new information, they

explicitly introduce inter-band context which seems to be necessary.

• Adding ∆2f features decreased the performance and they were not further used.

NOTE: The substantial progress resulting from using ∆f -stream entices to try training

MLP using that stream only. Doing so would result in 20.8% FER and 5.0% WER.

It means that the progress is caused by the complementarity of the streams, rather

than by the performance of derivatives ∆f themselves. The combination of both

streams is thus inevitable.

How many filters?

So far it is not clear how many temporal filters should be used. The somewhat arbitrary

count 2 × 8 deserves an experimental evidence of optimality.

Two more systems were trained with the successful t+∆f features. One containing

2× 4 temporal filters and the second containing 2× 16 filters. In both cases, widths of the

filters were again distributed logarithmically from 8 ms to 130 ms. The difference among

the systems was the density of filters. Comparison of the two alternatives and the default

setting is given in Tab. 7.3.

Number of filters Features FER [%] WER [%]

2 × 4 224 18.0 3.9

2 × 8 (default) 448 17.4 3.4

2 × 16 896 17.4 3.7

Table 7.3: Looking for a suitable number of temporal filters in M-RASTA.


It seems that 8 different widths were close to optimum, other counts performed worse.

Note that 2×16 filters require quite large MLP. Had the overall MLP size stayed constant,

the performance would have been 18.4%FER and 3.9% WER.

Evaluation on CTS

Two successful setups of M-RASTA were evaluated on CTS:

• M-RASTA 240 features (from 2×8 temporal filters at σ = 8–130 ms) and

M-RASTA 488 features (from 2×8 temporal filters + their frequency derivatives ∆f).

• MLP topology (240 or 488)×2000 × 46 units.

Results of M-RASTA compared to TRAP-DCT are given in Tab 7.4. M-RASTA and

TRAP-DCT perform about the same. The explicit frequency relations between bands (∆f)

do not seem to be crucial anymore, though they still improve performance. It can either

be explained by the larger size of CTS compared to S/N (about 4 × more training data)

or rather by the strong language model which attenuates any particularities in features as

discussed in section 4.4.

Devel set Test set


M-RASTA 240 (t) 37.8 53.7 52.5

M-RASTA 448 (t+∆f) 37.5 53.3 51.4

TRAP-DCT 39.2 53.3 51.2

Table 7.4: Performance of M-RASTA on CTS.

Note on MLP: It could be argued that in case of 448 features the TANDEM MLP

is quite big. For 240 features the overall MLP size about 580 000 parameters which is

comparable to PLP-TANDEM, but 448 features enlarge the MLP by about 70%. If the

overall size had to be preserved, the hidden layer would have 1160 units and the WER

would drop by 0.6% (male Tune set), thus approaching M-RASTA 240. In such case,

M-RASTA 240 would probably be the first choice due to its better robustness to channel

noise (refer to the following section).

7.3.4 Robustness to channel noise

To get an idea how robust is the new speech representation to a stationary channel mis-

match between training and testing data, first order preemphasis filter with α = 0.97

was applied to the test data. Such distorted data were passed through existing systems

and word error rate was evaluated. As can be seen in Tab. 7.5, short-term PLP fea-

tures are very sensitive to these distortions and TRAP-DCT as well. However, to be fair,

TRAPs entering TRAP-DCT could have been mean-normalized, which would boost the

robustness, though possibly at the price of accuracy at matched conditions. M-RASTA

features are quite resistant, especially when no explicit relationship among bands using

∆f is involved.

When the same preemphasis was applied on CTS male tune set, WER dropped by

6.6% (absolute) for PLP features, by 0.6% for M-RASTA 240 features, and by 2.1% for

M-RASTA 448 features.

7.3 Experiments 79

Matched Mismatched Relative loss

Features WER [%] WER [%] [%]

PLP 5.2 13.5 160.0

TRAP-DCT 4.2 5.5 31

M-RASTA 240 (t) 4.3 4.4 2.1

M-RASTA 448 (t+∆f) 3.4 3.6 4.0

Table 7.5: Influence of channel mismatch on performance of M-RASTA and other features.

S/N task.

7.3.5 Modulation frequency properties

As shown in Fig. 7.3, the applied RASTA filters cover wide range of modulation frequencies

up to 50Hz (the sampling of spectra is 100 Hz). To get an insight into relative importance

of various modulations for ASR of continuous digits from the M-RASTA point of view, the

modulation range was shrunk from both the lower and the higher ends by modifying σ. To

keep the number of free parameters in the system constant, the shrinking was associated

with reducing spacing between filter’s center frequencies so that even for the narrowest

range there were still 2×8 filters as illustrated in Fig. 7.8.

10 20 30 40 50 70 90 110 13080

σ [ms]

Towards hi−pass Towards low−pass

Shrin

kin

g b

andw

idth

Figure 7.8: Shrinking the modulation bandwidth in M-RASTA by limiting σ range. Blue

with pluses = gradually omitting wide filters (only fast modulations preserved), red with

circles = gradually omitting narrow filters (only slow modulations preserved).

Determining bandwidth

For each set of filters gi1,2 associated with a set of σi, i = 1 . . . 8, the bandwidth was

determined as follows. Frequency responses of all filters were normalized to maximum

gain of 0 dB, without loss of generality1. Subsequently were found the most extreme

1Consider that MLP weights the outputs of individual filters so as to minimize some global error during

training, therefore it can compensate for any differences in gain. Besides, the input data to the MLP are


filters in the filter bank and frequencies of 3 dB attenuation were taken as the bandwidth

limits. It is illustrated in Fig. 7.9 for a simple bank with two filters.

0 0.5 1 1.5 2 2.5 3−5

−4

−3

−2

−1

0

Modulation frequency [Hz]

Norm

aliz

ed g

ain

[dB

]

g1

g2

lower limit upper limit

Figure 7.9: Determining bandwidth of a bank with two filters. Illustrated with g1,2 filters

at σ=130 ms.

The mapping between σi and the bandwidth of the associated gi1,2 filters is shown in

Fig. 7.10: If the whole filter bank acts as low-pass, then its cutoff frequency is determined

by the upper limit of the narrowest σ in the filter bank, and vice-versa.

8 10 20 30 40 60 80 100 130

0.5

1

2

5

10

20

50

σ [ms]

Mo

du

latio

n f

req

ue

ncy [

Hz]

Upper σ limit of Low−pass

Lower σ limit of High−pass

Figure 7.10: Mapping between σ and bandwidth of associated g1,2 filters.

Results

The original modulation bandwidth was shrunk, either by cutting off the high frequency

content or low frequency content. For every filter bank from Fig. 7.8, a new recognizer

was trained and WER and FER were evaluated. The results in Fig. 7.11 suggest that:

• Eliminating significant part of the low modulation frequency range (up to 4 Hz) has

no noticeable effect of WER, while only moderate cut in high modulation frequencies

(down to 19 Hz) is detrimental (see the right pane of the figure). This observation

matches Drullman’s HSR experiments [33].

• For FER, low modulation frequencies are more important while higher modulation

frequencies can be eliminated with only minor effect of FER (see the left pane of the

mean and variance normalized.

7.3 Experiments 81

0.5 1 2 5 10 20 5010

15

20

25

30

35

40

45


FE

R [

%]

Hi−pass cutoff

Low−pass cutoff

0.5 1 2 5 10 20 500

2

4

6

8

10


WE

R [

%]

Low−pass cutoff

Hi−pass cutoff

Figure 7.11: FER (left) and WER (right) as a function of cutoff frequency of filter banks

acting as low-pass or high-pass in modulation spectrum.

figure). A range of approximately 1.5–8 Hz appears to contain the most of the rele-

vant information for the frame-level classification. It is consistent with observation

of approaches reviewed in section 3.2.

7.3.6 Discrepancy between MLP and HMM – phoneme posteriograms

There have been two evaluation criteria used in this thesis. Frame error rate is a measure of

MLP reliability in estimating phoneme (or generally class) posteriors. Optimizing features

with respect to this criterion ensures the best match between estimated posteriors and

the underlying sequence of phonemes. Since the posteriors form the input to the HMM

decoder evaluated by the standard word error rate criterion, both criteria were supposed

to be consistent. However, various experiments and features from this thesis suggest that

it is not the case. In fact, a consistent discrepancy was observed, related to modulation

properties of speech.

Recall Fig. 7.11 and Fig. 7.7 from M-RASTA and also Fig. 5.4 (pp. 42) from TRAP-

DCT. In all cases it was observed that FER criterion “preferred” longer and smoother

input corresponding to slow modulations as opposed to WER criterion requiring rather

shorter inputs with faster modulations. It is interesting to compare posteriograms of the

speech (i.e. the evolution of estimated posteriors in time) for two features, one optimized

for FER and the second optimized for WER. Fig. 7.12 shows rather extreme case of two

posteriograms of the same utterance “nine” as estimated by two M-RASTA MLPs using

only two temporal filters each (refer to section 7.3.2 for details). The first features gave

the best FER and the second the best WER:

Optimized for σ(g1) [ms] σ(g2) [ms] FER [%] WER [%]

Best FER 45 64 32 10.1

Best WER 32 16 48 6.6

The displayed utterance comes from the labeled training data. HMM framework rec-

ognizes test digits at 10.1% WER using features similar to the upper plot and at 6.6%

WER using features similar to the lower plot.

We can speculate that the “Best FER” MLP probably smooths the posteriors so that

the trend can be better seen by human eye. But it might not be what HMMs need.


5

10

15Spectrogram

isi

si

si

si

si

si

si

s

n

n

n

n

n

n

n

n

n

n

ny

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

a

n

n

n

n

n

n

n

n

n

n

nh

ah

ah

ah

a

d t kcdctck s z fht v m n l r wyihiheyeeayahaoawowurexais

isi

si

si

si

si

si

si

s

n

n

n

n

n

n

n

n

n

n

ny

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

a

n

n

n

n

n

n

n

n

n

n

nh

ah

ah

ah

a

5

10

15Spectrogram

isi

si

si

si

si

si

si

s

n

n

n

n

n

n

n

n

n

n

ny

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

a

n

n

n

n

n

n

n

n

n

n

nh

ah

ah

ah

a

d t kcdctck s z fht v m n l r wyihiheyeeayahaoawowurexais

isi

si

si

si

si

si

si

s

n

n

n

n

n

n

n

n

n

n

ny

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

ay

a

n

n

n

n

n

n

n

n

n

n

nh

ah

ah

ah

a

Figure 7.12: Posteriograms for utterance “nine”. Horizontal phoneme sequence represents

truth (10 ms step). Posteriors were optimized for the best FER (upper plot) and for the

best WER (lower plot).

HMMs model phoneme transitions by themselves and might prefer to get a more detailed,

though noisy, information. ASR pursues the main goal of the best word sequence and any

intermediate products are disregarded.

Such thoughts could suggest to withdraw TANDEM architecture and return to the

idea of Hybrid approach where MLP probabilities replace Gaussian mixture model, but

it would be a step back (refer to pp. 16). Yet, there is a different though perhaps a little

peculiar idea to try abandoning HMM framework and recognize words purely using MLPs.

It will be presented in the next chapter.

7.4 Combining LP-TRAP and M-RASTA 83

7.4 Combining LP-TRAP and M-RASTA

The previous chapter introduced LP-TRAP features with two key properties:

1. Energy envelopes in sub-bands are formed by sub-band Hilbert envelopes. This

brings high temporal resolution.

2. Sub-band energies are modeled by LP cepstra.

In this chapter, trajectories of auditory spectra were passed through a bank of M-RASTA

filters with different widths. Narrow filters were shown to be essential for accuracy, yet

their possible thinness was limited by the spectral sampling frequency of 100 Hz. It

suggests to try combining both techniques. Sub-band Hilbert envelopes offer resolution

down to milliseconds, the M-RASTA impulse responses can thus be sampled much finer

and their time constants can be shorter.

Hilbert-M-RASTA features

The combination of LP-TRAP and M-RASTA is characterized by three processing steps:

1. Long segments of sub-band Hilbert envelopes are extracted from the speech

(the FDLP model is not used).

2. M-RASTA t + ∆f filters are applied to the logarithm of Hilbert envelopes.

3. Processed spectra are fed to the TANDEM MLP.

+1

−1

0

TA

ND

EM

pro

ba

bili

ty e

stim

ato

rFIR

bank

FIR

bank

−0.5

+1

−0.5

2

frequency filtering

3−tap FIR

t−filtered spectra

temporal filtering

Band 1

Band 15f

| |FFT

DCT

(2*cmpr)

sub−bands

speech sub−band Hilbert envelopes

Figure 7.13: Scheme of Hilbert-M-RASTA feature extraction.

The feature extraction procedure illustrated in Fig. 7.13 was implemented in a stan-

dalone C++ application hilbert gauss allowing for effective and general experimenting.

The core blocks from fdlp and pfile gauss tools were used so the executable runs fast

thanks to fftw library [1]. The viability of the features was tested on S/N and CTS tasks

with fifteen sub-band Hilbert envelopes in 1000 ms segments with 10 ms step, sub-bands

were selected by Gaussian filters on Bark scale with blap = 0.395 (refer to section 6.2 for

details) and the best performing M-RASTA t + ∆f filters (refer to section 7.3.3) yielded

448 features, which were fed to the TANDEM MLP.

Tab. 7.6 and 7.7 suggest that the default settings do not bring any major improvement.

More interesting would be to employ M-RASTA filters shorter in time. Unfortunately, it

bears an issue of proper temporal sampling: The time shift between successive 1000 ms



LP-TRAP 18.2 4.1

M-RASTA 448 17.4 3.4

Hilbert-M-RASTA 17.0 4.0

Table 7.6: Comparing LP-TRAP, M-RASTA and Hilbert-M-RASTA on S/N task.

Devel set Test set


LP-TRAP 39.1 53.1 50.5

M-RASTA 37.5 53.3 51.4

Hilbert-M-RASTA 37.2 52.3 50.8

Table 7.7: Comparing LP-TRAP, M-RASTA and Hilbert-M-RASTA on CTS task.

frames would have to be reduced in order not to violate the sampling theorem. However,

this would raise the frame rate and so the complexity of the system. It could still be

dealt with (e.g. by post-processing the MLP posteriors and decimating or by a multi-rate

processing), though it has not been done within this thesis.

7.5 Conclusion

A novel feature extraction technique was developed based on multiple 2-D filtering of

time-frequency plane.

Filters are determined by their impulse responses which are all zero-mean, implying

robustness to linear distortions of the signal and to changes in spectral tilt that could

be induced by extra-linguistic factors, thus inherently alleviating one of major sources of

harmful variability in the speech.

Experiments with small-vocabulary and mid-vocabulary ASR have shown that 2-D

multi-resolution RASTA filtering in conjunction with TANDEM feature extraction appears

to be an efficient means for representing message-specific information in the speech.

• In digits recognition (S/N Task) the approach outperformed all competitive features

(PLP, PLP-TANDEM, TRAP, TRAP-DCT, and LP-TRAP) and proved robustness

to linear distortions.

• In conversational telephone speech recognition (CTS Task) it approaches the upper

bound of accuracy achieved by competitive long-term features. Though not shown

in this chapter (see Appendix B for more information), M-RASTA features contains

complementary information to short-term features as it can significantly improve

accuracy on CTS when combined with PLP features.

Chapter 8

Extensions: Towards recognition

by means of keyword spotting

This chapter presents an alternative approach to ASR in which each targeted word is

classified by a separate binary classifier against all other sounds. To build a recognizer for

N words, N parallel binary classifiers are applied. The system first estimates uniformly

sampled phoneme posterior probabilities, followed by a second step in which a rather long

sliding time window is applied to the phoneme posteriors and its content is classified by

MLP to yield posterior probability of the keyword. On small-vocabulary task, the system

still does not reach the performance of the state-of-the-art but its conceptual simplicity

and its inherent resistance to out-of-vocabulary sounds may prove significant advantage

in many applications.

After presenting the principle of the approach, an alternative recognizer of digits will

be built using eleven parallel keyword spotters. This recognizer will be evaluated on a

modified S/N task and compared to HMM baseline. Subsequently, it will be exposed to

an unconstrained speech containing a lot of out-of-vocabulary words.

8.1 Introduction

Since the early attempts for ASR, the task has been to recognize words from a closed

set of words. As any non-native speaker of the language may testify, human speech

communication applying this approach would be impossible. Daily experience suggests

that not all words in the conversation, but only a few important ones, need to be accurately

recognized for satisfactory speech communication among human beings. The important

keywords are more likely to be rarely occurring words with high information value. Human

listeners can identify such words in the conversation and possibly devote extra effort to

their decoding. On the other hand, in a typical ASR, acoustics of frequent words are likely

to be better estimated in the training phase, the language model is also likely to substitute

rare words by frequent ones and this is finally supported by the typical judging criterion,

word error rate, which rates the most important words equally to common phrases with

the least information value. As a consequence, important rare words are less likely to be

well recognized. Keyword spotting has a potential to address this issue by focusing only

on certain words while ignoring the rest of the acoustic input.

In this chapter, the ASR is seen as a task where the main goal is to find the target

85

86 8 Extensions: Towards recognition by means of keyword spotting

words in an acoustic stream while ignoring the rest.

8.2 Detecting a word in two steps

The proposed approach works in the following steps (illustrated in Fig. 8.1).

1. Equally-spaced posterior probabilities of phoneme classes are estimated from the

signal.

2. The probability of a target keyword is estimated from the sequence of phoneme

posteriors. The probability is smoothed by a matched filter.

step 1

step 2

speech

filtering

M−RASTA

time

phonem

es

sliding window

MLPmatched

timetime

MLP

phoneme

keyword

filter

posteriogram

posteriorkeyword

Figure 8.1: Scheme of hierarchical keyword spotting.

In the first step, Multi-resolution RASTA features are calculated from the speech and

fed to an MLP (further referred to as Phoneme MLP) which has been trained to estimate

posterior probabilities of 29 phonemes every 10 ms.

The second processing step replaces the common HMM decoding framework with a

second MLP (further referred to as Keyword MLP) with multiple inputs and two comple-

mentary outputs. It projects a relatively large span of the phoneme posteriogram (about

1000 ms) to the posterior probability of the given keyword being present in the center

of the time span. Thus, the input to Keyword MLP is a 2929-dimensional vector (29

phoneme posteriors within the context of 101 frames). By sliding the window frame-by-

frame, the phoneme posteriogram is converted to the keyword posteriogram. A typical

keyword posteriogram is shown in Fig. 8.2.

0 100 200 300 400

0

0.5

1

Frames0 1 2 3 4

time [s]

pro

babili

ty

Figure 8.2: Example of keyword posteriogram.

8.2 Detecting a word in two steps 87

Training targets for keyword MLP

Keyword MLP is trained against hard targets. They are set to “1” at all frames spanning

from the beginning till the end of the targeted word, otherwise they are “0”. Hence, the

estimated posteriors should register the keyword presence regardless of the position within

the keyword. In fact, such targets are necessary due to the discriminative nature of MLP

training: if only the frames exactly at the word centers were labeled “1”, the fraction of

positive examples in the whole training frames would be negligible and the MLP training

would converge to a degenerated solution of constant zero.

8.2.1 From frame-based estimates to word level

Even though to a human eye the frame-based posterior estimates usually clearly indicate

the presence of the underlying word, the step from the frame-based estimates to word-

level estimates is very important. It involves nontrivial operation of information rate

reduction (carried sub-consciously by human visual perception while studying the poste-

riogram) where the equally sampled estimates at the 100 Hz rate are to be reduced to

non-equidistant estimates of word probabilities. In the conventional (HMM-based) sys-

tem, this is accomplished by searching for an appropriate underlying sequence of hidden

states.

Matched filters

Here a more direct approach was opted, which postulates the existence of matched filters

for temporal trajectories of word posteriors. Impulse responses of the filters reflect the

average contour of posteriors in 1 second interval centered at the keyword. For every

keyword, the impulse response of its matched filter was obtained as follows.

1. All instances of the particular keyword in the training set were found and their

centers were localized.

2. One second long trajectories of targets were formed, centered at the keyword center.

3. These trajectories were averaged.

In computing the averages, we need to deal with cases where the window contains

more than one instance of the keyword. For simplicity, these segments were not included

in the calculation. Resulting filters are shown in Fig. 8.3.

Decision about keyword presence

The raw posteriors, as estimated by Keyword MLPs, were smoothed with the appropriate

matched filters. As a consequence, local maxima (peaks) of each filtered trajectory indi-

cated that the given word was aligned with the impulse response. The position of the peak

then indicated the center of the word. The value in the peak could be used as an estimate

of confidence that the keyword was present. The final decision was taken by comparing

the peak value to a threshold, which had been derived from the data during training. The

details of the decision-making were subject to the experiments and will be clarified later.

Fig. 8.4 illustrates the process.


−500 0 500

0.2

0.4

0.6

0.8

1

time [ms]

Figure 8.3: Impulse responses of matched filters for eleven keywords (digits from “zero”

to “nine” plus “oh”).

truth

filtered

raw

time

pro

ba

bili

ty

po

ste

rio

r

filtered

Figure 8.4: Finding the keyword position from its posterior probability. Circles indicate

potential alarms.

8.3 Experiments

The aims of the experiments were:

• to test the viability of the proposed approach,

• to understand it’s properties and tune it’s parameters on a development data,

• to evaluate and the method on an unconstrained speech.

8.3.1 Training and testing data sets

The experiment was based on OGI Stories and Numbers95. However, compared to S/N

task setup, the distribution of files changed. There were four disjunctive data sets:

Train set 1 – 208 files from Stories (2.8 hrs) with frame-level phoneme labels (matches

with MLP set 1 from S/N task),

Train set 2 – 2547 files from Numbers95 containing strings of digits (1.3 hrs) with frame-

level phoneme labels (matches with HMM train set from S/N task),

Devel set – 1433 files from Numbers95 containing strings of 11 digits (1.0 hrs) with word

transcription (subset of Test set from S/N task),

Extraneous set – 129 files from Stories (1.7 hrs) with general speech, with word tran-

scription.

8.3 Experiments 89

Binary training targets for Keyword MLPs were created for both Train sets. Stories

database provides word boundary labels that were used for Train set 1. For Train set 2

the labels were not available. A backward mapping from phoneme labels to words was not

possible, hence the word boundaries were obtained from the automatic alignment using

an existing ASR system and the word transcript.

Repeating keywords issue

Utterances in Test set were chosen not to contain repeating keywords. It reveals one

particular problem of the system that still needs to be addressed if the system is to be

used in certain ASR applications. When two subsequent keywords come, the technique is

not likely to detect both of them. The keyword posteriors as estimated by Keyword MLP

usually stay at high level. Even if there was a transient drop in the trajectory, it would

be smeared by the matched filter.

This would not be an issue in the envisioned application of the system that require

merely mark the frames containing the keyword, but when the system is treated and

evaluated as a recognizer, it represents a problem. The solution was found by Lehtonen

et al. [68], who omitted the Keyword MLP step and applied matched filters directly to

phoneme posteriors. He localized peaks in the smoothed posteriogram which supposedly

represented phonemes and decoded the targeted words from the sequence of peaks.

8.3.2 Initial experiment – checking viability

The objective of the initial experiment was to get an idea of limits of the proposed sys-

tem when applied as a simple speech recognizer. The task was to recognize 11 digits in

utterances from Devel set. The evaluation was done by means of WER. All systems were

required to give approximately the same number of insertions and deletions, which could

be tuned on Devel set.

The procedure:

The hypotheses about word sequences in tested utterances were obtained as follows.

1. Speech from Train set 2 was projected onto phoneme classes using an existing

Phoneme MLP. The MLP has been trained earlier using 448 M-RASTA features

and MLP sets 1 + 2 from S/N task (details can be found in section 7.3.3).

2. Eleven independent Keyword MLPs, each of which with 1 s long trajectory of

phoneme posteriors at the input (2929 features) and two complementary outputs

(P (present), P (not present))1 were trained on Train set 2 to give frame-wise key-

word posteriors. MLP topologies were 2929×500×2 units. Subsequently, speech

from Devel set was passed through Phoneme MLP and Keyword MLPs.

3. Eleven matched filters for the respective keywords were derived by computing the

mean trajectory patterns from Train set 2. Keyword posteriors of Devel set were

then filtered with the filters.

1As the MLP outputs represent probabilities, they have to be at least two so that they always sum to

one. Further processing utilizes only one output.


4. The decision about keyword presence was made by comparing the local maxima

of the filtered trajectories to a fixed threshold (alarm threshold), which has been

iteratively found to balance insertions and deletions. At every utterance, all peaks

valued above the alarm threshold were considered as detected keywords and sorted

by time to yield the final hypothesis.

When evaluated on Devel set, the system yielded encouraging 9.9% WER (compare to

the optimized HMM recognizer with M-RASTA features yielding 3.4% WER).

8.3.3 Simplifying the system – omitting intermediate steps

An interesting question could be asked, what would happen if we omitted the intermedi-

ate processing steps, M-RASTA filtering and Phoneme MLP, and estimated the keyword

posteriors directly from critical-band log-energies? The answer is shown in Fig. 8.5.

CriticalBand

Analysis filtering MLP

PhonemeM−RASTA Keyword

MLP

speech YES

NO

15 448 29

CriticalBand



MLP

speech YES

NO

15 448 29

CriticalBand



MLP

speech YES

NO

15 448 29

31 %

16.2 %

9.9 %

WER

Figure 8.5: Omitting intermediate processing steps from hierarchical keyword spotting.

When the keyword posteriors were estimated directly from M-RASTA features (MLP

topology 448 × 3000 × 2 units), the performance dropped to 16.2% WER. Omitting also

the M-RASTA feature computation and training the keyword networks directly on 1 s

trajectory of critical band energies (MLP topology (101∗15)×1000×2 units), yielded 31%

WER. It suggests that the hierarchical processing employing high-dimensional features and

the use of intermediate phoneme classes are beneficial.

8.3.4 Optimizing false alarm rate – Enhanced system

The aim of the next experiment was to study the relationship between word error rate

and false alarm (FA) rate, and to optimize the proposed algorithm for the subsequent test

with unconstrained speech.

The task was again to recognize 11 digits from Devel set. To balance the insertion

rates among keywords, individual alarm thresholds were found for each keyword to give

the required FA rate. The dependence of alarm thresholds on the FA rate is shown in

Fig. 8.6. The lower the threshold, the more alarms produced (either positive or false

alarms). Once all 11 thresholds were found for the given FA rate, the WER was evaluated

(see the second column of Tab. 8.1).

It can be observed that the proposed approach can act as ASR system at 12% WER

level while keeping at most 30 false alarms per hour. On the other hand, the competitive

8.3 Experiments 91

15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FA/h

thre

sh

old

one

two

three

four

five

six

seven

eight

nine

zero

oh

Initial system

15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FA/h

thre

sh

old

one

two

three

four

five

six

seven

eight

nine

zero

oh

Enhanced system

Figure 8.6: Alarm thresholds as a function of false alarm rate (floored at threshold 0.05).

system initial enhanced HMM

FA/h WER[%] WER[%] WER[%]

20 19 16 53

25 15 11.6 40

30 12.1 10.3 26

40 9.9 9.3 3.4

Table 8.1: WER as a function of false alarms per hour on digits recognition task.

HMM-based ASR system with its best performance of 3.4% WER yields 38 false alarms

per hour and when modified to yield 30 alarms per hour by manipulating its insertion

penalty, its performance degrades to 26% WER. Further efforts for lowering the FA rate

in HMM system degraded the performance yet further (see the fourth column in Tab. 8.1

and also Fig. 8.7). However, it should be noted the HMM system used here is not designed

to act as a keyword-spotter.

More discriminative training – adding negative examples

When training a new set (the enhanced set) of 11 Keyword MLPs on a joint set of Train

set 2 with a subset of Train set 1 in a ratio about 1:1, the prior probabilities of all keywords

lowered roughly twice. After re-setting the thresholds to the required FA rates, the WER

improved, see the third column of Tab. 8.1. The spheres of operation of the respective

systems are shown in Fig. 8.7. Knees of the curves could be used to identify the optimal

usage for each method.

The comparison of enhanced and initial systems supports that a sufficient amount of

negative examples is necessary in discriminative training of classifiers.

One big Keyword MLP instead of individual MLPs

To allow for better discriminability among keywords, one big network with a topology

(29*101)×500×12 units was trained for all keywords at once. The 12 outputs were mapped

to 11 keyword classes plus one non-keyword class. As they were estimates of posterior

probabilities, the 12 output values always sum to one. The posteriors were postprocessed

in the same way as in case of individual networks.


10 20 30 40 500

5

10

15

20

25

30

Average FA/h

WE

R [

%]

Initial system

Enhanced system

HMM

Figure 8.7: Operation ranges of the proposed system and HMM recognizer.

The observed behavior of the big network was very close to that of individual Keyword

MLPs (at 40 FA/h, where the individual MLPs reached 9.9% WER, the big MLP was

about 0.5% absolute better). It suggests that explicitly introducing discrimination among

target keywords is not worth the loss of the independence among target words.

8.3.5 Keyword spotting on frame level

An interesting application of the keyword spotter is to let it mark all frames belonging

to a given keyword. It can be simply achieved by passing the speech through M-RASTA

filtering, Phoneme MLP and Keyword MLPs, without any other postprocessing. In this

case the marked segments are not necessarily continuous – there can be gaps on frame

level. However, the operation can be extremely fast as it only involves a sequence of

matrix multiplications. One can concatenate the marked segments and listen to the output.

Surprisingly, the concatenated speech sounds naturally, which suggests that human hearing

is able to subconsciously recover the missing data.

This application has been implemented in a small live demo, where the speaker records

his own voice through a PC microphone and subsequently is able to listen to the concate-

nated segments of any of the eleven spotted keywords.

one two three four five six seven eight nine zero oh0

20

40

60

80

100

%

Digits training data (Train set 2)

one two three four five six seven eight nine zero oh0

20

40

60

80

100

%

Joint Digits + Stories training data (Train sets 1 + 2)

Correct reject (% of frames without keyword)

Correct spot (% of frames with keyword)

False alarms (% of all frames)

Deletions (% of keyword frames)

Figure 8.8: Frame-level evaluation of Keyword MLPs, Test set.

Frame-level performance of initial and enhanced systems can be seen in Fig. 8.8. The

figure reveals a property of the discriminative training which sometimes may cause prob-

lems: systems trained using more negative examples tend to reject more frames. However,

8.3 Experiments 93

as it was shown above, with an appropriate postprocessing, this property does not repre-

sent an issue.

8.3.6 Keyword spotting in unconstrained speech

The most interesting situation for a keyword spotting system is when the test data contain

a lot of out-of-vocabulary speech. Such a task was emulated by appending one hour of

digits (Devel set) with 1.7 hours of extraneous general speech (Extraneous set). Standard

evaluation procedure was applied as in the previous ASR tasks, but the extraneous speech

from OGI Stories was labeled as no speech. For initial MLP system, thresholds yielding

30 FA/h on Devel set were used and for enhanced system, thresholds yielding 25 FA/h on

Devel set were used.

system False alarms % Hits WER [%] FOM

Initial system 6208 87.4 91.6 74.0

Enhanced system 1313 85.9 24.5 83.6

Table 8.2: Results on a joint set of digits and unconstrained speech. % Hits = % correctly

spotted/all keywords, FOM = Figure Of Merit (average accuracy over 1–10FA/h).

The two studied systems were exposed to the joint speech. Their performances are

compared in Table 8.2. % Hits represents a fraction of correctly spotted keywords in all

presented keywords. Figure of merit is defined by NIST as the upper-bound estimate of

the keyword spotting accuracy averaged over 1 to 10 false alarms per hour [100].

Observation

• Enhanced system largely outperforms Initial system. It reduces the number of false

alarms almost 5×, while preserving the number of hits.

• When evaluated in terms of WER of digits recognition in unconstrained speech,

Initial system failed, but Enhanced system was still able to recognize 75% of digits.

Negative training examples for Keyword-spotting MLPs seem to be essential for the

system being able to reject out-of-vocabulary sounds. The ability of the proposed system

to focus only at the words of interest can be illustrated also by looking at the behavior of

the baseline HMM recognizer, when exposed to such task. If the closed-vocabulary HMM

recognizer was forced to recognize the unconstrained speech (though it is not designed to do

so), it would insert 11925 false alarms, thus bringing its final WER to rather unacceptable

152%.

It is worth noting that the speech from the Extraneous set contains a lot of items

that are acoustically similar to the keywords, such as numbers (nineteen), compound

words (someone) and other acoustically similar items (too, for). Without a higher level

semantical knowledge, these items cannot be distinguished from the targeted digits. It

is thus questionable whether these “false alarms” are indeed errors. However, they were

considered as errors in the above evaluations.


8.4 Discussion and conclusion

The compared digit-recognizing HMM system represents a typical closed-set vocabulary

system, which, when presented with out-of-vocabulary word, attempts to match it with

the word from its closed-set vocabulary, yielding a false alarm. This problem is typically

addressed by introducing some measure that provides an estimate of confidence in decision

about the identity of the underlying word (a difficult research problem on its own).

Here, an attempt was made to develop a simple alternative system based on a set

of parallel discriminative classifiers that is better capable of yielding no output when

presented with an unknown out-of-vocabulary word. The new approach differs from the

current ASR strategies in several aspects:

• The recognizer for N words is built as a system of N parallel discriminative binary

classifiers, each classifying one keyword against the rest of other possible sounds.

• The classification is based on hierarchical processing where first equally-spaced pos-

terior probabilities of phoneme classes are derived from the signal, followed by es-

timation of the probability of the given keyword from the sequence of phoneme

posteriors.

• In contrary to most of current ASR systems, no explicit time warping (DTW in

Viterbi) is done. Instead, the binary classifier is trained for word length invariance

on many examples of the keyword.

It was demonstrated that given a sufficient amount of negative examples for discrim-

inative training, the studied approach can inherently reduce the insertion error problem

on out-of-vocabulary words.

Chapter 9

Summary and conclusion

9.1 Summary of the work

This thesis is oriented towards new approaches in speech recognition utilizing neural net-

works and hidden Markov models. Though the main interest was devoted to the devel-

opment of novel features extracted from spectral dynamics, virtually all parts of the used

ASR framework were questioned:

Front-end:

• Obtaining spectrogram: The conventional frame-based processing was questioned

using LP-TRAP features being able to precisely localize temporal speech events.

• How much of the spectrogram to use: Long-term features use up to 1000 ms context.

Sections on truncating TRAP and warping time axis studied how the ASR-relevant

information is distributed in such a context.

• How to parametrize the spectrogram: A new way of converting the spectrogram into

features, M-RASTA, was proposed.

Probability estimator:

• A suitability of phonemes as target classes for MLP classifier was examined and

alternative classes were suggested (in appendixes C, D).

Decoder:

• An alternative words-recognition system was proposed, based purely on MLP classi-

fiers. It addresses the issue of out-of-vocabulary words in small-vocabulary ASR. The

system could substitute HMM-based decoder in computation-critical applications.

Chapter 5 studied the distribution of information in time-frequency plane. It was found

that context over 1000 ms is rather irrelevant for long-term feature extraction. In TRAP,

for word recognition the context can be reduced down to 200-400 ms without a significant

performance drop on small vocabulary; for frame-level classification, minimum of 400 ms

seems needed. The lower the feature dimension, the shorter the optimal TRAP length.

When only a few features are available to describe the spectral dynamics, their first choice

95

96 9 Summary and conclusion

are fast spectral modulations. However, slow modulations still contain complementary

information useful for large feature vectors (in context up to 1000 ms). Most of the

information comes from the center of the time window. By warping time axis, distant

frames can be largely sub-sampled while preserving the resolution near the center, which

can reduce TRAP complexity up to 5 times.

Chapter 6 implemented in C++, optimized, and extended the linear predictive TRAP

features (LP-TRAP). LP-TRAP bypasses speech framing and can preserve fine temporal

structure of energy trajectories in frequency sub-bands. The most important tunable

parameters were optimized on small-vocabulary task. In word-recognition experiments,

LP-TRAP cepstra significantly outperformed LP-TRAP sampled envelopes. LP-TRAP

cepstral features markedly outperformed baseline TRAP-DCT features on conversational

telephone speech, CTS task. Better temporal localization of sub-band events in LP-TRAP

than in TRAPs seems beneficial in more complex tasks (CTS) that are able to utilize the

detailed information. The idea of pre-warping the temporal axis in order to stress the

central part of 500 ms sub-band energy trajectories was shown to help: either it can

considerably reduce the feature bandwidth without loss in recognition performance (by

70% at CTS task), or with the full bandwidth it can markedly improve the performance

over non-warped case.

Chapter 7 proposed new features for ASR named M-RASTA. The method applies a bank

of 2-D time-frequency filters with varying temporal resolutions to the speech spectrogram,

prior to its projection onto phoneme probabilities by MLP. The impulse responses of the

filters implement the earlier findings about the information in spectro-temporal plane,

namely focusing on the central parts of TRAPs and preserving only the important mod-

ulation spectrum. On small-vocabulary task, M-RASTA outperformed all competitive

features (PLP, PLP-TANDEM, TRAP, TRAP-DCT, LP-TRAP) and proved the antic-

ipated robustness to linear distortions. On the more complex CTS task it approached

the upper bound of accuracy achieved by competitive long-term systems. The M-RASTA

features were shown to be complementary to short-term features, similarly to TRAP and

LP-TRAP features. A detailed study suggested that M-RASTA features mainly use spec-

tral modulations between 4–19 Hz for word recognition and modulations between 1.5–8 Hz

for frame recognition.

Chapter 8 proposed an alternative approach to word recognition, where each word was

classified by a separate binary classifier against all other sounds. The system used only

discriminatively trained MLP classifiers, without applying HMMs or dynamic time warp-

ing. Properties of discriminative training were studied, observing that a certain balance of

positive and negative examples is required for good performance. The efficient and simple

system focuses on capturing only the words of interest, therefore it was able to reasonably

reject out-of-vocabulary words. When compared to the best available HMM recognizer on

small-vocabulary ASR, it produced more errors on word level, yet it was better capable

to lower the number of false-alarms per target word.

9.2 Original contribution 97

9.2 Original contribution

The main objective of this thesis was to extend the current knowledge about features for

ASR derived from the dynamics of speech spectrum.

Certain new properties of the existing TRAP-related systems were found (importance

of modulations up to 20 Hz, possibility to sub-sample distant parts of TRAP), which

allowed to improve the current approaches in terms of their simplicity and performance.

These findings helped to improve LP-TRAP features by warping time axis and also moti-

vated the design of the novel speech representation M-RASTA. The study of the relation-

ship between quality of MLP posteriors and word recognition accuracy was an impulse to

propose a new alternative to the mainstream words-decoding framework.

Besides the theoretical achievements, the author’s efforts also gave rise to several open-

source software tools, which enable for flexible and effective experimenting with various

speech features. Two speech recognition and evaluation tasks used throughout this thesis

were efficiently implemented, parallelized, documented and made available to the involved

research community.

Together with the live recognition demo for hierarchical keyword spotting, the products

of this work are being actively used and further developed [65, 97, 84].

9.3 Conclusion

We decided to study acoustic features for speech recognition since no speech recognizer can

do a good job with bad features. We postulated that auditory-like spectrum completely

preserves the underlying message. We believed that posterior probabilities of phonemes

could help when used as an intermediate step between the speech and the text. We also

believed that wider temporal context could improve the local decisions about what was

pronounced.

Given these assumptions, we asked: How much of the time context can possibly be

useful for features? How the information is actually distributed within such a context? Is

it important to properly model detailed temporal structures? What dynamic events are

the most important for recognition – slow or fast modulations?

In the presented work we answered these questions to some extent and supported

the drawn conclusions by experimental evidence. As the ultimate goal was the speech

recognition, we assesed the improvements in terms of recognition rates on standard tasks.

The findings are summarized above and also in individual chapters.

Based on the findings, we can finally conclude that searching for proper ways to ex-

plicitly track temporal processes in speech within acoustic features pays off by significant

improvements in recognition performance in terms of accuracy and robustness against

non-linguistic variability.

98 9 Summary and conclusion

Future Research

Although a huge effort has been put in ASR development worldwide, it seems that there

is still a lot of gaps in our knowledge, providing an open space for research. This work

attempted to fill in some of those gaps, which in turn revealed new interesting questions.

Some topics were not fully explored, such as the optimal way of warping the time axis in

LP-TRAP features, or pruning the large feature space resulting from M-RASTA filtering.

These topics as well as the discriminative keyword recognition leave a range of possibilities

for future research and development.

9.4 Acknowledgment

The work was done at Czech Technical University in Prague, Czech Republic, and at

IDIAP Research Institute, Martigny, Switzerland.

The grants supporting the part of research done at FEE CTU Prague were: GACR

102/05/0278 “New Trends in Research and Application of Voice Technology”, GACR

102/03/H085 “Biological and Speech Signals Modeling”, GACR-102/02/0124 “Voice Tech-

nologies for Support of Information Society”, and the research activity MSM 6840770014

“Research in the Area of the Prospective Information and Navigation Technologies”.

The part of research done at IDIAP was supported by DARPA grant “EARS Novel

Approaches” no. MDA972-02-1-0024. The other sources of support were DARPA GALE

program, the European Community AMI and M4 grants, and the IM2 Swiss National

Center for Competence in Research, managed by Swiss National Science Foundation on

behalf of Swiss authorities.

Appendix A

Class coverage of S/N & CTS

tasks.

Stories, MLP set 1 Numbers95, MLP set 2Index Label Frames % Frames %

0 d 5723 0.58 367 0.061 t 19706 1.99 21466 3.562 k 14175 1.43 2746 0.463 dcl 14753 1.49 506 0.084 tcl 27514 2.78 20282 3.365 kcl 17279 1.75 6779 1.126 s 51456 5.20 39826 6.607 z 14845 1.50 6993 1.168 f 17334 1.75 28375 4.709 th 5302 0.54 11692 1.9410 v 9892 1.00 14765 2.4511 m 22476 2.27 39 0.0112 n 42632 4.31 50208 8.3213 l 26176 2.65 915 0.1514 r 20952 2.12 29711 4.9215 w 14961 1.51 20465 3.3916 iy 34997 3.54 32289 5.3517 ih 38389 3.88 16014 2.6518 eh 22473 2.27 13959 2.3119 ey 20703 2.09 19102 3.1720 ae 27155 2.74 163 0.0321 ay 28911 2.92 53950 8.9422 ah 54829 5.54 28199 4.6723 ao 13537 1.37 4012 0.6624 ow 17026 1.72 47588 7.8925 uw 12399 1.25 27781 4.6026 er 15333 1.55 2638 0.4427 ax 11342 1.15 842 0.1428 SIL 190831 19.29 72475 12.0129 REJ 176417 17.83 29230 4.84

Overall 989518 100.00 603377 100.00

Table A.1: Class coverage of S/N task, MLP sets 1 & 2.

99

100 A Class coverage of S/N & CTS tasks.

male femaleIndex Label Frames % Frames %

0 SIL 1529079 26.67 1382202 24.431 aa 67581 1.18 68108 1.202 ae 196264 3.42 197321 3.493 ah 84119 1.47 87478 1.554 ao 65186 1.14 65041 1.155 aw 45686 0.80 49209 0.876 ax 239126 4.17 235463 4.167 ay 194186 3.39 199003 3.528 b 62175 1.08 60647 1.079 ch 20773 0.36 21474 0.3810 d 91351 1.59 95054 1.6811 dh 87815 1.53 84389 1.4912 dx 28929 0.50 27520 0.4913 eh 91584 1.60 93612 1.6514 er 94777 1.65 93853 1.6615 ey 94336 1.65 94917 1.6816 f 70752 1.23 67365 1.1917 FIP 9833 0.17 6155 0.1118 g 46039 0.80 43180 0.7619 hh 62289 1.09 78471 1.3920 ih 105977 1.85 105998 1.8721 iy 162879 2.84 164602 2.9122 jh 25965 0.45 25723 0.4523 k 134674 2.35 136179 2.4124 l 153233 2.67 155005 2.7425 LAU 72837 1.27 126833 2.2426 m 107423 1.87 108322 1.9127 n 221132 3.86 219663 3.8828 ng 48422 0.84 53268 0.9429 ow 145679 2.54 185196 3.2730 oy 5250 0.09 5587 0.1031 p 65262 1.14 60704 1.0732 PUH 142911 2.49 91609 1.6233 PUM 51569 0.90 69184 1.2234 r 141826 2.47 139225 2.4635 s 208325 3.63 212399 3.7536 sh 32093 0.56 31628 0.5637 t 199722 3.48 202069 3.5738 th 32087 0.56 33386 0.5939 uh 17369 0.30 20455 0.3640 uw 81752 1.43 81549 1.4441 v 50708 0.88 49531 0.8842 w 104537 1.82 106207 1.8843 y 104404 1.82 98951 1.7544 z 98470 1.72 94262 1.6745 zh 1966 0.03 1807 0.0346 REJ 34988 0.61 28861 0.51

Overall 5733340 100.00 5658665 100.00

Table A.2: Class coverage of CTS task, training sets. Note that only male set was used

in this work.

102 B Summary of experiments on CTS Task.

Appendix B

Summary of experiments on CTS

Task.

WER[%] WER[%]

Experiment Devel Test St., Grm.a

Short-term features

MFCC (3x13, Utt. E norm.b, lifter) 52.5 50.2 1k, 14

PLP (3x13) 53.2 51.6 1k, 14

ePLP (enhanced PLP - 3x13, Spk/Utt VTLN)c 46.4 43.8 1k, 14

9-frames

PLP-MLP (9 frames PLP, 46 fea)d 50.9 - 2k, 20

PLPu-MLP (9 frames PLP, Utt norm., 46 fea) 48.7 - 2k, 20

ePLP-MLP (9 frames ePLP, 46 fea) 46.3 43.3 2k, 20

TRAP

TRAP nn (101 frames, no norm., 46 fea) 54.5 53.4 2k, 20

TRAP-DCT (101 frames, Hamming, DCT 51e, 46 fea) 53.3 51.2 2k, 20

LP-TRAP

LP-TRAP (fp70, len51, ncep50, no warp, 46 fea) 53.1 50.5 2k,20

LP-TRAPw ( -//-, warp 1.75 ) 51.4 50.3 2k, 20

M-RASTA

M-RASTA 240 (2×8 filt, 46 fea) 53.7 52.5 2k, 20

M-RASTA 448 (2×8 filt + ∆ff, 46 fea) 53.3 51.4 2k, 20

M-RASTA 448 tri ( -//-, 3 phn statesg, 46 fea) 51.1 50.1 2k, 20

Hilbert-M-RASTA (2×8 filt + ∆, 46 fea) 52.3 50.8 2k, 20

Combinations

ePLP + ePLP-MLP (39 + 17 fea) 43.7 - 2k, 30

ePLP + ie[ePLP-MLP, LP-TRAPw]h(39 + 25 fea) 44.0 40.7 2k, 25

ePLP + LP-TRAPw (39 + 25 fea) 43.5 41.3 2k, 25

ePLP + M-RASTA 448 (39 + 25 fea) 45.6 42.4 2k, 25

ePLP + M-RASTA 448 tri (39 + 25 fea) 44.3 42.8 2k, 25

a St. = Number of triphone states, Grm. = Grammar scale factor.b Utterance-based energy normalization.c fea = Final feature vector size.d Vocal Tract Length Normalization per speaker (train) and per utterance (test).e Hamming window on TRAP, first 51 DCT coefficients used.f 2×8 temporal filters plus frequency deltas.g Three phoneme states (aligned by HMM) used as MLP targets.h ie = Inverse Entropy combination.

Table B.1: Summary of experiments on CTS task.

Appendix C

On target classes for

band-classifiers in TRAP

This section is a small study on suitable target classes for Band-MLP classifiers in TRAP

architecture. It does not question the Merger-MLP classes which were always phonemes

in this thesis, see Fig. C.1.

��

��

��

��

��

��

��

��

��

��

��

ph

on

eme p

osterio

rs

Band−MLPs Merger−MLP

?

?��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure C.1: TRAP MLP architecture.

The study was inspired by the works of Pratibha Jain and continuing work of Petr

Svojanovsky, who experimented with fewer broader classes. Their motivation was that

phonemes cannot be properly distinguished given an energy trajectory in a sub-band.

They experimented with either broader phonetic classes (silence, plosive, nasal, glide,

low/high-vocalic energy, schwa, flap, fricative) or automatically data-driven classes across

bands (UTRAPs) which they assigned to sub-band TRAPs by means of the least Euclidian

distance. They relabeled every frame in every sub-band and trained the MLPs using these

targets. They reported improvement in ASR over conventional phonetic targets [58, 91].

Frame accuracies in Band-MLPs of a typical TRAP system with 29 phonemes reach

only about 35% FAcc (FAcc = 100% - FER). The existence of broad classes confirms some

measure of ambiguity in phoneme targets. One may ask to what extent such ambiguity

matters. Considering that TRAP actually works proves that either the ambiguity does

not matter or that the found broad classes are not optimal and can be improved. It is

103

104 C On target classes for band-classifiers in TRAP

possible to find which explanation is right by emulating 100% non-separable classes.

Main idea

The Band-MLPs in S/N task have 29 phoneme targets (1-29). By repeating every frame in

the training set exactly twice and assigning it two different labels L and L+29, respectively,

we can artificially introduce 29 non-separable class pairs. In theory, the FAcc for Band-

MLPs should halve. The question is, what happens to the Merger-MLP FAcc and WER.

If they get worse, then the ambiguity indeed matters and it makes sense to put efforts in

developing broad classes. If they stay the same, then the phonemic targets are satisfactory.

Experiment

Two MLP systems were trained on mean-normalized TRAP of the length 51 frames and

evaluated in terms of FAcc and WER. The number of training examples per Band-MLP

target class stayed constant.

Baseline

• 15 Band-MLPs with 29 targets,

• Merger-MLP with 29 targets, input size = 29 × 15 features.

Ambiguous

• 15 Band-MLPs with 2 × 29 = 58 targets,

• Merger-MLP with 29 targets, input size = 58 × 15 features.

Results are given in the first two rows of Tab. C.1. When comparing Ambiguous

system to Baseline system, the frame accuracy in Band-MLPs indeed halved. Merger

accuracy stayed constant, which supports the hypothesis that non-separability does not

matter. However, word error rate increased. It could have been caused by a change in

Merger-MLP topology (doubling its input size). To further investigate on this, two more

experiments were run.

• In the first experiment, the 29 posteriors per band from Baseline system were re-

peated twice. It eliminated any possible Band-MLP performance drop while again

forming double input (58 × 15 features) to the merger.

• The second experiment realized the opposite idea. Out of the 58 posteriors per each

band of Ambiguous system, only one half was fed to the merger (either posteriors

1–29 or 30–58). The merger input size from Baseline system was thus preserved.

105

Band-MLP Merger-MLP

Targets Aver. FAcc [%] FAcc [%] WER [%]

baseline 35 82 4.8

ambiguous 17 82 5.3

2×repeated baseline – 81 5.1

ambiguous 1–29 – 82 5.2

ambiguous 30–58 – 82 5.1

Table C.1: Emulating non-separable classes in TRAP Band-MLP targets.

Results of the two additional experiments are given in rows 3–5 of Tab. C.1. By

merely replicating posteriors from Baseline system the word recognition deteriorated (see

the third row). It confirms that the above WER mismatch was caused only by the change

in MLP topology.

On the other hand, preserving the original small topology by using only half of the

features provided by Band-MLPs did not return the WER back to 4.8%. However, using

only a half of the features could in principle have caused a loss of information. For this

reason the former experiment can be considered more reliable.

Conclusion

Phoneme classes used as targets for band-conditioned neural classifiers are not always

separable, because the classifiers have only partial information. Since the assignment

between phonemes and sub-band energy patterns may be ambiguous, the theoretically

reachable accuracy of the classifiers decreases. However, it was shown that such sub-band

ambiguity does not reduce the phoneme-classification potential of the merging neural

network, thus it can affect the word error rate only a little. Hence, if phonemes are used

as targets for Band-MLPs, the performance should not be impaired. It does not seem to

be neccessary to search for a data-driven broad categories.

106 C On target classes for band-classifiers in TRAP

Appendix D

Sub-phoneme targets for

TANDEM classifier

This section introduces a simple experiment with alternative phoneme classes that could

enhance the temporal resolution of the classifier.

Target classes of MLPs in combined ANN/HMM recognizers are typically phonemes.

The training targets are obtained from a phoneme transcript simply by setting the flag

for each phoneme pi from time ti to ti+1 and keeping other phonemes unset, see Fig. D.1.

Considering that the MLP does not have memory, it cannot model any temporal evolution

within the phoneme. In principle, the training frames could even be randomly reshuffled

prior to training with no drawback in performance1. After the training, the “best” MLP

is usually judged the one which yields posteriors best matching the training targets – in

other words, the most “rectangular” posteriors with respect to their temporal evolution.

Yet, these posteriors are subsequently modeled typically by 3–5 state HMMs. Why using

three states for a rectangle? Wouldn’t it make more sense if the MLP was trained with

state targets instead of phoneme targets?

Experiment on S/N Task

A combined ANN/HMM system based on TANDEM architecture and Multi-RASTA fea-

tures [49] was trained first with phoneme targets, and second with state targets (3 per

phoneme). The experiment was done with S/N task. Phoneme targets and state targets

were both obtained from a forced alignment using an existing set of HMMs. In theory, if

the reasoning was odd, the FER would increase 3 times. Tab. D.1 shows what actually

happened.

The FER dropped only to 29.1%, not to 57.3%, which suggests that the MLP is able

to benefit from state targets. It was also confirmed by a big progress in WER.

For a fair comparison of systems with different number of posteriors, only the first 29

1In practice, the training frames are actually being reshuffled, which ameliorates MLP convergence.

107

108 D Sub-phoneme targets for TANDEM classifier

SIL

eh

ih

n

s

v

x

MLP1

0

SIL s eh v ih n SIL

t t t tt1 2 3 4 50

t

Figure D.1: Illustration of phonemes as MLP training targets for an utterance “seven”.

Targets FER [%] WER [%]

29 phonemes 19.1 4.0

3×29 states 29.1 3.4

Table D.1: Performance of phoneme states used as MLP targets, S/N task.

features after KLT transform in TANDEM were used. Had all 87 features been fed to the

HMMs, the WER would have reached 3.2%.

It could be also argued that there was a difference in overall MLP sizes, as the latter

MLP had more output units. However, this can hardly be compensated for. If the number

of free parameters in the systems was preserved, there would still be a difference in MLP

topology (roughly A × 3B × C units vs. A × B × C/3 units), which could significantly

affect performance. Nevertheless, for the sake of completeness, another MLP was trained

with the reduced number of hidden units to preserve the overall size. FER was 31.3% and

WER was 3.8%2, still better than the phoneme-target system.

Experiment on CTS Task

Since it is often the case that S/N observations do not translate to LVCSR, the same

experiment was carried out on CTS task. The sequence of phoneme labels commonly

used for training was forced-aligned to states (3 per phoneme) using existing HMMs.

Subsequently, to repeat the S/N experiment, two MLPs were trained:

State targets “fair” – the number of free parameters was preserved (3× more outputs,

3× less hiddens).

State targets – the same hidden layer size as the baseline, 3× more outputs.

2When using all 87 outputs from this MLP, the WER would be 3.5%.

109

MLP Topology (Input × Hidden × Output)

Phoneme targets 448 × 2000 × 47

State targets “fair” 448 × 667 × 139

State targets 448 × 2000 × 139

Note that the “reject” class was not split to states as it is not used for training, hence

the number of classes was 46 × 3 + 1 = 139. The results are given in Tab. D.2.

MLP FER [%] Devel set WER [%] Test set WER [%]

Phoneme targets 37.5 53.3 51.4

State targets “fair” 54.4 51.1 50.1

State targets 52.3 50.8 50.9

Table D.2: Performance of phoneme states used as MLP targets, CTS task.

The FER dropped only by about 15%, which again supports the use of phoneme

states as MLP targets. The WER improved significantly, by more than 1% absolute.

Interestingly, the “fair” MLP (smaller one) outperformed the big MLP on the test set.

Conclusion

The MLP classifier in TANDEM architecture typically treats the phoneme as an atomic

temporal unit. It arises from the fact that during training, all frames belonging to a

certain phoneme are freely interchangeable. Here it was shown that treating the phoneme

as an entity with certain temporal dynamics (modeled by three independent sub-phoneme

states) in MLP classifier can significantly improve ASR performance. It was observed at

the frame level as well as in word error rate. The findings support the notion that making

the system overall more coherent by coordinating MLP targets with elementary HMM

modeling units (phone states) pays off by accuracy. Note that similar observations were

reported also in [88], though not very explicitly.

• This experiment was carried out after all other approaches presented in the thesis

had been already fixed, therefore the idea has not been implemented in evaluations.

Bibliography

[1] FFTW home page, http://www.fftw.org.

[2] ICSI speech FAQ, http://www.icsi.berkeley.edu/speech/faq.

[3] NIST spoken language technology evaluation and utility web,

http://www.nist.gov/speech/index.htm.

[4] QuickNet home page, http://www.icsi.berkeley.edu/Speech/qn.html.

[5] Sprachcore web page,

http://www.icsi.berkeley.edu/∼dpwe/projects/sprach/sprachcore.html.

[6] Web pages of Speech Processing and Signal Analysis Group at FEE CTU Prague,

http://noel.feld.cvut.cz/speechlab.

[7] Web pages of Speech Processing Group at Brno University of Techonology,

http://www.fit.vutbr.cz/research/groups/speech.

[8] Allen, J. B. How do humans process and recognize speech? IEEE Trans. on

Speech and Audio Proc. 2, 4 (Oct. 1994), 567–577.

[9] Arai, T., Pavel, M., Hermansky, H., and Avendano, C. Intelligibility of

speech with filtered time trajectories of spectral envelopes. In Proc. of ICSLP ’96

(Philadelphia, PA, 1996), vol. 4, pp. 2490–2493.

[10] Arai, T., Pavel, M., Hermansky, H., and Avendano, C. Syllable intelligibility

for temporally-filtered LPC cepstral trajectories. J. Acoust. Soc. Am. (1999).

[11] Arai, T., Takahashi, M., and Kanedera, N. On the important modulation

frequency bands of speech for human speaker recognition. In Proc. of ICSLP 2000

(2000), vol. 3, pp. 774–777.

[12] Atal, B. Effectiveness of linear prediction characteristics of the speech wave for

automatic speaker identification and verification. J. Acoust. Soc. Am. 55, 6 (1974),

1304–1312.

[13] Atal, B., and Schroeder, M. Predictive coding of speech signals. Proceedings of

the 1967 Conference on Communications and Processing (November 1967), 360–361.

[14] Athineos, M., and Ellis, D. Frequency-domain linear prediction for temporal

features. In Proc. of IEEE ASRU 2003 (St. Thomas, U.S. Virgin Islands, 2003),

pp. 261–266.

111

http://www.fftw.org

http://www.icsi.berkeley.edu/speech/faq

http://www.nist.gov/speech/index.htm

http://www.icsi.berkeley.edu/Speech/qn.html

http://www.icsi.berkeley.edu/~dpwe/projects/sprach/sprachcore.html

http://noel.feld.cvut.cz/speechlab

http://www.fit.vutbr.cz/research/groups/speech


[15] Athineos, M., Hermansky, H., and Ellis, D. P. LP-TRAP: Linear predictive

temporal patterns. In International Conference on Spoken Language Processing

(ICSLP) (2004). IDIAP RR 04-59.

[16] Athineos, M., Hermansky, H., and Ellis, D. P. W. PLP2: Autoregressive

modeling of auditory-like 2-D spectro-temporal patterns. In Proc of. SAPA-2004

(Jeju, Korea, 2004).

[17] Avendano, C., van Vuuren, S., and Hermansky, H. Data-based RASTA-like

filter design for channel normalization in ASR. In ICSLP’96 (Philadelphia, PA,

USA, Oct. 1996), vol. 4, pp. 2087–2090.

[18] Bellman, R. E. Dynamic programming. Princeton University Press, 1957.

[19] Bourlard, H., and Dupont, S. A new ASR approach based on independent

processing and recombination of partial frequency bands. In Proc. of ICSLP ’96

(Philadelphia, PA, 1996), vol. 1, pp. 426–429.

[20] Burget, L., Dupont, S., Garudadri, H., Grezl, F., Hermansky, H., Jain,

P., Kajarekar, S., and Morgan, N. QUALCOMM-ICSI-OGI features for ASR.

In Proc. 7th International Conference on Spoken Language Processing (2002), Inter-

national Speech Communication Association.

[21] Burget, L., and Hermansky, H. Data driven design of filter bank for speech

recognition. In Proc. of TSD’01 (Zelezna Ruda, Czech Republic, September 2001).

[22] Chen, B., Cetin, O., Doddington, G., Morgan, D., Ostendorf, M., Shi-

nozaki, T., , and Zhu, Q. A CTS task for meaningful fast-turnaround experi-

ments. In Proc. of RT-04 Workshop (IBM Palisades Center, November 2004).

[23] Chen, B. Y. Learning Discriminant Narrow-Band Temporal Patterns for Automatic

Speech Recognition. PhD thesis, University of California, Berkeley, 2005.

[24] Chen, B. Y., Chang, S., and Sivadas, S. Learning discriminative temporal

patterns in speech: Development of novel TRAPS-like classifiers, 2003.

[25] Chen, B. Y., Zhu, Q., and Morgan, N. Tonotopic multi-layered perceptron: A

neural network for learning long-term temporal features for speech recognition. In

Proc. of ICASSP 2005 (Philadelphia, PA, 2005).

[26] Cole, R., Noel, M., and Lander, T. Telephone speech corpus development

at CSLU. In In Proceedings of the International Conference on Spoken Language

Processing (ICSLP’94) (Yokohama, Japan, 1994), pp. 1815–1818.

[27] Cole, R., Noel, M., Lander, T., and Durham, T. New telephone speech

corpora at CSLU. In Proceedings of the Fourth European Conference on Speech

Communication and Technology (1995), vol. 1, pp. 821–824.

[28] Cooley, J. W., and Tukey, J. W. An algorithm for the machine calculation of

complex Fourier series. Mathematics of Computation 19 (1965), 297–301.

113

[29] Davis, S. B., and Mermelstein, P. Comparison of parametric representations for

monosyllabic word recognition in continuously spoken sentences. IEEE Transactions

on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357–366.

[30] deCharms, R. C., Blake, D. T., and Merzenich, M. M. Optimizing sound

features for cortical neurons. Science 280, 5368 (May 1988), 1439 – 1444.

[31] Depireux, D. A., Simon, J. Z., Klein, D. J., and Shamma, S. A. Spectro-

temporal response field characterization with dynamic ripples in ferret primary au-

ditory cortex. J. Neurophysiol. 85 (2001), 1220 – 1234.

[32] Dimitriadis, D., Maragos, P., and Potamianos, A. Auditory teager energy

cepstrum coefficients for robust speech recognition. In Proc. of Interspeech’05 (Lis-

bon, Portugal, September 2005).

[33] Drullman, R., Festen, J. M., and Plomp, R. Effect of reducing slow temporal

modulations on speech recognition. J. Acoust. Soc. Am. 95, 5 (May 1994), 2670–

2679.

[34] Ephraim, Y., and Malah, D. Speech enhancement using a minimum mean square

error short time spectral amplitude estimator. IEEE Trans. on ASSP-32 6 (Decem-

ber 1984), 1109–1121.

[35] Flanagan, J. Speech Analysis Synthesis and Perception, 2 ed. Springer-Verlag,

1972.

[36] Fletcher, H. Speech and hearing in communication. In The ASA edition of Speech

and Hearing in Communication, J. B. Allen, Ed. Acoustical Society of America, New

York, 1995.

[37] Fousek, P. Does phoneme labeling of speech have to be done by hand? Tech. Rep.

R06-3, FEE CTU, Dept. of Circuit Theory, Prague, 2006.

[38] Furui, S. Cepstral analysis technique for automatic speaker verification. In IEEE

Trans. ASSP (1981), vol. 29, pp. 254–272.

[39] Furui, S. On the role of spectral transition for speech perception. Journal of the

Acoustical Society of America 80, 4 (October 1986), 1016–1025.

[40] Gales, M. Maximum likelihood linear transformations for HMM-based speech

recognition. Computer Speech and Language 12, 2 (1998), 75–98.

[41] Gillick, L., and Cox, S. J. Some statistical issues in the comparison of speech

recognition algorithms. In Proc. of ICASSP’89 (Glasgow, 1989), pp. 532–535.

[42] Gold, B., and Morgan, N. Speech and Audio Signal Processing: Processing and

Perception of Speech and Music. John Wiley & Sons, Inc., New York, NY, USA,

1999.

[43] Gong, Y. Speech recognition in noisy environments: a survey. Speech Commun.

16, 3 (1995), 261–291.


[44] Greenberg, S. Understanding speech understanding: Towards a unified theory of

speech perception. In Workshop on the Auditory Basis of Speech Perception (1996),

1–8.

[45] Grezl, F. Local time-frequency operators in TRAPs for speech recognition. In 6th

International Conference TSD 2003 (2003), vol. 2003, University of West Bohemia

in Pilsen, pp. 269–274.

[46] Grezl, F., and Hermansky, H. Local averaging and differentiating of spectral

plane for TRAP-based ASR. In Proc. EUROSPEECH 2003 (2003), Institute for

Perceptual Artificial Intelligence.

[47] Hermansky, H. Perceptual linear predictive (PLP) analysis for the speech. J.

Acous. Soc. Am. (1990), 1738–1752.

[48] Hermansky, H., Ellis, D., and Sharma, S. Connectionist feature extraction

for conventional HMM systems. In ICASSP’00 (Istanbul, Turkey, 2000).

[49] Hermansky, H., and Fousek, P. Multi-resolution RASTA filtering for TANDEM-

based ASR. In Proceedings of Interspeech 2005 (2005).

[50] Hermansky, H., Fujisaki, H., and Saito, Y. Analysis and synthesis of speech

based on spectral transform linear predictive method. In Proc. of ICASSP’83 (April

1983), vol. 8, pp. 777–780.

[51] Hermansky, H., and Jain, P. Band-independent speech-event categories for

TRAP based ASR. In Proc. of Eurospeech 2003 (Geneve, CH, 2003), pp. 1013–

1016.

[52] Hermansky, H., and Morgan, N. RASTA processing of speech. IEEE Transac-

tions on Speech and Acoustics 2 (October 1994), 587–589.

[53] Hermansky, H., and Sharma, S. TRAPs - classifiers of TempoRAl Patterns. In

Proc. of ICSLP’98 (November 1998).

[54] Hermansky, H., and Sharma, S. Temporal patterns (TRAPS) in ASR of noisy

speech. In in ICASSP’99 (Phoenix, Arizona, USA, Mar. 1999).

[55] Hertz, J., Krogh, A., , and Palmer, R. G. Introduction to the Theory of

Neural Networks. Addison-Wesley, 1991.

[56] Ikbal, S., Misra, H., Sivadas, S., Hermansky, H., and Bourlard, H. En-

tropy Based Combination of Tandem Representations for Noise Robust ASR. In

Proc. of ICSLP-04 (Jeju Island, Korea, October 2004).

[57] Itakura, F., and Saito, S. A statistical method for estimation of speech spectral

density and formant frequencies. Electronics Communications of Japan 53-A, 1

(1970), 36–43.

[58] Jain, P. Temporal patterns of frequency-localized features in ASR. PhD thesis, OGI

School of Science & Engineering at OHSU, 2003.

115

[59] Janin, A., Ellis, D., and Morgan, N. Multi-stream speech recognition: Ready

for prime time? In Proc. of Eurospeech-99 (Budapest, 1999), pp. 591–594.

[60] Kajarekar, S., Yegnanarayana, B., and Hermansky, H. A study of two

dimensional linear discriminants for ASR. In ICASSP’01 (Salt Lake City, Utah,

USA, May 2001).

[61] Kajarekar, S. S., and Hermansky, H. Optimization of units for continuous-

digit recognition task. In Proc. of ICSLP 2000 (Beijing, China, 2000).

[62] Kanedera, N., Arai, T., Hermansky, H., and Pavel, M. On the relative

importance of various components of the modulation spectrum for automatic speech

recognition. Speech Communication 28, 1 (May 1999), 43–55(13).

[63] Kanedera, N., Hermansky, H., and Arai, T. On properties of modulation

spectrum for robust automatic speech recognition. Proc. of the IEEE International

Conf. on Acoustics, Speech, and Signal Processing (ICASSP) 2 (1998), 613–616.

[64] Karafiat, M., Grezl, F., and Cernocky, J. TRAP based features for LVCSR

of meeting data. In Proc. 8th International Conference on Spoken Language Pro-

cessing (2004), Sunjin Printing Co,, pp. 437–440.

[65] Ketabdar, H., and Hermansky, H. Identifying unexpected words using in-

context and out-of-context phoneme posteriors. IDIAP-RR 68, IDIAP, 2006.

[66] Kleinschmidt, M., and Gelbart, D. Improving word accuracy with Gabor

feature extraction. In Proc. of ICSLP’02 (Denver, Colorado, 2002).

[67] Kurkova, V. Kolmogorov’s theorem and multilayer neural networks. Neural Com-

putation 5, 3 (1992), 501–506.

[68] Lehtonen, M., Fousek, P., and H., H. Hierarchical approach for spotting

keywords. In Proc. of 2nd Workshop on Multimodal Interaction and Related Machine

Learning Algorithms – MLMI’05 (Edinburgh, UK, July 2005).

[69] Linsker, R. Self-organization in a perceptual network. Computer 21, 3 (1988),

105–117.

[70] Lippmann, R. P. Accurate consonant perception without mid-frequency speech

energy. Speech and Audio Processing, IEEE Transactions on 4, 1 (1996), 66–69.

[71] Lockwood, P., and Boudy, J. Experiments with a non-linear spectral subtractor

(NSS), hidden Markov models and the projection for robust speech recognition in

cars. In Proc. of Eurospeech 1991 (1991).

[72] Malayath, N., and Hermansky, H. Bark resolution from speech data. Proceed-

ings of International Conference on Spoken Language Processing 2002 (September

2002).

[73] Martin, R. Noise power spectral density estimation based on optimal smoothing

and minimum statistics. IEEE Transactions on Speech and Audio Processing 9, 5

(July 2001).


[74] Mermelstein, P. Distance measures for speech recognition: Psychological and

instrumental. In Pattern Recognition and Artificial Intelligence, C. H. Chen, Ed.

Academic Press, New York, 1976, pp. 374–388.

[75] Meyer, B., and Kleinschmidt, M. Robust speech recognition based on localized

spectro-temporal features. In Proc. of ESSV’03 (Karlsruhe, 2003).

[76] Misra, H., Vepa, J., and Bourlard, H. Multi-stream ASR: An Oracle perspec-

tive. In Proc. of ICSLP’06 (Pittsburgh, U.S.A., September 2006).

[77] Morgan, N., and Bourlard, H. An introduction to hybrid HMM/connectionist

continuous speech recognition. IEEE Signal Processing Magazine (May 1995), 25–42.

[78] Morris, A., Hagen, A., and Bourlard, H. Map combination of multi-stream

hmm or hmm/ann experts. In Proc. of Eurospeech’01 (Aalborg, Denmark, Septem-

ber 3-7 2001).

[79] Motlicek, P., Hermansky, H., Garudadri, H., and Srinivasamurthy, N.

Speech coding based on spectral dynamics. In Ninth International Conference on

Text, Speech and Dialogue (TSD) (2006). IDIAP-RR 06-05.

[80] Motlıcek, P. Modeling of Spectra and Temporal Trajectories in Speech Process-

ing. PhD thesis, Brno University of Technology, Faculty of Information Technology,

august 2003.

[81] Motlıcek, P., and Cernocky, J. Time-domain based temporal processing with

application of orthogonal transformations. In Proc. EUROSPEECH 2003 (2003),

Institute for Perceptual Artificial Intelligence, pp. 821–824.

[82] Oppenheim, A. V., and Schafer, R. W. Discrete-time signal processing.

Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989.

[83] Pavel, M., and Hermansky, H. Information fusion by human and machines. In

Proc. of The First European conference on signal analysis and prediction (Prague,

Czech Republic, 1997).

[84] Prasanna, S. H. M., and Hermansky, H. Multi-RASTA and PLP in automatic

speech recognition. Tech. Rep. RR 06-45, IDIAP Research Institute, Martigny, 2006.

[85] Rabiner, L. R. A tutorial on Hidden Markov Models and selected applications in

speech recognition. Proceedings of the IEEE 77, 2 (1989), 257–286.

[86] Rumelhart, D. E., Hintont, G., and Williams., R. J. Learning representa-

tions by back-propagating errors. Nature 4, 323 (1986), 533–536.

[87] Schwarz, P., Matejka, P., and Cernocky, J. Recognition of phoneme strings

using TRAP technique. In Proceedings of 8th International Conference Eurospeech

(2003), International Speech Communication Association.

[88] Schwarz, P., Matejka, P., and Cernocky, J. Towards lower error rates in

phoneme recognition. In Proceedings of 7th International Conference Text,Speech

and Dialoque 2004 (2004), Springer Verlag.

Bibliography 117

[89] Schwarz, P., Matejka, P., and Cernocky, J. Hierarchical structures of neural

networks for phoneme recognition. In Proceedings of ICASSP 2006 (2006), pp. 325–

328.

[90] Sovka, P., Pollak, P., and Kybic, J. Extended spectral subtraction. In Proc.

of European Signal Processing Conference (EUSIPCO–96) (Trieste, Italy, November

1996).

[91] Svojanovsky, P. Band–independent classiers in TRAP-TANDEM ASR system.

In Proc. of SPECOM 2005 (Patras, Greece, October 2005), pp. 769–772.

[92] Tibrewala, S., and Hermansky, H. Sub-band based recognition of noisy speech.

In Proc. of ICASSP ’97 (Munich, Germany, 1997), pp. 1255–1258.

[93] Tyagi, V., and Wellekens, C. Fepstrum representation of speech signal. In

Proceedings of IEEE ASRU’05 (December 2005), pp. 44–49.

[94] Uhlır, J., and Sovka, P. Cıslicove zpracovanı signalu. CTU Publishing House,

1995.

[95] Umesh, S., Cohen, L., and Nelson, D. Frequency-warping and speaker-

normalization. In Proceedings of the 1997 IEEE International Conference on Acous-

tics, Speech, and Signal Processing (ICASSP ’97) (Washington, DC, USA, 1997),

vol. 2, IEEE Computer Society, p. 983.

[96] Valente, F., and Hermansky, H. Discriminant linear processing of time-

frequency plane. In Proc. of ICSLP’06 (2006). IDIAP-RR 06-20.

[97] Valente, F., and Hermansky, H. Combination of acoustic classifiers based on

Dempster-Shafer theory of evidence. In Proc. of ICASSP 2007 (Honolulu, Hawaii,

USA, 2007).

[98] van Vuuren, S., and Hermansky, H. Data-driven design of RASTA-like filters.

In Eurospeech’97 (Rhodes, Greece, 1997), ESCA.

[99] Yang, H. H., Sharma, S., van Vuuren, S., and Hermansky, H. Relevance

of time-frequency features for phonetic and speaker/channel classification. Speech

Communication (Aug. 2000).

[100] Young, S., Ollason, D., Valtchev, V., and Woodland, P. The HTK Book

(for HTK Version 3.2.1). Cambridge University Press, Cambridge, UK, 2002.

[101] Zhu, Q., Chen, B., Grezl, F., and Morgan, N. Improved MLP structures

for data-driven feature extraction for ASR. In Interspeech’2005 - Eurospeech - 9th

European Conference on Speech Communication and Technology (2005).

[102] Zhu, Q., Chen, B., Morgan, N., and Stolcke, A. On using MLP features in

LVCSR. In Proc. of INTERSPEECH 2004 (2004), pp. 921–924.

[103] Zhu, Q., Stolcke, A., Chen, B. Y., and Morgan, N. Using MLP features in

SRI’s conversational speech recognition system. In Proc. of INTERSPEECH 2005

(2005), pp. 2141–2144.

118 Selected publications

Other used literature

1. Psutka, J., Muller, L., Matousek, J. and Radova, V., Mluvıme s pocıtacem

cesky, Academia Praha, Prague, 2005.

2. Uhlır, J., Sovka, P. and Cmejla, R., Uvod do cıslicoveho zpracovanı signalu,

CTU Publishing House, Prague, 2003.

3. Sovka, P. and Pollak, P., Vybrane metody cıslicoveho zpracovanı signalu, CTU

Publishing House, Prague, 2001.

4. Rybicka, J., LATEX pro zacatecnıky, KONVOJ Brno, ISBN 80-85615-74-6, 1999.

5. Satrapa, P., Perl pro zelenace, Neokortex, ISBN 80-86330-02-8.

6. Racek, S. and Kvoch, M., Trıdy a objekty v C++, Kopp, ISBN 80-7232-017-3,

1998.

Selected publications

1. Fousek, P. and Hermansky, H., Towards ASR based on hierarchical posterior-

based keyword recognition, Proc. of ICASSP ’06, Toulouse, France, 2006.

2. Boril, H. and Fousek, P., Influence of different speech representations and HMM

training strategies on ASR performance, Proc. of Poster 2006, Prague, 2006.

3. Fousek, P., Does phoneme labeling of speech have to be done by hand?, Tech. Rep.

R06-3, FEE CTU, Dept. of Circuit Theory, Prague, 2006.

4. Hermansky, H. and Fousek, P., Multi-resolution RASTA filtering for TANDEM-

based ASR, Proc. of Interspeech 2005, Lisbon, Portugal, September 2005.

5. Hermansky, H., Fousek, P. and Lehtonen, M., The role of speech in multi-

modal human-computer interaction (towards reliable rejection of non-keyword in-

put), Proc. of the 8th International Conference on Text, Speech and Dialogue - TSD

2005, Carlsbad, Czech Republic, 2005.

6. Lehtonen, M., Fousek, P. and Hermansky, H., Hierarchical approach for spot-

ting keywords, IDIAP Research Report, 2005.

7. Fousek, P., Svojanovsky, P., Grezl , F. and Hermansky, H., New nonsense

syllables database - analyses and preliminary ASR experiments, Proc. of ICSLP’04,

Seoul, Corea, 2004.

8. Fousek, P., Performance of LDA based parametrization techniques for robust

speech recognition, In Speech Processing, Prague, Academy of Sciences of the Czech

Republic, Institute of Radioengibneering and Electronics, 2004, vol. 1, pp. 152–153.

Selected publications 119

9. Fousek, P. and Pollak, P., Additive noise and channel distortion-robust param-

eterization tool - performance evaluation on Aurora 2 & 3, Proc. of Eurospeech ’03,

Geneve, Switzerland, 2003.

10. Fousek, P. Computer cluster-based speech recognition system, Proc. of Poster

2003, Prague, 2003.

11. Fousek, P., Robust speech parametrization for recognition purposes, Proc. of

Poster 2002, Prague, 2002.