Czech Technical University in Prague
Faculty of Electrical Engineering
DOCTORAL THESIS
Petr Fousek March 2007
Czech Technical University in Prague
Faculty of Electrical Engineering
DOCTORAL THESIS
Extraction of Features for Automatic Recognition
of Speech Based on Spectral Dynamics
Petr Fousek
March 2007
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Circuit Theory
Extraction of Features for Automatic Recognition
of Speech Based on Spectral Dynamics
by
Petr Fousek
PhD Program: Electrical Engineering and Information Technology
Branch of Study: Electrical Engineering Theory
Supervisor: Doc. Ing. Petr Pollak, CSc.
Supervisor–Specialist: Prof. Hynek Hermansky
Abstract
This work is oriented towards new approaches in automatic recognition of speech utilizing
neural networks and Hidden Markov Models (HMM). The main interest is in developing
novel features extracted from spectral dynamics.
It is generally acknowledged that the underlying message in speech is contained in
auditory spectrogram. However, it is not well known how the relevant information is
distributed in the spectrogram and how it should be converted into features.
The first part of the thesis investigates on the distribution of information in the au-
ditory time-frequency plane. In particular, we study how much of the symmetric time
context surrounding the particular time instant can possibly be useful for features. We
also look at how the length of this context changes with the varying size of the feature
vector. After showing that most of the information comes from the center of the symmet-
ric contextual window, we propose to warp the time axis so as to devote more modeling
power to the window center at the expense of boundaries. Finally we focus on modulation
properties of speech: Considering that the information being transmitted by speech is
encoded in temporal modulations of the auditory spectra, we explore which modulations
should be preserved in features and which of those are the most important.
In the next part we focus on the development of features explicitly encoding temporal
evolution of energies in frequency sub-bands called LP-TRAP (Linear Predictive Tempo-
RAl Patterns). LP-TRAPs bypasses frame-based analysis and can preserve fine temporal
structure of sub-band events. The method is implemented in C++ and tunable parame-
ters are optimized. The idea of pre-warping the temporal axis in order to stress the central
part of sub-band energy trajectories is implemented and shown to be beneficial.
Subsequently is proposed a novel feature extraction technique named M-RASTA, which
extends and generalizes the RelAtive SpecTrAl filtering (RASTA) method. It filters speech
spectrogram with a bank of 2-D time-frequency filters with varying temporal resolutions.
The result is projected onto phoneme probabilities by neural network. It is inspired by
the earlier findings about the information in spectro-temporal plane, namely focusing on
the central parts of TRAPs and preserving only the important modulation spectrum. The
technique is optimized, implemented in C++ and compared to baseline features.
The last studied topic proposes a new alternative approach to mainstream HMM word
recognition, where each targeted word is classified by a separate binary classifier against
all other sounds. The system uses only discriminatively trained neural network classifiers.
Since the proposed framework focuses on capturing only the words of interest, it is able
to reasonably reject all out-of-vocabulary words.
The observations and conclusions of this work are based on experimental evidence
using two standard independent speech recognition tasks.
iii
iv
Abstrakt
Prace je orientovana smerem k novym metodam pro automaticke rozpoznavanı reci
zalozenym na umelych neuronovych sıtıch a skrytych Markovovych modelech (HMM).
Duraz je kladen na vyvoj novych prıznaku odvozenych z dynamiky spektra.
Je obecne uznavano, ze informace o tom, co ma byt recı sdeleno, je obsazena ve spektro-
gramu. Nicmene nenı presne znamo, jakym zpusobem je tato informace ve spektrogramu
rozprostrena a jakym zpusobem by mela byt prevedena do prıznaku pro rozpoznavanı.
Prvnı cast dizertacnı prace zkouma prave rozlozenı informace v casove-frekvencnı ro-
vine spektrogramu. Zamerıme se na to, jak velky casovy kontext symetricky obklopujıcı
zkoumany casovy okamzik muze byt uzitecny pro prıznaky. Podıvame se, zda se sırka to-
hoto kontextu menı s poctem prıznaku, ktere mame k dispozici. Dale ukazeme, ze nejvetsı
cast informace pochazı ze stredu casoveho kontextu a navrhneme metodu borcenı casove
osy, ktera umoznı lepe modelovat stred kontextu na ukor jeho okraju. Na konci teto casti
se venujeme modulacnım vlastnostem reci. S uvazenım, ze informace prenasena recı je
zakodovana v casovych zmenach – modulacıch – spektra, se pokusıme zjistit, ktere modu-
lace by mely byt zachovany v prıznacıch a ktere z nich jsou nejdulezitejsı.
V dalsı casti se zamerıme na jednu z metod vypoctu prıznaku, ktera explicitne mo-
deluje casovy vyvoj energiı ve frekvencnıch sub-pasmech spektra, na metodu LP-TRAP
(Linear Predictive TempoRAl Patterns). Metoda nevyuzıva segmentalnı analyzy, proto je
schopna zachovat detailnı strukturu casovych udalostı na urovni sub-pasem. Algoritmus
je implementovan v jazyce C++ a je provedena optimalizace parametru metody. Je im-
plementovana vyse uvedena myslenka borcenı casove osy a experimentalne je ukazan jejı
prınos.
Dale je navrzena nova metoda vypoctu prıznaku M-RASTA, ktera rozsiruje a zo-
becnuje RASTA filtraci (RelAtive SpecTrAl filtering). M-RASTA filtruje spektrogram reci
pomocı banky dvourozmernych casove-frekvencnıch filtru s ruznym casovym rozlisenım.
Vysledek filtrace je promıtan na aposteriornı pravdepodobnosti fonemu pomocı neuronove
sıte. Metoda je inspirovana zavery z predchozıch castı, zejmena se zameruje na stred casove
trajektorie a zachovava pouze potrebnou cast modulacnıho spektra. Metoda je castecne
optimalizovana, implementovana v C++ a srovnana s jinymi typy prıznaku.
V poslednı casti prace je navrzen novy alternativnı postup rozpoznavanı slov bez HMM,
ve kterem je kazde cılove slovo odliseno od jakychkoliv jinych zvuku pomocı nezavisleho
binarnıho klasifikatoru. Ke klasifikaci jsou vyuzity pouze neuronove sıte. Protoze se
navrzeny system zameruje pouze na cılova slova, je schopen do rozumne mıry zamıtnout
vsechna slova mimo slovnık.
Poznatky a zavery prezentovane v teto praci jsou experimentalne dolozeny s vyuzitım
dvou nezavislych rozpoznavacıch uloh.
v
vi
Acknowledgment
The post-gradual study takes quite a bit of lifetime to finish. Meanwhile, one manages to
meet a number of good people, have a lot of fun, learn many new things and eventually do
a piece of work. I happened to spend my PhD at two places, namely at my alma mater,
Czech Technical University in Prague, and at IDIAP Research Institute, Martigny.
In Prague I worked under the guidance of my supervisor Petr Pollak. Petr let me to
the scientific world, took me to my first conference and arranged my stage at IDIAP. He
was also the source of big theoretical support and practical experience. It was him who
introduced me to the world of linux for which I am endlessly grateful.
My thanks go also to all other members of the lab for the nice working and off-work
atmosphere. In particular, to Vaclav Hanzl (our linux and system guru) for his in-depth
technical and theoretical support; to Jan Novotny for fruitful discussions, for his digits
recognizer cook-book and also for his microphone which I stole him; to Jindrich Zd’ansky
for Perl and his irresistible write-only scripts; and to Hynek Boril for all the progressive
and constructive discussions, for very effective and smooth cooperation, for sharing his
hot-dog machine in the lab and for being a friend. Finally, I want to thank to the head of
our department, prof. Sovka, for his perfect and enthusiastic DSP lectures, through which
I long ago decided for this field.
During my stage at IDIAP I was guided by my co-supervisor Hynek Hermansky. Hynek
allowed me to recognize the cutting edge in the world of speech recognition. His encyclo-
pedic knowledge of the latest technology, his open mind and wild ideas make him a unique
personality whom it is an outstanding experience to work with. I greatly appreciate his
catching optimism and his involvement in our work.
I am grateful to all guys from the speech recognition group for their cooperation and
nice atmosphere in the lab. First of all, to my friend Frantisek Grezl for teaching me
TRAPs, being able to answer any TRAP-related questions and providing me with all his
scripts and tools. Our everyday free-hand cooking sessions with Franta gave no doubt
rise to our best ideas in ASR. I also thank to Petr Motlıcek who, besides the serious
work, helped me with exploring the Alps on foot and on bike. Among other lab-mates, I
enjoyed the cooperation with Marios Athineos, Petr Svojanovsky, Hamed Ketabdar, Mikko
Lehtonen and Hemant Misra. The swift and secure working environment at IDIAP is
certainly due to admins Frank Formaz, Norbert Crettol and the director, Herve Bourlard.
vii
viii
I want to thank to all guys from FIT Brno Speech Processing lab for kindly providing
me with lot of source code, support and for domesticating that “guy from Prague”.
As the PhD study is not entirely a lucrative job, I owe my thanks to my parents
for material support. Apart from that, my parents have always provided me with an
absolutely reliable and all-inclusive home, which I appreciate the most.
Finally, my saturated thanks go to my wife Petra for being with me, for her bullet-proof
patience, indissoluble tolerance and for making our life colorful.
— big thanks go to Franta, Pet’a and Pet’ka for reading the manuscript
and to my father for the picture with the poor guy in the desert.
Thank you.
Contents
1 Preface 1
2 Introduction 3
2.1 Statistical speech recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Back-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Incorporating spectral dynamics in features . . . . . . . . . . . . . . . . . . 7
2.2.1 Long time context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Studied topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Main goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Survey: From short term spectrum to spectral dynamics 13
3.1 Obtaining auditory spectrogram . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Frequency resolution of speech spectrum . . . . . . . . . . . . . . . . 13
3.1.2 Bank of filters in time domain . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Spectrogram filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Temporal filtering and modulation spectrum . . . . . . . . . . . . . 14
3.2.2 2-D Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Parametric spectrogram representation . . . . . . . . . . . . . . . . . . . . . 15
3.4 Parametrizing spectrogram by probabilistic features . . . . . . . . . . . . . 16
3.4.1 Hybrid and TANDEM architectures . . . . . . . . . . . . . . . . . . 16
3.4.2 TRAP architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.3 Further development of TRAP . . . . . . . . . . . . . . . . . . . . . 18
3.4.4 Multi-stream systems . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Recognition and evaluation framework 21
4.1 Software tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Recognizer architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.1 Multi Layer Perceptron as posterior probability estimator . . . . . . 25
4.3 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Evaluation tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.1 English digits recognition – Stories & Numbers95 . . . . . . . . . . . 29
4.4.2 Conversational Telephone Speech – CTS . . . . . . . . . . . . . . . . 32
4.5 Baseline results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
ix
x
5 Information in time-frequency plane 37
5.1 Limits of useful temporal context for TRAP features . . . . . . . . . . . . . 37
5.1.1 Mean TRAPs of phonemes . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.2 Truncating TRAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.3 Fixed MLP topology – truncating TRAP-DCT . . . . . . . . . . . . 41
5.1.4 Extension – combining DCTs of different lengths . . . . . . . . . . . 43
5.2 Focusing on TRAP center . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 Warping time axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6 Linear Predictive TempoRAl Patterns (LP-TRAP) 51
6.1 Introduction to LP-TRAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Extracting LP-TRAP features . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2.1 Importance of Hilbert envelope . . . . . . . . . . . . . . . . . . . . . 53
6.2.2 Obtaining frequency sub-bands . . . . . . . . . . . . . . . . . . . . . 54
6.2.3 FDLP – Frequency-Domain Linear Prediction . . . . . . . . . . . . . 55
6.2.4 Free parameters in algorithm . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Experimentally optimizing LP-TRAP . . . . . . . . . . . . . . . . . . . . . 58
6.3.1 Sub-band envelope compression and LPC order . . . . . . . . . . . . 58
6.3.2 Sampled FDLP temporal envelopes vs. FDLP cepstra as features . . 58
6.3.3 Input window length . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3.4 Overlap of frequency sub-bands . . . . . . . . . . . . . . . . . . . . . 60
6.3.5 LP model order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.6 Evaluating optimized features on S/N and CTS tasks . . . . . . . . 61
6.4 Warping time axis in LP-TRAP . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.1 Temporal resolution of LP-TRAP . . . . . . . . . . . . . . . . . . . 62
6.4.2 Two ways to warp time axis . . . . . . . . . . . . . . . . . . . . . . . 63
6.4.3 Non-linear time warping and sampling theorem . . . . . . . . . . . . 63
6.4.4 Warping function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7 Multi-resolution RASTA filtering (M-RASTA) 69
7.1 Introduction – M-RASTA from different perspectives . . . . . . . . . . . . . 69
7.2 M-RASTA features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2.1 Temporal filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2.2 2-D – time-frequency filters . . . . . . . . . . . . . . . . . . . . . . . 72
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.3.1 Filtering with single filter . . . . . . . . . . . . . . . . . . . . . . . . 73
7.3.2 Combining two temporal filters . . . . . . . . . . . . . . . . . . . . . 74
7.3.3 Tuning the system for best accuracy . . . . . . . . . . . . . . . . . . 76
7.3.4 Robustness to channel noise . . . . . . . . . . . . . . . . . . . . . . . 78
7.3.5 Modulation frequency properties . . . . . . . . . . . . . . . . . . . . 79
7.3.6 Discrepancy between MLP and HMM – phoneme posteriograms . . 81
7.4 Combining LP-TRAP and M-RASTA . . . . . . . . . . . . . . . . . . . . . 83
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
xi
8 Extensions: Towards recognition by means of keyword spotting 85
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2 Detecting a word in two steps . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.2.1 From frame-based estimates to word level . . . . . . . . . . . . . . . 87
8.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.3.1 Training and testing data sets . . . . . . . . . . . . . . . . . . . . . . 88
8.3.2 Initial experiment – checking viability . . . . . . . . . . . . . . . . . 89
8.3.3 Simplifying the system – omitting intermediate steps . . . . . . . . . 90
8.3.4 Optimizing false alarm rate – Enhanced system . . . . . . . . . . . . 90
8.3.5 Keyword spotting on frame level . . . . . . . . . . . . . . . . . . . . 92
8.3.6 Keyword spotting in unconstrained speech . . . . . . . . . . . . . . . 93
8.4 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9 Summary and conclusion 95
9.1 Summary of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.2 Original contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.4 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
A Class coverage of S/N & CTS tasks. 99
B Summary of experiments on CTS Task. 101
C On target classes for band-classifiers in TRAP 103
D Sub-phoneme targets for TANDEM classifier 107
Bibliography 110
xii
List of Figures
2.1 Diagram of a typical ASR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Example of a three-state sequential HMM. . . . . . . . . . . . . . . . . . . . 7
2.3 Illustration of long-term feature extraction. . . . . . . . . . . . . . . . . . . 9
3.1 Scheme of Hybrid MLP/HMM system. . . . . . . . . . . . . . . . . . . . . . 17
3.2 Scheme of TANDEM system combining MLP and GMM/HMM. . . . . . . 17
3.3 Scheme of TRAP system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Block scheme of feature extractor CtuCopy. . . . . . . . . . . . . . . . . . . 23
4.2 General scheme of front-end for long-term features. . . . . . . . . . . . . . . 25
4.3 Scheme of three-layer MLP and neuron. . . . . . . . . . . . . . . . . . . . . 26
4.4 Structure of the corpora in Stories-Numbers95 (S/N) task. . . . . . . . . . . 30
5.1 Mean TRAPs for 41 phonemes and 6 non-phoneme classes of CTS. . . . . . 38
5.2 Mean TRAPs and standard deviations for selected phonemes of CTS. . . . 39
5.3 Influence of TRAP length on FER and WER. . . . . . . . . . . . . . . . . . 40
5.4 Influence of TRAP-DCT length on FER and WER. . . . . . . . . . . . . . 42
5.5 First four DCT bases weighted by Hamming window. . . . . . . . . . . . . . 44
5.6 Chosen bases of DCT applied to 100 ms and 1000 ms TRAPs. . . . . . . . 44
5.7 Illustration of combining bases of two DCTs of different sizes. . . . . . . . . 45
5.8 Combining DCT size 2 features with another DCT size 2 features. . . . . . 45
5.9 Symmetric TRAP warping function. . . . . . . . . . . . . . . . . . . . . . . 47
5.10 Discrete mapping from TRAP size 101 to TRAP size 21. . . . . . . . . . . . 48
5.11 Warping time axis in TRAPs. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 LP-TRAP feature extraction scheme. . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Illustration of forming frequency sub-bands from DCT spectrum. . . . . . . 55
6.3 Duality between time and frequency domains in LP-TRAP. . . . . . . . . . 55
6.4 Detailed LP-TRAP feature extraction scheme. . . . . . . . . . . . . . . . . 57
6.5 Compression of sub-band dynamics in LP-TRAP. . . . . . . . . . . . . . . . 58
6.6 Bank of filters in LP-TRAP. Influence of blap factor on filter widths. . . . 60
6.7 WER and FER as a function of blap factor in LP-TRAP. . . . . . . . . . . 60
6.8 WER and FER as a function of LP order fp in LP-TRAP. . . . . . . . . . 61
6.9 Illustration of temporal resolution of LP-TRAP vs. TRAP. . . . . . . . . . 62
6.10 Warping temporal axis in LP-TRAP feature extraction. . . . . . . . . . . . 63
6.11 Illustration of binning process in warping LP-TRAPs. . . . . . . . . . . . . 66
7.1 M-RASTA feature extraction scheme. . . . . . . . . . . . . . . . . . . . . . 71
xiii
xiv
7.2 Normalized impulse responses of the first two Gaussian derivatives. . . . . . 72
7.3 Normalized frequency responses of the first two Gaussian derivatives. . . . . 72
7.4 Detail of the first two Gaussian derivatives for various σ. . . . . . . . . . . . 72
7.5 Example of impulse responses of 2-D RASTA filters with σ = 60 ms. . . . . 73
7.6 M-RASTA with only one filter: FER and WER dependencies on filter width. 74
7.7 Combining two temporal filters in M-RASTA: error as a function of σ. . . . 75
7.8 Shrinking the modulation bandwidth in M-RASTA by limiting σ range. . . 79
7.9 Determining bandwidth of a bank with two filters. . . . . . . . . . . . . . . 80
7.10 Mapping between σ and bandwidth of associated g1,2 filters. . . . . . . . . . 80
7.11 FER and WER as a function of cutoff frequency in modulation spectrum. . 81
7.12 Posteriograms for utterance “nine”. . . . . . . . . . . . . . . . . . . . . . . . 82
7.13 Scheme of Hilbert-M-RASTA feature extraction. . . . . . . . . . . . . . . . 83
8.1 Scheme of hierarchical keyword spotting. . . . . . . . . . . . . . . . . . . . . 86
8.2 Example of keyword posteriogram. . . . . . . . . . . . . . . . . . . . . . . . 86
8.3 Impulse responses of matched filters for eleven keywords. . . . . . . . . . . . 88
8.4 Finding the keyword position from its posterior probability. . . . . . . . . . 88
8.5 Omitting intermediate processing steps from hierarchical keyword spotting. 90
8.6 Alarm thresholds as a function of false alarm rate (floored at threshold 0.05). 91
8.7 Operation ranges of the proposed system and HMM recognizer. . . . . . . . 92
8.8 Frame-level evaluation of Keyword MLPs, Test set. . . . . . . . . . . . . . . 92
C.1 TRAP MLP architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
D.1 Illustration of phonemes as MLP training targets for an utterance “seven”. 108
List of Tables
4.1 Performance of baseline features on S/N task. . . . . . . . . . . . . . . . . . 34
4.2 Suitable grammar scale factors for CTS task. . . . . . . . . . . . . . . . . . 34
4.3 Performance of baseline features on CTS task. . . . . . . . . . . . . . . . . . 35
5.1 Warping factors w and compression ratios for given lengths M . . . . . . . . 49
5.2 Performance of selected warped TRAPs on CTS. . . . . . . . . . . . . . . . 50
6.1 Performance of sampled temporal envelopes vs. of cepstra in LP-TRAP. . . 59
6.2 Influence of LP-TRAP input segment length on FER and WER. . . . . . . 59
6.3 Performance of optimized LP-TRAPs on S/N task. . . . . . . . . . . . . . . 61
6.4 Performance of optimized LP-TRAPs on CTS task. . . . . . . . . . . . . . . 61
6.5 Performance of warped LP-TRAPs as a function of fp, ncep, warp. S/N
task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.6 Performance of warped LP-TRAPs, CTS task. . . . . . . . . . . . . . . . . 67
7.1 Best reached FER and WER for three feature pair types. . . . . . . . . . . 76
7.2 Adding frequency derivatives to M-RASTA features. . . . . . . . . . . . . . 77
7.3 Looking for a suitable number of temporal filters in M-RASTA. . . . . . . . 77
7.4 Performance of M-RASTA on CTS. . . . . . . . . . . . . . . . . . . . . . . . 78
7.5 Influence of channel mismatch on WER (M-RASTA and other features). . . 79
7.6 Comparing LP-TRAP, M-RASTA and Hilbert-M-RASTA on S/N task. . . 84
7.7 Comparing LP-TRAP, M-RASTA and Hilbert-M-RASTA on CTS task. . . 84
8.1 WER as a function of false alarms per hour on digits recognition task. . . . 91
8.2 Results on a joint set of digits and unconstrained speech. . . . . . . . . . . 93
A.1 Class coverage of S/N task, MLP sets 1 & 2. . . . . . . . . . . . . . . . . . 99
A.2 Class coverage of CTS task, training sets. . . . . . . . . . . . . . . . . . . . 100
B.1 Summary of experiments on CTS task. . . . . . . . . . . . . . . . . . . . . . 102
C.1 Emulating non-separable classes in TRAP Band-MLP targets. . . . . . . . . 105
D.1 Performance of phoneme states used as MLP targets, S/N task. . . . . . . . 108
D.2 Performance of phoneme states used as MLP targets, CTS task. . . . . . . 109
xv
xvi
List of Abbreviations
ANN Artificial Neural Network
AR Autoregressive (model)
AR-MA Autoregressive - Moving Average (model)
ASR Automatic Speech Recognition
CRBE CRitical Band Energy
CTS Conversational Telephone Speech
DCT Discrete Cosine Transform
DTW Dynamic Time Warping
FA False Alarm (rate)
FAcc Frame Accuracy
FDLP Frequency-Domain Linear Prediction
FER Frame Error Rate
FFT, F[·] Fast Fourier Transform
FIR Finite Impulse Response (filter)
FOM Figure Of Merit
GMM Gaussian Mixture Model
H[·] Hilbert transform
HMM Hidden Markov Model
HSR Human Speech Recognition
HTK Hidden Markov model ToolKit
KLT Karhunen-Loeve Transform
LDA Linear Discriminant Analysis
LP, LPC Linear Prediction, Linear Predictive Coding
LP-TRAP Linear Predictive TempoRAl Patterns
LVCSR Large-Vocabulary Continuous Speech Recognition
MFCC Mel-Frequency Cepstral Coefficients
MLP Multi-Layer Perceptron
M-RASTA Multi-resolution RelAtive SpecTrA (filtering)
OOV Out Of Vocabulary (word)
P () Probability
PCA Principal Components Analysis
PLP Perceptual Linear Prediction
TDLP Time-Domain Linear Prediction
TRAP TempoRAl Patterns
VTLN Vocal Tract Length Normalization
WER Word Error Rate
xvii
xviii
Chapter 1
Preface
Speech has been the most natural and important means of communication among humans
since tens of thousands years. Production of speech and hearing has had even more time
to evolve as the ability to detect and classify sounds was the key feature to survival. Re-
cent studies of cortical neurons of mammals indicate that the brain develops fundamental
classification abilities even in early prenatal stage, before the organs that capture exter-
nal stimuli are being developed [69]. Later, the human fetus starts to react to mother’s
voice by movements. It confirms that speech production and recognition capabilities are
by nature thoroughly optimized and mutually adapted. Every day human beings process
immense volumes of acoustic stimuli, performing subconsciously continual filtering, denois-
ing, channel normalization, adaptation to speaker, classification, and complex high level
grammatical and contextual searches that enable for very reliable means of information
transmission.
The idea of automatic recognition of speech by machine is much younger (although
its origin can virtually go back to the invent of writing, dating about five thousand years
ago). Attempts for automatic conversion between written and spoken form of message,
known as speech synthesis (Text To Speech, TTS) and Automatic Speech Recognition
(ASR), were triggered by Edison’s invent of the phonograph in 1877, but seriously evolved
after major breakthroughs in 1960’s, namely Fast Fourier Transform [28], cepstral analysis
[82], Linear Predictive Coding [13, 57], Dynamic Time Warping [18] and Hidden Markov
Modeling [85]. Within a half of the century of development boosted by the recent boom of
computers and digital technologies allowing for acquiring and processing extreme amounts
of data, there have been gradual and considerable achievements. Although, the ultimate
goal, a system that would be able to substitute humans in this task, has not yet been
reached. The reason is that the task is quite complex and current ASR is very fragile. It
is now possible to build a laboratory system tuned to deliver good transcript of recorded
utterances, but when exposed to real-life voice, the system usually fails due to the lack of
robustness [43]. Even the current commercial solutions need an additional in-place tuning
and adaptation after deployment. It suggests that we still do not know how to properly
process the speech data. What is it that we should extract from the data and how to
present it to the learning and classification systems?
Apart from the linguistic message which is the target for ASR, the speech contains
much more information. It describes the speaker himself, his mood (joy, anxiety, excite-
ment), even his health (hoarse, sleepiness). The speech also differs for the context in
1
2 1 Preface
which the speaker is. Calling a friend on the phone would likely be more casual than a
formal talk to an audience. All those aspects are called the intra-speaker variations. The
changes from one speaker to another introduce the inter-speaker variations. Furthermore,
there is an information in the speech about the transmission channel and microphone,
and the acoustic image of the ambient called background noise (e.g. other sounds, speech,
music, noise). One part of that non-linguistic information may be important for different
domains such as speech coding, speaker recognition, health monitoring, etc., however, for
ASR it represents an unwanted variability which should be alleviated. The ASR approach,
which will be introduced in the following section, is able to deal with this diversity to some
extent, which has been proved by years of use. On one hand it is a good reason to stick to
this knowledge and build upon it; on the other hand there is a challenge for novel feature
extraction approaches, which is the main concern of this thesis.
The commonly used short-term features such as MFCC or PLP [74, 47] encode the
envelope of the power spectrum into cepstral features. Such representation is motivated
by findings from speech production (LPC analysis, cepstral analysis), and mainly speech
perception (Bark, ERB, and Mel frequency scales, Equal Loudness perception, Intensity-
Loudness Power Law [47]). Though this knowledge makes the features reasonably speaker
independent, they are still vulnerable to distortions. For example, in cheap AC-powered
radios, harmonics of 50 Hz often leak into the speech, which does not noticeably influence
intelligibility. However, such distortion alters the overall spectral shape, which affects
cepstral coefficients and the recognizer based on short-term features may thus fail.
Another obstacles can appear easily in real environment: By changing the distance of
the speaker from the microphone or even changing the microphone itself causes a change
in the cepstrum which may have severe impact on ASR performance. Current commercial
dictation systems (e.g. from Nuance) often supply a custom close talk microphone to
alleviate this source of variability. This issue can generally be treated either with similar
hardware workarounds which pose an unpleasant constraints on the user and make the
system less portable, or by a more sophisticated signal processing means. Many techniques
have been developed for attenuating such distortions. Additive noises can be reasonably
suppressed with many variants of spectral subtraction, e.g. [90, 71, 34, 73]. Channel noise
suppression techniques operate on logarithmic spectrum or cepstrum where convolution
transforms into addition, e.g. spectral “normalization” by subtracting long-term average
from cepstrum [12], or RASTA filtering allowing for on-line processing [52]. However,
these techniques are knowledge-based and put further assumptions on the speech, for
example the noise has to be stationary or at least slow-changing for the spectral subtraction
to work. Generally, these techniques can only help by post-processing already existing
features that are not robust to make them more robust. But, consider a speech which is
corrupted by a severe channel noise in a particular frequency sub-band, say by a complete
deletion of that sub-band. It was shown that humans are amazingly resistant to such
phenomena. Even deletion of everything within the 800 Hz–4 kHz band from the speech
did not prevent human listeners to recognize nonsense syllables at about 90% rate [70].
Given the standard short-term cepstral features derived from such speech, it is impossible
to recover the “clean” features. Nevertheless, there exist ways to overcome similar issues
in ASR. This work shows that features derived from the spectral dynamics, rather than
from the absolute spectra, can conveniently serve this purpose.
Chapter 2
Introduction
To be able to position the thesis within the ASR framework, a typical speech recognition
system will be introduced first with its skills and weaknesses, followed by the motivation
for the presented study.
2.1 Statistical speech recognizer
The main question of a statistical ASR system is: “What is the most likely sequence
of words M given an acoustic observation X?” The probability of a word sequence can
be written as P (M|X,Θ), where M = {m1,m2, . . . ,mN} is the sequence of words mi,
X = {x1,x2, . . . ,xM} is a sequence of acoustic observations represented by feature vectors
xi, and Θ are parameters of the model. The most likely sequence is then
M = arg maxM
P (M|X,Θ). (2.1)
As this probability cannot be evaluated directly, we apply Bayes rule:
M = arg maxM
P (X|M,Θ)P (M|Θ)
P (X|Θ). (2.2)
We get two terms in numerator. P (X|M,Θ) is the probability of the acoustic ob-
servation given a sequence of words. P (M|Θ) is the probability of word sequence M
independent on the acoustic observation. P (X|Θ) in denominator is the a priori proba-
bility of the acoustic observation which is constant over all word sequences M and can be
omitted due to argmax. In principle, eq. 2.2 enables independent acoustic and language
modeling, yet every word has to be modeled with an individual model. It is not very
efficient as soon as there are more words in the vocabulary and, for large-vocabulary ASR,
it is even impossible. But, if there exist a set of sub-word units to which all words can be
mapped, we can expand the nominator in eq. 2.2. Let us consider a set of units called
phones Q = {q1, q2, . . . , qK}. If a sequence of words M can be transcribed with a set of
possible phone sequences Q, then the probability of the word sequence can be evaluated
by summing over all phone sequences. Eq. 2.2 then becomes
3
4 2 Introduction
M = arg maxM
∑
Q
P (X|Q,M,Θ)P (Q,M|Θ) (2.3)
= arg maxM
∑
Q
P (X|Q,ΘAM )P (Q|M,ΘPM )P (M|ΘLM ). (2.4)
At the step from eq. 2.3 to eq 2.4 in the first term we assume that the acoustic
observation is conditionally independent on the word sequence given the phone sequence
(in other words, it does not matter how the words transcribe in phones). The second term
was just factorized in two independent factors. According to eq. 2.4, the ASR problem
splits in three parts:
P (X|Q,ΘAM ) – likelihood of the acoustic observation given a fixed set of sub-word units
and given a subset of parameters known as Acoustic model (likelihood that it is
the observation sequence X that the sequence of phonemes Q has generated).
P (Q|M,ΘPM ) – likelihood of the phoneme sequence Q given the word sequence M and
the parameters subset called Pronunciation model (likelihood that the words M
transcribe as Q – qualified by pronunciation dictionary).
P (M|ΘLM ) – likelihood of the word sequence M given parameters of the Language
model (this introduces grammatical constraints).
Estimation of these probabilities constitute a three wide branches of ASR research
on their own. Pronunciation modeling deals with variations of pronunciation, typically
using finite state automata which model transition probabilities between phones using a
priori or data-driven rules. Language modeling incorporates a knowledge about language
(grammar, semantics) into ASR typically by N-gram grammars which model probabilities
of transitions between N subsequent words. The models are trained on large text corpora.
The goal of the acoustic modeling is a mapping between acoustic input and the chosen
sub-word units. It is typically accomplished by Hidden Markov Models (HMM) which
model temporal structure and rate variations of speech with a finite state automata; a
measure of similarity of a particular feature and a sub-word unit is quantified by Gaussian
probability density functions or neural networks. The parameters of the acoustic model are
trained on large labeled speech corpora. This thesis is oriented towards acoustic modeling,
therefore it will not further discuss the language-related parts of ASR.
Fig.2.1 shows how the three ASR branches are distributed in a typical ASR system.
The recognizer comprises speech data acquisition and preprocessing, extraction of features,
estimation of class probabilities, and decoding. It can be split to a so-called front-end and
back-end, which originates from the notion of distributed telephone ASR. The front-end
is implemented at the telephone side and the back-end at the server side.
2.1.1 Front-end
The preprocessing block first converts the input speech into a digitized signal and then
possibly applies DSP enhancement techniques such as DC removal, normalization, equal-
ization, and noise suppression. Some noise suppression techniques can be efficiently inte-
grated in feature extraction as will be shown later. Subsequently, the front-end extracts
2.1 Statistical speech recognizer 5
feature extraction
preprocessingtext
class
decodinglikelihood
estimation
acoustic modeling language modelingpronunciation,
front−end back−end
Figure 2.1: Diagram of a typical ASR.
features from the speech. The goal is to preserve the important linguistic and other rel-
evant information and to reduce the redundancy of the data. The information rate of
telephone quality speech is about 64 kpbs (present European telephone standard). Typi-
cal algorithms used in low-rate speech coding need an order of magnitude less (2400 bps
for LPC10 standard) and the intelligibility can still be preserved for bitrates <500 bps
[79]. ASR differs from speech coding in that it only needs to preserve the linguistic mes-
sage which bitrate is around 50 bps [35], three orders of magnitude less than the original
speech. The algorithms thus aim at separating the relevant information from the ir-
relevant. Moreover, the subsequent classification and decoding algorithms are generally
computationally-intensive and it would not be efficient to process the full bandwidth sig-
nal. Development of front-end algorithms also helps to recover the underlying processes
behind speech and to understand the generation and perception processes.
Typically, the feature extraction is based on some form of short-term spectrum esti-
mated in a frame-based manner. Every 10–20 ms the signal is windowed with a 20–30 ms
long Hamming window. By taking FFT of the windowed frame we move to the frequency
domain and get about 100–200 samples of the spectrum. Such an approach assumes a
stationarity within the frame. Though this assumption is not met exactly, the 20–30 ms
long frame is a reasonable approximation. The features are derived only from the mag-
nitude since the phase is believed not to contain much of the relevant information for
speech recognition1. The envelope of the magnitude (or power) spectrum is smoothed
with a bank of auditory-like filters. In frequency domain this represents a simple binning
of the spectral samples. The filter bank contains 15–30 filters approximately equidistant
on a logarithmic frequency axis which emulates a property of the human speech percep-
tion. The dynamics of the filtered spectrum is compressed by the logarithm. The final
auditory-like spectrum with 15–30 bins is further smoothed and decorrelated either with
discrete cosine transform yielding Mel-Frequency Cepstral Coefficients (MFCC) [74, 29],
or with linear prediction with further auditory-like operations yielding Perceptual Linear
Predictive cepstrum (PLP) [47]. By these operations, the feature extraction reduces the
bandwidth from 8000 numbers per second (speech) to about 1200 numbers per second
(features). Recall that the linguistic rate of speech is about 50 bits per second, there is
still a lot of redundancy that could be reduced. However, other relevant yet non-linguistic
factors may need to be preserved in features for a successful decoding in the back-end.
1The fact that the ear is frequency deaf is referred to as Ohm’s Law of Acoustics [8].
6 2 Introduction
2.1.2 Back-end
The feature vectors are passed to the back-end. There the goal is to map the speech
frames onto chosen sub-word units and to decode the message. Since a widely used sub-
word units are the phones2, we will demonstrate the idea using phones. The mapping
from feature vectors to phones is not one-to-one. One phone is an event in time with a
characteristic temporal structure, which usually spans over several frames. This structure
helps to identify the phone. Given that speech frames themselves are assumed stationary,
to preserve the phone structure, every phone has to be modeled by several phases, called
states. Typically, one phone is formed by 3–7 subsequent states. The problem of mapping
features to the phonemes thus splits in two parts. The first is a static pattern matching
(comparing the feature to a template) and the second is a sequence recognition (time-
alignment).
To be able to match a state of a phone to a feature vector, there must exist some
discriminative function capable of evaluating a measure of similarity between the observed
feature and a stored statistical template (created during a training phase). For this purpose
mixtures of multidimensional Gaussian probability density functions (Gaussian Mixture
Model, GMM) are commonly used, which evaluate the similarity by means of a likelihood.
The GMMs assume the features to have multinomial Gaussian distribution which puts
constraints on front-end algorithms. Alternatively, the similarity can be estimated by a
trained neural network. Neural networks generally put less constraints on features and
have some advantages over GMMs: They allow for an arbitrary mapping between input
and output, and they can produce posterior probabilities3.
Approximating the fluent speech by a sequence of distinct stationary segments may
be quite crude, yet it allows for using well developed mathematical tools. The natural
variability of speech requires the similarity measures mentioned above and also some dy-
namic time-alignment between states and frames. That all is a task for Hidden Markov
Models (HMM). HMMs are stochastic finite state automata. The HMM is a generative
model, meaning that it aims at modeling the underlying process of speech generation. It
consists of several states S = {s1, s2, . . . , sK} and a topology of connections between them.
As speech is a temporal event, a sequential (left-to-right) model is used most often, see
example in Fig.2.2. There is a set of parameters associated with an HMM:
Transition probabilities a(sl|sk) – probabilities of a transition from state sk to sl (or
to stay at sk when k = l). These are modeled by Markov models of the first order,
which means that the next state depends only on the current state and not on any
previous states. The first order assumption allows for estimating the hidden state
sequence with only a small time context.
Emission probability b(xn|si) – a measure of a match of the observed feature and the
state si’s template, modeled by the above mentioned probability density functions.
Emission probability estimates how likely it is that the state si emitted (generated)
the observation xn.
2The phone is an elementary acoustic unit while a phoneme is a similar linguistic unit. Since these
units can often be mapped unambiguously, we will sometimes interchange these terms in the text.3The probability is a value between 0–1 as opposed to the likelihood which has generally no limits. A
system producing posterior probabilities has discriminative properties by nature.
2.2 Incorporating spectral dynamics in features 7
a(s |s )2 1
a(s |s )3 2
b(x|s )2
b(x|s )3
a(s |s )2 2
a(s |s )3 3
x
b(x|s )1
s s s1 2 3
1 1
x x
a(s |s )
Figure 2.2: Example of a three-state sequential HMM, characterized by transition proba-
bilities a(sk|sl) and emission probabilities b(x|si).
The joint likelihood of passing the observations though the model can be evaluated
using these probabilities under two following conditions. The first is the above mentioned
1st order Markov assumption and, secondly, it is assumed that the observations are in-
dependent on the previous states and features. If the assignment between features and
states were known, the joint probability would be a simple product of probability terms.
However, the sequence of states is unknown – hidden, therefore the probability that the
sequence of acoustic vectors X = x1,x2, . . . ,xT was emitted by the model M is given by
a sum over all possible state sequences S = s(1), s(2), . . . , s(T ),
P (X|ΘM ) =∑
S
a(s(1))
T∏
t=1
b(xt|s(t))a(s(t + 1)|s(t)), (2.5)
where a(s(1)) is an a priori probability of beginning at state s(1). There exist a number
of well-developed algorithms for efficient HMM training and decoding, see e.g. [85].
2.2 Incorporating spectral dynamics in features
It can be seen in Fig. 2.1 that the back-end of the ASR system is a complex multi-field
task which takes into account all the a priori lexical, phonetic, and grammatical properties
of the language that are incorporated in the pronunciation and language models. However,
the main information about the message comes through the acoustic model by means of
speech features, which arise from the front-end processing. It is evident that an appropriate
treatment of speech in the front-end is fundamental. If by a fallacious manipulation some
vital speech artifacts get lost or distorted during feature extraction, there is no more way
to recover them at the back-end. On the other hand, the complex information-merging
processes that take place at the back-end can perform reliably only when the acoustic
input serves highly relevant and discriminative features. Therefore, in a journey towards a
more robust ASR it seems coherent to put efforts in developing a robust front-end, rather
than trying to reclaim the information later at back-end using equalization, compensation,
normalization and other techniques. This idea is the main motivation for the front-end
orientation of this thesis.
8 2 Introduction
2.2.1 Long time context
Speech is the means of communication. The communication channel consists of three
parts: the transmitter (speaker = production), transmission (the air pressure), and the
receiver (the listener = perception). If we compare HSR and ASR, both systems share
the first two blocks. There is thus an analogy between the listener and the recognizer.
Since the evolution has optimized the use of the channel by human perception, why not
to emulate it in ASR? This idea motivated the perception-based approach. J. B. Allen
[8] wrote: “Until the performance of automatic speech recognition hardware surpasses
human performance in accuracy and robustness, we stand to gain by understanding the
basic principles behind how humans recognize speech.”. So far, the ASR based on the
short-term analysis is very fragile and ASR is far behind HSR, so we have to look at the
human recognition.
Inspiration by Humans
Temporal properties of speech are determined by the inertia of production organs which
causes fluent transitions between phones known as co-articulation. The phones are not
acoustically distinct, instead, they overlap. The significant information about the phoneme
can be found as far as 200 ms from its center [99] and most of the co-articulation happens
within a syllable-length (150–200 ms) region [44]. The perception has consistently time
constant about 200 ms given by the known property of temporal masking. It suggests
that if one wants to have a complete information about a phone in a signal, about half a
second long segment needs to be selected.
Introducing Long Context in ASR
In the current ASR the temporal evolution of speech is modeled poorly. Features are
derived from mere 25 ms and the HMMs where the temporal properties should be modeled
are constrained with the first order assumption (see above). The need of longer temporal
evidence in ASR is further documented by the success of delta features [38] which describe
to some extent local evolution of features. The suggestion that it is the change in spectrum
which carries the information was verified in perceptual experiments in [39] and for ASR
proved by the RASTA filtering which eliminates the steady components of the spectrum
[52]. Even the study of modulations in the speech supports this notion: most of the
modulations occur between 1 – 16 Hz with a peak at 4 Hz, which again corresponds to
time constants 150–250 ms [17, 10]. This idea suggests that the approach of extracting
the features “across frequency” might be better done “across time”, see Fig. 2.3.
Harvey Fletcher’s HSR experiments introduced the critical band frequency channels
and he suggested that humans use these channels independently of one another [36]. The
“across time” processing in the independent channels forms the core of the recent TRAP
features [53], which has been shown to improve robustness and performance of ASR.
2.3 Thesis overview 9
time
frequency
time
frequency
Figure 2.3: Speech spectrograms. Left: Short-term features are extracted from one speech
frame “across frequency”. Right: TRAP-like features are extracted in every frequency
band from a long-term trajectory, “across time”.
2.3 Thesis overview
2.3.1 Studied topics
This thesis goes along the lines of features extracted from the temporal evolution of the
spectrum. The objective is to find new ways of converting the dynamics in spectro-
temporal representation into features that would allow for better robustness against the
non-linguistic variation in speech than offer the existing techniques. There are some par-
ticular ideas that will be studied in the next chapters:
• It was mentioned above that by co-articulation the information about a phoneme
can spread up to half a second interval. But, the density of the information along
the 500 ms may not be homogeneous and the uniform attention paid to the segment
may not be optimal. In the human eye most of the information also comes from the
center. How to put more emphasis on the central part and less on the boundaries?
Would it help ASR?
Answers to these questions will be searched for in Chapter 5, Information in time-
frequency plane.
• Frame-based analysis assumes a stationarity within the examined frame. However,
speech is fluent and co-articulated. There is no underlying framing in it. Some special
speech events such as plosives cannot be preserved in the short-term spectrum which
smooths them out, especially when the phase is not used. The magnitude spectrum
of a sequence and the same but time-reversed sequence are the same! Is there a way
to extract features which would preserve the temporal properties even within frame?
And, is there a need of preserving such fine details?
These questions will be discussed in Chapter 6 on Linear Predictive Temporal Pat-
terns.
• Modulation spectral domain allows for partial separation of speech from non-
linguistic artifacts thanks to the limited modulation range of speech. The sepa-
ration can be implemented by temporal filtering of the sub-band energy trajectories.
10 2 Introduction
Replacing one filter with a bank of multi-resolution filters can represent frequency
decomposition. Particular impulse responses derived from Gaussian function seem
to be consistent with the evolving knowledge about mammalian auditory cortex.
Would it be desirable to combine these thoughts in a new speech representation?
Would it have some advantage over other features?
These ideas will be explored in Chapter 7 on Multi-resolution RASTA filtering.
• Daily experience suggests that not all words in the conversation, but only a few im-
portant ones, need to be accurately recognized for satisfactory speech communication
among human beings. The important key-words are more likely to be rare-occurring
high-information-valued words. Human listeners can identify such words in the con-
versation and possibly devote extra effort to their decoding. On the other hand, in
a typical ASR, acoustics of frequent words are likely to be better estimated in the
training phase and language model is also likely to substitute rare words by frequent
ones. As a consequence, important rare words are less likely to be well recognized.
Keyword spotting bypasses this problem by attempting to find and recognize only
certain words in the utterance while ignoring the rest.
An intermediate product of the typical ASR is the assignment between frames and
classes (see the previous section). In this thesis it will mostly be posterior probabil-
ities of phones estimated by a neural network. Visual inspection of a time sequence
of these phone estimates often gives the underlying word. It suggests that there
could be a way to automatically process the sequence and detect only the words of
interest. This is further discussed in Chapter 8, Extensions.
2.3.2 Thesis outline
The thesis extends the previous studies of feature extraction from spectro-temporal plane
and proposes novel approaches. Since there are several compact topics studied, an indi-
vidual chapter is devoted to each of them. To be able to evaluate the properties of the
presented front-ends and to compare them, the evaluation tasks and criteria need to be
defined before the chapters with front-ends are given. Thus, the experiments related to
the individual approaches are always presented within the appropriate chapter.
• Chapter 3 gives a technical background for the subsequent chapters. It surveys the
relevant state-of-the-art and describes the techniques that are used and adopted in
this thesis.
• Chapter 4 introduces the recognition and evaluation framework which is used
throughout the rest of the thesis. First it describes the tools used for feature extrac-
tion and recognition. Subsequently are presented the evaluation tasks and corpora
along with chosen typical front-ends which will be used as a baseline to which the
proposed systems will be compared. For a rough notion of the task complexity, the
baseline experimental results are also given.
• Chapter 5 studies the properties of the adopted approaches for extracting features
from the time-frequency plane. The first goal is to limit the large input space from
2.3 Thesis overview 11
where the features are extracted. Further it studies the distribution of the ASR-
relevant information in the time-frequency plane and introduces the idea of time
axis warping. Finally, a small study on target classes for MLP classifier is given.
It is shown that temporal context of 1000 ms is large enough to capture all the
useful information. The context can be reduced down to 200–400 ms without a
significant drop in performance. Most of the ASR-relevant information seems to
come from the central frames, distant frames can be largely sub-sampled without
loss in performance, which can reduce TRAP complexity up to 5 times. Finally it
is shown that replacing phoneme targets in MLP classifier by phoneme-state targets
(3 states per phoneme) can significantly improve ASR.
• Chapter 6 is devoted to Linear Predictive Temporal Patterns (LP-TRAP) as a means
of presenting the information in the time-frequency plane to the neural net classifier.
LP-TRAP bypasses the common frame-based processing in the front-end and allows
for preserving fine temporal structure of the energy trajectory in frequency sub-
bands. It is done by modeling the energy trajectory by Linear Prediction.
As there are a number of tunable parameters in the novel LP-TRAP features, they
are optimized at first. Further it is shown that LP-TRAP features can outperform
conventional approaches. The idea of time-warping is implemented and shown to
improve the performance.
• Chapter 7 presents the Multi-Resolution RASTA Filtering (M-RASTA). The tech-
nique extends earlier works on delta features and RASTA filtering by processing
temporal trajectories by a bank of band-pass filters with varying resolutions. Since
the applied filters have zero-mean impulse responses, the technique is inherently
robust to linear distortions.
The M-RASTA features are shown to outperform baseline as well as other approaches
proposed in this thesis. The resistance to channel noise is illustrated. The M-RASTA
features are shown to be complementary to conventional features. Combining M-
RASTA with conventional features yields an additional improvement in performance.
• Chapter 8 views the ASR from the perspective of the keyword spotting. It presents
an alternative approach to the ASR in which each targeted word is classified by
a separate binary classifier against all other sounds. No time alignment is done.
The recognizer is formed by a cascade of two neural network classifiers, the first
estimating phoneme probabilities and the second estimating word probabilities.
On a small vocabulary task, the system still does not reach the performance of
the state-of-the-art but its simplicity, the ease of adding new target words, and its
inherent resistance to out-of-vocabulary sounds may prove significant advantage in
many applications.
• Chapter 9 summarizes the findings from the earlier chapters, draws final conclusions,
summarizes the contribution of the thesis and sketches the future research.
12 2 Introduction
2.3.3 Main goals
The main goals of the thesis can be summarized as follows:
1. To get acquainted with the state-of-the-art in feature extraction from spectral dy-
namics with particular interest in long-term features utilizing neural network classi-
fiers.
2. To study the chosen techniques and to propose improvements with respect to the
robustness against non-linguistic factors and the complexity.
3. To design a new robust speech representation from spectral dynamics using available
MLP and HMM technology. To propose an alternative to the HMM framework for
decoding words.
4. To implement the developed algorithms in standalone tools (C++), to assemble ASR
evaluation tasks, and to integrate the proposed algorithms in these tasks.
5. To verify and optimize the properties of the proposed techniques using evaluation
tasks. To study their strengths and weaknesses, to compare them to the existing
state-of-the-art features and to assess the contribution to the ASR.
Chapter 3
Survey: From short term
spectrum to spectral dynamics
This chapter is a survey through the recent knowledge about spectral dynamics and its
relevance to state-of-the-art ASR. A particular interest is in techniques for feature extrac-
tion which consider long temporal context or relate to the field. The ideas and approaches
adopted in this work are described in detail.
3.1 Obtaining auditory spectrogram
Most of today’s features are derived from some form of spectral representation of speech,
which was originally motivated by the human sound perception.
3.1.1 Frequency resolution of speech spectrum
Harvey Fletcher’s HSR simultaneous masking experiments revealed critical band frequency
channels and he suggested that humans use these channels independently on one an-
other [36]. The bandwidth of these channels was found to be roughly one octave. The
logarithmic-like behavior of the human perception lead to the currently popular melodic
and Bark scale frequency warpings [74, 47]. Some studies report about 20 frequency chan-
nels to be suitable for modeling the perception [8]. By applying these filter banks to
speech, the frequency resolution gets effectively reduced. However, it was shown that such
reduction does not affect speech intelligibility. Recent studies suggest that even less than
10 frequency channels can still preserve the intelligibility [44].
Similar observations were noted in ASR: Burget and Hermansky [21] inspected base
vectors of Linear Discriminant Analysis (LDA) which was applied along the bins of linear
short-term spectrum. The resulting data-driven bases exhibit critical-band-like frequency
resolution. It was also confirmed in [72]. The need of higher resolution at low frequencies,
which lowers approximately logarithmically towards higher frequencies, was confirmed also
by Umesh et al. in a search for such frequency warping that would minimize the differences
between speakers [95].
13
14 3 Survey: From short term spectrum to spectral dynamics
3.1.2 Bank of filters in time domain
Since there exists the Cooley and Tukey’s FFT algorithm, the auditory spectrum is typ-
ically obtained from the short-term FFT by applying a bank of filters in the frequency
domain. However, the sub-band energies can also be obtained by emulating analog band-
bass filters in time domain, which can enable for better time resolution and get closer to
the human perception. This idea leads to LP-TRAP features presented in Chapter 6.
Motlıcek [81, 80] applies a set of auditory-like Gammatone filters directly to the speech
in time domain. The band-pass signals are demodulated and low-pass filtered to yield the
energy envelope trajectories, which are subsequently sampled at the required frame rate.
Tyagi et al. [93] calculate their so called fepstrum in a similar manner, using a bank of
band-pass filters in time domain, that are linearly spaced and have rectangular frequency
responses. They subsequently take the logarithm of every band’s output magnitude and
then down-sample this “log-energy” trajectory. Finally, they apply DCT along each tra-
jectory which yields the fepstum. Athineos [14] attempts to derive the true sub-band
energies from the Hilbert envelope of the signal. Being close to this approach, Dimitriadis
et al. [32] approximate the sub-band energies by the Teager-Kaiser operator which is less
computationally expensive.
3.2 Spectrogram filtering
The time-frequency plane can be filtered along its time or frequency dimension, or both
dimensions (2-D filtering). The filtering along frequency comprises e.g. auditory filter
banks or DCT in MFCC computation. However, more important for this work is the
temporal filtering and 2-D filtering, which forms the core of the Multi-Resolution RASTA
approach presented in Chapter 7.
3.2.1 Temporal filtering and modulation spectrum
Filtering temporal trajectories of sub-band energies (or generally features, as cepstrum is
linearly transformed log-energy) has been widely used in ASR. Virtually any processing
on the feature sequence can be viewed as filtering. Furui’s well known delta features [38]
apply 1st and 2nd order derivatives to the sequence of cepstral coefficients. Time filtering
can be used to suppress linear distortions by removing the steady component from the
sequence, as implemented in Cepstral Mean Subtraction technique [12].
Generally, the time trajectory of a feature can be seen as a time signal with certain
frequency properties. It contains slow as well as fast components representing modula-
tions. Thus, the temporal filtering can be seen as an operation in modulation-frequency
domain. Assuming that 1) communication channel is stationary, 2) spectral changes are
the carrier of message in speech, it suggests that in modulation-frequency domain the
channel noise can be separated from speech. Hermansky and Morgan [52] filtered out
non-speech artifacts and linear distortions from speech by band-passing the trajectories
of compressed critical-band energies with temporal RASTA filter. Arai et al. [9, 10, 11]
explored what modulations are needed for speech and speaker recognition as performed
by humans, by band-pass filtering time-trajectories of LPC or MFC coefficients and play-
ing the re-synthesized speech to human listeners. They found that modulations between
3.3 Parametric spectrogram representation 15
1.5–16 Hz are important and other components can be filtered out without a loss of in-
telligibility. Kanedera et al. [63, 62] filtered similarly sub-band energy trajectories and
evaluated the performance of the ASR. Consistently with Arai, they reported bandwidth
2–8 Hz with a peak at 4 Hz (corresponding to the average syllable rate) to be crucial.
Avendano et al. [17] and van Vuuren [98] apply LDA along TRAPs to derive impulse re-
sponses of RASTA filters. Another example of successful time-filtering operation is DCT
or PCA transform applied to TRAPs in order to decorrelate the features and reduce their
dimension [88].
3.2.2 2-D Filtering
Any combination of linear frequency and temporal filtering can be interpreted as a 2-D
filtering of the time-frequency plane, for example MFCC features (DCT across frequency)
followed by RASTA filtering (filtering across time). However, 2-D filtering does not have
only this “artificial” interpretation. Recent works revealed that ASR can benefit from
explicit 2-D filtering emulating some properties of the mammalian auditory cortex [30, 31]:
Kleinschmidt, Gelbart and Meyer [66] applied a number various 2-D Gabor filters to the
auditory spectrogram and used a data-driven feature selection method to find optimized
feature sets. Their robustness was shown on Aurora 2 & 3 tasks. They also report
that over 40% of automatically selected features exhibit diagonal characteristics [75]. It
favorizes TANDEM approach over TRAP as the TRAP is not designed to preserve inter-
band relations (refer to section 3.4 for more details). Nevertheless, 2-D filtering can enable
TRAP for such relations: Grezl et al. [45] apply 3 bands×3 frames operators to auditory
spectrogram and improve features for TRAP MLPs and LVCSR [46, 64]. Kajarekar et al.
[60] and Valente and Hermansky [96] use LDA to derive similar 2-D discriminants.
3.3 Parametric spectrogram representation
Some authors attempted to use parametric model for spectrogram representation.
Motlıcek et al. [81, 80] derived TRAPs using temporal Gammatone filters and applied
LP model directly to the TRAPs. The fact that the approach was not successful revealed
that the phase information in TRAP is necessary for ASR: the phase information gets
discarded in LP model as it is derived from autocorrelation1.
Athineos et al. [15] used LP modeling for another sort of TRAPs, LP-TRAPs. How-
ever, in contrary to the previous approach, he applied the LP in frequency domain, hence
fitting the model to the shape of LP-TRAP itself and not to its spectrum. Since here the
phase information was preserved, the approach was successful. This approach was further
developed in this work and will be described in full detail in Chapter 6. Athineos [16]
later extended the use of LP model to 2-D spectral modeling.
1 More precisely, since LP analysis minimizes a certain measure of distance between the modeled subject
and the fit in power spectral domain, the module of the sub-band modulation spectrum was preserved, yet
its phase was discarded, which in turn corrupted the temporal properties of TRAP.
16 3 Survey: From short term spectrum to spectral dynamics
3.4 Parametrizing spectrogram by probabilistic features
Whatever has been the way from speech to spectrogram, the main question remains, how
to convert the spectrogram in features useful for ASR. Probabilistic features can serve this
purpose.
Probabilistic features are considered those, proceeding on their way to the final feature
vector through posterior probabilities of some classes, mostly phonemes. A widely used
framework for obtaining estimates of posterior probabilities are Multi-Layer Perceptron
(MLP) neural networks, for their simplicity and solid mathematical background. As MLPs
are frequently used throughout the thesis, they are introduced in detail in section 4.2.1. For
now, let us picture the MLP as a box being able to learn any non-linear mapping function
between its input and output vectors from the labeled training data. The training data
are presented to the MLP in the form of input vectors associated with a particular output
class. The trained MLP provides an estimate of posterior probabilities of all classes given
an input sample. Besides other advantages of MLPs, they can reduce the feature dimension
and can be conveniently integrated in existing ASR framework.
3.4.1 Hybrid and TANDEM architectures
Both, the Hybrid and TANDEM architectures use MLP to project a certain portion of
speech posteriogram onto phoneme classes.
The earlier approach, Hybrid, was proposed by Morgan and Bourlard [77]. The con-
ventional GMM/HMM system does the acoustic modeling by means of Gaussian mixtures,
which parameters are trained by maximizing the likelihood of the observed data given the
models (ML training) [85]. In Hybrid approach, the GMMs are replaced by an MLP which
is trained discriminatively and can also represent emission probabilities needed for HMM
decoding. One one hand, Hybrid approach simplifies the training, as the full expectation-
maximization algorithm is not needed, on the other hand, some of the matured powerful
techniques available in GMM-HMM are lost, such as tied context-dependent triphones or
model adaptation based on Maximum Likelihood Linear Regression (MLLR) [40].
TANDEM architecture was proposed later by Hermansky et al. [48]. TANDEM by-
passes the drawbacks of either GMMs or MLPs by combining the techniques and benefits
from the best of both. The class posteriors are again estimated by MLPs, however, instead
of being directly used as emission probabilities, they are fed to the GMM/HMM system
as features. The purpose of log and KLT transforms is to modify the feature distribution
and to decorrelate the features, so that they can be better modeled by GMMs with di-
agonal covariance matrix. TANDEM usually outperforms Hybrid, especially under noise
conditions [48], which comes at the price of more complex training and decoding.
In both architectures the MLP allows to easily incorporate temporal context in fea-
tures. The MLP is trained on 9 consecutive frames of PLP coefficients and it estimates
phoneme posteriors, hence using a mid-term temporal context of about 100 ms. The
architectures are pictured in Fig. 3.1 and 3.2.
3.4.2 TRAP architecture
TRAPs feature in a substantial part of this work, therefore they will be discussed in more
detail.
3.4 Parametrizing spectrogram by probabilistic features 17
time
extraction
PLP
39 featu
res
351 ife
atu
res
phonem
es
probabilitiesemission
Viterbi decoder
HMM
text
9 frames
MLP
1/priors
Figure 3.1: Scheme of Hybrid MLP/HMM system.
time
GMM/HMM
text
HMM
GMM
extraction
PLP
39 featu
res
351 ife
atu
res
phonem
es
log+
KLTMLP
features
25−50
9 frames
Figure 3.2: Scheme of TANDEM system combining MLP and GMM/HMM.
Conventional short-term systems are built upon the idea of spectral patterns as the
fundamental discriminative features for ASR. Decades have proved this approach right,
however, short-term features are very sensitive to non-linguistic variability. Inspired by
Fletcher’s pioneering work [36], researchers started to think of processing the speech in
independent sub-bands [19]. Instead of slicing the spectrogram across frequency, they cut
across time, thus replacing the frequency context by temporal context.
After some initial trials [92], the number of frequency sub-bands stabilized at approxi-
mately critical-band resolution and the temporal context has widened up to 1 second. The
primary source of information was the log-critical-band spectrogram, obtained as an inter-
mediate product in PLP calculation [47]. 1000 ms long trajectories of the CRitical-Band
log-Energies (CRBE) were examined by Hermansky and Sharma [53, 54]. They calcu-
lated CRBEs from a phonetically labeled corpora and assigned a phoneme label to each
frame. Subsequently, they concatenated 50 consecutive frames before and after the cur-
rent frame, thus forming 101-frames long TempoRAl Pattern (TRAP) of log-energies for
every sub-band and every frame. They calculated so called Mean TRAPs of all phonemes
by averaging all TRAPs assigned the particular phoneme label2. Study of Mean TRAPs
suggested to use these prototype patterns for sub-band phonetic classification. They used
a simple distance measure to compare TRAPs of an unknown speech to Mean TRAPs
of every phoneme, and trained MLP to estimate phoneme probabilities given these 435
distances (15 bands×29 phonemes). Use of these posteriors in Hybrid system did not out-
perform PLPs. Even an agglomerative clustering of Mean TRAPs in five “Broad TRAPs”,
although nicely matching with real phonetic categories, did not improve ASR [53].
Better success was reported when using non-linear neural classifier even for the band-
specific classification (Neural TRAP). Discriminatively trained MLPs are able to better
distinguish among classes, especially at the class boundaries, as discussed in [48].
2This illustrative experiment will be reproduced in section 5.1.1.
18 3 Survey: From short term spectrum to spectral dynamics
��
��
���� ��
��
��
����
��
����
Critical band spectrogramBand MLPs
phoneme posteriors
GMM/HMM
log
+
KLT
features25−50
text
Merger MLP
��
��
����
��
����
����
����
��
����
��
����
��
������������
������
log
log
log
log
transform
TRAP
transform
TRAP
Figure 3.3: Scheme of TRAP system.
The general scheme of Neural TRAP (further referred to as TRAP) is pictured in
Fig. 3.3. The band-specific (or band-conditioned) MLPs as well as the merger are trained
on phoneme targets. On small-vocabulary task, TRAP reaches competitive performance
to PLP-TANDEM; when both probabilistic features are combined by averaging in log
domain, the result outperforms either features [53].
3.4.3 Further development of TRAP
Since Sharma and Hermansky successfully developed the Neural TRAP, more efforts have
been put in this architecture. First it was the improvement of the input to TRAP, meaning
the way of obtaining the spectrogram, its filtering and various projections, as reviewed in
sections 3.1.2 and 3.2. On top of TRAP, mean and variance normalizations were shown
to improve the performance especially under noise conditions [54, 64].
Second, the architecture itself has evolved. Much of the improvement came from
International Computer Science Institute in Berkeley, Oregon Graduate Institute and Brno
University of Technology.
Chen and Zhu proposed two major modifications, Hidden Activation TrapS (HATS)
and Tonotopic Multi-Layered Perceptron (TMLP) [101, 24, 25, 23]. HATS train the Merger
MLP using outputs from the hidden units in Band-MLPs. Hence, the merger is not trained
using posterior probabilities, but using matched filters for basic patterns appearing in
TRAPs. The authors report only about 20 patterns per band need to be modeled. TMLPs
replaces the hierarchy of MLPs with only one network with two hidden layers, where
the units in the first, tonotopic layer, discriminate patterns in every band independently
and the outputs are merged together in the second, fully connected layer. Reported
improvements are significant on LVCSR tasks [102, 103].
Schwarz et al. [87] utilized TRAPs for phoneme recognition. They deal with insuffi-
cient amount of training data by replacing Band MLPs with pre-windowed linear oper-
ations (PCA, DCT) and by splitting the temporal context in left and right parts which
they report to require less training examples [88]. Recently they proposed to split not
only the temporal context, but also frequency context and experimented with hierarchical
structures of MLPs [89].
3.4 Parametrizing spectrogram by probabilistic features 19
3.4.4 Multi-stream systems
A substantial improvement in features for ASR has been reached after the complementarity
of long-term features to short-term features revealed. When tuning the systems for their
best possible accuracy on a given task, researchers experimented with many possibilities
of combining multiple data streams.
Probabilistic features enable to be easily combined. If multiple ANNs (experts) are
trained on the same data set with the same targets but with different representations
(e.g. TRAP and TANDEM), one could simply average their posteriors to get an improved
estimate [59]. However, it poses certain assumptions upon the experts, which are not
always met, therefore a number of alternatives have been proposed [83, 78, 56, 76].
Today’s state-of-the-art systems using probabilistic features typically append some
form of long-term features to short-term features and further tune the system by using
transformations for decorrelation and dimensionality reduction [20, 101]. As an example,
the so called Combined-augmented features used by ICSI Speech group [102] consist of
39 short-term features obtained from HLDA-transformed3 PLP+∆+∆∆+∆∆∆ features,
which are mean and variance-normalized over conversational side, appended by 25 KLT-
decorrelated probabilistic features, obtained from inverse-entropy combination of TRAP
and PLP-TANDEM phoneme posteriors.
3HLDA stands for Heteroscedastic Linear Discriminant Analysis.
20 3 Survey: From short term spectrum to spectral dynamics
Chapter 4
Recognition and evaluation
framework
This chapter introduces the recognition and evaluation framework which is used through-
out the rest of the thesis. Habitually, the experimental framework would be introduced
in the experimental part, after the theory has been presented, but this thesis deals with
multiple topics which all share the same evaluation tasks so that the results were directly
comparable, therefore the evaluation framework and the baselines have to be introduced
before the subsequent chapters.
4.1 Software tools
First, the software tools that are used for experiments will be introduced. A part of them
was implemented by the author and made public available on the internet, another part
was obtained from other research sites.
HTK
The Hidden Markov Model Toolkit (HTK) is a widely used open-source toolkit for building
and manipulating hidden Markov models [100]. It will be used for training acoustic models
and will act as the back-end in all HMM-based recognizers.
QuickNet
QuickNet is a suite of software that facilitates the use of multi-layer perceptrons (MLPs)
in statistical pattern recognition systems. It contains a program for efficiently training
MLPs, a program for using MLPs to do pattern recognition and a library of C++ objects
that handle MLP training and perform I/O operations with various file formats. QuickNet
was developed in the Speech Group at the International Computer Science Institute by
David Johnson and it is an open-source project. Further information about QuickNet can
be found on the internet [4].
21
22 4 Recognition and evaluation framework
Auxiliary tools for QuickNet from SPRACHcore project
SPRACHcore is a software release of the neural network speech recognition tools developed
by ICSI plus other partners and it is an open source project. The downloadable package
contains tools for neural net training and recognition, feature calculation, and sound file
manipulation. The tools are maintained by Dan Ellis and Chuck Wooters. More on
SPRACHcore is available on the internet [5]. For the purposes of this thesis the following
utilities were used:
feacalc – A feature calculation program.
feacat - A utility for conversion and trimming of data files.
pfile utils - Specialized programs to transform data in pfile format.
CtuCopy
CtuCopy is an efficient command line tool written in C++ implementing speech enhance-
ment and feature extraction algorithms. It is similar to HCopy from HTK Toolkit and
it also supports HTK file format. It uses fftw library for fast DFT calculations [1]. It
is an open-source program developed by the author and it is available under the GNU
license on the web site of the Speech Processing Group at CTU Prague [6]. Originally the
CtuCopy was developed during the author’s masters study to enable efficient use of speech
enhancement techniques in cascade with feature extraction. When these two blocks are im-
plemented in one tool, there is no more need to reconstruct the speech prior to the feature
extraction, because the spectrum obtained from the speech enhancement techniques can
be passed directly to the filter bank of front-end. Such all-in-one tool offers more efficiency
and flexibility of the DSP algorithms than if several one-purpose tools were used. Later,
the CtuCopy has been extended with more capabilities which, to the author’s knowledge,
are not available in other open-source tools.
Basic function: CtuCopy acts as a filter with speech waveform file(s) at the input, and
either a speech waveform file(s) or feature file(s) at the output, see Fig. 4.1. Several
input formats are supported: MS Wave, raw file with PCM data, A-law data or
mu-law data in both byte orders or the on-line input.
Preprocessing: The preprocessing block segments and windows the signal and can op-
tionally apply preemphasis, dither and remove possible DC offset.
Speech enhancement: CtuCopy implements several speech enhancing methods based
on spectral subtraction. Extended Spectral Subtraction - exten [90] combines Wiener
filtering and spectral subtraction with no need of a voice activity detector (VAD).
Other methods are based on spectral subtraction with VAD which can be either
external (data are being read from a file) or from the internal cepstral detector. The
noise suppression methods can be applied either directly to speech spectra or to the
auditory spectra obtained with a bank of filters. The enhanced speech can either
be reconstructed and written to the output file or passed to the feature extraction
process. Channel noise can be suppressed with general RASTA filtering either with
the original filter published by Hermansky [52] or with arbitrary impulse responses
specified by the external file.
4.1 Software tools 23
Postprocess
Liftering
Parametrization
DCT
PLP
LPC
ck
ak
Sk
preprocess
spectrum
iFFT + OLA
speech
FFT
user−defined
signal
Filter Banks
melodic
Bark
linear
expolog
Additive
Spectral
Subtraction:
Extended SS
SS with VAD
Channel
filtering
RASTA
HTK/pfile
features
Enhanced
speech
Figure 4.1: Block scheme of the universal feature extractor and speech enhancer CtuCopy.
Filter banks: The bank of filters offers a set of frequency scales (melodic, Bark, expolog,
linear) and three filter shapes (triangular, rectangular, trapezoidal) which can be
arbitrarily combined to form standard or user-defined filter banks.
Feature types, file formats: In feature extraction mode a number of common features
can be extracted from either original or enhanced speech, e.g. Linear Predictive
Coefficients (LPC) or Cepstral coefficients (LPCC) , Perpceptual Linear Predictive
cepstral coefficients (PLP), Mel-scale Cepstral Coefficients (MFCC), and magni-
tude/logarithmic spectra. Features can be saved either in HTK format or, thanks
to Petr Schwarz from Brno University of Technology for his C class, also in pfile
format.
“fdlp” tool
This command line tool written in C++ implements LP-TRAP feature extraction using
Frequency Domain Linear Prediction (FDLP) as will be discussed in chapter 6. The
program acts as a filter with speech waveform file(s) at the input and a data file with
LP-TRAP features in pfile format at the output. Conceptually it is similar to CtuCopy
and it also uses fftw library for fast DCT and DFT calculations. It was developed by the
author. Since the documentation is incomplete, fdlp has not been made public available.
Trapper
A tool from Brno University of Technology which, given a data file with critical band
energies computed from the speech by feacalc or CtuCopy, forms a data file with TRAP
features that are being used by QuickNet. Various transforms and normalizations can
24 4 Recognition and evaluation framework
be applied to the TRAPs. Trapper is an internal tool kindly provided to the author by
Speech Processing Group at Brno University of Technology [7].
Miscellaneous
Other one-purpose tools implemented by the author in C++ related to this thesis:
pfile gauss – 2-dimensional filtering of the TRAP features. The filtering over time uses
Gaussian derivatives as impulse responses and filtering over the frequency approx-
imates the first and second order difference between neighboring sub-bands. The
result is the M-RASTA feature as discussed in Chapter 7.
pfile shuffle – Reshuffles the frames of the input feature file in a random order and stores
the result in the output file. It is used for preprocessing the neural net training data
to optimize the training.
pfile warp – Warps temporal axis in TRAP feature file using a symmetrical exponential
function by resampling the TRAP trajectory. The function can either expand the
central part of TRAP and compress the boundaries or vice-versa. Number of output
samples per TRAP determine whether the trajectory gets rather stretched or shrunk.
More will be explained in Chapter 4.
trap shapes – A tool for computing Mean TRAP as introduced by Hermansky and
Sharma [53] using multiple approaches. It produces mean TRAPs plus their vari-
ances and occurrence counts for every class. It can also compute class confusion
matrices since the inputs to the program are both the reference labeling and neural
net outputs.
hilbert gauss – A mix of pfile gauss and fdlp. Similarly to pfile gauss, this tool
filters the auditory spectrogram with 2-D filters. However, the input spectrogram is
not obtained with the short-term spectral analysis as in TRAP, but using sub-band
Hilbert envelopes as in LP-TRAP.
etrap – The Energy-TRAP feature extractor. It extracts features from the energy contour
of the speech. The energy is obtained with Hilbert envelope and is parametrized
with LP cepstrum. It is a special case of LP-TRAP cepstral features for the whole
frequency band. The features can complement conventional features.
4.2 Recognizer architecture
There will be two evaluation tasks introduced in this chapter. The general structure of both
systems is similar and resembles the typical recognizer from Chapter 1. There is a front-end
and a back-end. The front-end computes features which are used by the pattern-matching
and decoding mechanisms in the back-end. The back-end is a GMM/HMM system with
models of phones, either context-independent (CI) or context-dependent (CD). Let us now
look very briefly at the structure of the front-end, as it is needed for defining the evaluation
criteria.
The front-end can represent common short-term features such as PLP or MFCC, which
are obtained from the speech using the above introduced tools HCopy from HTK, feacalc
4.2 Recognizer architecture 25
or CtuCopy. These features will be used as a baseline. Long-term features which are
studied in this thesis are derived essentially in two steps, see Fig. 4.2: First the speech
in converted into some spectral representation specific to the method, which is typically
high-dimensional. The dimensionality needs to be reduced. For that purpose serves the
neural network which does the non-linear projection and dimensionality reduction. It
also acts as the phoneme probability estimator, which is important for the Frame Error
Rate evaluation criterion defined later. Before the phoneme posteriors obtained from the
neural network can be passed to the HTK back-end, they need to be “gaussianized” and
decorrelated. This is approximated by logarithm and Karhunen-Loeve Transform (KLT)
in feacat and pfile klt tools, respectively.
feature
extractionprobability
phoneme
estimator
ph
on
em
es
sum = 1
Speech
KLT
log
+Features
Figure 4.2: General scheme of front-end for long-term features.
4.2.1 Multi Layer Perceptron as posterior probability estimator
Since MLP neural network is a workhorse used throughout this thesis, some details about
its background are given in this section. The text does not intend to thoroughly cover a
background on artificial neural networks, which can be found e.g. in [55]. It rather serves
as the extension of section 3.4 introducing probabilistic features.
In this work, MLP is used as a statistical classifier for estimating the probability that
the given acoustic observation corresponds to a particular class (phoneme). The MLP is
being used for phoneme classification from several reasons:
1. MLP is a discriminative classifier, as opposed to generative models such as HMMs,
hence it can provide an estimate of posterior probabilities of the classes.
2. With only one hidden layer, MLP is able to learn any non-linear mapping function
between input and output from the data, provided that it has enough hidden units
and enough training data [67]. Hence, no assumptions about the distribution of the
input data are required.
3. MLP is simple and it has a solid mathematical background in the sense that there
exist a robust gradient-descent training algorithm – Error Back-Propagation [86].
4. MLP can largely reduce the redundancy and dimensionality of data, since its output
is typically smaller than its input.
5. QuickNet software provides efficient algorithms for MLP training and forward-
passing.
26 4 Recognition and evaluation framework
Structure of MLP
This work utilizes only one ANN architecture, MLP with three layers of neurons, see
Fig. 4.3. Between neighboring layers, all neurons are fully connected and the signal flow is
only in one direction, i.e. there are no feedback loops. Such structure is called feed-forward
MLP.
(z)2
3(z)
bias
xw
wx
x
1
2
1
i i
(z)z
Neuron
activationfunction
INPUT
layer
HIDDEN
layer layer
OUTPUT
Figure 4.3: Scheme of three-layer MLP and neuron.
Neurons in the first layer only distribute the input to the subsequent layer, they do not
implement any calculation. The neurons from the second and the third layers implement
general function
g(x) = φ
(
b +
m∑
i=1
wixi
)
= φ(z), (4.1)
where xi are the neuron inputs, wi are weights, b is bias and φ(z) is a non-linear activation
function of one variable z (z is a weighted linear summation of xi plus the bias). In
particular, the second-layer neurons use sigmoid activation function, which maps all real
numbers to the interval < 0, 1 >,
φ2(z) =1
1 + e−z. (4.2)
The third-layer neurons use softmax activation function, which ensures that outputs of all
C neurons from the output layer act like probabilities, i.e. their values lay between 0 and
1 and they sum up to one:
φ3,k(z) = probk =ezk
∑Cj=1 ezj
, (4.3)
where k = 1 . . . C are indices of output neurons.
Training procedure
The weights and biases of the MLPs are subject to the iterative gradient-descent training.
It is done in a on-line supervised manner, in which a large amount of input–output pairs
are repeatedly presented to the network. The training minimizes a cross-entropy error
criterion between the network output and the desired output [42]. One cycle of presenting
all data to the network is called an epoch.
During the training the MLP is learning the mapping function between its input and
output. It is desirable that the MLP captures only the gross trend and not the details
4.3 Evaluation criteria 27
which are due to the variability in data. If the MLP was over-trained, it would loose its
generalizing property and it could not predict the output for an unseen data. To prevent
this situation, the MLP is being evaluated on an independent data (cross-validation set,
CV) after each training epoch. Usually 90% of the available data are used for training
and 10% for cross-validation. The performance is measured by Frame Error Rate (FER),
which will be introduced in the next section. The early-stopping training strategy is being
used [2], which says:
1. Start the training with a learning rate of 0.008.
Learning rate value determines the speed of training. It specifies to what extent the
weights and biases can change between epochs.
2. Repeat the training until the FER on CV data does not improve over the
previous training epoch by more than 0.5%.
Since than, the learning rate is halved before each epoch, which increases precision
at the local optimum.
3. Continue training, but halve the learning rate in every epoch.
Halving the learning rate initially boosts the FER improvements, however, eventually
the learning rate becomes so small that the improvement is minimal. When the FER
again improves by less than 0.5%, the training is stopped.
It typically takes 7 – 10 epochs to train an MLP.
Features and targets for MLP
In this work, MLP is mostly used to project input vectors with hundreds of features to
posterior probabilities of tens of phonemes. It is done every 10 ms.
A large amount of phonetically labeled corpora are used for training. The labels can
either be obtained manually, which is expensive, or by automatic alignment using HMMs,
which gives slightly worse results but for huge corpora is inevitable [37]. Training data
contains input-output pairs of feature vectors and associated phoneme targets. As there
are one output neuron for every class, the training targets are coded in the 1-out-of-C
manner, representing ideal posterior probabilities. The training targets are hard, which
means that if the frame belongs to a certain phoneme, the target assigned to that phoneme
is set to 1 and all other targets are set to 0.
Though all MLPs in TRAP and TANDEM architectures typically use phonemes as
targets since they are easy to obtain, it is questionable if these classes are optimal. Ap-
pendixes C and D deal with this question in detail.
4.3 Evaluation criteria
As the objective of this thesis is to develop front-end algorithms for speech recognition,
the natural way of evaluating their performance is to build a recognizer, have it recognize
testing utterances and compute the word error rate. Word error rate is defined as the
sum of the number of erroneous words divided by the number of words in the reference
transcription,
28 4 Recognition and evaluation framework
WER =D + I + S
N· 100%, (4.4)
where D, I, S denote deleted words, inserted words, and substituted words, respec-
tively. N is the reference number of words. The values of D, I, S are obtained from a
comparison of the recognized output and the reference transcription using NIST align-
ment procedure as implemented in HTK Toolkit [100] and NIST’s SCLITE [3]. The WER
criterion has the advantage that it is widely used and can be easily implemented. The
drawback is that since WER quantifies the quality and reliability of the whole system
by only one number, the measure can sometimes be misleading. It serves well for simple
systems such as digits recognizer, but for LVCSR this measure may not be sufficient.
An additional criterion evaluates how well the neural classifier is estimating the
phoneme posterior probabilities. It operates on a frame basis and is called Frame Er-
ror Rate (FER). FER is a fraction of misclassified frames given the true labels. More
precisely, it is the number of frames Fmis where the maximum posterior does not match
the underlying class label over the overall number of frames F ,
FER =Fmis
F· 100%. (4.5)
It is defined only for front-ends with MLP classifier and it is evaluated on cross-
validation set (will be defined in the next section).
4.4 Evaluation tasks
Having defined the evaluation criteria and the structure of the recognizer, the last thing
that remains to be specified is the task itself.
Modern ASR is a complex system starting with a speech data acquisition, prepro-
cessing, extraction of features, classification, and decoding. Each of these blocks are
usually being developed with the assumption of independence on the other blocks, since
a global optimization is generally not feasible. However, the assumption of independence
can hardly be granted and the resulting achievements can thus be quite subjective and
only locally optimal. The simplest way to minimize the influence of the parts of the
system that are not of interest is to avoid them. Therefore, since the primary objective
of this thesis is a study of feature extraction techniques, the chosen evaluation task is
a small-vocabulary speech recognition. Such a task has a number of advantages. First,
it eliminates a possible bias coming from the language modeling (LM), as a strong em-
phasis on LM in large-vocabulary ASR (LVCSR) may smear the studied particularities
of features, especially if the only criterion is the word error rate. Small-vocabulary ASR
uses only a simple LM, the experiments are thus less LM-dependent. Second, the small-
vocabulary recognition system is more simple and transparent than LVCSR: it has less
degrees of freedom, which enable for back-tracing and analysis, allowing to get much closer
to the particular phenomena. Ultimately, one can learn more from a simpler task. The
drawback is that the results seen on small-vocabulary ASR may not hold for LVCSR. The
reason is that in LVCSR the information may come from different sources (better training
process, strong LM) or – and this is one more argument against LVCSR – state-of-the-art
LVCSR recognizer is a complex and carefully tuned equilibrium which can be easily broken
by introducing any brand new approach, no matter how promising it actually is.
4.4 Evaluation tasks 29
There are two tasks being used in this thesis. First, there is a digits recognizer for
English, which will be used for all data-driven development, optimizations and evaluations.
The second one is a simplified recognizer of English Conversational Telephone Speech
(CTS) with a limited vocabulary.
4.4.1 English digits recognition – Stories & Numbers95
Two speech corpora were used, OGI-Stories and OGI-Numbers95 [26, 27]. Both con-
tain speech recorded over a telephone channel in the same recording conditions (8 kHz,
16 bits, shortpack compression) and both contain a training subset, which is transcribed
on phoneme level by hand (with time-stamps). OGI-Stories contains spontaneous contin-
uous speech with rather large vocabulary, OGI-Numbers95 contains strings of digits and
numbers. This task will be further abbreviated as S/N (Stories/Numbers95).
Four data sets were created from these corpora (see also Fig. 4.4):
MLP set 1 – 208 files from Stories (2.8 hrs) with frame-level phoneme labels.
MLP set 2 – 3590 files from Numbers95 (1.7 hrs) containing digits and numbers with
frame-level phoneme labels.
HMM train set – 2547 files from Numbers95 containing strings of 11 digits from zero
to nine plus oh (1.3 hrs). It is a subset of MLP train set 2.
Test set – 2169 files/12437 words from Numbers95 containing strings of 11 digits
(1.7 hrs) with word transcription.
The MLP sets are used in the following way. When there is only one MLP used in
the front-end, which is the case for TANDEM systems PLP-TANDEM and M-RASTA,
then the MLP is trained on a joint set MLP set 1 + 2, see Fig. 4.4. The MLP is
actually trained on 90% of the data, the remaining 10% is used for MLP cross-validation
to determine the end of the training. When the front-end contains more than one MLP
in tandem, which is the case for TRAP and LP-TRAP, then the sub-band classifiers train
on MLP set 1 and the merging MLP trains on MLP set 2. Reasons for this division are
rather historical as it was adopted from Pratibha Jain, Frantisek Grezl, and originally from
Sangita Sharma. However, their back-end was different from this work, so the results are
not directly comparable.
The OGI-Stories corpus contains phonetically rich sentences, which can be beneficial
for the MLP training: All phonemes have similar occurrence which prevents the classifier
to patronize some classes at the expense of others. The HMM train and Test sets contain
speech from distinct speakers with sequences of digits.
MLP Setup
4.5 hours of training data in MLP sets 1 and 2 provide about 1.5 million speech frames
assuming the segmentation 25/10 ms. Training targets are 29 English phonemes – only
those contained in the ten recognized digits. Their list can be found in Appendix A.1. All
MLPs are 3-layer with the input, hidden, and output layer. The sizes of the MLPs for the
individual architectures are expressed in Input x Hidden x Output size:
30 4 Recognition and evaluation framework
2.8 hrs
1.7 hrsMLP set 1 MLP set 2
1.7 hrsHMM train set
1.4 hrs
Test set
OGI − Stories OGI − Numbers95
Figure 4.4: Structure of the corpora in Stories-Numbers95 (S/N) task.
• PLP-TANDEM: 351x1800x29 units (9 frames x 39 features at the input, 1800 hidden
units, and 29 units in the output layer mapped to phoneme targets).
• M-RASTA: 448x1800x29 units.
The TRAP architectures have a set of MLPs for every frequency sub-band denoted
Band-MLP and a merging network called Merger-MLP. Their sizes are:
• Band-MLP: Nx100x29 (N features at the input, 100 hidden units, 29 phoneme out-
puts). N depends on the TRAP length and in case of LP-TRAPs also on LPC order.
It typically varies between 50–100.
• Merger-MLP: 435x300x29 (435 inputs are formed by the outputs of 15 Band-MLPs
with 29 outputs each; 300 hidden units, 29 phoneme outputs).
Such neural networks have from 350k (TRAP-TANDEM) to 900k (TANDEM) train-
able parameters. After all the neural networks have been trained using qnstrn tool from
QuickNet, the 29 phoneme posteriors at the MLP output are passed through logarithm
and KLT transform. The KLT is a data-driven method which requires the training data
to derive the base vectors. For this purpose the HMM train set is used. The training
process for TANDEM and TRAP-TANDEM architectures is the following:
TANDEM:
1. Train MLP on MLP sets 1+2.
2. Forward HMM train & Test sets.
3. Apply log and compute KLT bases on
HMM train set.
4. Apply KLT transform on HMM train
& Test sets.
TRAP-TANDEM:
1. Train Band-MLPs on MLP set 1.
2. Forward MLP set 2, HMM train set,
and Test set through Band-MLPs.
3. Train Merger-MLP on forwarded MLP
set 2.
4. Forward HMM train set & Test set
through Merger-MLP.
5. Apply log and compute KLT bases on
HMM train set.
6. Apply KLT transform on HMM train
& Test sets.
4.4 Evaluation tasks 31
HMM Setup
When all the features for HMM train and Test sets are ready, then the back-end can be
trained and evaluated. The back-end is an HTK system written in Perl. Its key features
are:
• GMM/HMM system modeling context-independent phones,
• 22 phoneme HMMs (only the phonemes contained in the 11 recognized digits),
• 5 emitting states, each with 32 Gaussian mixtures,
• 11 target words (“zero” to “nine” plus “oh”) in 28 pronunciation variants.
The 22 phoneme HMMs are initialized from hand-labeled HMM train set by HInit
tool and re-estimated using the Baum-Welch procedure [85] by HRest tool. Subsequently,
all files that contain other words than the eleven digits are sorted out to eliminate un-
known phonemes. The Embedded re-estimation step performed next does internally the
phoneme alignment which would not be possible with unknown phonemes present. The
re-estimation is run 5 times to yield final models.
Test utterances are recognized using HVite tool given the features, a grammar with
11 digits plus silence in a loop, and a pronunciation dictionary with 28 pronunciation
variants. The performance is evaluated in terms of WER by HResults tool and the word
insertion penalty is allowed be tuned to yield the same number of insertions and deletions.
Statistical Significance
The word error rate is a statistical measure and should be accompanied with some confi-
dence interval. A simple approach adopted from [41] will help to indicate the statistical
significance for S/N task.
Let us suppose that two recognizers are evaluated at the same task, reaching two WER
scores p1, p2. The goal is to decide if there is enough evidence to conclude that either
p1 = p2 or p1 6= p2. A null hypothesis H0 saying that p1 = p2 is tested for a chosen
significance level α. If H0 can be rejected, then the scores are statistically different,
otherwise the difference is probably due to the chance.
Our basic question can read “By how much we need to improve on the given WER for
the improvement to be significant?” There are 12437 words in HMM Test set, the WER
hovers typically around 4% (will be shown later). Following the approach for significance
α = 0.05 we get the minimal difference in scores 0.5%. It means that if the baseline reaches
4.0% WER, the proposed algorithm needs to be better than 3.5% WER for the difference
to be significant at 95%. For 99% significance the difference is 0.7%.
Note that this measure is only an indication as it assumes an independence among the
errors in the tested words and also between the two experiments. There exist more accurate
significance measures, but most of them have to be re-evaluated for every experiment since
they depend on the recognized sequences and thus apply only to the specific run [41].
32 4 Recognition and evaluation framework
4.4.2 Conversational Telephone Speech – CTS
The main goal of the development of new feature extraction and recognition techniques
is to enhance the state-of-the-art ASR, which is typically an LVCSR system trained on
large amounts of data. It is not computationally feasible to develop on the full scale
task as the turnaround time is long, hence the experiments are usually done on simpler
tasks. However, it is often the case that the results reported on a small-vocabulary task
do not generalize to the state-of-the-art system. The task introduced in this section is
a compromise between the complexity and reliability on one hand and the simplicity on
the other hand. It is based on the data used by EARS Rich Transcription system and is
shown that the conclusions taken on it do translate to large task. A complete description
of the task can be found in [22]. Its key properties are:
• Two gender-dependent recognizers,
• telephone data (8 kHz) from various sources (cell phones, ISDN, analog lines)
• 32 hours of train data (16 hrs per gender) selected from Fisher and Switchboard
corpora,
• about half an hour of development + half an hour of test data per gender selected
from NIST RT-03 Evaluation [3],
• 1000 words in about 5k pronunciation variants, out-of-vocabulary-words (OOV) on
test set < 7.5%,
• bi-gram language model (inferred from NIST 2004 CTS evaluation).
Since architectures of the recognizers for both genders are the same, the following text
describes either one of them.
MLP Setup
The MLP setup is similar to the S/N and Speecon tasks. All MLPs and also KLT are
trained on the first 14.6 hours of the training data, the remaining 1.6 hours is a cross-
validation set. Training targets were obtained using automatic forced-alignment using the
state-of-the-art system from SRI [22]. There are 47 classes out of which 46 are used as
MLP targets (44 English phonemes + silence + other events) and one class with data not
suitable for training (about 0.6% of all frames). Details can be found in Appendix A.2.
This was decided after the preliminary experiments have shown that training on all targets
only lowers the FER and does not improve WER. There are about 5.7 million training
frames in about 16 k files. The sizes of MLPs are N x 2000 x 46 units in case of TANDEM
(N depends on the technique) and in case of TRAP the Band-MLPs have N x 100 x 46
units, the merger has (46x15) x 300 x 46 units.
After all the training, development, and testing data have been passed through the
trained MLPs, log, and KLT, they are converted to HTK features.
4.5 Baseline results 33
HMM Setup
Here the back-end is little more complicated than for S/N task as it is closer to LVCSR
task. It is a GMM/HMM systems with decision-tree tied state triphone HMMs, each HMM
with 3 emitting states and 32 mixtures per state. First the CI phoneme models with one
mixture are initialized from the flat start and trained by the Expectation-Maximization
algorithm (HERest tool). Then they are converted to CD models and these are clustered
until there is a required number of states. Subsequently, the number of mixtures is being
doubled and HMMs are re-estimated 5 times after each increase, until there are the final
32 mixtures.
Apart from the number of states being a knob to tune on the development set, also
the grammar scale factor can be tuned, balancing the acoustic and the language model.
However, tuning of these parameters is costly and will not be always done. Preliminary
experiments have shown that the WER is affected the most by these three factors:
• Grammar-scale factor. It depends on features (their acoustic modeling ability), but
mainly on the number of features in the vector, which affects the dynamic range of
the likelihood (tested with 39, 46 and 64 features).
• Number of triphone states. It varies for every feature kind. The general rule “the
more the better” may not be applied to all features.
• Normalization of the features. It seems useful to do speaker-normalization on the
train set and utterance-normalization on the test set.
Test utterances containing the CTS with a limited vocabulary are recognized using
HVite tool given the features, language model, pronunciation dictionary and the set of
CD phonemes. Output hypotheses are evaluated using SCLITE tool in terms of WER.
The CTS task allows for previewing the feature’s behavior on a large task. The draw-
back is a rather high computational demand. On a one-processor machine (Athlon 2200)
the training and evaluation takes almost two weeks. Because of this, only one gender,
male, is being used in all evaluations.
4.5 Baseline results
Four front-ends are used as a baseline:
MFCC – mel-frequency axis, 26 triangular filters from 0 – 4 kHz, no preemphasis, no
liftering, 12 cepstral coefficients + c0 + ∆ + ∆∆.
PLP – Bark-frequency axis, 15 trapezoidal filters from 0 – 4 kHz, no preemphasis, no
liftering, EQ-LD, IN-LD, 15th order LPC, 12 cepstral coefficients + c0 + ∆ + ∆∆.
PLP-TANDEM – 9 consecutive frames of the above PLP features at the MLP input
(351 features).
TRAP – critical band energies (CRBE) taken from the PLP computation, logarithm
applied. TRAP formed as 101 consecutive frames of CRBEs. Thus, each of the 15
band-MLPs have 101 inputs.
34 4 Recognition and evaluation framework
Features FER[%] WER[%]
MFCC - 4.5
PLP - 5.2
PLP–TANDEM 18 4.8
TRAP 19 4.7
Table 4.1: Performance of baseline features on S/N task.
Performance of these features on S/N task is given in Tab. 4.1.
Preliminary experiments on CTS task with the baseline features PLP, PLP-TANDEM,
TRAP suggested that the above mentioned factors may affect the performance more than
the features themselves. Therefore the factors were explored at first.
The suitable number of clustered triphone states depends on the type of features, hence
it should be optimized for every experiment. Reasonable starting values, meaning that
the WER should not be more than about 5% worse than the optimum, are between 1000 –
2000 states. For male gender generally more states are needed [22]. As the optimization is
very costly (involving the HMM retraining and evaluation), most of the time the optimum
is chosen from about three variants in the presented experiments.
The grammar scale factor post-multiplies the language model likelihoods from the
word lattices [100] and by default is 1.0. The bigger value, the more weight is given to the
language model. The WER was evaluated for some combinations of three vector sizes (39
for PLP, 46 for TANDEM, and 64 for combined features) and factors 3, 8, 14, 20, 25, 30,
40. As the recognition is costly, only several combinations were evaluated. For 46 features,
the reasonable choice of the factor was between 14 and 30 with WER ± 1%. Factor 8
yielded 5% (absolute) worse performance. For 64 features, the reasonable range (± 1%
WER) was between factor 20 and 30. Factor 40 yielded 3% (absolute) worse WER. For
39 features the factor was set at 14 [22]. Based on these observations, the default setting
for the grammar scale factor was set as shown in Tab. 4.2.
Number of features in the vector 39 46 64
Grammar scale factor 14 20 25
Table 4.2: Grammar scale factors for particular feature vector sizes (CTS task).
The baseline features were evaluated with the appropriate grammar scale factors and
a suitable number of tied triphone states. The word insertion penalty was fixed in all
experiments at zero. The results are given in Tab. 4.3. Note that ePLP denotes enhanced
PLP features. They were normalized with respect to the vocal tract length (VTLN)
and energy normalized over all speech from one speaker (training data) or per utterance
(evaluation data).
4.5 Baseline results 35
MLP cross-validation Devel set Test set
Features FER[%] WER[%] WER[%]
MFCC - 52.5 50.2
PLP - 53.2 51.6
ePLP - 46.4 43.8
ePLP–TANDEM 35 46.3 43.3
TRAP 40 54.8 53.3
TRAP–DCT 39 53.3 51.2
Table 4.3: Performance of baseline features on CTS task.
36 4 Recognition and evaluation framework
Chapter 5
Information in time-frequency
plane
The ASR knowledge allows for the assumption that the auditory-like spectrum preserves
complete linguistic information. However, the speech is a temporal event and so is also
its spectrum. Due to the co-articulation, the ASR-relevant information is spread across
some interval of the spectrogram. The goal of this chapter is to help to understand the
distribution of the relevant information in the time-frequency space. Such knowledge
would allow to improve the existing techniques, to make them more discriminative against
non-linguistic artifacts, and possibly also suggest new approaches. It will be accomplished
by a detailed study of TRAP features.
5.1 Limits of useful temporal context for TRAP features
To be able to study the spatial distribution of a phenomenon, the first step is to limit the
search space to some reasonable boundaries. By restricting our interest to probabilistic
features with intermediate phoneme classes, the main question can be: “How much of
temporal context is necessary for separating among phonemes?” The answer will be
searched for by first studying the mean temporal trajectories of phonemes. However, as
the ultimate target for ASR are words and not phonemes, the subsequent experimental
search for the context limits will be done with respect to both, phoneme and word criteria,
FER and WER.
5.1.1 Mean TRAPs of phonemes
To get a first insight in temporal patterns of the sub-band energies, the mean TRAPs were
computed according to the work of Hermansky and Sharma [53] from the training part of
CTS database.
Fifteen CRitical Band log-Energies (CRBE) were extracted from the speech with
short-term analysis and 100 Hz rate using command feacalc -plp 12 -deltaorder
0 -domain logarithmic -dither -frqaxis bark -samplerate 8000 -win 25 -step
10. TRAPs were formed as 101 samples long trajectories of CRBEs (about 1000 ms).
They were assigned labels according to the frames in the middle. There are overall about
11 million frames in CTS training part. The class coverage is given in Appendix A.2.
37
38 5 Information in time-frequency plane
There are 41 English phonemes plus silence (SIL), word fragment interruption point
(FIP), laughter (LAU), filler phonemes (PUH, PUM) and a class for all sounds rejected
from the training (REJ). All TRAPs belonging to the same class were averaged to yield
the Mean TRAPs. For each of the 101 points in every Mean TRAP, a standard deviation
was also calculated. Fig. 5.1 shows the Mean TRAPs and standard deviations at the 5th
critical band (450 – 640 Hz) for all 47 classes.
aa ae ah ao aw ax
ay b ch d dh dx
eh er ey f g hh
ih iy jh k l m
n ng ow oy p r
s sh t th uh uw
v v y z zh
SIL
dB10
0
−10
0−500 ms +500 ms
FIP LAU PUH PUM REJ
Figure 5.1: Mean TRAPs (black line) for 41 phonemes and 6 non-phoneme classes of CTS,
5th band. Standard deviations σ (blue dashed line). Green area represents a measure of
confidence mean±σ.
Observation
Typically the TRAP wavelets for phonemes span about 200–300 ms and the lowest variance
(highest confidence) is near the center. Non-phonetic classes have somewhat wider shapes
and higher variance at the center, reflecting that these segments are typically longer and
contain artifacts with higher variability. Although not illustrated in the figure, mean
5.1 Limits of useful temporal context for TRAP features 39
TRAPs in other bands have different shapes but similar time constraints. It suggests
that 1 second TRAP is long enough to capture complete temporal structure relevant to
phoneme-based ASR and it is likely that the minimum necessary interval is even shorter,
yet not shorter than about 200 ms. Hence, the base TRAP length was set to 1000 ms for
further experiments .
Mean TRAPs from MLP Posteriors
Analogically to the above experiment, similar Mean TRAPs were computed from the same
data, but with different labels. Here, the TRAPs were weighted by phoneme posteriors
obtained from a trained MLP, not the discrete labels. The MLP was trained on the 5th
band only.
SIL dx zh Mean − hand
Mean − MLP
std.dev. − hand
std.dev. − MLP
Figure 5.2: Mean TRAPs and standard deviations for selected phonemes of CTS in 5th
band. Labels were obtained either from hand-labeling or from trained MLP.
Though the MLP frame error rate was quite large (60–70%), the resulting Mean TRAPs
were almost the same as in the above experiment, as shown in Fig. 5.2. It suggests that
1) on the sub-band level all phonemes cannot be distinguished and, 2) it is possible to
find some patterns which are common to more phonemes. This idea relates to the work
of Kajarekar and Hermansky [61], who aimed at finding a set of optimal broad sub-word
classes for ASR. They reported no success when using these targets as HMM states. Later
efforts with UTRAP system [51] extended the idea to all bands; a neural system with one
universal classifier for all bands with 9 classes reached the performance of TRAPs with
over an order of magnitude less trainable parameters. The existence of shared classes was
the main motivation for the experiment in appendix C.
5.1.2 Truncating TRAP
Having settled the maximum TRAP length of 1000 ms, a study was done with shortening
the TRAP. The initial goal was to find the best possible accuracy with the TRAP length
being the only free parameter. Subsequently a minimal length was searched for which
still preserved the quality of the system with optimal length. The objective was not to
minimize the TRAP length to make the system smaller; the point was rather to find where
the most important information comes from. It was supposed that the critical length could
be detected at the point where the system started to break.
Experiment setup
The experiment was based on the S/N task and the TRAP system introduced in Section
4.4.1:
40 5 Information in time-frequency plane
• TRAP length varies from 161 frames (≈ 1600 ms) down to 1 frame.
• MLP sizes: Nx100x29 units in Band-MLPs, 435x300x29 units in Merger-MLP.
N equals TRAP length in frames.
• 2 types of TRAP: normal, and mean–normalized over TRAP length.
• Evaluation criteria are WER and FER.
For every TRAP length a new set of MLPs and HMMs was trained. Subsequently,
FER on the merger-MLP and WER were evaluated and plot in the graphs at Fig. 5.3.
0 200 400 600 800 1000 1200 1400 1600 180015
20
25
30
35
TRAP length [ms]
FE
R[%
]
mean normalization over TRAP
no normalization
0 200 400 600 800 1000 1200 1400 1600 18003.5
4
4.5
5
5.5
6
TRAP length [ms]
WE
R[%
]
mean normalization over TRAP
no normalization
Figure 5.3: Influence of TRAP length on FER (left) and WER (right). MLP size depends
on the TRAP length.
Observation
FER criterion:
• The dependency of FER on the TRAP length is smooth. The only outlier point
for TRAP length 101 was caused by a slightly worse convergence of MLP training,
which did not affect markedly the WER.
• When optimizing for the frame level phoneme recognition, the TRAP length should
be between 400–1000 ms. Longer TRAPs than 1000 ms start to hurt the perfor-
mance. Shorter TRAPs than 400 ms are not long enough and can severely affect the
FER.
• Mean-normalized TRAPs require about 50 ms (5 frames) larger window than no-
normalized TRAPs. A possible explanation might be that the missing information
about the mean in mean-normalized TRAPs can be recovered from wider context.
WER criterion:
• The dependency of WER on the TRAP length is jittery and only a rough trend can
be interpreted.
5.1 Limits of useful temporal context for TRAP features 41
• WER seems to be almost independent of TRAP length in wide range. TRAP can be
truncated down to 300 ms (normal TRAP) and even to 100 ms (mean-normalized
TRAP) before significantly deteriorating in WER.
This surprising phenomenon can possibly be explained if the phoneme posteriors
are seen generally as features: Although for short TRAP the frame-level phoneme
posteriors are “noisy”, they still apparently contain the linguistic information, which
can be isolated in HMMs thanks to their temporal smoothing mechanisms with
strong a priori constraints.
NOTE: There is an extreme case of this experiment worth noting. It is when the TRAP
length is only one frame. Such a degenerated system is interesting as it is essentially only
the short-term auditory spectrum at the input. There is no context.
The Band-MLPs have a topology of 1 input – 100 hidden – 29 output units and they
have to make a decision about the underlying phoneme given only one number – sub-band
energy. Surprisingly, having only such subtle information, they are still able to properly
classify 25–30% of phonemes1! Recall that Band-MLPs for standard 51-frames TRAP
properly classify 30–40% frames.
The Merger-MLP properly classifies 50% frames (51-TRAP system: 82%). The word
error rate reaches respectable 18.7% (51-TRAP system: WER = 4.6%).
From a different point of view, such a system is in principle comparable to the PLP
without ∆ and ∆2, which gives 17.9% WER and confirms the consistence of the results.
5.1.3 Fixed MLP topology – truncating TRAP-DCT
The reason for the jitter in the WER curves from Fig. 5.3 is that the MLP topology differs
among various TRAP lengths. This consistently introduces a stochastic component in
WER. In order to make the curves smoother, the MLP topology has to be fixed over the
experiment. This can be accomplished by TRAP-DCT. The idea of TRAP-DCT is as
follows.
1. To form TRAPs of the required length.
2. To apply Hamming window along TRAP to reduce the TRAP boundary transitions.
3. To project the windowed TRAPs to the first K DCT bases.
4. To use the projections as features for MLP.
The number of DCT bases (DCT size) K used for projection needs to be fixed. K
needs to be at most equal to the TRAP length N (in frames) for the DCT to be defined,
K ≤ N . Hence, low K allows to shorten the TRAP down to K frames (K×10 ms), but as
there are few features then, the performance is low. High K allow for good performance,
but the TRAP cannot be truncated below K frames. Therefore K was left to be a free
parameter.
1Classification rate is expressed in terms of Frame Accuracy. It is defined as 100% - FER.
42 5 Information in time-frequency plane
MLP architecture
The TRAP-DCTs are commonly treated by a TRAP MLP architecture with 15 band
networks plus merger network. However, preliminary experiments have shown that for
small K the dependencies of WER and FER on TRAP length are noisy. It is because the
band MLPs have very few inputs and their decision cannot be reliable. This issue can be
avoided by replacing all 15 plus 1 MLPs by one big MLP (for small K), which preserves
the trend in results and smooths the curves. Specifically, the setup was the following.
DCT size K MLP Architecture
2, 4, 6 1 MLP, size = (K × 15)x1800x29
(100k–200k trainable parameters)
7, 11, 25, 50 15 Band-MLPs, sizes = Kx100x29
1 Merger-MLP, size = 435x300x29
(200k–260k trainable parameters)
The experiment was carried out in two loops.
• For every DCT size K out of 2, 4, 6, 7, 11, 25, and 50, do:
• Vary the TRAP length N from the minimum K × 10 ms to about 1000 ms.
• For each K and TRAP length N , calculate features (using the above four points),
train MLPs and HMMs, evaluate FER and WER.
The resulting dependencies are plotted in Fig. 5.4.
0 200 400 600 800 1000 1200 1400 160010
15
20
25
30
35
40
45
50
55
60
TRAP length [ms]
FE
R [%
]
DCT size 2
DCT size 4
DCT size 6
DCT size 7
DCT size 11
DCT size 25
DCT size 50
0 200 400 600 800 1000 1200 1400 1600
4
6
8
10
12
14
TRAP length [ms]
WE
R [%
]
DCT size 2
DCT size 4
DCT size 6
DCT size 7
DCT size 11
DCT size 25
DCT size 50
Figure 5.4: Influence of TRAP length on FER (left) and WER (right). TRAP-DCT
features are used, so MLP size is fixed for every trajectory.
Observation
• The number of useful DCT bases saturates at K = 25, reaching the best FER about
15% and the best WER about 4.3%. More bases start to hurt the performance.
• For best FER a long context (400–1000 ms) is required. This agrees to the experi-
ment in Section 5.1.2.
5.1 Limits of useful temporal context for TRAP features 43
• WER criterion seems to prefer shorter context as the minima of all curves lie at the
left side. Best WER was found for context 200–400ms.
• The lower the feature dimension, the shorter the optimal TRAP length.
Conclusion
The experiment has shown that relatively fast modulations in speech are the most impor-
tant for ASR. However, slower modulations are also useful, as they contain complementary
information.
In particular, when the feature size is limited, then shorter TRAPs are the first choice:
300 ms for good FER and 100 ms for good WER. Slower modulations contained in longer
TRAPs up to 1000 ms can be useful when the feature size does not matter.
5.1.4 Extension – combining DCTs of different lengths
The observations from the previous experiment may deserve more discussion. Consider
the shape of the applied DCT bases shown in Fig. 5.5. They are cosinusoids weighted by
Hamming window. In discrete time with frames n = 0 . . . N − 1, N is the TRAP length,
DCTk[n] = cos
(kπ
N(n + 0.5)
)
︸ ︷︷ ︸
DCT
·(
0.54 − 0.46 cos(
2πn
N
))
︸ ︷︷ ︸
Hamming
, (5.1)
where k = 1 . . . K is the DCT base index2. The actual projection of TRAP samples x[n]
onto DCT bases is done with
X[k] = ak
N∑
n=0
x[n] · DCTk[n], (5.2)
where ak is a scaling constant. DCT can be seen a bank of very narrow band-pass filters
tuned at different frequencies. The center frequencies fc are linearly spaced from 0 Hz to
fs Hz, where fs is spectrum sampling frequency (here fs = (10 ms)−1Hz= 100 Hz). The
absolute filter’s center frequency in Hertz is determined by the ratio k/N :
fc(k,N) =1
2π
kπ
Nfs =
k
2Nfs [Hz]. (5.3)
Thus, the same frequency can be obtained in multiple ways. Let us take an example.
Let DCTNk be the kth DCT base applied to the TRAP of the length N samples. According
to eq. 5.3, the first two DCT bases applied to 10-frames long TRAP (denoted DCT101,2) have
frequencies 5 Hz and 10 Hz, respectively3. The DCT1001,2 bases applied to ten times longer
TRAP have frequencies 0.5 Hz and 1 Hz, respectively, but the the tenth and twentieth
bases (DCT10010,20) have again frequencies 5 Hz and 10 Hz. What differs between DCT10
1
and DCT10010 is in principle just the length of the “cosinusoid” as shown in Fig. 5.6. It
suggests an interesting question, which bases perform better, shorter or longer ones? In
other words, when projecting TRAP to DCT bases with multiple periods, is it necessary
to use all the periods?
2The DCT base for k = 0 is also defined, but without Hamming window it is a constant (cos 0 = 1),
therefore it will not be considered here.3The explanation does not consider Hamming windowing for simplicity.
44 5 Information in time-frequency plane
0−1
−0.5
0
0.5
1DCT1
DCT2
DCT3
DCT4
Figure 5.5: First four DCT bases weighted by Hamming window.
Recall that low-order DCT features from the previous experiment reach relatively good
accuracies. DCT71,2 coefficients perform 8.5% WER, DCT15
1,2,3,4 5.5% WER. This incites to
try combining the low-order DCT bases that gave good performance with other low-order
DCTs of different TRAP length, instead of calculating higher-size DCT of only one length.
For example, instead of calculating four features DCT151,2,3,4, to use two plus two features
DCT151,2 + DCT7
1,2. It is illustrated in Fig. 5.7.
Experiment
A simple experiment was carried out, similar to the experiment in the previous section.
It combined the best-performing DCT size 2 features (DCT71,2) with another two DCT
size 2 features (DCTM1,2). The hypothesis was: It is possible to find a combination of four
DCT bases DCT71,2+DCTM
1,2, which outperforms the best DCT size 4, DCT151,2,3,4. If this
was true, better features could be found than TRAP-DCT by optimizing the projection
bases. The experiment proceeded as follows.
1. The promising features from DCT size 2 were found (top blue curves from Fig. 5.4).
2. They were concatenated with other DCT size 2 features. It was repeated for all
TRAP lengths.
3. The performance was compared to the DCT size 4 features.
(The same procedure was run independently for both criteria, FER and WER.)
0 100 200 300 400 500 600 700 800 900 1000−1
0
1
time [ms]
0−1
0
1
DCT1
10
DCT2
10
DCT1
100DCT
10
100DCT
20
100
DCT bases for 100 ms (10 samples)TRAP
DCT bases for 1000 ms (100 samples)TRAPTRAP center
Figure 5.6: Chosen bases of DCT applied to 100 ms and 1000 ms TRAPs.
5.1 Limits of useful temporal context for TRAP features 45
DCT1
N
DCT2
N
DCT3
N
DCT4
N
DCT1
N
DCT2
N
DCT2
N/2
DCT1
N/2
First four bases ofstandard DCT
Combing two DCTsof different sizes After weighting with Hamming window
time
Figure 5.7: Illustration of combining bases of two DCTs of different sizes. Left part shows
the conventional bases (DCTN1,2,3,4) and combined bases (DCTN
1,2+DCTN/21,2 ). Right part
shows the same bases, but windowed with Hamming window.
0 200 400 600 800 1000 120020
25
30
35
40
45
50
55
60
65
70
TRAP length [ms]
FE
R [
%]
DCT size 2
DCT size 4
300 ms TRAP → DCT size 2 + two more coeffs DCT size 2
0 100 200 300 400 500 600 700 800 900 1000 11002
4
6
8
10
12
14
16
TRAP length [ms]
WE
R [
%]
DCT size 2
DCT size 4
300 ms TRAP → DCT size 2 + two more coeffs DCT size 2
Figure 5.8: Combining DCT size 2 features with another DCT size 2 features. FER (left)
and WER (right). Dashed ellipse = best of DCT size 4, solid ellipse = combined 2+2
features.
Observation
Results are plotted in Fig. 5.8. The hypothesis was confirmed by both criteria.
• FER: DCT1,2 bases applied to 300 ms TRAP can be profitably augmented by sim-
ilar features derived from either shorter or longer TRAPs. Best seems to combine
approximately 300 ms + 100 ms or 300 ms + 800 ms long bases. The combined
features outperform the best of DCT size 4 (300 ms).
• WER: DCT1,2 bases of the length 150 ms can be successfully combined with DCT1,2
of the length 30 ms, outperforming the best of DCT size 4 (150 ms).
Conclusion
TRAP-DCT uses a set of cosinusoidal projection bases of the same lengths and increasing
frequencies. The “high-frequency” bases have multiple periods of the cosine within the
46 5 Information in time-frequency plane
window. It was shown that using all periods can actually harm performance and it is
beneficial to truncate these bases (see Fig. 5.7, left). In other words, it seems desirable
that the length of a particular cosinusoidal base be proportional to its frequency.
This knowledge is one of the motivations for Multi-RASTA filtering, where the pro-
jection bases have fixed shapes and they differ only in widths. This framework will be
introduced in Chapter 7. Note that Multi-RASTA filters have remarkably similar shapes to
the bases that were successfully used in this section (see the rightmost column in Fig. 5.7),
though their origin is different.
5.2 Focusing on TRAP center
The previous sections studied the necessary length of speech needed for ASR with respect
to WER and FER criteria. The TRAP lengths for which the performance was not far
from the best ranged between 100 ms – 1000 ms. In this section it will be explored what
is the density of relevant information within TRAP.
TRAPs treat all the points in the trajectory the same way. However, by intuition, the
most important part should be located near the TRAP center. Consider a human eye:
there most of the information comes from the area where the eye is focused on and only
a minor message comes from the periphery. If there was an analogy with speech, then it
could be beneficial to put different attention to different parts of the TRAP4. But how
the attention is defined here? For example, TRAP-DCT aim at emphasizing the center of
the trajectory by weighting it with Hamming window before applying DCT. The following
reasoning seeks for a more explicit proportioning of the attention. It assumes that if a
region with higher density of information is parametrized by more features, the system
performance should raise.
5.2.1 Warping time axis
Let us assume that the complete linguistic information is contained in the 101 x 15 samples
of the auditory spectrogram. If the information density is not uniform, then some parts of
the spectrogram are neccessarily redundant. These sparse-information regions that require
less features could be sub-sampled. A simple approach to achieve this is to bin together
neighboring TRAP points and to use only the resulting point. To further simplify the
problem so that it can be solved experimentally, the density along the frequency axis is
assumed uniform. The main goal can now be interpreted by two questions.
• What is the mapping function between the linear time axis and a warped time axis
with constant information density?
• How redundant are the 101 samples in 1000 ms long TRAP trajectory? In other
words, how many TRAP samples are needed to preserve the information?
4It can be argued that the MLP is able to learn any nonlinear projection and this reasoning does not
make sense. But that is only a theory assuming many conditions are met which are actually not, for
example the lack of train data. If the world was perfect, one could feed the MLP directly with speech
samples. In practice, it makes a lot of sense to “help” the MLP with proper feature extraction and selection.
5.2 Focusing on TRAP center 47
Warping Function
The mapping is approximated by a center-symmetric function which one half is given by
f(x) = 1 − (1 − x)w, x ∈< 0, 1 >, (5.4)
where w is the only parameter. The symmetrization takes place along the point
[x=1,f(x)=1]. The complete function maps the warped output axis (x axis) to the input
linear axis (y axis), as shown in Fig. 5.9.
One could ask, why the output is mapped to the input and not vice-versa. The goal is
to design a warped axis which, when sampled uniformly, produces samples having uniform
density of information. Hence, the most important is to be able to project the samples of
the warped axis onto the linear axis. The mapping from linear to warped axis is not so
important.
Fig. 5.9 shows the influence of w on the warping function. w = 1.0 represents 1:1
mapping, w > 1.0 prioritizes center, w < 1.0 prioritizes boundaries (included for com-
pleteness). This function was chosen for three reasons. 1) The “warping effect” is nicely
proportional to w. 2) The function defaults to straight line for w = 1. 3) For a given
w, the warping near the center is stronger than at the boundaries (note the slopes for
w = 2.6 near points 1.0 and 0.0 on horizontal axis), which protects the boundaries from
being sampled too sparsely when high w values are used.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
OUTPUT warped axis
INP
UT
lin
ea
r a
xis
w
w = 0.6
w = 2.6
w = 1.0
Figure 5.9: Symmetric TRAP warping function, various w factors. Coordinates [1,1]
denote TRAP center. Red dashed lines illustrate the projection of the uniformly sampled
warped axis to the linear axis.
Now the function has to be discretized, it means to be sampled at equidistant points of
the horizontal axis. The input TRAP size is 101 points and the number of output points is
M (M ∈< 1, 101 >). M is odd, which means that both edge points as well as the middle
point will be warping-independent. The example of the discrete mapping from 101 to 21
48 5 Information in time-frequency plane
points is illustrated in Fig. 5.10. For w = 1.0, about 5 input points are binned (averaged)
to yield one output point, the ratio is 5:1. For w = 2 the mapping is 1:1 in the center and
10:1 at the boundaries. Obviously, it does not make sense to raise the w too much, as the
central points would start to repeat at the output.
Experiment Setup
101-points input TRAP representing about 1000 ms were mapped to M = 51, 31, 21, 15,
and 11 points. Lower counts do not make sense according to the experiment from Section
5.1.2. For each M , four warpings were tested:
Linear – linear mapping, w = 1.0, all points are compressed with the same average ratio
101:M .
High – strong warping such that mapping is 1:1 at the center.
Medium – less aggressive warping such that mapping is 1:2 at the center.
Inverse – priority to boundaries, mapping is 1:1 at the boundaries.
All the tested warping factors and output sizes are summarized in Tab. 5.1.
5 10 15 20
10
20
30
40
50
60
70
80
90
100
OUT
IN
w = 1
5 10 15 20
10
20
30
40
50
60
70
80
90
100
OUT
IN
w = 2
5 10 15 20
10
20
30
40
50
60
70
80
90
100
OUT
INw = 0.5
Figure 5.10: Discrete mapping from TRAP size 101 to TRAP size 21.
The experiment was based on the same setup as TRAP-length experiment (see Section
5.1.2), with TRAP back-end and S/N task. For every combination of output sizes M and
factors w, warped-TRAP features were calculated from mean-normalized TRAPs using the
pfile warp tool and fed to the ANN/HMM framework, FER and WER were evaluated.
Observation
The recognition results with warped TRAPs are shown in Fig. 5.11.
• Linear mapping 101 to M points smoothly deteriorates FER and WER as M drops.
There is a well-performing exception of M = 51, where each two points in TRAP are
5.2 Focusing on TRAP center 49
warping type
Linear High Medium Inverse
Output boundary center boundary center
size M ratio w ratio w ratio ratio w ratio
101 1:1 1.00 1:1 1.00 1:1 1:1 1.00 1:1
51 2:1 1.37 3:1 1.18 1:1 1:3 0.51 12:1
31 3:1 1.67 5:1 1.33 2:1 1:4 0.31 34:1
21 4:1 2.00 10:1 1.63 2:1 1:6 0.21 54:1
15 6:1 2.40 14:1 1.96 2:1 1:12 0.15 66:1
11 9:1 3.00 22:1 2.30 2:1 1:18 0.11 78:1
Table 5.1: Warping factors w and compression ratios for given lengths M .
0 10 20 30 40 50 60 70 80 90 100 11017
18
19
20
21
22
23
24
25
26
27
Feature vector size
FE
R [
%]
No warping
Strong center emphasis
Medium center emphasis
Boundary emphasis
0 10 20 30 40 50 60 70 80 90 100 1104
4.5
5
5.5
6
6.5
7
7.5
8
8.5
9
Feature vector size
WE
R [%
]
No warping
Strong center emphasis
Medium center emphasis
Boundary emphasis
Figure 5.11: Warping time axis in TRAPs. Left: FER, right: WER.
binned together. This has a similar effect as using 2x longer speech frames, which
still preserves modulations up to 20 Hz (2x 10 ms frame ↔ fs = 50 Hz). Some
authors use segmentation 32/16 ms ordinarily.
• Preserving fine resolution near the TRAP center at the expense of boundaries indeed
makes sense. It proves the hypothesis that TRAPs are redundant and that the crucial
information lies near the center. The severe sub-sampling of boundaries with ratio
10:1 for M = 21, w = 2.0 achieved the best performance in both FER and FER (see
also Fig. 5.10). Using down to 15 features per TRAP kept the performance. 11
features started to harm.
• The inverse warping which preserves boundaries at the expense of center failed,
supporting the notion of the major importance of center.
Extension to CTS Task
The promising warpings were evaluated also on the CTS task and compared to the stan-
dard TRAPs with 101 samples and 21 samples, see Tab. 5.2.
In frame error rate, warped TRAPs reached comparable performance to 101-TRAP,
significantly outperforming short 21-TRAP. It confirms the validity of the idea of non-
50 5 Information in time-frequency plane
Devel set Test set
Features FER[%] WER[%] WER[%]
TRAP 101 40.0 54.5 53.4
TRAP 21 44.3 55.2 52.8
warpTRAP M = 21, w = 2.0 40.5 55.5 53.8
warpTRAP M = 15, w = 2.4 40.6 56.5 54.3
Table 5.2: Performance of selected warped TRAPs on CTS.
uniform information density allowing to use less features. In word error rate, the obser-
vations from S/N task did not translate to LVCSR. At average, warped TRAP performs
1% worse than TRAP. The discrepancy between FER and WER resembles figures from
Sec. 5.1.2. A possible explanation is that on CTS task, FER is fully determined by features,
while WER is influenced by more inputs (HMM temporal constraints, LM, dictionary).
Chapter 6
Linear Predictive TempoRAl
Patterns (LP-TRAP)
This chapter deals with Linear Predictive TepoRAl Patterns (LP-TRAP). The idea of
LP-TRAP originates from Marios Athineos, Hynek Hermansky and Daniel P. W. Ellis
[15]. During author’s stay at IDIAP Research Institute, he cooperated with the first
two LP-TRAP inventors. The outcome was a C++ implementation of LP-TRAP feature
extraction, partial optimizations of the technique and evaluations on S/N and CTS task.
In addition, the author experimented with non-linear warping of the temporal axis of
LP-TRAP.
6.1 Introduction to LP-TRAP
Speech is by nature a continuous time signal. Though human hearing performs implicitly
some form of spectral analysis, time remains the primary dimension.
Short-term feature extraction begins with segmenting the speech in about 100 Hz rate,
thus reducing the temporal resolution by two orders of magnitude. On one hand it allows
to assume stationarity within the segment and apply FFT, on the other hand it pushes
time dimension aside: any temporal detail below 10 ms is lost. Even more it might seem
unfortunate when considered that the next processing step is a filter bank with spectral
resolution of only about 20 samples. The uncertainty principle suggests that given so
reduced frequency resolution, the temporal resolution could have been much finer than it
is. Why dropping the information which is there and could be important? What if faster
temporal events in speech (e.g. stops) improve ASR when better localized?
Among feature extraction approaches that focus primarily on temporal structure are
TRAPs, which estimate phoneme posterior probabilities from the time evolution of sub-
band energy. However, TRAPs originate from the short-term spectral samples, not imply-
ing any finer details. There should be a more straightforward way to obtain the temporal
evolution of sub-band energy. In principle, the signal could have been split in sub-bands
by a simple filter bank in time domain. Then the band-specific temporal resolution would
be determined just by the impulse response of the particular filter and it would not be
influenced by frame analysis. Section 3.1.2 in survey introduces several works towards
this approach. Note that to get the final estimates of energy envelopes, one would have
to demodulate the band-pass signals and take the square. Anyway, the point is that such
51
52 6 Linear Predictive TempoRAl Patterns (LP-TRAP)
approach is only an approximation to the true energy envelope called Hilbert envelope,
which can also be evaluated.
The squared Hilbert envelope represents instantaneous energy in signal and can be
calculated directly from speech samples. It also allows for sub-band processing. Sub-
band Hilbert envelopes are energy trajectories similar to TRAPs, but with full temporal
resolution. They need to be smoothed to suppress the presence of glottal pulses, but
instead of an ad-hoc universal low-pass filter hidden in the frame-based processing, in
LP-TRAPs the smoothing is accomplished by Linear Prediction1. The involved auto-
regressive model, when applied appropriately (will be further discussed), can smooth the
trajectory and at the same time capture fine temporal details in milliseconds accuracy. Its
behavior can be adjusted by two additional mechanisms: 1) transform applied on spectral
values 2) warping temporal axis. Finally, the LPC coefficients can be easily converted into
cepstral coefficients, which can serve as a natural and elegant TRAP representation.
6.2 Extracting LP-TRAP features
LP-TRAPs are extracted in several steps as illustrated in Fig. 6.1. Very briefly, speech is
segmented in long segments (500–1000 ms) with a shift of 10 ms (to be consistent with
TRAPs) and transformed in frequency domain by Discrete Cosine Transform (DCT).
The spectrum is then split in auditory-like frequency sub-bands. In each sub-band, the
fragment of DCT is solved for coefficients of the LP model. Note the duality between time
and frequency: If we were in time domain, the LPC would approximate a power spectrum;
in frequency domain the LPC approximates something close to “power of a time signal”,
more precisely the Hilbert envelope of the sub-band. For the speech recognition, either
the LP coefficients or cepstral coefficients can be used. Alternatively can be calculated
temporal envelopes of each sub-band, resembling TRAPs.
DCT f
sub−bandssegmentation
FDLPLP coeffs
a[k] / c[k]
FFT
Figure 6.1: LP-TRAP feature extraction scheme.
Now, let us now look closer at some particularities of the algorithm. The scope of the
next sections is the following:
• Section 6.2.1 shows the importance of Hilbert envelope for LP-TRAP. It will also
explain why the spectrum is calculated using DCT.
• Section 6.2.2 clarifies how the DCT spectrum is split in frequency sub-bands.
1Note that here the LPC smooths temporal envelope instead of power spectrum, it is applied in
frequency domain instead of time domain, and therefore is called Frequency-Domain Linear Prediction
(FDLP) [14].
6.2 Extracting LP-TRAP features 53
• Section 6.2.3 explains the principle of Frequency-Domain Linear Prediction (FDLP).
A trick for modeling dips and peaks equally well by AR model is presented.
6.2.1 Importance of Hilbert envelope
Here will be shown very schematically yet vividly the importance of Hilbert envelope.
When thinking about how to estimate the energy trajectory (or envelope) of a signal, the
first idea might be to simply take the power of the signal and possibly smooth the resulting
trajectory by a low-pass filter. But, is this right?
Imagine a simple stationary signal, a sinusoid. What is its envelope? Intuitively, the
amplitude of the sinusoid is constant and so should be its envelope. However, the power of
sinusoid is not constant. It is a value which varies between zero and the square amplitude,
depending on phase. Why would the energy envelope be dependent on an incidental phase
in time? The envelope does not depend on phase, so taking the power is clearly not the
right way. But, how to “make” the signal phase-independent? A rough, yet illustrative
answer is: Accompany the sinusoid with a cosinusoid of the same amplitude. Obviously,
this cannot be achieved by an addition, but by forming the complex signal with the sinusoid
being its real part and the cosinusoid being its imaginary part, a complex exponential.
Since sin2 x + cos2 x = 1, the absolute value of such complex signal is constant. This is in
principle the Hilbert envelope.
To make things clearer and little more formal, recall some basics from signal theory.
Let x(t) be a time signal. Then
x(t) =1
πt∗ x(t) (6.1)
is its Hilbert transform (H[·]). This convolution represents nothing but a phase change
by 90◦, which can be shown using Fourier transform (F[·]):
F
[1
πt
]
= −j sgn(ω), (6.2)
where j is the imaginary unit and sgn function returns the sign. The above mentioned
complex signal is called analytic signal,
x+(t) = x(t) + jx(t). (6.3)
More important for our purposes is its Fourier transform
F[x+(t)] = X+(ω) = X(ω) + j(−j sgn(ω)X(ω)) =
2X(ω), ω > 0
X(0), ω = 0
0, ω < 0
(6.4)
which is a causal (or “one-sided” spectrum). This is actually a dual form of Krammers–
Kronig relation which says that real and imaginary parts of Fourier transform of a causal
signal a(t) are related (ℜ denotes real part of a complex number):
F[a(t)] = ℜ[A(ω)] + jH [ℜ[A(ω)]] . (6.5)
Here, the causal signal is X(ω) and the real signal is x(t), hence
F−1[X(ω)] = x(t) + jH [x(t)] . (6.6)
54 6 Linear Predictive TempoRAl Patterns (LP-TRAP)
For the purposes of modulated (band-pass) signals, let us also define the complex envelope
simply as demodulated analytic signal
x(t) = x+(t) exp−2πf0t . (6.7)
Finally, the Hilbert envelope of the signal x(t) is
h(t) = |x(t)| = |x+(t)|. (6.8)
Some authors also define the temporal envelope as
ε(t) = h(t)2. (6.9)
The temporal envelope of the sub-band signal is the subject of modeling in LP-TRAPs.
Discrete Cosine Transform instead of FFT
According to eq. 6.4, the envelope could be in principle calculated by taking Fourier
transform of the signal, canceling its left part (ω < 0) and taking the absolute value of
the inversely transformed signal. But, consider that the envelope is going to be modeled
by linear prediction. LP applied in temporal domain first calculates the autocorrelation
and then solves Yule-Walker’s equations. The autocorrelation can optionally be obtained
using Wiener-Kchinchin theorem [94] as
R(t) = F−1[|F[x(t)]|2]. (6.10)
When the LP is applied in frequency domain (FDLP), it calculates the autocorrelation not
from the signal, but from the spectrum. Therefore the spectrum needs to be purely real.
The real spectrum in turn implies symmetric signal in time. Therefore in LP-TRAPs,
prior to taking the Fourier transform, the input signal is symmetrized, which makes all its
components even. In other words, the signal is composed of cosines only. It can be shown
that in case of a discrete symmetrized signal xsym = {x1, x2, . . . , xn, xn−1, xn−2, . . . , x2} of
the length 2(n− 1), FFT of xsym turns into DCT of the original signal2 x(t) of the length
n. Hence, DCT is preferred over FFT as it is more computationally efficient.
6.2.2 Obtaining frequency sub-bands
Once the speech spectrum is available, it can be split in frequency sub-bands. The sub-
band signals can be seen as modulated band-limited signals. Before they can be modeled,
they have to be demodulated3. In contrary to the time-domain, in frequency domain this
operation is very simple.
The sub-band selection and demodulation from eq. 6.7 is carried out by selecting a
certain part of DCT spectrum and shifting it towards zero. The demodulated sub-band
spectrum is thus formed as
Xb = {X[firstb],X[firstb + 1], . . . ,X[lastb], padded zeros}, (6.11)
2To be precise, both transforms are equal only for a certain DCT definition. For example, in case of
DCT type I the two transforms differ only in scaling.3Actually, due to the equality 6.8, the demodulation is in principle not required, but it comes for free
when forming the sub-band spectra.
6.2 Extracting LP-TRAP features 55
where X is the DCT of the input speech, b denotes the sub-band index and firstb, lastbare the limits of the chosen spectral samples. The process is illustrated in Fig. 6.2. The
selected samples are padded with zeros to the length 2N − 1, which is needed for LP
model: LP calculates autocorrelation using Wiener-Kchinchin theorem and the padded
zeros prevent it from being cyclic.
last bfirst b
X[0] X[1] X[1000]
X[200]
Full DCT
Sub−band DCT, band b X[300] 00
X[300]X[200]
Figure 6.2: Illustration of forming frequency sub-bands from DCT spectrum.
It should be noted for completeness that before the sub-bands are modeled with lin-
ear prediction, there is a Gaussian window applied to the chosen DCT samples (before
appending the zeros). It is used with the same purpose as triangular frequency filters in
MFCC calculation or trapezoidal filters in PLP calculation. Its properties will be discussed
in section 6.3.4.
6.2.3 FDLP – Frequency-Domain Linear Prediction
FDLP is the core of LP-TRAP feature extraction. Its purpose is to approximate the
sub-band temporal envelopes ε(t) by autoregressive (AR) model. AR modeling enables to
record the important trend of the modeled function into coefficients of FIR filter. FDLP
is a dual form of time-domain linear prediction (TDLP).
0 10 20 30 40 50−1
0
1
0 1000 2000 3000 4000
−1
0
1
0 10 20 30 40 50−60
−40
−20
0 1000 2000 3000 4000−60
−40
−20
0 10 20 30 40 50−60
−40
−20
time [ms]
0 1000 2000 3000 4000−60
−40
−20
frequency [Hz]
dB dB
dB dB
Signal DCT of signal
FDLP
TDLP
Squared Hilbert envel.
Power spectrum
Figure 6.3: Duality between time and frequency domains in LP-TRAP. Left column shows
a part of speech, its squared Hilbert envelope and FDLP model fit. Right column displays
DCT of the same signal, power spectrum of the signal and conventional LPC fit.
56 6 Linear Predictive TempoRAl Patterns (LP-TRAP)
The duality between TDLP and FDLP is illustrated in Fig. 6.3. The conventional use
of LPC is to approximate the shape of the vocal tract, the TDLP thus models the shape of
the power spectrum. On the contrary, FDLP approximates the temporal envelope of the
speech. Typically, LP derives its coefficients by solving Yule-Walker equations, for which
it needs autocorrelation coefficients. In TDLP the autocorrelation is calculated from time
signal, in FDLP it is calculated from DCT spectrum. Further details about FDLP can be
found in [14].
The LP model allows for large dimensionality reduction. It is associated with a trade-
off between modeling precision and the number of parameters, which can be controlled by
LP model order. Similarly to TDLP, in FDLP the smoothing property of the AR model
is desirable to some extent. The FDLP model order will be subject to the experimental
optimization.
Trick for modeling peaks and dips in trajectory equally well
Autoregressive model well captures spectral peaks but tends to disregard dips. This prop-
erty might be desirable in standard TDLP when modeling the vocal tract, but in FDLP,
the dips need to be modeled equally well as peaks. It could be satisfied by AR-MA model
(Autoregressive Moving Average) at the cost of non-elegant iterative procedure. Fortu-
nately, there is a powerful workaround which preserves the desirable analytic calculation
of the AR model and at the same time allows for an arbitrary weighting between precision
in peaks and dips. It is an intermediate nonlinear operation – compression of the spectrum
in TDLP or of ε(t) in case of FDLP. It was proposed by Hermansky et al. [50].
Consider a discrete envelope of the length N. Instead of minimizing an error as a
function of the analyzed envelope ε[n] and the modeled envelope ε[n], we minimize an
error employing transformed variables:
E =G2
N
N∑
n=1
ε[n]
ε[n]→ ET =
G2T
N
N∑
n=1
T[ε[n]]
T[ε[n]], (6.12)
G stands for gain in AR model and T[·] is the non-linear transform. The inverse
transform has to be applied when evaluating the power spectrum of the AR model (or the
approximated envelope in FDLP case):
ε[m] = T−1
[
G2T · c
|1 +∑K
k=1 ake−2πjkm/M |2
]
, m = 1 . . . M, (6.13)
where ak are LP coefficients of the order K, M is the number of envelope samples and
c is a normalization constant. Note that the modeled trajectory can be evaluated in an
arbitrary number of points M , which does not necessary need to match N (except for the
modeling stage, eq. 6.12).
In FDLP, the transform T[·] represents compression, T[·] = (·)cmpr, cmpr < 1, for
cmpr = 1 it defaults to no compression. Its influence on the envelope shape will be
illustrated in section 6.3.1. To be able to implement the compression, the autocorrelation
needs to be calculated using eq. 6.104.
4To be strict, due to the compression, the R(t) (or the discrete R[n]) cannot be called autocorrelation
anymore, but it remains referred so for simplicity.
6.2 Extracting LP-TRAP features 57
6.2.4 Free parameters in algorithm
Considering the above details, the LP-TRAP scheme can be redrawn as in Fig. 6.4. Red
numbers represent “bandwidths” – sizes of data vectors. From DCT sub-bands up to LPC
the sizes represent one sub-band only. Sub-band DCT sizes include padded zeros.
FFT (gain, zeropad)(2/cmpr)
FFT (a[k], zeropad)
f
| |FFT iFFT
a[k] + gain c[k]
(8200)
(500−8000)
(500−8000)
(100)
log
(101)
(15)
LPC
(8200)
DCT
time domain frequency domain
(2*cmpr)
(100)
"autocorrelation"
sub−bands
(500−8000)
speech sub−band Hilbert envelopes LP−TRAPs
LP−cepstra
len, step
blap
cmpr
fp
ncep
traplen
Figure 6.4: Detailed LP-TRAP feature extraction scheme. Red bracketed numbers in
italics = data vector sizes, fixed font in magenta = parameters of the algorithm.
There are several free parameters in the system that could be optimized. Some of
them will be optimized in experiments and some of them will stay fixed. Those staying
fixed have either been optimized earlier, or their optimization is not critical.
Parameters not to be optimized:
• Speech segmentation step step. It will stay fixed at 10 ms for consistency with other
speech features.
• Compression of sub-band Hilbert envelopes cmpr distributing the LPC modeling
power between peaks and dips. This parameter has been optimized in earlier works
[15].
• Number of cepstral coefficients ncep per band. This parameter has also been opti-
mized in [15].
• LP-TRAP sampling interval, or the LP-TRAP length traplen. The FDLP en-
velopes, being the frequency responses of LP filters, can be evaluated in arbitrarily
number of points. The used traplen defaults to standard TRAP of the size 101
points. It will not be optimized as FDLP cepstra will be shown to outperform
FDLP envelopes.
Parameters to be optimized:
• Input segment length len. Initial choice is roughly 1 second (8200 samples of 8 kHz
speech). Shorter segments might be more suitable.
58 6 Linear Predictive TempoRAl Patterns (LP-TRAP)
• Bank of band-pass filters applied to DCT. The filters are Gaussian-shaped and uni-
formly spaced on Bark frequency axis at 1 Bark intervals. The free parameter is
their width (overlap), controlled by blap factor.
• LPC order fp offering fp/2 poles per band for modeling.
6.3 Experimentally optimizing LP-TRAP
After the LP-TRAP feature extraction had been implemented in standalone C++ appli-
cation fdlp, the free parameters were optimized on S/N task. The best recorded settings
were also evaluated on CTS. The experimental setup was the same as for the baseline
TRAP architecture (refer to the section 4.4 for details). For the purpose of comparison,
conceptually similar TRAP and TRAP-DCT systems were used (sections 4.4 and 5.1.3).
Recall their respective accuracies 19% FER, 4.7% WER and 20% FER, 4.2% WER.
6.3.1 Sub-band envelope compression and LPC order
The influence of envelope compression factor cmpr and number of poles fp on ASR accu-
racy was already scanned by Athineos in [15], the best values found (cmpr = 0.1, fp =
50) were adopted as initial setting here. Their influence on FDLP model is demonstrated
in Fig. 6.5.
signal x(t)
0
−60
−30
0
temporal envelope e(t)
dB
0
−60
−30
0
LP−TRAP cmpr=1
dB
0
−60
−30
0
LP−TRAP cmpr=0.1
dB
0
−60
−30
0
LP−TRAP cmpr=−0.5
dB
signal x(t)
0
−60
−30
0
temporal envelope e(t)
dB
0
−60
−30
0
LP−TRAP fp=20
dB
0
−60
−30
0
LP−TRAP fp=70
dB
0
−60
−30
0
LP−TRAP fp=200
dB
Figure 6.5: 1000 ms LP-TRAPs. Left part: Compression of sub-band dynamics prior to
LP modeling (fp=30), cmpr<0 inverts the dynamics and AR model resembles MA model.
Right part: Various LP orders for cmpr=0.1.
6.3.2 Sampled FDLP temporal envelopes vs. FDLP cepstra as features
In LP-TRAP technique, the sub-band temporal envelopes are parametrized by frequency-
domain linear prediction. FDLP model is determined by coefficients of the FDLP filters.
Sampled frequency responses of these filters represent sub-band energy trajectories (LP-
TRAPs). However, since there are FDLP coefficients available, the LP-TRAPs as such
do not seem to be the best possible representation anymore: It was shown previously
and also in this work that coefficients representing sub-band modulation spectra typically
outperform TRAPs. Conventionally, such coefficients are obtained by the DCT transform
applied on temporal log-energy envelopes (TRAP-DCT). In LP-TRAP technique, these
6.3 Experimentally optimizing LP-TRAP 59
modulation spectral coefficients – or FDLP cepstral coefficients (denoted c[k] in Fig. 6.4)
– can be elegantly obtained directly from the FDLP coefficients via recursion.
For these reasons, FDLP cepstra were presumed to work better than sampled FDLP
envelopes. To support our decision to focus only on cepstral features by some experimental
results, a comparison of two LP-TRAP features with different settings of free parameters
with either cepstra or envelopes are shown in Tab. 6.1. The precise settings of len, step,
blap, cmpr, fp and ncep are not important for now, hence the settings are called simply
I and II.
Features FER [%] WER [%]
I – envelope 18.3 4.6
I – cepstra 18.2 4.1
II – envelope 18.0 4.3
II – cepstra 18.3 3.9
Table 6.1: Performance of sampled LP temporal envelopes vs. LP cepstra as parametriza-
tion in LP-TRAP under two different settings of free parameters (I and II).
This experiment was actually carried out after all optimizations described in the next
sections had been completed, so it acts as a sanity check with the optimized settings in
use. The “envelope” settings use 101 samples of LP-TRAP per band as MLP inputs, the
“cepstra” settings use 50 FDLP cepstral coefficients per band as MLP inputs.
Cepstral features were consistently outperforming envelope features in word error rate
not only in the two presented experiments. Therefore in further experiments only cepstral
features were used.
6.3.3 Input window length
The dependence of recognition accuracy on LP-TRAP input segment length len was
evaluated. As LP-TRAPs are conceptually similar to TRAPs for which the optimal length
ranges between 300–1000 ms, only two lengths were evaluated, 1000 ms and 500 ms. To
preserve the MLP topology, LP order 50 was used in both cases and Band-MLPs were
trained on 50 cepstral coefficients. Tab. 6.2 shows that 500 ms LP-TRAP performs better
than 1000 ms.
To further support the observation, an additional experiment was run, which, instead
of keeping the MLP topology constant, preserved the “pole rate” (ratio of the LP model
order to the segment length). Even though the MLP managed more parameters, neither
this case outperformed 500 ms input (the FER even dropped). Suggested segment length
is thus 500 ms.
LP order fp Segment length len [ms] FER [%] WER [%]
50 1000 19.0 4.7
50 500 18.3 4.3
100 1000 20.7 4.4
Table 6.2: Influence of LP-TRAP input segment length on FER and WER.
60 6 Linear Predictive TempoRAl Patterns (LP-TRAP)
6.3.4 Overlap of frequency sub-bands
Fifteen Gaussian filters uniformly spaced on Bark scale were applied on the DCT of the
signal. Specifically, the full frequency scale from 0 Bark to fbmax Bark (where fbmax is
the Bark equivalent to fs/2 Hz) was split to 16 intervals and the filters were centered at
fbcent,i = fbmax/i, i = 1 . . . 15. The fb denotes frequency in Bark units, the relation to
linear Hertz axis is
f = 600 ∗ sinh(fb/6.0) [Hz,Bark]. (6.14)
The frequency dependence of filter gain for the ith filter is given by
gi(fb) = exp
(
−1
2(fb − fbcent,i)
2 · 10blap
)
, (6.15)
with a floor at -48 dB. The blap factor varied between -0.5 and +1.0, see Fig. 6.6. Its
effect on FER and WER is shown in Fig. 6.7.
0 1000 2000 3000 40000
0.5
1
f[Hz]
blap = −0.5
0 1000 2000 3000 40000
0.5
1
f[Hz]
blap = 0.375
0 1000 2000 3000 40000
0.5
1
f[Hz]
blap = 1.0
Figure 6.6: Bank of filters in LP-TRAP. Influence of blap factor on filter widths.
−0.5 0 0.5 118
18.5
19
19.5
20
20.5
21
blap
FE
R [%
]
−0.5 0 0.5 14
4.5
5
5.5
blap
WE
R [
%]
Figure 6.7: WER and FER as a function of blap factor in LP-TRAP.
The results suggest that the width of overlapping frequency sub-bands is an important
factor, which can largely affect recognition accuracy. Its optimal choice (yielding WER
= 4.1%) seems to lie around blap = 0.375. If the bandwidth of the filters is defined by
-3 dB drop in gain (represents the gain of 0.71 in Fig. 6.6), then the optimal blap = 0.375
approximately represents a bank of filters tied to each other without overlap.
6.3.5 LP model order
With other values fixed, len = 500 ms, cepstral order ncep = 50, blap = 0.255 and
envelope compression cmpr = 0.1, the only parameter left free was the LPC order fp,
5At the time when this experiment was run, the optimized value for blap was not yet known.
6.3 Experimentally optimizing LP-TRAP 61
which varied from 20 to 150. The dependencies at Fig. 6.8 suggest that though LP order
does not seem to be critical, it should not be lower than about 30. The best performing
value (WER = 4.1%) was observed for fp = 70, offering 35 poles for modeling 500 ms
sub-band envelope.
0 50 100 15018
18.5
19
19.5
20
20.5
21
fp
FE
R [
%]
0 50 100 1504
4.5
5
5.5
fp
WE
R [
%]
Figure 6.8: WER and FER as a function of LP order fp in LP-TRAP.
6.3.6 Evaluating optimized features on S/N and CTS tasks
Sections 6.3.1 – 6.3.5 optimized the values of free parameters in LP-TRAP algorithm. The
best settings found on S/N task are:
• len = 500 ms (input segment length),
• fp = 70 (LP order),
• blap = 0.375 (sub-band overlapping factor),
• cmpr = 0.1 (sub-band envelope compression factor for LP),
• ncep = 50 (number of output cepstral coefficients).
Performance of these features was compared to the baseline TRAP-DCT on S/N task
(Tab. 6.3) and on CTS task (Tab. 6.4).
Features FER [%] WER [%]
LP-TRAP 18.2 4.1
TRAP-DCT 19.7 4.2
Table 6.3: Performance of optimized LP-TRAPs on S/N task.
Devel set Test set
Features FER[%] WER[%] WER[%]
LP-TRAP 39.1 53.1 50.5
TRAP-DCT 39.2 53.3 51.2
Table 6.4: Performance of optimized LP-TRAPs on CTS task.
On both tasks LP-TRAP performed about the same as TRAP-DCT. This is very
encouraging result as the proposed novel features perform not worse than one of the best
62 6 Linear Predictive TempoRAl Patterns (LP-TRAP)
published long-term representations. On the other hand, it brings a question why the
LP-TRAPs are not even better.
6.4 Warping time axis in LP-TRAP
One of the major motivations for LP-TRAPs was the potential of very accurate temporal
modeling. This section experiments with the temporal resolution of LP-TRAP and pro-
poses further improvement by warping the temporal axis to devote more modeling power
to the central parts of the trajectory.
6.4.1 Temporal resolution of LP-TRAP
Let us inspect the temporal resolution of LP-TRAP with an artificial signal. It is formed
by two pulses sampled at 8 kHz, each of which 10 samples long (1.3 ms), with 10 ms pause
between them (see the top pane in Fig. 6.9). The signal is processed by LP-TRAP and
TRAP algorithms in 15 critical bands. The six plots below show the energy envelope in
the 5th band. The leftmost pane illustrates the sub-band Hilbert envelope from LP-TRAP
calculation. Note that it contains the information which separates the pulses.
7980 8000 8020 8040 8060 8080 8100 8120sample
test signal
10 20 30 40 50sample
100 200 300 400 500sample
100 200 300 400 500sample
10 20 30 40 50sample
10 20 30 40 50sample
1000 2000 3000 4000sample
Hilbert envelope
LP−TRAP fp=50, 51points
LP−TRAP fp=50, 501points
TRAP, 50 points LP−TRAP, fp=20, 501 points
LP−TRAP, fp=20,51 points, warp=2.0
Figure 6.9: Illustration of temporal resolution of LP-TRAP vs. TRAP.
It can be observed that:
• TRAP cannot distinguish the pulses as it is calculated from 25 ms long windows.
• LP-TRAP with 70th order FDLP fits both peaks, however, 51 samples of the envelope
are not enough for separation.
• LP-TRAP cepstrum of the order 70 can separate the pulses (the envelope with 501
samples is in principle calculated from the LP-TRAP cepstrum).
6.4 Warping time axis in LP-TRAP 63
• LP-TRAP with 20th order FDLP cannot distinguish the peaks.
The rightmost pane of the figure shows a preview of the FDLP fit of a pre-warped
temporal axis. The warping acts as a magnifying glass sliding over the signal and enables
to effectively capture both peaks even with FDLP of order 20. It also allows to distinguish
both peaks in 51 samples of LP-TRAP envelope.
6.4.2 Two ways to warp time axis
The idea of “magnifying glass” can be implemented in two ways.
• Considering that LP coefficients in LP-TRAP capture more temporal details than
the final sampled envelope (which has been shown in the last section), one could
evaluate the LP-TRAP envelope with a denser sampling (see the two panes with LP-
TRAP fp=70 in Fig. 6.9) and subsequently apply a non-uniform binning introduced
in Warping TRAPs, section 5.2.1, pp. 48. By doing so, the final envelope would scale
up the resolution in the center at the expense of boundaries, preserving the original
number of samples. However, this would only optically increase the resolution in the
sampled envelope and not in the LP model itself.
• To be able to capture more details in the LP model, the temporal axis has to be
warped prior to LP modeling, see Fig. 6.10. The sub-band Hilbert envelope has
the full available resolution and can be non-uniformly resampled to yield a warped
envelope, which is subsequently modeled by LP. Thus, LP is forced to devote more
modeling power to the center of the window and the model parameters as well as
the final LP-TRAP envelope can capture more details.
f
sub−bands
(8200)
(500−8000)
| |(2*cmpr)
IN = linear
OU
T =
warp
ed
speech
a[k] + gain c[k]
LP−cepstra
FFT (a[k], zeropad)
(2/cmpr)FFT (gain, zeropad)
log
LP−TRAPs
(8200)
DCT
LPC
(100)
(101)
(15)
"autocorrelation"
(100)
sub−band Hilbert envelopes
(?) iFFT
FFT
(?)
Figure 6.10: Warping temporal axis in LP-TRAP feature extraction.
Note: One might think of warping directly the input signal instead of the envelope. It
is not possible due to the sub-band processing.
6.4.3 Non-linear time warping and sampling theorem
We will adopt the second warping approach. In a more detailed view, there is an issue to
be solved, the proper sampling (denoted by red exclamations in Fig 6.10). What should
64 6 Linear Predictive TempoRAl Patterns (LP-TRAP)
be the proper FFT and iFFT sizes and how to actually do the non-linear mapping?
Non-warped case
Let Nin be the size of FFT calculating the sub-band Hilbert envelope and Nout be the
size of iFFT calculating the autocorrelation. Without warping, Nin is determined by the
sub-band DCT size NDCT , which is appended by zeros to the length Nin = 2NDCT − 1,
to avoid aliasing in autocorrelation (the spectral compression is not considered). For FFT
purposes, the nearest higher power of 2 is used.
Nout is determined by the LP order fp,
Nout ≥ 2fp− 2 (6.16)
because LP uses only the first fp samples of autocorrelation and because the iFFT output
is symmetric, having only Nout/2+1 non-redundant samples. Such Nout is typically lower
than Nin. Mismatch in Nin and Nout can thus safely be avoided by setting both Nin and
Nout to the greater of Nin and Nout.
Warped case
With warping, the reasoning is little more complex. The following phenomena have to be
considered:
• The warping function “zooms in” the center and the needed resolution of the enve-
lope increases. So would do the FFT size, in order to interpolate all needed samples.
In theory, for proper resampling the required Nin would be very high6. An approx-
imation is to find the maximal slope α of the sampled warping function and set
Nin ≥ αNout.
• Recall that Nout is given by eq. 6.16 and that Nin ≥ αNout. When Nin is larger than
Nout, it means that many input samples of Hilbert envelope are to be mapped to
only several samples of the warped envelope. This in turn requires that the linear
Hilbert envelope be low-passed prior to the projection. However, a uniform low-pass
filtering would discard the needed details in the center and the whole idea would be
lost. But, there is a reasonable approximation to the non-uniform low-pass. One can
bin together several neighboring samples of the linear envelope. The process will be
clarified in the following section.
6.4.4 Warping function
The mapping is done using the same function as for TRAPs (refer to section 5.2.1, pp. 47).
Recall its analog version
f(x) = 1 − (1 − x)w, x ∈< 0, 1 >, (6.17)
6For two constant sampling frequencies the minimal up-sampling factor is given by their least common
multiple. Here, one of the frequencies is being changed, which in theory requires to calculate the factor
for every possible position on the warped axis and find again the least common multiple of all particular
factors. Such resampling is clearly not feasible.
6.4 Warping time axis in LP-TRAP 65
Discretization
The input Hilbert envelope of size Nin is real, positive and symmetric (due to FFT), it
means it has Min = Nin/2 + 1 meaningful points. The same applies to the output warped
envelope, which has Mout = Nout/2 + 1 meaningful points. Min/out are always odd, which
means that both boundary points, as well as the middle point of the envelope are warping-
independent. The discrete warping function (its left symmetric half) maps every output
sample of the warped axis to an input sample:
in[k] = Nin/4 − round
[
Nin/4
(Nout/4 − k
Nout/4
)w]
, k = 0, . . . ,Nout/4. (6.18)
Note that the factor 4 comes from 2×2, first for the symmetric spectrum and second for
only a half of the warping function. The function has to be center-symmetrized along the
point k = Nout/4, which is shared. Hence, the final function maps Mout to Min samples.
Required FFT size
To derive the required Nin for a given Nout, the tangent or slope of the discrete warping
function has to be evaluated from two points near the center:
slope =
[
1 −(
N/4−N/4N/4
)w]
−[
1 −(
N/4−(N/4−1)N/4
)w]
(N/4)−1= (6.19)
=N
4{[1 − 0] − [1 − (
1
N/4)w]} =
N
4(
4
N)w = (
N
4)1−w,
where N = Nout. Thus the required input FFT size is
Nin ≈ ceil(Nout/slope) = ceil
[
4
(Nout
4
)w]
. (6.20)
Example: The 0.5 s trajectory sampled at 8 kHz is to be warped with w = 1.5 and
modeled by FDLP of the order fp=70. Using 6.16 we get Nout ≥ 138, the next
power of 2 gives Nout = 256. Eq. 6.20 yields
Nin = ceil
(
4
(256
4
)1.5)
= 2048,
maximum zoom in the center is 8×. For a stronger warp w = 2.0 we get Nin = 16386,
zoom = 64×.
FFT complexity grows with n log n which in real implementation limits the possible
warp factors to about w = 2.0.
Properly resampling linear input to non-linear output
Eq. 6.18 projects output samples to input samples of the envelope. However, as it was
mentioned above, the mapping does not sample from the the input; instead, there is a
binning done. It is illustrated in Fig. 6.11. Let us look at the the point “2” in the figure.
66 6 Linear Predictive TempoRAl Patterns (LP-TRAP)
Instead of sampling the input only in the specific position denoted by the small arrow (and
violating the sampling theorem), an average from all samples between points A and B is
taken. The actual position these interlaced points can be obtained from eq. 6.18 simply
by substituting k by s = k ± 1/2.
OUT
1
2A
B3
IN
Figure 6.11: Illustration of binning process in warping LP-TRAPs.
6.4.5 Experiments
The warping was tested first on S/N task and chosen configurations were subsequently
evaluated on CTS task. The configuration common to all experiments was: segment
length len = 500 ms with step = 10 ms overlap, sub-band overlap blap = 0.375, dynamic
compression cmpr = 0.1. There were three free parameters:
• LP order fp. Initial value was 70.
• Cepstral feature size ncep. Initial value was 50. (Chosen by an analogy with TRAP-
DCT features where this size performed reliably, refer to Fig. 5.4).
• Warping factor warp≥1. Values below 1.0 deteriorated the performance in prelimi-
nary experiments, consistently with warping TRAPs (section 5.2.1).
Goals:
1. To find if the warping improves performance. This was tested with fixed fp = 70,
ncep = 50. Higher values should not have been necessary. If lower values were
optimal for warped case, the observation should not be biased anyway, as more
features generally do not harm.
2. To learn the influence of lowering LP order fp. Without warping, performance with
low fp should be worse than for high fp. With warping, two extreme cases might
happen: 1) If the performance improved and approached the linear case with high
fp, it would mean then only the center of the envelope seems important and the
temporal resolution of LP-TRAP without any warping is good enough. 2) If the
performance did not improve with warping, it would suggest that low order LP
basically cannot sufficiently capture the envelope shape.
3. To learn the influence of lowering cepstral order ncep on performance. Experience
with TRAP-DCT at S/N task suggests that low cepstral order might not harm. But
what happens when the temporal axis is warped?
6.4 Warping time axis in LP-TRAP 67
As the warped LP-TRAPs are computationally demanding, the error surface was sam-
pled only at certain combinations of the three parameters.
S/N task – results
The S/N task setup (MLP and HMM) was consistent with previous experiments (refer to
the baseline TRAP architecture, section 4.4). All possible combinations of fp = 20/70,
ncep = 15/51 and warp = 1.0/1.5/2.0 were evaluated. The results are summarized in
Tab. 6.5.
LP order Cepstral size Warp. factor
fp ncep warp FER[%] WER[%]
70 50 1.0 18.7 4.3
70 50 1.5 18.7 4.3
70 50 2.0 18.6 4.3
70 15 1.0 19.0 4.5
70 15 1.5 18.4 4.0
70 15 2.0 18.5 4.5
20 15 1.0 19.0 4.3
20 15 1.5 18.5 4.7
20 15 2.0 18.4 4.4
Table 6.5: Performance of warped LP-TRAPs as a function of fp, ncep, warp. S/N task.
The S/N task did not fulfill the expectations. None of the results were significantly
worse or better than the default settings (the first row), including random sampling the
error surface in interlaced positions (not displayed). It can be explained by the fact that
S/N task vocabulary does not contain speech with fast transients, except for the plosive
“t” in two. There might thus not be a need for better temporal modeling.
CTS task – results
Contrary to S/N task, the CTS evaluation displayed a nice and consistent picture. The
setup followed the baseline TRAP architecture (refer to section 4.4.2). The following
observation can be made from Tab. 6.6:
LP order Cepstral size Warp. factor WER[%]
fp ncep warp Devel set
70 50 1.0 52.6
70 50 1.75 51.4
70 15 1.0 55.0
70 15 1.75 52.6
20 15 1.0 54.8
20 15 2.0 53.8
Table 6.6: Performance of warped LP-TRAPs as a function of fp, ncep, warp. CTS task.
• The warping notably improved the performance in all cases. The best WER was
reached for fp=70, ncep=50 (default values) and warp=1.75 (1.2% better than with-
out warping). On CTS task it is the best word error rate out of all compared
long-term representations (TRAP, TRAP-DCT, M-RASTA).
68 6 Linear Predictive TempoRAl Patterns (LP-TRAP)
• Lowering LP order to fp=20 deteriorated the performance. Warping was helpful,
but the score with LP order 70 was not reached. It suggests that low-order LP model
is not able to capture the important details in the envelope, even when it is warped.
• Lowering cepstral order to fp=15 provides quite interesting answers.
With LP order 70 and linear time, using only the first 15 cepstral coefficients per band
was clearly insufficient (compare rows 1 and 3). However, when warping was applied,
then only 15 coefficients reached the score of 50 coefficients per band (compare rows
1 and 4)! It suggests that by warping the important information gets compressed in
lower cepstral coefficients.
Full 50 cepstral coefficients were still better choice (compare rows 2 and 4), which
means that the feature size of 50 per sub-band is not redundant and that the warping
helps the LP model to better capture the detailed information from central parts of
the window.
6.5 Conclusion
The original idea from Marios Athineos of representing speech by means of sub-band mod-
ulation spectra obtained with sub-band Hilbert envelopes smoothed by linear prediction
was implemented in fdlp tool. The method does not impose any sampling constraints
to the spectrum of the speech (as opposed to about 10 Hz sampling rate of conventional
approaches) and therefore allows for precise localization of temporal events.
Using fdlp the method was optimized on S/N and CTS tasks, resulting in features
performing comparably to the best long-term representations (TRAP-DCT) known to the
author. It can be concluded that:
• LP-TRAP cepstral features perform significantly better than sampled LP-TRAP
envelopes. This is consistent with TRAP and TRAP-DCT observations.
• LP-TRAP cepstra markedly outperform TRAP-DCT (by about 1.4%) on CTS task.
However, on S/N task both features perform about the same. Better temporal
localization of events in LP-TRAP than of TRAPs thus seems useful in more complex
tasks that are able to utilize the detailed information.
• Pre-warping the temporal axis in order to stress the central part of 500 ms trajecto-
ries allows to model temporal envelopes sufficiently well by less cepstral coefficients
than in linear case (15 instead of 50 coefficients at CTS task), thus reducing the
feature bandwidth by 70% without loss in performance. In addition, when the
bandwidth is not an issue, using all 50 coefficients can markedly improve the score
over the linear case (1.2% improvent was achieved at CTS task).
It should be noted that up to now the warping has not been thoroughly explored.
More appropriate warping functions could be suggested and optimized by more care-
ful scanning of the error surface. There is thus a potential for further improvement.
Chapter 7
Multi-resolution RASTA filtering
(M-RASTA)
This chapter introduces a novel speech representation for ASR. The technique extends
earlier works on RASTA filtering by applying a bank of two-dimensional band-pass filters
as a pre-processing step in TANDEM feature extraction. The filters are applied to the
auditory-like speech spectrogram and the set of resulting spectra are projected to phoneme
posteriors using MLP. Since the filters have zero-mean impulse responses, the technique
is inherently robust to linear distortions.
7.1 Introduction – M-RASTA from different perspectives
Before presenting the M-RASTA feature extraction, let us summarize related works and
their ideas.
Relationship to TRAP and TANDEM
M-RASTA extract features from the spectro-temporal plane similarly to TANDEM or
TRAP and benefits from combining both architectures. It uses temporal context up to
1000 ms like TRAPs, yet uses only one MLP like TANDEM. The long context enables
in principle to preserve the complete information about phoneme. Using one MLP with
inputs from all bands enables to learn virtually any inter-band relationships. Finally, there
is a nice property inherited from both systems: It was shown in this work and also by
other authors that PLP–TANDEM with 9 frames of PLP performs better than TRAP on
clean speech. However, in contrary to TRAPs, PLP coefficients are not robust to linear
distortions as the DFT involved in cepstral calculation causes that any distortion (even
in a narrow bandwidth) affects all coefficients. In M-RASTA, the MLP has access to the
spectra (like TRAPs), though filtered with temporal filters, therefore the robustness to
linear distortions is preserved.
Modulation Spectrum and Robustness
The log-energy in a critical sub-band have its own dynamic structure which is recorded
in a raw form in TRAP vectors. A spectrum of TRAP (modulation spectrum) quantifies
to what extent the dynamics contains slow and fast changes. This modulation domain
69
70 7 Multi-resolution RASTA filtering (M-RASTA)
is exactly where strong a priory knowledge can be implemented, enabling to partially
separate speech from other unwanted artifacts.
It was already mentioned that the vocal tract has its intrinsic inertia which prevents
it from very fast movements. Too slow movements are not efficient for communication.
Experiments with human and machine recognition (reviewed in section 3.2) found that
active modulation range of speech lies between 1.5–16 Hz with a maximum around 4 Hz.
On the contrary, non-speech artifacts often lie outside this range. Channel noise is typically
stationary or relatively slow-changing, and incidental noises such as clicks and bangs can
be very fast. Band-pass filtering modulations can thus significantly improve robustness as
proved by RASTA filtering. M-RASTA aims at preserving this property.
Temporal Decomposition of Energy Trajectory
From the point of view of temporal decomposition, M-RASTA is closely related to TRAP-
DCTs. There, DCT bases weighted by Hamming window are convolved with TRAPs,
which is actually a temporal decomposition to a set of cosinusoidal bases. M-RASTA
differs from DCT in that it applies bases – or filters – derived from Gaussian function.
Experiments from section 5.1.4 at page 43 suggested that better projection bases than
those of TRAP-DCT could possibly be found. It seemed desirable that the length of
a particular cosinusoidal base was proportional to its frequency. In other words, it was
suggested to use a projection base of constant shape and only vary its width. M-RASTA
implements this idea by using impulse responses with similar wavelets differing only in
widths.
Emulating Cortical Receptive Fields
One of the inspirations for M-RASTA filters were differently motivated, though related
efforts of Kleinschmidt and Gelbart. They used two-dimensional time-frequency Gabor
filters and explicitly stated the relation of such speech processing with a known physiology
of auditory cortex. They attempted for a simplified version of Shamma’s model of cortical
processing [66].
7.2 M-RASTA features
The combination of temporal filters and frequency filters applied to the auditory spectro-
gram can be interpreted as 2-D filtering of the spectro-temporal plane. In M-RASTA the
2-D filtering is implemented by first processing critical band trajectories with temporal
filters and subsequently applying frequency filters to the result, see diagram at Fig. 7.1.
1. Critical-band auditory spectrum is obtained in the same way as in TRAP (see sec-
tion 5.1.1, p. 37).
2. Temporal trajectory of energy in each sub-band is filtered with a bank of fixed-length
low-pass FIR filters. Their impulse responses represent Gaussian functions of several
different widths.
3. The first and the second temporal differentials are computed from the smoothed
trajectories, yielding a set of N modified spectra at every 10 ms (labeled “t-filtered
spectra” in the diagram). The same filter-bank is used for all bands.
7.2 M-RASTA features 71
+1
−1
0
TA
ND
EM
pro
babili
ty e
stim
ato
r
FIR
bank
FIR
bank
time (frames)
critical bands
−0.5
+1
−0.5
2
frequency filtering
Auditory spectrogram
3−tap FIR
t−filtered spectra
temporal filtering
Figure 7.1: M-RASTA feature extraction scheme.
4. From each of N modified spectra the first and second frequency derivatives are
calculated. It yields two additional feature streams (labeled ∆ and ∆2 in diagram).
5. Feature vector which forms the input to the TANDEM MLP is then obtained by
concatenating all feature streams. MLP is trained to estimate phoneme posteriors.
7.2.1 Temporal filters
Instead of filtering sub-band trajectories with the low-pass Gaussian function and subse-
quently computing the differentials, the trajectories are directly filtered with the first and
second differentials of the Gaussian, representing impulse responses of band-pass filters.
The impulse responses are obtained by sampling analytic derivatives of the Gaussian that
are given by
g1[x] = −x exp(−x2
2σ2) · k1, (7.1)
g2[x] = (x2 − σ2) exp(−x2
2σ2) · k2, (7.2)
where x is time, x ∈ 〈−500, 500〉 ms with the step of 10 ms, standard deviation σ
determines the effective width of the Gaussian and k1, k2 are scaling constants. The
derivatives will be further referred to as g1 and g2. Filters with low σ values (hi-pass)
have finer temporal resolution, high σ filters (low-pass) cover wider temporal context and
yield smoother trajectories. All filters are zero-phase FIR filters, i.e. they are centered
around the frame being processed. Length of all filters is fixed at 101 frames, corresponding
to roughly 1000 ms of signal, thus introducing a processing delay of 500 ms.
First and second derivatives of Gaussian function have zero-mean by the definition.
By using such impulse responses we gain an implicit mean normalization of the features
within a temporal region proportional to the value of σ, which infers robustness to linear
distortions. Impulse responses given by Eq. 7.1 are shown in the left part of Fig.7.2, the
right parts shows impulse responses given by Eq. 7.2. Respective frequency responses are
illustrated in Fig. 7.3.
Since the impulse responses are discretized and limited in length to 101 samples, the
real Gaussian derivatives are approximated with certain error, which increases towards
both extremes of the σ value. For small σ the sampling is too sparse, for larger σ there
are significant discontinuities at the endpoints due to the finite truncation of the infinite
72 7 Multi-resolution RASTA filtering (M-RASTA)
−500 0 500
g1[x]
time [ms]
g1[x]g
1[x]g
1[x]g
1[x]
−500 0 500
g2[x]
time [ms]
g2[x]g
2[x]g
2[x]g
2[x]
Figure 7.2: Normalized impulse responses of the first two sampled and truncated Gaussian
derivatives g1 and g2 for σ = 8 – 130 ms.
100
101
−40
−30
−20
−10
0
modulation frequency [Hz]
dB
σ=130 ms
σ=8 ms
100
101
−40
−30
−20
−10
0
modulation frequency [Hz]
dB
Figure 7.3: Normalized frequency responses of the first two sampled and truncated Gaus-
sian derivatives g1 and g2 for σ = 8 – 130 ms.
Gaussian function, both introducing DC offset, see Fig. 7.4. Note that g1 has odd sym-
metry and has always zero mean, but the sampled and/or truncated g2 may have non-zero
mean.
−5 0 5
−1
−0.5
0
0.5
1
frames [10ms]
g1
g2
−5 0 5
−1
−0.5
0
0.5
1
frames [10ms]
g1
g2
−50 0 50
−1
−0.5
0
0.5
1
frames [10ms]
g1
g2
Figure 7.4: Detail of the first two sampled and truncated Gaussian derivatives for σ = 6 ms
(left), σ = 8 ms (center), σ = 130 ms (right).
The limits for σ were found using somehow arbitrary criterion that DC offset of the
sampled impulse response must not exceed 10% of the maximal absolute value of the
response. It preserves the normalizing property important for robustness. The resulting
range is σ ∈ (6, 130) ms. The σ values used in experiments are spaced logarithmically.
7.2.2 2-D – time-frequency filters
The first and second frequency derivatives are approximated by 3-tap FIR filters with
impulse responses {−1, 0,+1} and {−0.5, 1,−0.5}, introducing three-Bark frequency con-
7.3 Experiments 73
text. Combination of temporal and frequency filters yields 2-D filters. Examples of their
impulse responses with 101×3 taps are in Fig 7.5. Vertical axis is smoothed to illustrate
the filtering effect on the original spectrum (prior to the critical-band integration).
−0.5
0
0.5
g1[x]
g2[x]
∆2
∆
−500 ms +500 ms0
0 Bark
+1 Bark
−1 Bark
Figure 7.5: Example of impulse responses of 2-D RASTA filters with σ = 60 ms.
7.3 Experiments
This section presents chronologically the development progress on S/N task from testing
individual filters, combining more filters, to evaluations of the M-RASTA with optimized
settings on S/N and CTS tasks.
7.3.1 Filtering with single filter
The aim of the first experiment was to learn what is the effect of temporal filtering of the
auditory spectrogram with one filter. It proceeded as follows:
• Auditory spectra were calculated from the speech.
• For each filter shape (g1, g2) and chosen width σ (varying logarithmically from 6 ms
to 130 ms):
– Auditory spectra were processed with the filter,
– a new MLP was trained using the modified spectra,
– a new HMMs were trained,
– FER and WER were evaluated.
Setup details
• 15 critical band energies,
• 101-tap FIR filters (max. 1000 ms length), 9 different σ,
• MLP topology 15 × 1800 × 29 units.
If no filter was applied, the MLP would be trained on plain auditory spectra. Such a
system would have 50.5% FER and 19.1% WER and can serve as a baseline. The results
of the g1 and g2 filtering are in Fig. 7.6.
74 7 Multi-resolution RASTA filtering (M-RASTA)
8 10 20 30 40 50 70 90 13050
55
60
65
70
75
80
σ [ms]
FE
R [
%]
108 20 30 40 50 70 90 1300
10
20
30
40
50
60
σ [ms]
WE
R [
%]
g1
g2
Figure 7.6: M-RASTA with only one filter: FER and WER dependencies on filter width.
Observation
• All dependencies are quite smooth. The only outlier point is g2 for σ = 6 ms. It
is probably caused by too sparse sampling of the narrow Gaussian derivative (see
Fig. 7.4). Because of this, the used σ range was limited to 8–130 ms for all further
experiments.
• Similar to the experiments in Chapter 5, WER criterion prefers much faster changes
than FER criterion. The best FER (52%) was observed for the second derivative g2
with a wide σ = 64 ms. The Best WER (11.0%) was observed for the first derivative
g1 with a narrow σ = 10 ms.
• When used as the only MLP input, the filtered spectrum can perform markedly
better than the plain spectrum, even though the DC component have been removed
(recall 11.0% vs. 19.1%). This observation is encouraging as it indicates a big
potential of channel noise resistance without any harm on accuracy.
• Minima of all dependencies fall nicely within chosen σ range, suggesting that neither
longer impulse responses, nor finer sampling of the input spectra seem to be needed.
7.3.2 Combining two temporal filters
Having settled the range of filter widths σ, the next concern was combining the filters.
All possible pairs of filters with seven different σ values (0.8, 1.1, 1.6, 3.2, 4.5, 6.4, 9.0)
and both shapes g1 and g2 were made. The input MLP size doubled to 30 features. For
every σ and filter pair g1−g1, g1−g2, g2−g2 a new system was trained and evaluated. See
Fig. 7.7 for results.
Observation
• Filter pairs of the same shape (g1−g1, g2−g2) generally performed better for
distant σ, because the features are more complementary, see top and bottom rows
of Fig. 7.7.
7.3 Experiments 75
8 10 20 30 40 50 60 70 8030
35
40
45
50
55
60
65
70
σ [ms]
FE
R [%
]
Combining g1−g
1 : Frame Error Rate
8
11
16
32
45
64
90
8 10 20 30 40 50 60 70 805
10
15
20
25
σ [ms]
WE
R [%
]
Combining g1−g
1 : Word Error Rate
8
11
16
32
45
64
90
8 10 20 30 40 50 60 70 8030
35
40
45
50
55
60
65
70
σ [ms]
FE
R [%
]
Combining g1−g
2 : Frame Error Rate
8
11
16
32
45
64
90
8 10 20 30 40 50 60 70 805
10
15
20
25
σ [ms]
WE
R [%
]
Combining g1−g
2 : Word Error Rate
8
11
16
32
45
64
90
8 10 20 30 40 50 60 70 8030
35
40
45
50
55
60
65
70
σ [ms]
FE
R [%
]
Combining g2−g
2 : Frame Error Rate
8
11
16
32
45
64
90
8 10 20 30 40 50 60 70 805
10
15
20
25
σ [ms]
WE
R [%
]
Combining g2−g
2 : Word Error Rate
8
11
16
32
45
64
90
Figure 7.7: Combining two temporal filters gx–gy in M-RASTA. Performance as a function
of widths σ. Legend = width of gx [ms], horizontal axis = width of gy [ms].
• Observation from the previous experiment holds: 1) g2 filters reached consistently
better FER than g1 filters and the opposite applied to WER, 2) FER preferred wider
filters σ ≈ 32 − 90 ms, WER preferred narrower filters, σ ≈ 8 − 32 ms.
76 7 Multi-resolution RASTA filtering (M-RASTA)
• Heterogeneous pairs (g1−g2) outperform homogeneous pairs in FER and are
comparable in WER.
• For heterogeneous pairs, distant σ were no more preferred, suggesting that different
filter shape brings more complementary information than distant σ. It is further
supported by the next point.
• Best FER came from longer σ, concretely σ1 = 32 ms, σ2 = 90 ms (σn denotes
width of gn filter). It matches with the previous experiment with only one filter, see
Fig. 7.6, left.
• Best WER came from narrow σ in range 8 – 32 ms.
The best scores for each of the three filter combinations are summarized in Tab. 7.1.
Filters FER [%] σ1 σ2 WER [%] σ1 σ2
g1−g1 47 1.6 6.4 6.5 0.8 3.2
g2−g2 39 3.2 9.0 8.1 1.1 4.5
g1−g2 32 4.5 6.4 6.5 1.6 0.8
Table 7.1: Best reached FER and WER for three feature pair types with their respective
σ values.
The experiment confirmed the complementarity of different temporal filters. Their
combination improves recognition accuracy.
7.3.3 Tuning the system for best accuracy
Having sampled the properties of temporal filters, we can proceed with optimizations. The
initial knowledge is:
• Both filter shapes g1 and g2 should be employed.
• Widths should fit the range σ ∈ (8, 130) ms.
• Combining multiple filters improves accuracy.
First, a bank of 16 filters was formed, consisting of 8 first order and 8 second order
derivatives of Gaussian function g1 and g2. Each derivative had 8 widths placed equidis-
tantly on logarithmic scale, σ = 8.0, 11.9, 17.7, 26.4, 39.4, 58.6, 87.3, 130.0 ms. The bank
was applied to all 15 temporal trajectories of critical-band log-energies at all frequencies,
yielding 16× 15 = 240 spectral features per frame. These formed the main feature stream
t-stream.
The other two feature streams were formed by applying two 3-tap FIR filters to the
outputs of each of the 16 filters, across frequencies, as illustrated in Fig. 7.1, representing
2-D filters. Frequency derivatives for the first and last critical bands are not defined, so
we ended up with two additional feature streams ∆f and ∆2f, each of size 16 × 13 = 208
features.
Subsequently, three MLPs were trained with respective inputs:
• 240 features – t stream only,
7.3 Experiments 77
• 448 features – t+∆f streams, 240 + 208 features appended,
• 656 features – t+∆f+∆2f streams, 240 + 208 + 208 features appended.
Topology of MLPs differed among the three setups only in the input layer size: N ×
1800 × 29 units. As a comparative baseline, TRAP-DCT (101 frames, 51 features) was
used, because it is conceptually similar (refer to section 5.1.3). Its accuracy is 19.7% FER
and 4.2% WER. M-RASTA systems can be seen in Tab. 7.2.
Used feature streams FER [%] WER [%]
t 19.3 4.3
t+∆f 17.4 3.4
t+∆f+∆2f 17.8 3.6
Table 7.2: Adding frequency derivatives to M-RASTA features.
Observation
• M-RASTA with only temporal filters did not outperform the baseline.
• Augmenting t-stream with frequency derivatives ∆f brought large improvement re-
sulting in a system outperforming all competitors (PLP, PLP-TANDEM, TRAP,
TRAP-DCT, LP-TRAP). Although ∆f do not bring in any new information, they
explicitly introduce inter-band context which seems to be necessary.
• Adding ∆2f features decreased the performance and they were not further used.
NOTE: The substantial progress resulting from using ∆f -stream entices to try training
MLP using that stream only. Doing so would result in 20.8% FER and 5.0% WER.
It means that the progress is caused by the complementarity of the streams, rather
than by the performance of derivatives ∆f themselves. The combination of both
streams is thus inevitable.
How many filters?
So far it is not clear how many temporal filters should be used. The somewhat arbitrary
count 2 × 8 deserves an experimental evidence of optimality.
Two more systems were trained with the successful t+∆f features. One containing
2× 4 temporal filters and the second containing 2× 16 filters. In both cases, widths of the
filters were again distributed logarithmically from 8 ms to 130 ms. The difference among
the systems was the density of filters. Comparison of the two alternatives and the default
setting is given in Tab. 7.3.
Number of filters Features FER [%] WER [%]
2 × 4 224 18.0 3.9
2 × 8 (default) 448 17.4 3.4
2 × 16 896 17.4 3.7
Table 7.3: Looking for a suitable number of temporal filters in M-RASTA.
78 7 Multi-resolution RASTA filtering (M-RASTA)
It seems that 8 different widths were close to optimum, other counts performed worse.
Note that 2×16 filters require quite large MLP. Had the overall MLP size stayed constant,
the performance would have been 18.4%FER and 3.9% WER.
Evaluation on CTS
Two successful setups of M-RASTA were evaluated on CTS:
• M-RASTA 240 features (from 2×8 temporal filters at σ = 8–130 ms) and
M-RASTA 488 features (from 2×8 temporal filters + their frequency derivatives ∆f).
• MLP topology (240 or 488)×2000 × 46 units.
Results of M-RASTA compared to TRAP-DCT are given in Tab 7.4. M-RASTA and
TRAP-DCT perform about the same. The explicit frequency relations between bands (∆f)
do not seem to be crucial anymore, though they still improve performance. It can either
be explained by the larger size of CTS compared to S/N (about 4 × more training data)
or rather by the strong language model which attenuates any particularities in features as
discussed in section 4.4.
Devel set Test set
Features FER[%] WER[%] WER[%]
M-RASTA 240 (t) 37.8 53.7 52.5
M-RASTA 448 (t+∆f) 37.5 53.3 51.4
TRAP-DCT 39.2 53.3 51.2
Table 7.4: Performance of M-RASTA on CTS.
Note on MLP: It could be argued that in case of 448 features the TANDEM MLP
is quite big. For 240 features the overall MLP size about 580 000 parameters which is
comparable to PLP-TANDEM, but 448 features enlarge the MLP by about 70%. If the
overall size had to be preserved, the hidden layer would have 1160 units and the WER
would drop by 0.6% (male Tune set), thus approaching M-RASTA 240. In such case,
M-RASTA 240 would probably be the first choice due to its better robustness to channel
noise (refer to the following section).
7.3.4 Robustness to channel noise
To get an idea how robust is the new speech representation to a stationary channel mis-
match between training and testing data, first order preemphasis filter with α = 0.97
was applied to the test data. Such distorted data were passed through existing systems
and word error rate was evaluated. As can be seen in Tab. 7.5, short-term PLP fea-
tures are very sensitive to these distortions and TRAP-DCT as well. However, to be fair,
TRAPs entering TRAP-DCT could have been mean-normalized, which would boost the
robustness, though possibly at the price of accuracy at matched conditions. M-RASTA
features are quite resistant, especially when no explicit relationship among bands using
∆f is involved.
When the same preemphasis was applied on CTS male tune set, WER dropped by
6.6% (absolute) for PLP features, by 0.6% for M-RASTA 240 features, and by 2.1% for
M-RASTA 448 features.
7.3 Experiments 79
Matched Mismatched Relative loss
Features WER [%] WER [%] [%]
PLP 5.2 13.5 160.0
TRAP-DCT 4.2 5.5 31
M-RASTA 240 (t) 4.3 4.4 2.1
M-RASTA 448 (t+∆f) 3.4 3.6 4.0
Table 7.5: Influence of channel mismatch on performance of M-RASTA and other features.
S/N task.
7.3.5 Modulation frequency properties
As shown in Fig. 7.3, the applied RASTA filters cover wide range of modulation frequencies
up to 50Hz (the sampling of spectra is 100 Hz). To get an insight into relative importance
of various modulations for ASR of continuous digits from the M-RASTA point of view, the
modulation range was shrunk from both the lower and the higher ends by modifying σ. To
keep the number of free parameters in the system constant, the shrinking was associated
with reducing spacing between filter’s center frequencies so that even for the narrowest
range there were still 2×8 filters as illustrated in Fig. 7.8.
10 20 30 40 50 70 90 110 13080
σ [ms]
Towards hi−pass Towards low−pass
Shrin
kin
g b
andw
idth
Figure 7.8: Shrinking the modulation bandwidth in M-RASTA by limiting σ range. Blue
with pluses = gradually omitting wide filters (only fast modulations preserved), red with
circles = gradually omitting narrow filters (only slow modulations preserved).
Determining bandwidth
For each set of filters gi1,2 associated with a set of σi, i = 1 . . . 8, the bandwidth was
determined as follows. Frequency responses of all filters were normalized to maximum
gain of 0 dB, without loss of generality1. Subsequently were found the most extreme
1Consider that MLP weights the outputs of individual filters so as to minimize some global error during
training, therefore it can compensate for any differences in gain. Besides, the input data to the MLP are
80 7 Multi-resolution RASTA filtering (M-RASTA)
filters in the filter bank and frequencies of 3 dB attenuation were taken as the bandwidth
limits. It is illustrated in Fig. 7.9 for a simple bank with two filters.
0 0.5 1 1.5 2 2.5 3−5
−4
−3
−2
−1
0
Modulation frequency [Hz]
Norm
aliz
ed g
ain
[dB
]
g1
g2
lower limit upper limit
Figure 7.9: Determining bandwidth of a bank with two filters. Illustrated with g1,2 filters
at σ=130 ms.
The mapping between σi and the bandwidth of the associated gi1,2 filters is shown in
Fig. 7.10: If the whole filter bank acts as low-pass, then its cutoff frequency is determined
by the upper limit of the narrowest σ in the filter bank, and vice-versa.
8 10 20 30 40 60 80 100 130
0.5
1
2
5
10
20
50
σ [ms]
Mo
du
latio
n f
req
ue
ncy [
Hz]
Upper σ limit of Low−pass
Lower σ limit of High−pass
Figure 7.10: Mapping between σ and bandwidth of associated g1,2 filters.
Results
The original modulation bandwidth was shrunk, either by cutting off the high frequency
content or low frequency content. For every filter bank from Fig. 7.8, a new recognizer
was trained and WER and FER were evaluated. The results in Fig. 7.11 suggest that:
• Eliminating significant part of the low modulation frequency range (up to 4 Hz) has
no noticeable effect of WER, while only moderate cut in high modulation frequencies
(down to 19 Hz) is detrimental (see the right pane of the figure). This observation
matches Drullman’s HSR experiments [33].
• For FER, low modulation frequencies are more important while higher modulation
frequencies can be eliminated with only minor effect of FER (see the left pane of the
mean and variance normalized.
7.3 Experiments 81
0.5 1 2 5 10 20 5010
15
20
25
30
35
40
45
Modulation frequency [Hz]
FE
R [
%]
Hi−pass cutoff
Low−pass cutoff
0.5 1 2 5 10 20 500
2
4
6
8
10
Modulation frequency [Hz]
WE
R [
%]
Low−pass cutoff
Hi−pass cutoff
Figure 7.11: FER (left) and WER (right) as a function of cutoff frequency of filter banks
acting as low-pass or high-pass in modulation spectrum.
figure). A range of approximately 1.5–8 Hz appears to contain the most of the rele-
vant information for the frame-level classification. It is consistent with observation
of approaches reviewed in section 3.2.
7.3.6 Discrepancy between MLP and HMM – phoneme posteriograms
There have been two evaluation criteria used in this thesis. Frame error rate is a measure of
MLP reliability in estimating phoneme (or generally class) posteriors. Optimizing features
with respect to this criterion ensures the best match between estimated posteriors and
the underlying sequence of phonemes. Since the posteriors form the input to the HMM
decoder evaluated by the standard word error rate criterion, both criteria were supposed
to be consistent. However, various experiments and features from this thesis suggest that
it is not the case. In fact, a consistent discrepancy was observed, related to modulation
properties of speech.
Recall Fig. 7.11 and Fig. 7.7 from M-RASTA and also Fig. 5.4 (pp. 42) from TRAP-
DCT. In all cases it was observed that FER criterion “preferred” longer and smoother
input corresponding to slow modulations as opposed to WER criterion requiring rather
shorter inputs with faster modulations. It is interesting to compare posteriograms of the
speech (i.e. the evolution of estimated posteriors in time) for two features, one optimized
for FER and the second optimized for WER. Fig. 7.12 shows rather extreme case of two
posteriograms of the same utterance “nine” as estimated by two M-RASTA MLPs using
only two temporal filters each (refer to section 7.3.2 for details). The first features gave
the best FER and the second the best WER:
Optimized for σ(g1) [ms] σ(g2) [ms] FER [%] WER [%]
Best FER 45 64 32 10.1
Best WER 32 16 48 6.6
The displayed utterance comes from the labeled training data. HMM framework rec-
ognizes test digits at 10.1% WER using features similar to the upper plot and at 6.6%
WER using features similar to the lower plot.
We can speculate that the “Best FER” MLP probably smooths the posteriors so that
the trend can be better seen by human eye. But it might not be what HMMs need.
82 7 Multi-resolution RASTA filtering (M-RASTA)
5
10
15Spectrogram
isi
si
si
si
si
si
si
s
n
n
n
n
n
n
n
n
n
n
ny
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
a
n
n
n
n
n
n
n
n
n
n
nh
ah
ah
ah
a
d t kcdctck s z fht v m n l r wyihiheyeeayahaoawowurexais
isi
si
si
si
si
si
si
s
n
n
n
n
n
n
n
n
n
n
ny
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
a
n
n
n
n
n
n
n
n
n
n
nh
ah
ah
ah
a
5
10
15Spectrogram
isi
si
si
si
si
si
si
s
n
n
n
n
n
n
n
n
n
n
ny
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
a
n
n
n
n
n
n
n
n
n
n
nh
ah
ah
ah
a
d t kcdctck s z fht v m n l r wyihiheyeeayahaoawowurexais
isi
si
si
si
si
si
si
s
n
n
n
n
n
n
n
n
n
n
ny
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
a
n
n
n
n
n
n
n
n
n
n
nh
ah
ah
ah
a
Figure 7.12: Posteriograms for utterance “nine”. Horizontal phoneme sequence represents
truth (10 ms step). Posteriors were optimized for the best FER (upper plot) and for the
best WER (lower plot).
HMMs model phoneme transitions by themselves and might prefer to get a more detailed,
though noisy, information. ASR pursues the main goal of the best word sequence and any
intermediate products are disregarded.
Such thoughts could suggest to withdraw TANDEM architecture and return to the
idea of Hybrid approach where MLP probabilities replace Gaussian mixture model, but
it would be a step back (refer to pp. 16). Yet, there is a different though perhaps a little
peculiar idea to try abandoning HMM framework and recognize words purely using MLPs.
It will be presented in the next chapter.
7.4 Combining LP-TRAP and M-RASTA 83
7.4 Combining LP-TRAP and M-RASTA
The previous chapter introduced LP-TRAP features with two key properties:
1. Energy envelopes in sub-bands are formed by sub-band Hilbert envelopes. This
brings high temporal resolution.
2. Sub-band energies are modeled by LP cepstra.
In this chapter, trajectories of auditory spectra were passed through a bank of M-RASTA
filters with different widths. Narrow filters were shown to be essential for accuracy, yet
their possible thinness was limited by the spectral sampling frequency of 100 Hz. It
suggests to try combining both techniques. Sub-band Hilbert envelopes offer resolution
down to milliseconds, the M-RASTA impulse responses can thus be sampled much finer
and their time constants can be shorter.
Hilbert-M-RASTA features
The combination of LP-TRAP and M-RASTA is characterized by three processing steps:
1. Long segments of sub-band Hilbert envelopes are extracted from the speech
(the FDLP model is not used).
2. M-RASTA t + ∆f filters are applied to the logarithm of Hilbert envelopes.
3. Processed spectra are fed to the TANDEM MLP.
+1
−1
0
TA
ND
EM
pro
ba
bili
ty e
stim
ato
rFIR
bank
FIR
bank
−0.5
+1
−0.5
2
frequency filtering
3−tap FIR
t−filtered spectra
temporal filtering
Band 1
Band 15f
| |FFT
DCT
(2*cmpr)
sub−bands
speech sub−band Hilbert envelopes
Figure 7.13: Scheme of Hilbert-M-RASTA feature extraction.
The feature extraction procedure illustrated in Fig. 7.13 was implemented in a stan-
dalone C++ application hilbert gauss allowing for effective and general experimenting.
The core blocks from fdlp and pfile gauss tools were used so the executable runs fast
thanks to fftw library [1]. The viability of the features was tested on S/N and CTS tasks
with fifteen sub-band Hilbert envelopes in 1000 ms segments with 10 ms step, sub-bands
were selected by Gaussian filters on Bark scale with blap = 0.395 (refer to section 6.2 for
details) and the best performing M-RASTA t + ∆f filters (refer to section 7.3.3) yielded
448 features, which were fed to the TANDEM MLP.
Tab. 7.6 and 7.7 suggest that the default settings do not bring any major improvement.
More interesting would be to employ M-RASTA filters shorter in time. Unfortunately, it
bears an issue of proper temporal sampling: The time shift between successive 1000 ms
84 7 Multi-resolution RASTA filtering (M-RASTA)
Features FER [%] WER [%]
LP-TRAP 18.2 4.1
M-RASTA 448 17.4 3.4
Hilbert-M-RASTA 17.0 4.0
Table 7.6: Comparing LP-TRAP, M-RASTA and Hilbert-M-RASTA on S/N task.
Devel set Test set
Features FER[%] WER[%] WER[%]
LP-TRAP 39.1 53.1 50.5
M-RASTA 37.5 53.3 51.4
Hilbert-M-RASTA 37.2 52.3 50.8
Table 7.7: Comparing LP-TRAP, M-RASTA and Hilbert-M-RASTA on CTS task.
frames would have to be reduced in order not to violate the sampling theorem. However,
this would raise the frame rate and so the complexity of the system. It could still be
dealt with (e.g. by post-processing the MLP posteriors and decimating or by a multi-rate
processing), though it has not been done within this thesis.
7.5 Conclusion
A novel feature extraction technique was developed based on multiple 2-D filtering of
time-frequency plane.
Filters are determined by their impulse responses which are all zero-mean, implying
robustness to linear distortions of the signal and to changes in spectral tilt that could
be induced by extra-linguistic factors, thus inherently alleviating one of major sources of
harmful variability in the speech.
Experiments with small-vocabulary and mid-vocabulary ASR have shown that 2-D
multi-resolution RASTA filtering in conjunction with TANDEM feature extraction appears
to be an efficient means for representing message-specific information in the speech.
• In digits recognition (S/N Task) the approach outperformed all competitive features
(PLP, PLP-TANDEM, TRAP, TRAP-DCT, and LP-TRAP) and proved robustness
to linear distortions.
• In conversational telephone speech recognition (CTS Task) it approaches the upper
bound of accuracy achieved by competitive long-term features. Though not shown
in this chapter (see Appendix B for more information), M-RASTA features contains
complementary information to short-term features as it can significantly improve
accuracy on CTS when combined with PLP features.
Chapter 8
Extensions: Towards recognition
by means of keyword spotting
This chapter presents an alternative approach to ASR in which each targeted word is
classified by a separate binary classifier against all other sounds. To build a recognizer for
N words, N parallel binary classifiers are applied. The system first estimates uniformly
sampled phoneme posterior probabilities, followed by a second step in which a rather long
sliding time window is applied to the phoneme posteriors and its content is classified by
MLP to yield posterior probability of the keyword. On small-vocabulary task, the system
still does not reach the performance of the state-of-the-art but its conceptual simplicity
and its inherent resistance to out-of-vocabulary sounds may prove significant advantage
in many applications.
After presenting the principle of the approach, an alternative recognizer of digits will
be built using eleven parallel keyword spotters. This recognizer will be evaluated on a
modified S/N task and compared to HMM baseline. Subsequently, it will be exposed to
an unconstrained speech containing a lot of out-of-vocabulary words.
8.1 Introduction
Since the early attempts for ASR, the task has been to recognize words from a closed
set of words. As any non-native speaker of the language may testify, human speech
communication applying this approach would be impossible. Daily experience suggests
that not all words in the conversation, but only a few important ones, need to be accurately
recognized for satisfactory speech communication among human beings. The important
keywords are more likely to be rarely occurring words with high information value. Human
listeners can identify such words in the conversation and possibly devote extra effort to
their decoding. On the other hand, in a typical ASR, acoustics of frequent words are likely
to be better estimated in the training phase, the language model is also likely to substitute
rare words by frequent ones and this is finally supported by the typical judging criterion,
word error rate, which rates the most important words equally to common phrases with
the least information value. As a consequence, important rare words are less likely to be
well recognized. Keyword spotting has a potential to address this issue by focusing only
on certain words while ignoring the rest of the acoustic input.
In this chapter, the ASR is seen as a task where the main goal is to find the target
85
86 8 Extensions: Towards recognition by means of keyword spotting
words in an acoustic stream while ignoring the rest.
8.2 Detecting a word in two steps
The proposed approach works in the following steps (illustrated in Fig. 8.1).
1. Equally-spaced posterior probabilities of phoneme classes are estimated from the
signal.
2. The probability of a target keyword is estimated from the sequence of phoneme
posteriors. The probability is smoothed by a matched filter.
step 1
step 2
speech
filtering
M−RASTA
time
phonem
es
sliding window
MLPmatched
timetime
MLP
phoneme
keyword
filter
posteriogram
posteriorkeyword
Figure 8.1: Scheme of hierarchical keyword spotting.
In the first step, Multi-resolution RASTA features are calculated from the speech and
fed to an MLP (further referred to as Phoneme MLP) which has been trained to estimate
posterior probabilities of 29 phonemes every 10 ms.
The second processing step replaces the common HMM decoding framework with a
second MLP (further referred to as Keyword MLP) with multiple inputs and two comple-
mentary outputs. It projects a relatively large span of the phoneme posteriogram (about
1000 ms) to the posterior probability of the given keyword being present in the center
of the time span. Thus, the input to Keyword MLP is a 2929-dimensional vector (29
phoneme posteriors within the context of 101 frames). By sliding the window frame-by-
frame, the phoneme posteriogram is converted to the keyword posteriogram. A typical
keyword posteriogram is shown in Fig. 8.2.
0 100 200 300 400
0
0.5
1
Frames0 1 2 3 4
time [s]
pro
babili
ty
Figure 8.2: Example of keyword posteriogram.
8.2 Detecting a word in two steps 87
Training targets for keyword MLP
Keyword MLP is trained against hard targets. They are set to “1” at all frames spanning
from the beginning till the end of the targeted word, otherwise they are “0”. Hence, the
estimated posteriors should register the keyword presence regardless of the position within
the keyword. In fact, such targets are necessary due to the discriminative nature of MLP
training: if only the frames exactly at the word centers were labeled “1”, the fraction of
positive examples in the whole training frames would be negligible and the MLP training
would converge to a degenerated solution of constant zero.
8.2.1 From frame-based estimates to word level
Even though to a human eye the frame-based posterior estimates usually clearly indicate
the presence of the underlying word, the step from the frame-based estimates to word-
level estimates is very important. It involves nontrivial operation of information rate
reduction (carried sub-consciously by human visual perception while studying the poste-
riogram) where the equally sampled estimates at the 100 Hz rate are to be reduced to
non-equidistant estimates of word probabilities. In the conventional (HMM-based) sys-
tem, this is accomplished by searching for an appropriate underlying sequence of hidden
states.
Matched filters
Here a more direct approach was opted, which postulates the existence of matched filters
for temporal trajectories of word posteriors. Impulse responses of the filters reflect the
average contour of posteriors in 1 second interval centered at the keyword. For every
keyword, the impulse response of its matched filter was obtained as follows.
1. All instances of the particular keyword in the training set were found and their
centers were localized.
2. One second long trajectories of targets were formed, centered at the keyword center.
3. These trajectories were averaged.
In computing the averages, we need to deal with cases where the window contains
more than one instance of the keyword. For simplicity, these segments were not included
in the calculation. Resulting filters are shown in Fig. 8.3.
Decision about keyword presence
The raw posteriors, as estimated by Keyword MLPs, were smoothed with the appropriate
matched filters. As a consequence, local maxima (peaks) of each filtered trajectory indi-
cated that the given word was aligned with the impulse response. The position of the peak
then indicated the center of the word. The value in the peak could be used as an estimate
of confidence that the keyword was present. The final decision was taken by comparing
the peak value to a threshold, which had been derived from the data during training. The
details of the decision-making were subject to the experiments and will be clarified later.
Fig. 8.4 illustrates the process.
88 8 Extensions: Towards recognition by means of keyword spotting
−500 0 500
0.2
0.4
0.6
0.8
1
time [ms]
Figure 8.3: Impulse responses of matched filters for eleven keywords (digits from “zero”
to “nine” plus “oh”).
truth
filtered
raw
time
pro
ba
bili
ty
po
ste
rio
r
filtered
Figure 8.4: Finding the keyword position from its posterior probability. Circles indicate
potential alarms.
8.3 Experiments
The aims of the experiments were:
• to test the viability of the proposed approach,
• to understand it’s properties and tune it’s parameters on a development data,
• to evaluate and the method on an unconstrained speech.
8.3.1 Training and testing data sets
The experiment was based on OGI Stories and Numbers95. However, compared to S/N
task setup, the distribution of files changed. There were four disjunctive data sets:
Train set 1 – 208 files from Stories (2.8 hrs) with frame-level phoneme labels (matches
with MLP set 1 from S/N task),
Train set 2 – 2547 files from Numbers95 containing strings of digits (1.3 hrs) with frame-
level phoneme labels (matches with HMM train set from S/N task),
Devel set – 1433 files from Numbers95 containing strings of 11 digits (1.0 hrs) with word
transcription (subset of Test set from S/N task),
Extraneous set – 129 files from Stories (1.7 hrs) with general speech, with word tran-
scription.
8.3 Experiments 89
Binary training targets for Keyword MLPs were created for both Train sets. Stories
database provides word boundary labels that were used for Train set 1. For Train set 2
the labels were not available. A backward mapping from phoneme labels to words was not
possible, hence the word boundaries were obtained from the automatic alignment using
an existing ASR system and the word transcript.
Repeating keywords issue
Utterances in Test set were chosen not to contain repeating keywords. It reveals one
particular problem of the system that still needs to be addressed if the system is to be
used in certain ASR applications. When two subsequent keywords come, the technique is
not likely to detect both of them. The keyword posteriors as estimated by Keyword MLP
usually stay at high level. Even if there was a transient drop in the trajectory, it would
be smeared by the matched filter.
This would not be an issue in the envisioned application of the system that require
merely mark the frames containing the keyword, but when the system is treated and
evaluated as a recognizer, it represents a problem. The solution was found by Lehtonen
et al. [68], who omitted the Keyword MLP step and applied matched filters directly to
phoneme posteriors. He localized peaks in the smoothed posteriogram which supposedly
represented phonemes and decoded the targeted words from the sequence of peaks.
8.3.2 Initial experiment – checking viability
The objective of the initial experiment was to get an idea of limits of the proposed sys-
tem when applied as a simple speech recognizer. The task was to recognize 11 digits in
utterances from Devel set. The evaluation was done by means of WER. All systems were
required to give approximately the same number of insertions and deletions, which could
be tuned on Devel set.
The procedure:
The hypotheses about word sequences in tested utterances were obtained as follows.
1. Speech from Train set 2 was projected onto phoneme classes using an existing
Phoneme MLP. The MLP has been trained earlier using 448 M-RASTA features
and MLP sets 1 + 2 from S/N task (details can be found in section 7.3.3).
2. Eleven independent Keyword MLPs, each of which with 1 s long trajectory of
phoneme posteriors at the input (2929 features) and two complementary outputs
(P (present), P (not present))1 were trained on Train set 2 to give frame-wise key-
word posteriors. MLP topologies were 2929×500×2 units. Subsequently, speech
from Devel set was passed through Phoneme MLP and Keyword MLPs.
3. Eleven matched filters for the respective keywords were derived by computing the
mean trajectory patterns from Train set 2. Keyword posteriors of Devel set were
then filtered with the filters.
1As the MLP outputs represent probabilities, they have to be at least two so that they always sum to
one. Further processing utilizes only one output.
90 8 Extensions: Towards recognition by means of keyword spotting
4. The decision about keyword presence was made by comparing the local maxima
of the filtered trajectories to a fixed threshold (alarm threshold), which has been
iteratively found to balance insertions and deletions. At every utterance, all peaks
valued above the alarm threshold were considered as detected keywords and sorted
by time to yield the final hypothesis.
When evaluated on Devel set, the system yielded encouraging 9.9% WER (compare to
the optimized HMM recognizer with M-RASTA features yielding 3.4% WER).
8.3.3 Simplifying the system – omitting intermediate steps
An interesting question could be asked, what would happen if we omitted the intermedi-
ate processing steps, M-RASTA filtering and Phoneme MLP, and estimated the keyword
posteriors directly from critical-band log-energies? The answer is shown in Fig. 8.5.
CriticalBand
Analysis filtering MLP
PhonemeM−RASTA Keyword
MLP
speech YES
NO
15 448 29
CriticalBand
Analysis filtering MLP
PhonemeM−RASTA Keyword
MLP
speech YES
NO
15 448 29
CriticalBand
Analysis filtering MLP
PhonemeM−RASTA Keyword
MLP
speech YES
NO
15 448 29
31 %
16.2 %
9.9 %
WER
Figure 8.5: Omitting intermediate processing steps from hierarchical keyword spotting.
When the keyword posteriors were estimated directly from M-RASTA features (MLP
topology 448 × 3000 × 2 units), the performance dropped to 16.2% WER. Omitting also
the M-RASTA feature computation and training the keyword networks directly on 1 s
trajectory of critical band energies (MLP topology (101∗15)×1000×2 units), yielded 31%
WER. It suggests that the hierarchical processing employing high-dimensional features and
the use of intermediate phoneme classes are beneficial.
8.3.4 Optimizing false alarm rate – Enhanced system
The aim of the next experiment was to study the relationship between word error rate
and false alarm (FA) rate, and to optimize the proposed algorithm for the subsequent test
with unconstrained speech.
The task was again to recognize 11 digits from Devel set. To balance the insertion
rates among keywords, individual alarm thresholds were found for each keyword to give
the required FA rate. The dependence of alarm thresholds on the FA rate is shown in
Fig. 8.6. The lower the threshold, the more alarms produced (either positive or false
alarms). Once all 11 thresholds were found for the given FA rate, the WER was evaluated
(see the second column of Tab. 8.1).
It can be observed that the proposed approach can act as ASR system at 12% WER
level while keeping at most 30 false alarms per hour. On the other hand, the competitive
8.3 Experiments 91
15 20 25 30 35 40 45 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FA/h
thre
sh
old
one
two
three
four
five
six
seven
eight
nine
zero
oh
Initial system
15 20 25 30 35 40 45 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FA/h
thre
sh
old
one
two
three
four
five
six
seven
eight
nine
zero
oh
Enhanced system
Figure 8.6: Alarm thresholds as a function of false alarm rate (floored at threshold 0.05).
system initial enhanced HMM
FA/h WER[%] WER[%] WER[%]
20 19 16 53
25 15 11.6 40
30 12.1 10.3 26
40 9.9 9.3 3.4
Table 8.1: WER as a function of false alarms per hour on digits recognition task.
HMM-based ASR system with its best performance of 3.4% WER yields 38 false alarms
per hour and when modified to yield 30 alarms per hour by manipulating its insertion
penalty, its performance degrades to 26% WER. Further efforts for lowering the FA rate
in HMM system degraded the performance yet further (see the fourth column in Tab. 8.1
and also Fig. 8.7). However, it should be noted the HMM system used here is not designed
to act as a keyword-spotter.
More discriminative training – adding negative examples
When training a new set (the enhanced set) of 11 Keyword MLPs on a joint set of Train
set 2 with a subset of Train set 1 in a ratio about 1:1, the prior probabilities of all keywords
lowered roughly twice. After re-setting the thresholds to the required FA rates, the WER
improved, see the third column of Tab. 8.1. The spheres of operation of the respective
systems are shown in Fig. 8.7. Knees of the curves could be used to identify the optimal
usage for each method.
The comparison of enhanced and initial systems supports that a sufficient amount of
negative examples is necessary in discriminative training of classifiers.
One big Keyword MLP instead of individual MLPs
To allow for better discriminability among keywords, one big network with a topology
(29*101)×500×12 units was trained for all keywords at once. The 12 outputs were mapped
to 11 keyword classes plus one non-keyword class. As they were estimates of posterior
probabilities, the 12 output values always sum to one. The posteriors were postprocessed
in the same way as in case of individual networks.
92 8 Extensions: Towards recognition by means of keyword spotting
10 20 30 40 500
5
10
15
20
25
30
Average FA/h
WE
R [
%]
Initial system
Enhanced system
HMM
Figure 8.7: Operation ranges of the proposed system and HMM recognizer.
The observed behavior of the big network was very close to that of individual Keyword
MLPs (at 40 FA/h, where the individual MLPs reached 9.9% WER, the big MLP was
about 0.5% absolute better). It suggests that explicitly introducing discrimination among
target keywords is not worth the loss of the independence among target words.
8.3.5 Keyword spotting on frame level
An interesting application of the keyword spotter is to let it mark all frames belonging
to a given keyword. It can be simply achieved by passing the speech through M-RASTA
filtering, Phoneme MLP and Keyword MLPs, without any other postprocessing. In this
case the marked segments are not necessarily continuous – there can be gaps on frame
level. However, the operation can be extremely fast as it only involves a sequence of
matrix multiplications. One can concatenate the marked segments and listen to the output.
Surprisingly, the concatenated speech sounds naturally, which suggests that human hearing
is able to subconsciously recover the missing data.
This application has been implemented in a small live demo, where the speaker records
his own voice through a PC microphone and subsequently is able to listen to the concate-
nated segments of any of the eleven spotted keywords.
one two three four five six seven eight nine zero oh0
20
40
60
80
100
%
Digits training data (Train set 2)
one two three four five six seven eight nine zero oh0
20
40
60
80
100
%
Joint Digits + Stories training data (Train sets 1 + 2)
Correct reject (% of frames without keyword)
Correct spot (% of frames with keyword)
False alarms (% of all frames)
Deletions (% of keyword frames)
Figure 8.8: Frame-level evaluation of Keyword MLPs, Test set.
Frame-level performance of initial and enhanced systems can be seen in Fig. 8.8. The
figure reveals a property of the discriminative training which sometimes may cause prob-
lems: systems trained using more negative examples tend to reject more frames. However,
8.3 Experiments 93
as it was shown above, with an appropriate postprocessing, this property does not repre-
sent an issue.
8.3.6 Keyword spotting in unconstrained speech
The most interesting situation for a keyword spotting system is when the test data contain
a lot of out-of-vocabulary speech. Such a task was emulated by appending one hour of
digits (Devel set) with 1.7 hours of extraneous general speech (Extraneous set). Standard
evaluation procedure was applied as in the previous ASR tasks, but the extraneous speech
from OGI Stories was labeled as no speech. For initial MLP system, thresholds yielding
30 FA/h on Devel set were used and for enhanced system, thresholds yielding 25 FA/h on
Devel set were used.
system False alarms % Hits WER [%] FOM
Initial system 6208 87.4 91.6 74.0
Enhanced system 1313 85.9 24.5 83.6
Table 8.2: Results on a joint set of digits and unconstrained speech. % Hits = % correctly
spotted/all keywords, FOM = Figure Of Merit (average accuracy over 1–10FA/h).
The two studied systems were exposed to the joint speech. Their performances are
compared in Table 8.2. % Hits represents a fraction of correctly spotted keywords in all
presented keywords. Figure of merit is defined by NIST as the upper-bound estimate of
the keyword spotting accuracy averaged over 1 to 10 false alarms per hour [100].
Observation
• Enhanced system largely outperforms Initial system. It reduces the number of false
alarms almost 5×, while preserving the number of hits.
• When evaluated in terms of WER of digits recognition in unconstrained speech,
Initial system failed, but Enhanced system was still able to recognize 75% of digits.
Negative training examples for Keyword-spotting MLPs seem to be essential for the
system being able to reject out-of-vocabulary sounds. The ability of the proposed system
to focus only at the words of interest can be illustrated also by looking at the behavior of
the baseline HMM recognizer, when exposed to such task. If the closed-vocabulary HMM
recognizer was forced to recognize the unconstrained speech (though it is not designed to do
so), it would insert 11925 false alarms, thus bringing its final WER to rather unacceptable
152%.
It is worth noting that the speech from the Extraneous set contains a lot of items
that are acoustically similar to the keywords, such as numbers (nineteen), compound
words (someone) and other acoustically similar items (too, for). Without a higher level
semantical knowledge, these items cannot be distinguished from the targeted digits. It
is thus questionable whether these “false alarms” are indeed errors. However, they were
considered as errors in the above evaluations.
94 8 Extensions: Towards recognition by means of keyword spotting
8.4 Discussion and conclusion
The compared digit-recognizing HMM system represents a typical closed-set vocabulary
system, which, when presented with out-of-vocabulary word, attempts to match it with
the word from its closed-set vocabulary, yielding a false alarm. This problem is typically
addressed by introducing some measure that provides an estimate of confidence in decision
about the identity of the underlying word (a difficult research problem on its own).
Here, an attempt was made to develop a simple alternative system based on a set
of parallel discriminative classifiers that is better capable of yielding no output when
presented with an unknown out-of-vocabulary word. The new approach differs from the
current ASR strategies in several aspects:
• The recognizer for N words is built as a system of N parallel discriminative binary
classifiers, each classifying one keyword against the rest of other possible sounds.
• The classification is based on hierarchical processing where first equally-spaced pos-
terior probabilities of phoneme classes are derived from the signal, followed by es-
timation of the probability of the given keyword from the sequence of phoneme
posteriors.
• In contrary to most of current ASR systems, no explicit time warping (DTW in
Viterbi) is done. Instead, the binary classifier is trained for word length invariance
on many examples of the keyword.
It was demonstrated that given a sufficient amount of negative examples for discrim-
inative training, the studied approach can inherently reduce the insertion error problem
on out-of-vocabulary words.
Chapter 9
Summary and conclusion
9.1 Summary of the work
This thesis is oriented towards new approaches in speech recognition utilizing neural net-
works and hidden Markov models. Though the main interest was devoted to the devel-
opment of novel features extracted from spectral dynamics, virtually all parts of the used
ASR framework were questioned:
Front-end:
• Obtaining spectrogram: The conventional frame-based processing was questioned
using LP-TRAP features being able to precisely localize temporal speech events.
• How much of the spectrogram to use: Long-term features use up to 1000 ms context.
Sections on truncating TRAP and warping time axis studied how the ASR-relevant
information is distributed in such a context.
• How to parametrize the spectrogram: A new way of converting the spectrogram into
features, M-RASTA, was proposed.
Probability estimator:
• A suitability of phonemes as target classes for MLP classifier was examined and
alternative classes were suggested (in appendixes C, D).
Decoder:
• An alternative words-recognition system was proposed, based purely on MLP classi-
fiers. It addresses the issue of out-of-vocabulary words in small-vocabulary ASR. The
system could substitute HMM-based decoder in computation-critical applications.
Chapter 5 studied the distribution of information in time-frequency plane. It was found
that context over 1000 ms is rather irrelevant for long-term feature extraction. In TRAP,
for word recognition the context can be reduced down to 200-400 ms without a significant
performance drop on small vocabulary; for frame-level classification, minimum of 400 ms
seems needed. The lower the feature dimension, the shorter the optimal TRAP length.
When only a few features are available to describe the spectral dynamics, their first choice
95
96 9 Summary and conclusion
are fast spectral modulations. However, slow modulations still contain complementary
information useful for large feature vectors (in context up to 1000 ms). Most of the
information comes from the center of the time window. By warping time axis, distant
frames can be largely sub-sampled while preserving the resolution near the center, which
can reduce TRAP complexity up to 5 times.
Chapter 6 implemented in C++, optimized, and extended the linear predictive TRAP
features (LP-TRAP). LP-TRAP bypasses speech framing and can preserve fine temporal
structure of energy trajectories in frequency sub-bands. The most important tunable
parameters were optimized on small-vocabulary task. In word-recognition experiments,
LP-TRAP cepstra significantly outperformed LP-TRAP sampled envelopes. LP-TRAP
cepstral features markedly outperformed baseline TRAP-DCT features on conversational
telephone speech, CTS task. Better temporal localization of sub-band events in LP-TRAP
than in TRAPs seems beneficial in more complex tasks (CTS) that are able to utilize the
detailed information. The idea of pre-warping the temporal axis in order to stress the
central part of 500 ms sub-band energy trajectories was shown to help: either it can
considerably reduce the feature bandwidth without loss in recognition performance (by
70% at CTS task), or with the full bandwidth it can markedly improve the performance
over non-warped case.
Chapter 7 proposed new features for ASR named M-RASTA. The method applies a bank
of 2-D time-frequency filters with varying temporal resolutions to the speech spectrogram,
prior to its projection onto phoneme probabilities by MLP. The impulse responses of the
filters implement the earlier findings about the information in spectro-temporal plane,
namely focusing on the central parts of TRAPs and preserving only the important mod-
ulation spectrum. On small-vocabulary task, M-RASTA outperformed all competitive
features (PLP, PLP-TANDEM, TRAP, TRAP-DCT, LP-TRAP) and proved the antic-
ipated robustness to linear distortions. On the more complex CTS task it approached
the upper bound of accuracy achieved by competitive long-term systems. The M-RASTA
features were shown to be complementary to short-term features, similarly to TRAP and
LP-TRAP features. A detailed study suggested that M-RASTA features mainly use spec-
tral modulations between 4–19 Hz for word recognition and modulations between 1.5–8 Hz
for frame recognition.
Chapter 8 proposed an alternative approach to word recognition, where each word was
classified by a separate binary classifier against all other sounds. The system used only
discriminatively trained MLP classifiers, without applying HMMs or dynamic time warp-
ing. Properties of discriminative training were studied, observing that a certain balance of
positive and negative examples is required for good performance. The efficient and simple
system focuses on capturing only the words of interest, therefore it was able to reasonably
reject out-of-vocabulary words. When compared to the best available HMM recognizer on
small-vocabulary ASR, it produced more errors on word level, yet it was better capable
to lower the number of false-alarms per target word.
9.2 Original contribution 97
9.2 Original contribution
The main objective of this thesis was to extend the current knowledge about features for
ASR derived from the dynamics of speech spectrum.
Certain new properties of the existing TRAP-related systems were found (importance
of modulations up to 20 Hz, possibility to sub-sample distant parts of TRAP), which
allowed to improve the current approaches in terms of their simplicity and performance.
These findings helped to improve LP-TRAP features by warping time axis and also moti-
vated the design of the novel speech representation M-RASTA. The study of the relation-
ship between quality of MLP posteriors and word recognition accuracy was an impulse to
propose a new alternative to the mainstream words-decoding framework.
Besides the theoretical achievements, the author’s efforts also gave rise to several open-
source software tools, which enable for flexible and effective experimenting with various
speech features. Two speech recognition and evaluation tasks used throughout this thesis
were efficiently implemented, parallelized, documented and made available to the involved
research community.
Together with the live recognition demo for hierarchical keyword spotting, the products
of this work are being actively used and further developed [65, 97, 84].
9.3 Conclusion
We decided to study acoustic features for speech recognition since no speech recognizer can
do a good job with bad features. We postulated that auditory-like spectrum completely
preserves the underlying message. We believed that posterior probabilities of phonemes
could help when used as an intermediate step between the speech and the text. We also
believed that wider temporal context could improve the local decisions about what was
pronounced.
Given these assumptions, we asked: How much of the time context can possibly be
useful for features? How the information is actually distributed within such a context? Is
it important to properly model detailed temporal structures? What dynamic events are
the most important for recognition – slow or fast modulations?
In the presented work we answered these questions to some extent and supported
the drawn conclusions by experimental evidence. As the ultimate goal was the speech
recognition, we assesed the improvements in terms of recognition rates on standard tasks.
The findings are summarized above and also in individual chapters.
Based on the findings, we can finally conclude that searching for proper ways to ex-
plicitly track temporal processes in speech within acoustic features pays off by significant
improvements in recognition performance in terms of accuracy and robustness against
non-linguistic variability.
98 9 Summary and conclusion
Future Research
Although a huge effort has been put in ASR development worldwide, it seems that there
is still a lot of gaps in our knowledge, providing an open space for research. This work
attempted to fill in some of those gaps, which in turn revealed new interesting questions.
Some topics were not fully explored, such as the optimal way of warping the time axis in
LP-TRAP features, or pruning the large feature space resulting from M-RASTA filtering.
These topics as well as the discriminative keyword recognition leave a range of possibilities
for future research and development.
9.4 Acknowledgment
The work was done at Czech Technical University in Prague, Czech Republic, and at
IDIAP Research Institute, Martigny, Switzerland.
The grants supporting the part of research done at FEE CTU Prague were: GACR
102/05/0278 “New Trends in Research and Application of Voice Technology”, GACR
102/03/H085 “Biological and Speech Signals Modeling”, GACR-102/02/0124 “Voice Tech-
nologies for Support of Information Society”, and the research activity MSM 6840770014
“Research in the Area of the Prospective Information and Navigation Technologies”.
The part of research done at IDIAP was supported by DARPA grant “EARS Novel
Approaches” no. MDA972-02-1-0024. The other sources of support were DARPA GALE
program, the European Community AMI and M4 grants, and the IM2 Swiss National
Center for Competence in Research, managed by Swiss National Science Foundation on
behalf of Swiss authorities.
Appendix A
Class coverage of S/N & CTS
tasks.
Stories, MLP set 1 Numbers95, MLP set 2Index Label Frames % Frames %
0 d 5723 0.58 367 0.061 t 19706 1.99 21466 3.562 k 14175 1.43 2746 0.463 dcl 14753 1.49 506 0.084 tcl 27514 2.78 20282 3.365 kcl 17279 1.75 6779 1.126 s 51456 5.20 39826 6.607 z 14845 1.50 6993 1.168 f 17334 1.75 28375 4.709 th 5302 0.54 11692 1.9410 v 9892 1.00 14765 2.4511 m 22476 2.27 39 0.0112 n 42632 4.31 50208 8.3213 l 26176 2.65 915 0.1514 r 20952 2.12 29711 4.9215 w 14961 1.51 20465 3.3916 iy 34997 3.54 32289 5.3517 ih 38389 3.88 16014 2.6518 eh 22473 2.27 13959 2.3119 ey 20703 2.09 19102 3.1720 ae 27155 2.74 163 0.0321 ay 28911 2.92 53950 8.9422 ah 54829 5.54 28199 4.6723 ao 13537 1.37 4012 0.6624 ow 17026 1.72 47588 7.8925 uw 12399 1.25 27781 4.6026 er 15333 1.55 2638 0.4427 ax 11342 1.15 842 0.1428 SIL 190831 19.29 72475 12.0129 REJ 176417 17.83 29230 4.84
Overall 989518 100.00 603377 100.00
Table A.1: Class coverage of S/N task, MLP sets 1 & 2.
99
100 A Class coverage of S/N & CTS tasks.
male femaleIndex Label Frames % Frames %
0 SIL 1529079 26.67 1382202 24.431 aa 67581 1.18 68108 1.202 ae 196264 3.42 197321 3.493 ah 84119 1.47 87478 1.554 ao 65186 1.14 65041 1.155 aw 45686 0.80 49209 0.876 ax 239126 4.17 235463 4.167 ay 194186 3.39 199003 3.528 b 62175 1.08 60647 1.079 ch 20773 0.36 21474 0.3810 d 91351 1.59 95054 1.6811 dh 87815 1.53 84389 1.4912 dx 28929 0.50 27520 0.4913 eh 91584 1.60 93612 1.6514 er 94777 1.65 93853 1.6615 ey 94336 1.65 94917 1.6816 f 70752 1.23 67365 1.1917 FIP 9833 0.17 6155 0.1118 g 46039 0.80 43180 0.7619 hh 62289 1.09 78471 1.3920 ih 105977 1.85 105998 1.8721 iy 162879 2.84 164602 2.9122 jh 25965 0.45 25723 0.4523 k 134674 2.35 136179 2.4124 l 153233 2.67 155005 2.7425 LAU 72837 1.27 126833 2.2426 m 107423 1.87 108322 1.9127 n 221132 3.86 219663 3.8828 ng 48422 0.84 53268 0.9429 ow 145679 2.54 185196 3.2730 oy 5250 0.09 5587 0.1031 p 65262 1.14 60704 1.0732 PUH 142911 2.49 91609 1.6233 PUM 51569 0.90 69184 1.2234 r 141826 2.47 139225 2.4635 s 208325 3.63 212399 3.7536 sh 32093 0.56 31628 0.5637 t 199722 3.48 202069 3.5738 th 32087 0.56 33386 0.5939 uh 17369 0.30 20455 0.3640 uw 81752 1.43 81549 1.4441 v 50708 0.88 49531 0.8842 w 104537 1.82 106207 1.8843 y 104404 1.82 98951 1.7544 z 98470 1.72 94262 1.6745 zh 1966 0.03 1807 0.0346 REJ 34988 0.61 28861 0.51
Overall 5733340 100.00 5658665 100.00
Table A.2: Class coverage of CTS task, training sets. Note that only male set was used
in this work.
101
102 B Summary of experiments on CTS Task.
Appendix B
Summary of experiments on CTS
Task.
WER[%] WER[%]
Experiment Devel Test St., Grm.a
Short-term features
MFCC (3x13, Utt. E norm.b, lifter) 52.5 50.2 1k, 14
PLP (3x13) 53.2 51.6 1k, 14
ePLP (enhanced PLP - 3x13, Spk/Utt VTLN)c 46.4 43.8 1k, 14
9-frames
PLP-MLP (9 frames PLP, 46 fea)d 50.9 - 2k, 20
PLPu-MLP (9 frames PLP, Utt norm., 46 fea) 48.7 - 2k, 20
ePLP-MLP (9 frames ePLP, 46 fea) 46.3 43.3 2k, 20
TRAP
TRAP nn (101 frames, no norm., 46 fea) 54.5 53.4 2k, 20
TRAP-DCT (101 frames, Hamming, DCT 51e, 46 fea) 53.3 51.2 2k, 20
LP-TRAP
LP-TRAP (fp70, len51, ncep50, no warp, 46 fea) 53.1 50.5 2k,20
LP-TRAPw ( -//-, warp 1.75 ) 51.4 50.3 2k, 20
M-RASTA
M-RASTA 240 (2×8 filt, 46 fea) 53.7 52.5 2k, 20
M-RASTA 448 (2×8 filt + ∆ff, 46 fea) 53.3 51.4 2k, 20
M-RASTA 448 tri ( -//-, 3 phn statesg, 46 fea) 51.1 50.1 2k, 20
Hilbert-M-RASTA (2×8 filt + ∆, 46 fea) 52.3 50.8 2k, 20
Combinations
ePLP + ePLP-MLP (39 + 17 fea) 43.7 - 2k, 30
ePLP + ie[ePLP-MLP, LP-TRAPw]h(39 + 25 fea) 44.0 40.7 2k, 25
ePLP + LP-TRAPw (39 + 25 fea) 43.5 41.3 2k, 25
ePLP + M-RASTA 448 (39 + 25 fea) 45.6 42.4 2k, 25
ePLP + M-RASTA 448 tri (39 + 25 fea) 44.3 42.8 2k, 25
a St. = Number of triphone states, Grm. = Grammar scale factor.b Utterance-based energy normalization.c fea = Final feature vector size.d Vocal Tract Length Normalization per speaker (train) and per utterance (test).e Hamming window on TRAP, first 51 DCT coefficients used.f 2×8 temporal filters plus frequency deltas.g Three phoneme states (aligned by HMM) used as MLP targets.h ie = Inverse Entropy combination.
Table B.1: Summary of experiments on CTS task.
Appendix C
On target classes for
band-classifiers in TRAP
This section is a small study on suitable target classes for Band-MLP classifiers in TRAP
architecture. It does not question the Merger-MLP classes which were always phonemes
in this thesis, see Fig. C.1.
��
��
��
�� ��
��
��
��
��
��
��
��
ph
on
eme p
osterio
rs
Band−MLPs Merger−MLP
?
?��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
������
������
������
Figure C.1: TRAP MLP architecture.
The study was inspired by the works of Pratibha Jain and continuing work of Petr
Svojanovsky, who experimented with fewer broader classes. Their motivation was that
phonemes cannot be properly distinguished given an energy trajectory in a sub-band.
They experimented with either broader phonetic classes (silence, plosive, nasal, glide,
low/high-vocalic energy, schwa, flap, fricative) or automatically data-driven classes across
bands (UTRAPs) which they assigned to sub-band TRAPs by means of the least Euclidian
distance. They relabeled every frame in every sub-band and trained the MLPs using these
targets. They reported improvement in ASR over conventional phonetic targets [58, 91].
Frame accuracies in Band-MLPs of a typical TRAP system with 29 phonemes reach
only about 35% FAcc (FAcc = 100% - FER). The existence of broad classes confirms some
measure of ambiguity in phoneme targets. One may ask to what extent such ambiguity
matters. Considering that TRAP actually works proves that either the ambiguity does
not matter or that the found broad classes are not optimal and can be improved. It is
103
104 C On target classes for band-classifiers in TRAP
possible to find which explanation is right by emulating 100% non-separable classes.
Main idea
The Band-MLPs in S/N task have 29 phoneme targets (1-29). By repeating every frame in
the training set exactly twice and assigning it two different labels L and L+29, respectively,
we can artificially introduce 29 non-separable class pairs. In theory, the FAcc for Band-
MLPs should halve. The question is, what happens to the Merger-MLP FAcc and WER.
If they get worse, then the ambiguity indeed matters and it makes sense to put efforts in
developing broad classes. If they stay the same, then the phonemic targets are satisfactory.
Experiment
Two MLP systems were trained on mean-normalized TRAP of the length 51 frames and
evaluated in terms of FAcc and WER. The number of training examples per Band-MLP
target class stayed constant.
Baseline
• 15 Band-MLPs with 29 targets,
• Merger-MLP with 29 targets, input size = 29 × 15 features.
Ambiguous
• 15 Band-MLPs with 2 × 29 = 58 targets,
• Merger-MLP with 29 targets, input size = 58 × 15 features.
Results are given in the first two rows of Tab. C.1. When comparing Ambiguous
system to Baseline system, the frame accuracy in Band-MLPs indeed halved. Merger
accuracy stayed constant, which supports the hypothesis that non-separability does not
matter. However, word error rate increased. It could have been caused by a change in
Merger-MLP topology (doubling its input size). To further investigate on this, two more
experiments were run.
• In the first experiment, the 29 posteriors per band from Baseline system were re-
peated twice. It eliminated any possible Band-MLP performance drop while again
forming double input (58 × 15 features) to the merger.
• The second experiment realized the opposite idea. Out of the 58 posteriors per each
band of Ambiguous system, only one half was fed to the merger (either posteriors
1–29 or 30–58). The merger input size from Baseline system was thus preserved.
105
Band-MLP Merger-MLP
Targets Aver. FAcc [%] FAcc [%] WER [%]
baseline 35 82 4.8
ambiguous 17 82 5.3
2×repeated baseline – 81 5.1
ambiguous 1–29 – 82 5.2
ambiguous 30–58 – 82 5.1
Table C.1: Emulating non-separable classes in TRAP Band-MLP targets.
Results of the two additional experiments are given in rows 3–5 of Tab. C.1. By
merely replicating posteriors from Baseline system the word recognition deteriorated (see
the third row). It confirms that the above WER mismatch was caused only by the change
in MLP topology.
On the other hand, preserving the original small topology by using only half of the
features provided by Band-MLPs did not return the WER back to 4.8%. However, using
only a half of the features could in principle have caused a loss of information. For this
reason the former experiment can be considered more reliable.
Conclusion
Phoneme classes used as targets for band-conditioned neural classifiers are not always
separable, because the classifiers have only partial information. Since the assignment
between phonemes and sub-band energy patterns may be ambiguous, the theoretically
reachable accuracy of the classifiers decreases. However, it was shown that such sub-band
ambiguity does not reduce the phoneme-classification potential of the merging neural
network, thus it can affect the word error rate only a little. Hence, if phonemes are used
as targets for Band-MLPs, the performance should not be impaired. It does not seem to
be neccessary to search for a data-driven broad categories.
106 C On target classes for band-classifiers in TRAP
Appendix D
Sub-phoneme targets for
TANDEM classifier
This section introduces a simple experiment with alternative phoneme classes that could
enhance the temporal resolution of the classifier.
Target classes of MLPs in combined ANN/HMM recognizers are typically phonemes.
The training targets are obtained from a phoneme transcript simply by setting the flag
for each phoneme pi from time ti to ti+1 and keeping other phonemes unset, see Fig. D.1.
Considering that the MLP does not have memory, it cannot model any temporal evolution
within the phoneme. In principle, the training frames could even be randomly reshuffled
prior to training with no drawback in performance1. After the training, the “best” MLP
is usually judged the one which yields posteriors best matching the training targets – in
other words, the most “rectangular” posteriors with respect to their temporal evolution.
Yet, these posteriors are subsequently modeled typically by 3–5 state HMMs. Why using
three states for a rectangle? Wouldn’t it make more sense if the MLP was trained with
state targets instead of phoneme targets?
Experiment on S/N Task
A combined ANN/HMM system based on TANDEM architecture and Multi-RASTA fea-
tures [49] was trained first with phoneme targets, and second with state targets (3 per
phoneme). The experiment was done with S/N task. Phoneme targets and state targets
were both obtained from a forced alignment using an existing set of HMMs. In theory, if
the reasoning was odd, the FER would increase 3 times. Tab. D.1 shows what actually
happened.
The FER dropped only to 29.1%, not to 57.3%, which suggests that the MLP is able
to benefit from state targets. It was also confirmed by a big progress in WER.
For a fair comparison of systems with different number of posteriors, only the first 29
1In practice, the training frames are actually being reshuffled, which ameliorates MLP convergence.
107
108 D Sub-phoneme targets for TANDEM classifier
SIL
eh
ih
n
s
v
x
MLP1
0
SIL s eh v ih n SIL
t t t tt1 2 3 4 50
t
Figure D.1: Illustration of phonemes as MLP training targets for an utterance “seven”.
Targets FER [%] WER [%]
29 phonemes 19.1 4.0
3×29 states 29.1 3.4
Table D.1: Performance of phoneme states used as MLP targets, S/N task.
features after KLT transform in TANDEM were used. Had all 87 features been fed to the
HMMs, the WER would have reached 3.2%.
It could be also argued that there was a difference in overall MLP sizes, as the latter
MLP had more output units. However, this can hardly be compensated for. If the number
of free parameters in the systems was preserved, there would still be a difference in MLP
topology (roughly A × 3B × C units vs. A × B × C/3 units), which could significantly
affect performance. Nevertheless, for the sake of completeness, another MLP was trained
with the reduced number of hidden units to preserve the overall size. FER was 31.3% and
WER was 3.8%2, still better than the phoneme-target system.
Experiment on CTS Task
Since it is often the case that S/N observations do not translate to LVCSR, the same
experiment was carried out on CTS task. The sequence of phoneme labels commonly
used for training was forced-aligned to states (3 per phoneme) using existing HMMs.
Subsequently, to repeat the S/N experiment, two MLPs were trained:
State targets “fair” – the number of free parameters was preserved (3× more outputs,
3× less hiddens).
State targets – the same hidden layer size as the baseline, 3× more outputs.
2When using all 87 outputs from this MLP, the WER would be 3.5%.
109
MLP Topology (Input × Hidden × Output)
Phoneme targets 448 × 2000 × 47
State targets “fair” 448 × 667 × 139
State targets 448 × 2000 × 139
Note that the “reject” class was not split to states as it is not used for training, hence
the number of classes was 46 × 3 + 1 = 139. The results are given in Tab. D.2.
MLP FER [%] Devel set WER [%] Test set WER [%]
Phoneme targets 37.5 53.3 51.4
State targets “fair” 54.4 51.1 50.1
State targets 52.3 50.8 50.9
Table D.2: Performance of phoneme states used as MLP targets, CTS task.
The FER dropped only by about 15%, which again supports the use of phoneme
states as MLP targets. The WER improved significantly, by more than 1% absolute.
Interestingly, the “fair” MLP (smaller one) outperformed the big MLP on the test set.
Conclusion
The MLP classifier in TANDEM architecture typically treats the phoneme as an atomic
temporal unit. It arises from the fact that during training, all frames belonging to a
certain phoneme are freely interchangeable. Here it was shown that treating the phoneme
as an entity with certain temporal dynamics (modeled by three independent sub-phoneme
states) in MLP classifier can significantly improve ASR performance. It was observed at
the frame level as well as in word error rate. The findings support the notion that making
the system overall more coherent by coordinating MLP targets with elementary HMM
modeling units (phone states) pays off by accuracy. Note that similar observations were
reported also in [88], though not very explicitly.
• This experiment was carried out after all other approaches presented in the thesis
had been already fixed, therefore the idea has not been implemented in evaluations.
110 D Sub-phoneme targets for TANDEM classifier
Bibliography
[1] FFTW home page, http://www.fftw.org.
[2] ICSI speech FAQ, http://www.icsi.berkeley.edu/speech/faq.
[3] NIST spoken language technology evaluation and utility web,
http://www.nist.gov/speech/index.htm.
[4] QuickNet home page, http://www.icsi.berkeley.edu/Speech/qn.html.
[5] Sprachcore web page,
http://www.icsi.berkeley.edu/∼dpwe/projects/sprach/sprachcore.html.
[6] Web pages of Speech Processing and Signal Analysis Group at FEE CTU Prague,
http://noel.feld.cvut.cz/speechlab.
[7] Web pages of Speech Processing Group at Brno University of Techonology,
http://www.fit.vutbr.cz/research/groups/speech.
[8] Allen, J. B. How do humans process and recognize speech? IEEE Trans. on
Speech and Audio Proc. 2, 4 (Oct. 1994), 567–577.
[9] Arai, T., Pavel, M., Hermansky, H., and Avendano, C. Intelligibility of
speech with filtered time trajectories of spectral envelopes. In Proc. of ICSLP ’96
(Philadelphia, PA, 1996), vol. 4, pp. 2490–2493.
[10] Arai, T., Pavel, M., Hermansky, H., and Avendano, C. Syllable intelligibility
for temporally-filtered LPC cepstral trajectories. J. Acoust. Soc. Am. (1999).
[11] Arai, T., Takahashi, M., and Kanedera, N. On the important modulation
frequency bands of speech for human speaker recognition. In Proc. of ICSLP 2000
(2000), vol. 3, pp. 774–777.
[12] Atal, B. Effectiveness of linear prediction characteristics of the speech wave for
automatic speaker identification and verification. J. Acoust. Soc. Am. 55, 6 (1974),
1304–1312.
[13] Atal, B., and Schroeder, M. Predictive coding of speech signals. Proceedings of
the 1967 Conference on Communications and Processing (November 1967), 360–361.
[14] Athineos, M., and Ellis, D. Frequency-domain linear prediction for temporal
features. In Proc. of IEEE ASRU 2003 (St. Thomas, U.S. Virgin Islands, 2003),
pp. 261–266.
111
112 D Sub-phoneme targets for TANDEM classifier
[15] Athineos, M., Hermansky, H., and Ellis, D. P. LP-TRAP: Linear predictive
temporal patterns. In International Conference on Spoken Language Processing
(ICSLP) (2004). IDIAP RR 04-59.
[16] Athineos, M., Hermansky, H., and Ellis, D. P. W. PLP2: Autoregressive
modeling of auditory-like 2-D spectro-temporal patterns. In Proc of. SAPA-2004
(Jeju, Korea, 2004).
[17] Avendano, C., van Vuuren, S., and Hermansky, H. Data-based RASTA-like
filter design for channel normalization in ASR. In ICSLP’96 (Philadelphia, PA,
USA, Oct. 1996), vol. 4, pp. 2087–2090.
[18] Bellman, R. E. Dynamic programming. Princeton University Press, 1957.
[19] Bourlard, H., and Dupont, S. A new ASR approach based on independent
processing and recombination of partial frequency bands. In Proc. of ICSLP ’96
(Philadelphia, PA, 1996), vol. 1, pp. 426–429.
[20] Burget, L., Dupont, S., Garudadri, H., Grezl, F., Hermansky, H., Jain,
P., Kajarekar, S., and Morgan, N. QUALCOMM-ICSI-OGI features for ASR.
In Proc. 7th International Conference on Spoken Language Processing (2002), Inter-
national Speech Communication Association.
[21] Burget, L., and Hermansky, H. Data driven design of filter bank for speech
recognition. In Proc. of TSD’01 (Zelezna Ruda, Czech Republic, September 2001).
[22] Chen, B., Cetin, O., Doddington, G., Morgan, D., Ostendorf, M., Shi-
nozaki, T., , and Zhu, Q. A CTS task for meaningful fast-turnaround experi-
ments. In Proc. of RT-04 Workshop (IBM Palisades Center, November 2004).
[23] Chen, B. Y. Learning Discriminant Narrow-Band Temporal Patterns for Automatic
Speech Recognition. PhD thesis, University of California, Berkeley, 2005.
[24] Chen, B. Y., Chang, S., and Sivadas, S. Learning discriminative temporal
patterns in speech: Development of novel TRAPS-like classifiers, 2003.
[25] Chen, B. Y., Zhu, Q., and Morgan, N. Tonotopic multi-layered perceptron: A
neural network for learning long-term temporal features for speech recognition. In
Proc. of ICASSP 2005 (Philadelphia, PA, 2005).
[26] Cole, R., Noel, M., and Lander, T. Telephone speech corpus development
at CSLU. In In Proceedings of the International Conference on Spoken Language
Processing (ICSLP’94) (Yokohama, Japan, 1994), pp. 1815–1818.
[27] Cole, R., Noel, M., Lander, T., and Durham, T. New telephone speech
corpora at CSLU. In Proceedings of the Fourth European Conference on Speech
Communication and Technology (1995), vol. 1, pp. 821–824.
[28] Cooley, J. W., and Tukey, J. W. An algorithm for the machine calculation of
complex Fourier series. Mathematics of Computation 19 (1965), 297–301.
113
[29] Davis, S. B., and Mermelstein, P. Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences. IEEE Transactions
on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357–366.
[30] deCharms, R. C., Blake, D. T., and Merzenich, M. M. Optimizing sound
features for cortical neurons. Science 280, 5368 (May 1988), 1439 – 1444.
[31] Depireux, D. A., Simon, J. Z., Klein, D. J., and Shamma, S. A. Spectro-
temporal response field characterization with dynamic ripples in ferret primary au-
ditory cortex. J. Neurophysiol. 85 (2001), 1220 – 1234.
[32] Dimitriadis, D., Maragos, P., and Potamianos, A. Auditory teager energy
cepstrum coefficients for robust speech recognition. In Proc. of Interspeech’05 (Lis-
bon, Portugal, September 2005).
[33] Drullman, R., Festen, J. M., and Plomp, R. Effect of reducing slow temporal
modulations on speech recognition. J. Acoust. Soc. Am. 95, 5 (May 1994), 2670–
2679.
[34] Ephraim, Y., and Malah, D. Speech enhancement using a minimum mean square
error short time spectral amplitude estimator. IEEE Trans. on ASSP-32 6 (Decem-
ber 1984), 1109–1121.
[35] Flanagan, J. Speech Analysis Synthesis and Perception, 2 ed. Springer-Verlag,
1972.
[36] Fletcher, H. Speech and hearing in communication. In The ASA edition of Speech
and Hearing in Communication, J. B. Allen, Ed. Acoustical Society of America, New
York, 1995.
[37] Fousek, P. Does phoneme labeling of speech have to be done by hand? Tech. Rep.
R06-3, FEE CTU, Dept. of Circuit Theory, Prague, 2006.
[38] Furui, S. Cepstral analysis technique for automatic speaker verification. In IEEE
Trans. ASSP (1981), vol. 29, pp. 254–272.
[39] Furui, S. On the role of spectral transition for speech perception. Journal of the
Acoustical Society of America 80, 4 (October 1986), 1016–1025.
[40] Gales, M. Maximum likelihood linear transformations for HMM-based speech
recognition. Computer Speech and Language 12, 2 (1998), 75–98.
[41] Gillick, L., and Cox, S. J. Some statistical issues in the comparison of speech
recognition algorithms. In Proc. of ICASSP’89 (Glasgow, 1989), pp. 532–535.
[42] Gold, B., and Morgan, N. Speech and Audio Signal Processing: Processing and
Perception of Speech and Music. John Wiley & Sons, Inc., New York, NY, USA,
1999.
[43] Gong, Y. Speech recognition in noisy environments: a survey. Speech Commun.
16, 3 (1995), 261–291.
114 D Sub-phoneme targets for TANDEM classifier
[44] Greenberg, S. Understanding speech understanding: Towards a unified theory of
speech perception. In Workshop on the Auditory Basis of Speech Perception (1996),
1–8.
[45] Grezl, F. Local time-frequency operators in TRAPs for speech recognition. In 6th
International Conference TSD 2003 (2003), vol. 2003, University of West Bohemia
in Pilsen, pp. 269–274.
[46] Grezl, F., and Hermansky, H. Local averaging and differentiating of spectral
plane for TRAP-based ASR. In Proc. EUROSPEECH 2003 (2003), Institute for
Perceptual Artificial Intelligence.
[47] Hermansky, H. Perceptual linear predictive (PLP) analysis for the speech. J.
Acous. Soc. Am. (1990), 1738–1752.
[48] Hermansky, H., Ellis, D., and Sharma, S. Connectionist feature extraction
for conventional HMM systems. In ICASSP’00 (Istanbul, Turkey, 2000).
[49] Hermansky, H., and Fousek, P. Multi-resolution RASTA filtering for TANDEM-
based ASR. In Proceedings of Interspeech 2005 (2005).
[50] Hermansky, H., Fujisaki, H., and Saito, Y. Analysis and synthesis of speech
based on spectral transform linear predictive method. In Proc. of ICASSP’83 (April
1983), vol. 8, pp. 777–780.
[51] Hermansky, H., and Jain, P. Band-independent speech-event categories for
TRAP based ASR. In Proc. of Eurospeech 2003 (Geneve, CH, 2003), pp. 1013–
1016.
[52] Hermansky, H., and Morgan, N. RASTA processing of speech. IEEE Transac-
tions on Speech and Acoustics 2 (October 1994), 587–589.
[53] Hermansky, H., and Sharma, S. TRAPs - classifiers of TempoRAl Patterns. In
Proc. of ICSLP’98 (November 1998).
[54] Hermansky, H., and Sharma, S. Temporal patterns (TRAPS) in ASR of noisy
speech. In in ICASSP’99 (Phoenix, Arizona, USA, Mar. 1999).
[55] Hertz, J., Krogh, A., , and Palmer, R. G. Introduction to the Theory of
Neural Networks. Addison-Wesley, 1991.
[56] Ikbal, S., Misra, H., Sivadas, S., Hermansky, H., and Bourlard, H. En-
tropy Based Combination of Tandem Representations for Noise Robust ASR. In
Proc. of ICSLP-04 (Jeju Island, Korea, October 2004).
[57] Itakura, F., and Saito, S. A statistical method for estimation of speech spectral
density and formant frequencies. Electronics Communications of Japan 53-A, 1
(1970), 36–43.
[58] Jain, P. Temporal patterns of frequency-localized features in ASR. PhD thesis, OGI
School of Science & Engineering at OHSU, 2003.
115
[59] Janin, A., Ellis, D., and Morgan, N. Multi-stream speech recognition: Ready
for prime time? In Proc. of Eurospeech-99 (Budapest, 1999), pp. 591–594.
[60] Kajarekar, S., Yegnanarayana, B., and Hermansky, H. A study of two
dimensional linear discriminants for ASR. In ICASSP’01 (Salt Lake City, Utah,
USA, May 2001).
[61] Kajarekar, S. S., and Hermansky, H. Optimization of units for continuous-
digit recognition task. In Proc. of ICSLP 2000 (Beijing, China, 2000).
[62] Kanedera, N., Arai, T., Hermansky, H., and Pavel, M. On the relative
importance of various components of the modulation spectrum for automatic speech
recognition. Speech Communication 28, 1 (May 1999), 43–55(13).
[63] Kanedera, N., Hermansky, H., and Arai, T. On properties of modulation
spectrum for robust automatic speech recognition. Proc. of the IEEE International
Conf. on Acoustics, Speech, and Signal Processing (ICASSP) 2 (1998), 613–616.
[64] Karafiat, M., Grezl, F., and Cernocky, J. TRAP based features for LVCSR
of meeting data. In Proc. 8th International Conference on Spoken Language Pro-
cessing (2004), Sunjin Printing Co,, pp. 437–440.
[65] Ketabdar, H., and Hermansky, H. Identifying unexpected words using in-
context and out-of-context phoneme posteriors. IDIAP-RR 68, IDIAP, 2006.
[66] Kleinschmidt, M., and Gelbart, D. Improving word accuracy with Gabor
feature extraction. In Proc. of ICSLP’02 (Denver, Colorado, 2002).
[67] Kurkova, V. Kolmogorov’s theorem and multilayer neural networks. Neural Com-
putation 5, 3 (1992), 501–506.
[68] Lehtonen, M., Fousek, P., and H., H. Hierarchical approach for spotting
keywords. In Proc. of 2nd Workshop on Multimodal Interaction and Related Machine
Learning Algorithms – MLMI’05 (Edinburgh, UK, July 2005).
[69] Linsker, R. Self-organization in a perceptual network. Computer 21, 3 (1988),
105–117.
[70] Lippmann, R. P. Accurate consonant perception without mid-frequency speech
energy. Speech and Audio Processing, IEEE Transactions on 4, 1 (1996), 66–69.
[71] Lockwood, P., and Boudy, J. Experiments with a non-linear spectral subtractor
(NSS), hidden Markov models and the projection for robust speech recognition in
cars. In Proc. of Eurospeech 1991 (1991).
[72] Malayath, N., and Hermansky, H. Bark resolution from speech data. Proceed-
ings of International Conference on Spoken Language Processing 2002 (September
2002).
[73] Martin, R. Noise power spectral density estimation based on optimal smoothing
and minimum statistics. IEEE Transactions on Speech and Audio Processing 9, 5
(July 2001).
116 D Sub-phoneme targets for TANDEM classifier
[74] Mermelstein, P. Distance measures for speech recognition: Psychological and
instrumental. In Pattern Recognition and Artificial Intelligence, C. H. Chen, Ed.
Academic Press, New York, 1976, pp. 374–388.
[75] Meyer, B., and Kleinschmidt, M. Robust speech recognition based on localized
spectro-temporal features. In Proc. of ESSV’03 (Karlsruhe, 2003).
[76] Misra, H., Vepa, J., and Bourlard, H. Multi-stream ASR: An Oracle perspec-
tive. In Proc. of ICSLP’06 (Pittsburgh, U.S.A., September 2006).
[77] Morgan, N., and Bourlard, H. An introduction to hybrid HMM/connectionist
continuous speech recognition. IEEE Signal Processing Magazine (May 1995), 25–42.
[78] Morris, A., Hagen, A., and Bourlard, H. Map combination of multi-stream
hmm or hmm/ann experts. In Proc. of Eurospeech’01 (Aalborg, Denmark, Septem-
ber 3-7 2001).
[79] Motlicek, P., Hermansky, H., Garudadri, H., and Srinivasamurthy, N.
Speech coding based on spectral dynamics. In Ninth International Conference on
Text, Speech and Dialogue (TSD) (2006). IDIAP-RR 06-05.
[80] Motlıcek, P. Modeling of Spectra and Temporal Trajectories in Speech Process-
ing. PhD thesis, Brno University of Technology, Faculty of Information Technology,
august 2003.
[81] Motlıcek, P., and Cernocky, J. Time-domain based temporal processing with
application of orthogonal transformations. In Proc. EUROSPEECH 2003 (2003),
Institute for Perceptual Artificial Intelligence, pp. 821–824.
[82] Oppenheim, A. V., and Schafer, R. W. Discrete-time signal processing.
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989.
[83] Pavel, M., and Hermansky, H. Information fusion by human and machines. In
Proc. of The First European conference on signal analysis and prediction (Prague,
Czech Republic, 1997).
[84] Prasanna, S. H. M., and Hermansky, H. Multi-RASTA and PLP in automatic
speech recognition. Tech. Rep. RR 06-45, IDIAP Research Institute, Martigny, 2006.
[85] Rabiner, L. R. A tutorial on Hidden Markov Models and selected applications in
speech recognition. Proceedings of the IEEE 77, 2 (1989), 257–286.
[86] Rumelhart, D. E., Hintont, G., and Williams., R. J. Learning representa-
tions by back-propagating errors. Nature 4, 323 (1986), 533–536.
[87] Schwarz, P., Matejka, P., and Cernocky, J. Recognition of phoneme strings
using TRAP technique. In Proceedings of 8th International Conference Eurospeech
(2003), International Speech Communication Association.
[88] Schwarz, P., Matejka, P., and Cernocky, J. Towards lower error rates in
phoneme recognition. In Proceedings of 7th International Conference Text,Speech
and Dialoque 2004 (2004), Springer Verlag.
Bibliography 117
[89] Schwarz, P., Matejka, P., and Cernocky, J. Hierarchical structures of neural
networks for phoneme recognition. In Proceedings of ICASSP 2006 (2006), pp. 325–
328.
[90] Sovka, P., Pollak, P., and Kybic, J. Extended spectral subtraction. In Proc.
of European Signal Processing Conference (EUSIPCO–96) (Trieste, Italy, November
1996).
[91] Svojanovsky, P. Band–independent classiers in TRAP-TANDEM ASR system.
In Proc. of SPECOM 2005 (Patras, Greece, October 2005), pp. 769–772.
[92] Tibrewala, S., and Hermansky, H. Sub-band based recognition of noisy speech.
In Proc. of ICASSP ’97 (Munich, Germany, 1997), pp. 1255–1258.
[93] Tyagi, V., and Wellekens, C. Fepstrum representation of speech signal. In
Proceedings of IEEE ASRU’05 (December 2005), pp. 44–49.
[94] Uhlır, J., and Sovka, P. Cıslicove zpracovanı signalu. CTU Publishing House,
1995.
[95] Umesh, S., Cohen, L., and Nelson, D. Frequency-warping and speaker-
normalization. In Proceedings of the 1997 IEEE International Conference on Acous-
tics, Speech, and Signal Processing (ICASSP ’97) (Washington, DC, USA, 1997),
vol. 2, IEEE Computer Society, p. 983.
[96] Valente, F., and Hermansky, H. Discriminant linear processing of time-
frequency plane. In Proc. of ICSLP’06 (2006). IDIAP-RR 06-20.
[97] Valente, F., and Hermansky, H. Combination of acoustic classifiers based on
Dempster-Shafer theory of evidence. In Proc. of ICASSP 2007 (Honolulu, Hawaii,
USA, 2007).
[98] van Vuuren, S., and Hermansky, H. Data-driven design of RASTA-like filters.
In Eurospeech’97 (Rhodes, Greece, 1997), ESCA.
[99] Yang, H. H., Sharma, S., van Vuuren, S., and Hermansky, H. Relevance
of time-frequency features for phonetic and speaker/channel classification. Speech
Communication (Aug. 2000).
[100] Young, S., Ollason, D., Valtchev, V., and Woodland, P. The HTK Book
(for HTK Version 3.2.1). Cambridge University Press, Cambridge, UK, 2002.
[101] Zhu, Q., Chen, B., Grezl, F., and Morgan, N. Improved MLP structures
for data-driven feature extraction for ASR. In Interspeech’2005 - Eurospeech - 9th
European Conference on Speech Communication and Technology (2005).
[102] Zhu, Q., Chen, B., Morgan, N., and Stolcke, A. On using MLP features in
LVCSR. In Proc. of INTERSPEECH 2004 (2004), pp. 921–924.
[103] Zhu, Q., Stolcke, A., Chen, B. Y., and Morgan, N. Using MLP features in
SRI’s conversational speech recognition system. In Proc. of INTERSPEECH 2005
(2005), pp. 2141–2144.
118 Selected publications
Other used literature
1. Psutka, J., Muller, L., Matousek, J. and Radova, V., Mluvıme s pocıtacem
cesky, Academia Praha, Prague, 2005.
2. Uhlır, J., Sovka, P. and Cmejla, R., Uvod do cıslicoveho zpracovanı signalu,
CTU Publishing House, Prague, 2003.
3. Sovka, P. and Pollak, P., Vybrane metody cıslicoveho zpracovanı signalu, CTU
Publishing House, Prague, 2001.
4. Rybicka, J., LATEX pro zacatecnıky, KONVOJ Brno, ISBN 80-85615-74-6, 1999.
5. Satrapa, P., Perl pro zelenace, Neokortex, ISBN 80-86330-02-8.
6. Racek, S. and Kvoch, M., Trıdy a objekty v C++, Kopp, ISBN 80-7232-017-3,
1998.
Selected publications
1. Fousek, P. and Hermansky, H., Towards ASR based on hierarchical posterior-
based keyword recognition, Proc. of ICASSP ’06, Toulouse, France, 2006.
2. Boril, H. and Fousek, P., Influence of different speech representations and HMM
training strategies on ASR performance, Proc. of Poster 2006, Prague, 2006.
3. Fousek, P., Does phoneme labeling of speech have to be done by hand?, Tech. Rep.
R06-3, FEE CTU, Dept. of Circuit Theory, Prague, 2006.
4. Hermansky, H. and Fousek, P., Multi-resolution RASTA filtering for TANDEM-
based ASR, Proc. of Interspeech 2005, Lisbon, Portugal, September 2005.
5. Hermansky, H., Fousek, P. and Lehtonen, M., The role of speech in multi-
modal human-computer interaction (towards reliable rejection of non-keyword in-
put), Proc. of the 8th International Conference on Text, Speech and Dialogue - TSD
2005, Carlsbad, Czech Republic, 2005.
6. Lehtonen, M., Fousek, P. and Hermansky, H., Hierarchical approach for spot-
ting keywords, IDIAP Research Report, 2005.
7. Fousek, P., Svojanovsky, P., Grezl , F. and Hermansky, H., New nonsense
syllables database - analyses and preliminary ASR experiments, Proc. of ICSLP’04,
Seoul, Corea, 2004.
8. Fousek, P., Performance of LDA based parametrization techniques for robust
speech recognition, In Speech Processing, Prague, Academy of Sciences of the Czech
Republic, Institute of Radioengibneering and Electronics, 2004, vol. 1, pp. 152–153.
Selected publications 119
9. Fousek, P. and Pollak, P., Additive noise and channel distortion-robust param-
eterization tool - performance evaluation on Aurora 2 & 3, Proc. of Eurospeech ’03,
Geneve, Switzerland, 2003.
10. Fousek, P. Computer cluster-based speech recognition system, Proc. of Poster
2003, Prague, 2003.
11. Fousek, P., Robust speech parametrization for recognition purposes, Proc. of
Poster 2002, Prague, 2002.
Top Related