Covariation and weighting of harmonically decomposed streams for ASR
description
Transcript of Covariation and weighting of harmonically decomposed streams for ASR
aperiodic periodic
Production of /z/:
Covariation and weighting of harmonically decomposed
streams for ASR
Introduction
Pitch-scaled harmonic filter
Recognition experiments
Results
Conclusion
Motivation and aims
• Most speech sounds are either voiced or unvoiced, which have very different properties:
– voiced: quasi-periodic signal from phonation
– unvoiced: aperiodic signal from turbulence noise
• Do these properties allow humans to recognize speech in noise?
Maybe, we can use this information to help ASR...
by computing separate features for the two parts.
• Are their two contributions complementary?
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ INTRODUCTION
aperiodic contribution periodic contribution
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ INTRODUCTION
Voiced and unvoiced parts of a speech signal
Production of /z/:
speech waveform
aperiodic waveform
s(n)
periodic waveform
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD
Pitch-scaled harmonic filter
u(n)^
time shifting
v(n)^
PSHF. . .
optimised pitch
f0raw
f0opt
pitch optimisation
pitch extraction
Nopt
PSHFPSHF
re-splicing
Orig
inal
Per
iodi
cA
perio
dic
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD
Decomposition example (waveforms)
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD
Orig
inal
Per
iodi
cA
perio
dic
Decomposition ex. (spectrograms)
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD
Orig
inal
Per
iodi
cA
perio
dic
Decomposition ex. (MFCC specs.)
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD
Speech database: Aurora 2.0
• From TIdigits database of connected English digit strings (male & female speakers), filtered with G.712 at 8 kHz.
Data type Signal-to-Noise Ratio (dB)
clean-condition
multi-condition 20 15 10 5
set A (same noises)
20 15 10 5 0 -5
set B (different noises)
20 15 10 5 0 -5
set C (diffferent channel)
20 15 10 5 0 -5
TR
AIN
TE
ST
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD
Description of the experiments
• Baseline experiment: [base]– standard parameterisation of the original waveforms
(i.e., MFCC,+Δ,+ΔΔ)
• PCA experiments: [pca26, pca78, pca13 and pca39]– decorrelation of the feature vectors, and reduction of
the number of coefficients
• Split experiments: [split, split1]– adjustment of stream weights (periodic vs. aperiodic)
Caveat: pitch values were derived from clean speech files, for entire database!
PCA26:
PCA78:
PCA13:
PCA39:
MFCC +Δ, +Δ2catPSHF PCA
MFCC +Δ, +Δ2 catPSHF PCA
MFCC +Δ, +Δ2 catPSHF PCA
MFCC +Δ, +Δ2 catPSHF PCA
BASE: MFCCwaveform features
+Δ, +Δ2
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD
Parameterisations
SPLIT: MFCC +Δ, +Δ2 catPSHF
SPLIT1: MFCC +Δ, +Δ2 catPSHF
Word Error Rate (%) clean multi overall base 47.4 21.7 34.6
pca26 33.8 11.4 22.6 pca78 42.7 12.8 27.7 pca13 28.3 13.0 20.7 pca39 30.3 14.5 22.4
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ RESULTS
Full-sized PCA results
PCA26PCA39
• clean+ multi
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ RESULTS
Variance of Principal Components
PCA26 experiment’s results
CLEAN MULTI
Word Error Rate (%) clean multi overall base 47.4 21.7 34.6
pca26 29.0 11.4 20.2 pca78 38.3 12.1 25.2 pca13 27.6 12.6 20.1 pca39 29.3 12.5 20.9
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ RESULTS
Summary of best PCA results
Split experiment’s results
Word Error Rate (%) clean multi overall base 47.4 21.7 34.6
split (=0) 62.9 44.3 53.6
split (=1) 28.5 11.7 20.1
split (=2) 22.7 11.5 17.1
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ RESULTS
Sample Split results
Note: same value of stream weights used in training as in testing, for Split.
Split1 experiment’s results
Word Error Rate (%) WER (%) clean multi overall abs. rel. base 47.4 21.7 34.6 0.0 0.0
pca26 29.0 11.4 20.2 14.4 41.6 pca78 38.3 12.1 25.2 9.4 27.2 pca13 27.6 12.6 20.1 14.5 41.9 pca39 29.3 12.5 20.9 13.7 39.6
split 22.6 11.0 16.8 17.8 51.4 split1 21.0 10.9 16.0 18.6 53.8
Word Error Rate (%) clean multi overall base 47.4 21.7 34.6
pca26 29.0 11.4 20.2 pca78 38.3 12.1 25.2 pca13 27.6 12.6 20.1 pca39 29.3 12.5 20.9
split 22.6 11.0 16.8 split1 21.0 10.9 16.0
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ RESULTS
Summary of PCA & Split results
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ CONCLUSION
Conclusions• PSHF module split Aurora’s speech waveforms into
two synchronous streams (periodic and aperiodic)– large improvements over the single-stream Baseline
• Split was better than all PCA combinations:– PCA26/13 better than PCA 78/39, and PCA13 best
– Split1 marginally better than Split
• Periodic speech segments give robustness to noise.
Further work– Modeling: how best to combine the streams?
– LVCSR: evaluate front end on TIMIT (phone recognition).
– Robust pitch tracking
COLUMBO PROJECT: Harmonic decomposition
applied to ASR
Philip J.B. Jackson 1 <[email protected]>
David M. Moreno 2 <[email protected]>
Javier Hernando 2 <[email protected]>
Martin J. Russell 3 <[email protected]>
1 2 3
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/