Feature Extraction for speech applications
description
Transcript of Feature Extraction for speech applications
![Page 1: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/1.jpg)
Feature Extraction for Feature Extraction for speech applicationsspeech applications
Chapters 19-22Chapters 19-22
![Page 2: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/2.jpg)
The course so farThe course so far
• Brief introduction to speech analysis and
recognition for humans and machines
• Some basics on speech production, acoustics,
pattern classification, speech units
![Page 3: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/3.jpg)
Where to nextWhere to next• Multi-week focus on audio signal processing
• Start off with the “front end” for ASR
• Goal: generate features good for classification
• Waveform is too variable
• Current front ends make some sense in terms
of signal characteristics alone (production
model) - recall the spectral envelope
• But analogy to perceptual system is there too
• A bit of this now (much more on ASR in April)
![Page 4: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/4.jpg)
Biological analogy Biological analogy
• Essentially all ASR front ends start with
a spectral analysis stage
• “Output” from the ear is frequency dependent
• Target-probe experiments going back to
Fletcher (remember him?) suggest a
“critical band”
• Other measurements also suggest similar
mechanism (linear below 1kHz, log above)
![Page 5: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/5.jpg)
![Page 6: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/6.jpg)
![Page 7: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/7.jpg)
Basic Idea (Fletcher) Basic Idea (Fletcher)
• Look at response to pure tone in white noise
• Set tone level to just audible
• Reduce noise BW, initially same threshold
• For noise BW below critical value, audibility
threshold goes down
• Presence or absence of tone based on SNR
within the band
![Page 8: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/8.jpg)
Feature Extraction for ASRFeature Extraction for ASR
Spectral(envelope)Analysis
AuditoryModel/
Normalizations
![Page 9: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/9.jpg)
Deriving the Deriving the envelope (or the envelope (or the
excitation)excitation)
excitation
Time-varying filter
e(n) ht(n) y(n)=e(n)*ht
(n)
HOW CAN WE GET e(n) OR h(n) from y(n)?
![Page 10: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/10.jpg)
But first, why?But first, why?
• Excitation/pitch: for vocoding; for synthesis; for signal transformation; for prosody extraction (emotion, sentence end, ASR for tonal languages …); for voicing category in ASR
• Filter (envelope): for vocoding; for synthesis; for phonetically relevant information for ASR
• Frequency dependency appears to be a key aspect of a system that works - human audition
![Page 11: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/11.jpg)
Spectral Envelope EstimationSpectral Envelope Estimation
• Filters
• Cepstral Deconvolution
(Homomorphic filtering)
• LPC
![Page 12: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/12.jpg)
![Page 13: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/13.jpg)
Channel vocoder Channel vocoder (analysis)(analysis)
e(n)*h(n)
Broad w.r.t harmonics
![Page 14: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/14.jpg)
Rectifier Low-pass filterBand-pass filterA B C
B
C
A
Bandpass power estimationBandpass power estimation
![Page 15: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/15.jpg)
speech
BP 1
BP 2
BP N
rectify
rectify
rectify
LP 1
LP 2
LP N
decimate
decimate
decimate
Magnitudesignals
Deriving spectral envelope with a filter bank
![Page 16: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/16.jpg)
![Page 17: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/17.jpg)
Filterbank properties Filterbank properties
• Original Dudley Voder/Vocoder: 10 filters,
300 Hz bandwidth (based on # fingers!)
• A decade later, Vaderson used 30 filters,
100 Hz bandwidth (better)
• Using variable frequency resolution, can use
16 filters with the same quality
![Page 18: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/18.jpg)
Mel filterbank Mel filterbank
• Warping function B(f) = 1125 ln (1 + f/700)
• Based on listening experiments with pitch
(mel is for “melody”)
![Page 19: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/19.jpg)
Other warping functionsOther warping functions
• Bark(f) = [26.8 /(1 + (1960/f))] - 0.53
(named after Barkhausen, proposed loudness scale)
Based on critical band estimates from masking
experiments
• ERB(f) = 21.4 log10(1+ 4.37f/1000)
(Equivalent Rectangular Bandwidth)
Similarly based on masking experiments,
but with better estimates of auditory filter shape
![Page 20: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/20.jpg)
All together nowAll together now
![Page 21: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/21.jpg)
Towards other Towards other deconvolution deconvolution
methodsmethods• Filters seem biologically plausible• Other operations could potentially separate excitation from filter
• Periodic source provides harmonics (close together in frequency)
• Filter provides broad influence (envelope) on harmonic series
• Can we use these facts to separate?
![Page 22: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/22.jpg)
““Homomorphic” Homomorphic” processingprocessing
• Linear processing is well-behaved• Some simple nonlinearities also permit simple processing, interpretation
• Logarithm a good example; multiplicative effects become additive
• Sometimes in additive domain, parts more separable
• Famous example: “blind” deconvolution of Caruso recordings
![Page 23: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/23.jpg)
Oppenheim: Then all speech compression systems and many speech recognition systems are oriented toward doing this deconvolution, then processing things separately, and then going on from there. A very different application of homomorphic deconvolution was something that Tom Stockham did. He started it at Lincoln and continued it at the University of Utah. It has become very famous, actually. It involves using homomorphic deconvolution to restore old Caruso recordings.
Goldstein: I have heard about that.
Oppenheim: Yes. So you know that's become one of the well-known applications of deconvolution for speech.…Oppenheim: What happens in a recording like Caruso's is that he was singing into a horn that to make the recording. The recording horn has an impulse response, and that distorts the effect of his voice, my talking like this. [cupping his hands around his mouth]
Goldstein: Okay.
IEEE Oral History Transcripts: Oppenheim on Stockham’s Deconvolution of Caruso Recordings (1)
![Page 24: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/24.jpg)
Oppenheim: So there is a reverberant quality to it. Now what you want to do is deconvolve that out, because what you hear when I do this [cupping his hands around his mouth] is the convolution of what I'm saying and the impulse response of this horn. Now you could say, "Well why don't you go off and measure it. Just get one of those old horns, measure its impulse response, and then you can do the deconvolution." The problem is that the characteristics of those horns changed with temperature, and they changed with the way they were turned up each time. So you've got to estimate that from the music itself. That led to a whole notion which I believe Tom launched, which is the concept of blind deconvolution. In other words, being able to estimate from the signal that you've got the convolutional piece that you want to get rid of. Tom did that using some of the techniques of homomorphic filtering. Tom and a student of his at Utah named Neil Miller did some further work. After the deconvolution, what happens is you apply some high pass filtering to the recording. That's what it ends up doing. What that does is amplify some of the noise that's on the recording. Tom and Neil knew Caruso's singing. You can use the homomorphic vocoder that I developed to analyze the singing and then resynthesize it. When you resynthesize it you can do so without the noise. They did that, and of course what happens is not only do you get rid of the noise but you get rid of the orchestra. That's actually become a very fun demo which I still play in my class. This was done twenty years ago, but it's still pretty dramatic. You hear Caruso singing with the orchestra, then you can hear the enhanced version after the blind deconvolution, and then you can also hear the result after you get rid of the orchestra,. Getting rid of the orchestra is something you can't do with linear filtering. It has to be a nonlinear technique.
IEEE Oral History Transcripts (2)
![Page 25: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/25.jpg)
Log processingLog processing
• Suppose y(n) = e(n)*h(n)• Then Y(f) = E(f)H(f)• And logY(f) = log E(f) + log H(f)
• In some cases, these pieces are separable by a linear filter
• If all you want is H, processing can smooth Y(f)
![Page 26: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/26.jpg)
![Page 27: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/27.jpg)
Windowedspeech
FFTLog
magnitude FFTTime
separationSpectralfunction
Excitation Pitchdetection
Source-filter separation by cepstral analysis
![Page 28: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/28.jpg)
Cepstral featuresCepstral features
• Typically truncated (smooths the estimate; why?)
• Corresponds to spectral envelope estimation• Features also are roughly orthogonal• Common transformation for many spectral features
• Used almost universally for ASR (in some form)
• To reconstruct speech (without min phase assumption) need complex cepstrum
![Page 29: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/29.jpg)
An alternative:An alternative:Incorporate Production Incorporate Production
• Assume simple excitation/vocal tract model
• Assume cascaded resonators for vocal tract
frequency response (envelope)
• Find resonator parameters for best spectral
approximation
![Page 30: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/30.jpg)
Resonator frequency Resonator frequency responseresponse
![Page 31: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/31.jpg)
b =2rcosθc =r2
Where r = pole magnitude, θ = pole angle
Pole-only (complex) resonator
H (z) = 1
1−bz−1 −cz−2
![Page 32: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/32.jpg)
![Page 33: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/33.jpg)
![Page 34: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/34.jpg)
![Page 35: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/35.jpg)
E(z) =Y(z)− %Y(z) =Y(z)− aj
j=1
P
∑ z−jY(z)
=Y (z)(1− a jj=1
P
∑ z− jY (z))
E(z) =Y(z)H(z) where
H (z) =1
1− ajz−y
j=1
P
∑
Error Signal
![Page 36: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/36.jpg)
![Page 37: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/37.jpg)
Some LPC Issues Some LPC Issues
• Error criterion
• Model order
![Page 38: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/38.jpg)
Error Criterion
![Page 39: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/39.jpg)
LPC Peak Modeling LPC Peak Modeling
• Total error constrained to be (at best)
gain factor squared
• Error where model spectrum is larger
contributes less
• Model spectrum tends to “hug” peaks
![Page 40: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/40.jpg)
LPC spectra and errorLPC spectra and error
![Page 41: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/41.jpg)
More effects of More effects of LPC error LPC error criterioncriterion • Globally tracks, but worse match in
log spectrum for low values
• “Attempts” to model anti-aliasing
filter, mic response
• Ill-conditioned for wide-ranging
spectral
values
![Page 42: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/42.jpg)
Other LPC Other LPC properties properties • Behavior in noise
• Sharpness of peaks
• Speaker dependence
![Page 43: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/43.jpg)
LPC Model Order LPC Model Order
• Too few, can’t represent formants
• Too many, model detail, especially
harmonics
• Too many, low error, ill-conditioned
matrices
![Page 44: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/44.jpg)
LPC Speech SpectraLPC Speech Spectra
![Page 45: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/45.jpg)
LPC Prediction LPC Prediction errorerror
![Page 46: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/46.jpg)
Optimal Model Optimal Model Order Order • Akaike Information Criterion (AIC)
• Cross-validation (trial and error)
![Page 47: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/47.jpg)
Coefficient Coefficient Estimation Estimation • Minimize squared error - set derivs
to zero
• Compute in blocks or on-line
• For blocks, use autocorrelation or
covariance methods (pertains to
windowing, edge effects)
![Page 48: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/48.jpg)
D = e2 (n)n=0
N−1
∑ = (y(n)− ajy(n− j))2j=1
P
∑n=0
N−1
∑
a jφ(i, j) =φ(i,0)j=1
P
∑ for i =1,2,...,P
Where φ(i, j) is a correlation sum between versionsof the signal delayed by i and j points
Minimizing the error criterion
If we take partial derivatives with respect to each a j
⇒
![Page 49: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/49.jpg)
Solving the Solving the Equations Equations
• Autocorrelation method: Levinson or
Durbin recursions, O(P2) ops; uses
Toeplitz property (constant along
left-right diagonals), guaranteed
stable
• Covariance method: Cholesky
decomposition,
O(P3) ops; just uses symmetry
property, not guaranteed stable
![Page 50: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/50.jpg)
LPC-based LPC-based representationsrepresentations • Predictor polynomial - ai, 1<=i<=p , direct
computation
• Root pairs - roots of polynomial, complex
pairs
• Reflection coefficients - recursion;
interpolated values always stable (also
called PARCOR coefficients ki, 1<=i<=p)
• Log area ratios = ln((1-k)/(1+k)) , low
spectral sensitivity
• Line spectral frequencies - freq. pts around
resonance; low spectral sensitivity, stable
• Cepstra - can be unstable, but useful for
recognition
![Page 51: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/51.jpg)
LPC analysis block LPC analysis block diagramdiagram
![Page 52: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/52.jpg)
Spectral EstimationSpectral Estimation
Filter BanksCepstralAnalysis
LPC
Reduced Pitch Effects
Excitation Estimate
Direct Access to Spectra
Less Resolution at HF
Orthogonal Outputs
Peak-hugging Property
Reduced Computation
X
X
X
X XXX
X
X
X
![Page 53: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/53.jpg)
Feature Extraction for Feature Extraction for ASRASR
Chapter 22Chapter 22
![Page 54: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/54.jpg)
ASR Front EndASR Front End
• Coarse spectral representation
(envelope)
• Coarsest for high frequencies
• Limitations for each basic type
(filter bank, cepstrum, LPC)
![Page 55: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/55.jpg)
Limitations for Limitations for archetypesarchetypes • Filter banks
correlated outputs, no focus on
peaks
• Cepstral analysis uniform spectral resolution, no
focus on peaks
• LPC uniform spectral resolution
• Solution: hybrid approaches
![Page 56: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/56.jpg)
Two “Standards”Two “Standards”
• Mel Cepstrum: Bridle (1974), Davis
and Mermelstein (1980)
• Perceptual Linear Prediction (PLP):
Hermansky, ~1985, 1990
![Page 57: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/57.jpg)
![Page 58: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/58.jpg)
Preemphasis
FFT
| |2
Critical bands
Compression
IFFT
Smoothing
Liftering
Cepstral truncation LPC Analysis
Cube RootLog
TrapezoidalTriangular
Single Zero FIR Done in Crit. Band step
Mel Cepstral Analysis PLP Analysis
![Page 59: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/59.jpg)
Perceptual Linear Perceptual Linear Prediction (PLP)Prediction (PLP)[Hermansky 1990][Hermansky 1990]
• Auditory-like modifications of short-term speech spectrum prior to its approximation by all-pole autoregressive model (or cepstral truncation in case of MFCC)– critical-band spectral
resolution– equal-loudness
sensitivity– intensity-loudness
nonlinearity• These 3 applied in virtually
all state-of-the-art experimental ASR systems
![Page 60: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/60.jpg)
Steps 2-4 of PLP
![Page 61: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/61.jpg)
![Page 62: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/62.jpg)
![Page 63: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/63.jpg)
Dynamic Dynamic Features Features
• Delta features - local slope in
cepstrum
• Computed by filtering/linear
regression
• Higher derivatives often used now
• Typically used in combination w/
“static” features
![Page 64: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/64.jpg)
Speaker robustness - Speaker robustness - VTLNVTLN • Different vocal tract lengths ->
different formant positions (e.g.,
male vs female)
• Expansion/compression can be
estimated
• Typically use statistical modeling to
optimize
• Can look at characteristics like
pitch or 3rd formant
![Page 65: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/65.jpg)
Acoustic Acoustic (environment) (environment) robustnessrobustness • Convolutional error (e.g.,
microphone, channel spectrum)
• Additive noise (e.g., fans, auto
engine)
• Limitations for typical solutions:
time-invariant or slowly varying,
linear, phone-independent
![Page 66: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/66.jpg)
Key Processing Step Key Processing Step for ASR:for ASR:
Cepstral Mean Cepstral Mean SubtractionSubtraction
• Imagine a fixed filter h(n), so x(n)=s(n)*h(n)
• Same arguments as before, but - let s vary over time- let h be fixed over time
• Then average cepstra should represent the fixed component (including fixed part of s)
• (Think about it)
![Page 67: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/67.jpg)
Convolutional Error
X(ω,t) = S(ω,t)H(ω,t)
|X(ω,t)|2 = |S(ω,t)|2 |H(ω,t)|2
log |X(ω,t)|2 = log|S(ω,t)|2 + log |H(ω,t)|2
Cx(n,t) = CS (n,t) + CH (n,t)
![Page 68: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/68.jpg)
Convolutional error Convolutional error strategiesstrategies
• Blind deconvolution/cepstral mean
subtraction: Atal 1974
• On-line method- RelAtive SpecTral
Analysis
(RASTA): Hermansky and Morgan, 1991
![Page 69: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/69.jpg)
Some variants on CMSSome variants on CMS • Subtract utterance mean from each
cepstral coefficient
• Compute mean over a longer region
(e.g., conversational side)
• Compute a running mean
• Use the mean from the last utterance
• Also divide by std deviation
![Page 70: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/70.jpg)
![Page 71: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/71.jpg)
![Page 72: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/72.jpg)
![Page 73: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/73.jpg)
![Page 74: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/74.jpg)
![Page 75: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/75.jpg)
![Page 76: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/76.jpg)
““Standard” RASTAStandard” RASTA
![Page 77: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/77.jpg)
![Page 78: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/78.jpg)
Some of the proposed Some of the proposed improvements to improvements to RASTARASTA • Run backwards and forwards in time
(gets
rid of phase in transfer fn)
• Train filter on data (discriminative
RASTA)
• Use multiple filters
• Use in combination with Wiener
filtering
![Page 79: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/79.jpg)
Long-time Long-time convolutionconvolution • Reverberation has effects beyond the
typical analysis frame
• Can do log spectral subtraction w/
long frames
• Alternatively, smear system training
data to
improve match to temporal smearing in
test
• In practice, this is an unsolved
problem (especially when noise is
present, i.e., always)
![Page 80: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/80.jpg)
Additive noise Additive noise (stationary)(stationary) • Subtract off noise spectral estimate
• Need a noise estimate
• Use a second microphone if you have
it
![Page 81: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/81.jpg)
![Page 82: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/82.jpg)
Wiener filter Wiener filter /spectral /spectral subtractionsubtraction
• Assume that X = S + N (suppressing freq dep in notation)
• If uncorrelated, |X|2 = |S|2 + |N|2
(PSDs)
• |Sest|2 = |X|2 - |Nest|2 , or |H|2 = 1 - |N|2
/ |X|2
• If no channel effect, Wiener filter is
H = |S|2 / (|S|2 + |N|2 )
• So Wiener filter is H = 1 - |N|2 / |X|2
• Similar effect but for exponents
• In practice many variants - also
smoothing to
avoid “musical noise”
![Page 83: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/83.jpg)
Just Suppose …Just Suppose …
• What if, for some ω,|Nest|2 › |X|2 ?
• Then |Sest|2 = |X|2 - |Nest|2 is negative
• But if it is a PSD …
• So, what should we do?
![Page 84: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/84.jpg)
Piano with noise
Piano with noise and Wiener filtering
![Page 85: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/85.jpg)
![Page 86: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/86.jpg)
ETSI standard: AFEETSI standard: AFE
• Aurora competition
• AFE = “Advanced Front End”
• Noise est., Wiener filtering, done
twice
• Emphasis on high SNR parts of waveform
• Other methods did well later
• (e.g., Ellis – Tandem [MLP+HMM], 2
streams, PLP+MSG)
![Page 87: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/87.jpg)
Modulation-filtered SpectroGram (MSG)Modulation-filtered SpectroGram (MSG)
Kingsbury, 1998 Berkeley PhD thesis
![Page 88: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/88.jpg)
Noise and Noise and convolutionconvolution • Can use a different form of RASTA: “J-
RASTA”
• Filters log-like function of spectrum:
• f(x) = log( 1 + Jx)
where J 1/Noise power
• Many other methods (primarily
statistical)
• None lower word error rates to clean
levels
∝
![Page 89: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/89.jpg)
Noise and convolution - Noise and convolution - other compensation other compensation
methodsmethods • Given “stereo” data, find additive
vector to best match the cepstra
• Get data from multiple testing
environments/microphones, find best
match
• Vector Taylor Series methods (approx
effect on cepstra of noise,
convolution)
• SPLICE (Stereo-based Piecewise LInear
Compensation for Environments) methods
• Or else, adaptation of stat model
![Page 90: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/90.jpg)
Noise and convolution -Noise and convolution -what would we really what would we really
want?want? • For online case, would like to be
insensitive
to noise and convolutional errors
• Would like to do this without needing
known noise regions
• People can do this
• So - study auditory system?
![Page 91: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/91.jpg)
““Auditory” propertiesAuditory” propertiesin speech “front ends”in speech “front ends” • Nonlinear spacing/bandwidth for filter
bank
• Compression (log for MFCC, cube root
for PLP)
• Preemphasis/equal loudness
• Smoothing for envelope estimate
• Insensitivity to constant spectral
multiplier
![Page 92: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/92.jpg)
Auditory ModelsAuditory Models
• Shifting definitions
• Typically means whatever we aren’t
using yet
• Example: Ensemble Interval Histogram
(EIH) looking for coherence across bands
of histogram of threshold crossings
![Page 93: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/93.jpg)
![Page 94: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/94.jpg)
Seneff Auditory ModelSeneff Auditory Model
![Page 95: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/95.jpg)
Auditory Models Auditory Models (cont.)(cont.) • Representation of cochlear output -
e.g., the cochleagram
• Representation of temporal
information - the correlogram
(particularly for pitch)
- shows autocorrelation function for
each spectral component; i.e.,
frequency vs lag
![Page 96: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/96.jpg)
Correlogram example 1
![Page 97: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/97.jpg)
Correlogram example 2
![Page 98: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/98.jpg)
Correlogram example 3
![Page 99: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/99.jpg)
Correlogram example 4
![Page 100: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/100.jpg)
Correlogram example 5
![Page 101: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/101.jpg)
Other Other perspectivesperspectives • Temporal information in individual
bands (TRAPS/HATS)
• Spectro-Temporal Receptive Fields
(models from ferret brain
experiments)
• Multiple mappings for greater
robustness, including more “sluggish”
features
![Page 102: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/102.jpg)
ASR Systems are half-deafASR Systems are half-deaf
• Phonetic classification is very poor (even in low-noise conditions)
• Success is due to constraints (domain, speaker, noise- canceling mic, etc)
• These constraints can mask the underlying weakness of the technology
![Page 103: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/103.jpg)
time
Pushing the envelope Pushing the envelope (aside)(aside)
• Problem: Spectral envelope is a fragile information carrier
estimate of sound identity
info
rmat
ion
fusi
on
25 ms (stepped by 10 ms)OLD
NEW
• Solution: Probabilities from multiple time-frequency patches
i-th estimate
up to 1s
k-th estimate
n-th estimate
estimate of sound identity
![Page 104: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/104.jpg)
Narrowband 500 ms (HATS) Broadband 100 ms (Tandem) Broadband 25 ms
MLP MLP
13 overlapping spectral slices
9 frames,PLP
cepstra
posteriors
combineconcatenate
features
1 frame,PLP cepstra
Multi-rate features
![Page 105: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/105.jpg)
Multiple Multiple microphonesmicrophones • Array approaches for beamforming
• “Distant” second microphone for noise
estimate use cross-correlation to derive
transfer fn
for noise to get from noise sensor
to signal
sensor
![Page 106: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/106.jpg)
The Rest of the SystemThe Rest of the System • Focus has been on features
• Feature choice affects statistics
• Noise/channel robustness strategies
often focus on the statistical models
• For now, we will focus on a
deterministic view - later,
deterministic ASR (ch 24)
• First, pitch and general audio –
chap. 16, 31,
35, 37, 39
![Page 107: Feature Extraction for speech applications](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfe550346895da7d896/html5/thumbnails/107.jpg)
End - feature extraction; on to DTW …