Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000...
Transcript of Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000...
![Page 1: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/1.jpg)
Feature Extraction from Speech Signals
Abeer Alwan
Speech Processing and Auditory Perception Laboratory (SPAPL) Department of Electrical Engineering, UCLA
http://www.ee.ucla.edu/spapl [email protected]
![Page 2: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/2.jpg)
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
0 2000 4000 6000 8000 10000 12000 14000-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
ONE FIVE SEVEN
Waveform
Spectrogram
![Page 3: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/3.jpg)
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE
FIVE
SE -VEN
![Page 4: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/4.jpg)
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE FIVE SEVEN
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
![Page 5: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/5.jpg)
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE FIVE SEVEN
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
Harmonics
![Page 6: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/6.jpg)
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE
FIVE
SE -VEN
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
Pitch f0 = 250 HzHarmonics
![Page 7: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/7.jpg)
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE
FIVE
SE -VEN
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
Pitch f0 = 250 HzHarmonics
Vocal Tract Transfer Envelop
![Page 8: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/8.jpg)
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE
FIVE
SE -VEN
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
Pitch f0 = 250 HzHarmonics
Vocal Tract Transfer Envelope
Formants
![Page 9: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/9.jpg)
![Page 10: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/10.jpg)
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
![Page 11: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/11.jpg)
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
Filter Bank
![Page 12: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/12.jpg)
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
Filter Bank
log(.), DCT, ∆1 - ∆2
![Page 13: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/13.jpg)
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
Filter Bank
log(.), DCT, ∆1 - ∆2
Hidden Markov Models
![Page 14: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/14.jpg)
![Page 15: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/15.jpg)
0 2000 4000 6000 8000 10000 12000 14000-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE FIVESEVEN
log(.), DCT, ∆1 - ∆2
Short-Time Fourier Transform
Mel Filter Bank
MFCC
Cepstrum Transform
![Page 16: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/16.jpg)
Variability
• The variability in the way humans produce speech due to, for example, gender, accent, age, and emotion necessitates data-drivenapproaches to capture significant trends/behavior in the data.
• The same variability, however, may not be modeled adequately by such systems especially if data are limited and/or corrupted by noise.
![Page 17: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/17.jpg)
Technological Challenges
•Robustness of automatic recognition systems (ASR) to background acoustic and/or channel noise
•ASR robustness to variations due to gender, age, affect, dialect, accent,speaking rate
•Limited data
![Page 18: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/18.jpg)
Within and across speaker variation in voice quality
![Page 19: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/19.jpg)
Motivation
– Voice quality varies both between and within speakers• Within-speaker variability may not be negligible
• Quantitative model for the variability is needed 19
![Page 20: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/20.jpg)
Database
Large number of
speakers
Multiple recording sessions
Multiple speech tasks
High quality recording
TIMIT
Switchboard
…
Japanese
UCLA
20
• Need a new database• Existing databases are not adequate to study between- and within-
speaker variability at the same time
![Page 21: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/21.jpg)
UCLA Database• Data collected in collaboration with the Linguistics department
and Medical school• Inter-speaker variability
– Day/time variability (session variability)– Read speech vs. conversational speech– Low-affect speech vs. high-affect speech
• Recordings– Steady-state vowel /a/ (3 repetition)– Reading sentences– Explaining something to someone they do not know – Phone call to someone they know– Telling something unimportant/ joyful/ annoying – Speaking to pets
21
![Page 22: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/22.jpg)
Preliminary Study Overview• Objective
– Validate the importance of studying inter- and intra-speaker variability in voice quality by acoustic analysis
• Preliminary experiments with steady-state vowels– Motivation
• Minimum intra-speaker variability• Acoustic measures can be estimated with high accuracy
• Stimuli– Subset of the UCLA database– 9 female, 9 male speakers (18 speakers)– 3 repetitions of vowel /a/ from 3 sessions on different days,
(9 tokens per speaker)– Sampling rate 22100Hz
22
![Page 23: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/23.jpg)
Observations– Intra-speaker variance of some features are compatible to, or even larger than inter-
speaker variance
23
females males
F0H
NR
total
![Page 24: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/24.jpg)
Experimental Setup
• Classifier– Support Vector Machine (LIBSVM)
• Train conditions– Matched
• Trained on all 3 sessions with first 2 vowels• Tested on all 3 sessions with the remaining 1 vowel
– Mismatched• Trained on 2 sessions with all 3 vowels• Tested on the remaining 1 session with 3 vowels
24
![Page 25: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/25.jpg)
Speaker Identification Result
• Observations– Good features for matched condition are not necessarily
good for mismatched condition– Perceptual voice quality feature are most robust to session
variability25
Set 1
Vocal tract
Set 2
Voice source
Set 3
Perceptual
Set 4
MFCCs
81.5%
96.3% 94.4%98.2%
61.1%
79.6%
90.7%83.3%
40%
60%
80%
100%
set 1 set 2 set 3 set 4
accuracy
matched mismatched
![Page 26: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/26.jpg)
Height Estimation and Speaker Adaptation using Subglottal
Resonances
![Page 27: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/27.jpg)
The Subglottal System
• The acoustic system below the glottis consists of the trachea, bronchi, and lungs
• Similar to the supraglottalsystem (vocal tract), the subglottal system is characterized by a series of poles and zeros referred to as subglottal resonances and anti-resonances.
![Page 28: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/28.jpg)
Coupling of the Subglottal System• Introduces pole-zero pairs in the vocal tract
transfer function (Stevens, 1998)• Theoretically, subglottal resonances are
independent of speech sounds (Lulich, 2008b)
• The ‘acoustic length’ of the subglottalsystem is proportional to speaker height.
![Page 29: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/29.jpg)
Motivation for studying SGRs• Vocal-tract shape changes continuously, but the
subglottal tract has a fixed configuration.– SGRs (red) remain constant irrespective of spoken content
and language, unlike formants (green).
![Page 30: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/30.jpg)
• SGRs form natural boundaries between vowel categories.
/iy/
/uw/ /aa/
increasing vowel height
increasing vowel backness
![Page 31: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/31.jpg)
Major goals of this research
• Collect a sizable database of time-synchronized speech and subglottal acoustics for both adults and children.– Use an accelerometer to record subglottal acoustics.
• Develop automatic methods to extract subglottal information using speech signals only.
• Investigate the role of subglottal features in tasks requiring speaker-specific information.– Exploit the stationarity of subglottal acoustics for improved
performance when speech data are limited.
![Page 32: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/32.jpg)
The WashU-UCLA corpora• Time-synchronized recordings of speech (microphone) and subglottal
acoustics (accelerometer).
• Subjects: 50 adults (18 to 25 yrs); 43 children (6 to 17 yrs).– Native speakers of American English.– Speaker age and height were recorded.
• Recordings were made in a sound-attenuated booth and were phrases of the form “I said a <CVC> again”.
– 10 repetitions of each <CVC> for adults; 4 repetitions for kids.– Vowel beginning, steady-state and end were labeled manually.
• Data (for adults only) will be released for free by the LDC on 4/15.
K&K Sound HotSpotaccelerometer
![Page 33: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/33.jpg)
Data (speech & subglottal acoustics)
Measurement and analysis
SGR estimation algorithms
Height estimationSpeaker normalization
Estimating subglottal spectral features
Speaker identification
![Page 34: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/34.jpg)
Sg1 (Hz) Sg2 (Hz) Sg3 (Hz)Adults
Average male 542 1327 2198Average female 659 1511 2410Overall average 601 1419 2304
Children (6 to 17 yrs)Average male 727 1720 2710
Average female 752 1778 2720Overall average 730 1740 2710
Results: actual SGRs• Significant differences between:
– Adults and children– Adult males and adult females
![Page 35: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/35.jpg)
Results: SGRs versus body height• Trachea length correlates with height [Griscom & Wohl,
1985].– Results in negative correlations between SGRs and height.
![Page 36: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/36.jpg)
Evaluation setup
• For adults– Training: 35 speakers in the WashU-UCLA corpus.– Evaluation: 14 speakers in the MIT Tracheal Resonance
database (with known SGRs).
• For children– Training: 25 speakers in the WashU-UCLA corpus.– Evaluation: the remaining 18 speakers.
• Utterances are 2 to 3 seconds long; sampled at 8 kHz.
![Page 37: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/37.jpg)
ResultsRMSEs in Hz, for adults (5 dB SNR)
RMSEs in Hz, for children (5 dB SNR)
• RMSEs are on the order of measurement standard deviations.
0
50
100
Sg1 Sg2 Sg3
CleanBabblePinkFactoryWhite
0
100
200
Sg1 Sg2 Sg3
CleanBabblePinkCarWhite
![Page 38: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/38.jpg)
Height Estimation
• Based on the correlation between SGRs and body height.
unknown speaker
WashU-UCLA corpus
automatic SGR estimation algorithm
model the relationship
between SGRs and speaker
height
estimated height
estimated SGRs
actual SGRs and heights
![Page 39: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/39.jpg)
Height estimation: evaluation
Using Sg1 Using Sg2 Ganchev et al.
mean abs. error 5.3 cm 5.4 cm 5.3 cm
RMS error 6.6 cm 6.7 cm 6.8 cm
• Training data: SGRs and heights of 50 speakers.• Evaluation data: speech signals of 604 speakers (TIMIT).
Arsikere, Leung, Lulich and Alwan (Speech Communication, 2013)
![Page 40: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/40.jpg)
ResultsRMSEs (cm) for clean speech – comparing the three SGRs
RMSEs (cm) for noisy speech (0 dB SNR) using Sg1
• Advantages over [Ganchev et al., 2010] (RMSE = 6.8 cm):– Just 1 feature (as opposed to 50); very little training data (50
speakers vs. 468); no need to retrain models for every scenario.
6.8 6.9 7.1
Using Sg1 Using Sg2 Using Sg3
6.8 6.97.2
7.06.8
Clean Babble White Pink Factory
![Page 41: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/41.jpg)
SGR-based normalization (SGRN)• Motivation for using SGRs:
– Role of Sg1 and Sg2 as vowel-feature boundaries.– Phonetic invariance.– Can be estimated fairly well from noisy speech.
200 400 600 800 1000 12001000
1500
2000
2500
3000
F1 (Hz)
F2 (H
z)
Sg1c
Sg2cSg1m
Sg2m
Blue: male adultRed: male child
![Page 42: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/42.jpg)
Evaluation setup• Task: recognition of connected digit utterances.
– TIDIGITS database (utterances of 1 to 7 digits each).– ASR models trained on 112 adult speakers; tested on 50
children.– Training in clean condition.– Testing in clean + babble, pink, white and car noise (5 to 15 dB).
• Hidden Markov Models (HMMs) are used.
• Features: standard MFCCs extracted at 10 ms intervals.
• Performance metric: word error rate (WER).– Total number of substitution, insertion and deletion errors.
![Page 43: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/43.jpg)
Experimental results
39.7
26.218.9
23.217.9 14.416.7 16.1 14.1
6 to 8 yrs 9 to 11 yrs 12 to 15 yrs
BaselineVTLNSGRN
WERs (in %) by age group, averaged across noise types
24.7 27.5 27.5
18.2 18.4 17.414.9 16.2 15.6
1 or 2 words 3 to 5 words 6 or 7 words
BaselineVTLNSGRN
WERs (in %) by utterance length, averaged across noise types
![Page 44: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/44.jpg)
Summary (speaker normalization)• Proposed approach: SGR-based normalization (SGRN).
– ML corrections applied to initial SGR estimates, which are fairly robust to noise; frequency-dependent scaling.
• Experiments on children’s ASR in quiet & noise: SGRN is better than VTLN, especially for young speakers and short utterances.
![Page 45: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/45.jpg)
SID
SGCCs Computed just like speech MFCCs, but from subglottal acoustics.
45
![Page 46: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/46.jpg)
Conclusion and future directions
• Subglottal features are useful for: (1) height estimation, (2) speaker normalization for ASR, (3) speaker identification, and (4) cross-language adaptation.– Effective with limited data.– Robust to environmental noise.
• Future research directions:– Data collection and analysis in other languages.– Pole-zero models for SGR estimation in speech.– Larger databases and better models for height estimation.
![Page 47: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/47.jpg)
Noise Robust ASR
![Page 48: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/48.jpg)
Speech Perception in NoiseHealthy-hearing adults are remarkably adept at perceiving speech in noise. However,
•Nearly 30 million North Americans are hearing impaired, yet less than 6 million use hearing aids. The most common complaint of hearing-aid users is listening to speech in naturally noisy environments.
•The performance of automatic speech recognition (ASR) systems degrades significantly in the presence of noise.
![Page 49: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/49.jpg)
The ‘Robust’ Human Auditory System
The auditory system is extremely robust to noise due to both:
•“Intelligent” High-Level Processing• Inherently Robust Auditory
Representation (Front-End Processing)
![Page 50: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/50.jpg)
![Page 51: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/51.jpg)
Knowledge-Based Signal-processing Techniques
• Adaptation (sensitivity to onsets and offsets): modeled after forward masking experiments
• Peak isolation/Spectral sharpening: physiological and perceptual evidence
• Not all ‘uniform’ segments are equally important (VFR)
• VTTF changes slowly in time (Peak threading)• Extracting the spectral envelope more precisely
(Harmonic demodulation)
![Page 52: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/52.jpg)
(A)
(B)
I. Adaptation: Auditory system adapts as a function of previous input
Varied masker level, probe delay, and frequency
Strope and Alwan (1997, 1998)
![Page 53: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/53.jpg)
“These schemes will succeed only to the extent that metrics can be found that are sensitive to phonetically relevant spectral
differences..” D.H. Klatt, 1981
II. Peak Isolation
![Page 54: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/54.jpg)
Quiet 5 dB SNR
9 6 1 3 Time
![Page 55: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/55.jpg)
• Spectral changes are important perceptual cues for discrimination. Such changes can occur over very short timeintervals.
• Computing frames every 10 ms, as commonly done in ASR, is not sufficient to capture such dynamic changes.
III. Variable Frame Rate Analysis (VFR)
Zhu and Alwan (2000, 2003)You and Alwan (2002)
![Page 56: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/56.jpg)
Frame selection in VFR for a digit string “one two” with silence
An Example of VFR
2000 4000 6000 8000 10000 12000
-2000
0
2000
4000
Sample
50 100 150 200 250 300 350 4000
10
20
30
Frame
d(i)
Selection
Speech waveform
Inter-frame distance based on Euclidean
distanceor Entropy
Frames selected
![Page 57: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/57.jpg)
Summary• Achieved improved performance for rapid
speaker normalization and speaker ID with limited data using subglottal resonance information.
• Estimated speaker height using one feature
• Achieved improved ASR noise-robustness using: adaptation, peak isolation, and variable frame rate analysis
However, we are yet to achieve human performance on these tasks (except for height
estimation)!
![Page 58: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/58.jpg)
Other Research Projects
•High-speed imaging of the vocal cords to improve modeling and synthesis
•Bird song classification using limited data
![Page 59: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6](https://reader035.fdocuments.in/reader035/viewer/2022070915/5fb5cb5d6a39612a29454288/html5/thumbnails/59.jpg)
Acknowledgements
Former and Current Students: S. Wang, H. Ariskere, S. Park, B. Strope
Collaborators: S. Lulich, M. Sommers
Work supported in part by the NSF, NIH, DARPA, SONY Playstation