Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000...

Feature Extraction from Speech Signals

Abeer Alwan

Speech Processing and Auditory Perception Laboratory (SPAPL) Department of Electrical Engineering, UCLA

http://www.ee.ucla.edu/spapl [email protected]

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

0 2000 4000 6000 8000 10000 12000 14000-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

ONE FIVE SEVEN

Waveform

Spectrogram

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE

FIVE

SE -VEN

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE FIVE SEVEN

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE FIVE SEVEN

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9

Harmonics

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE

FIVE

SE -VEN

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9

Pitch f0 = 250 HzHarmonics

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE

FIVE

SE -VEN

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9


Vocal Tract Transfer Envelop

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE

FIVE

SE -VEN

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9


Vocal Tract Transfer Envelope

Formants

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

Filter Bank

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

Filter Bank

log(.), DCT, ∆1 - ∆2

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

Filter Bank

log(.), DCT, ∆1 - ∆2

Hidden Markov Models

0 2000 4000 6000 8000 10000 12000 14000-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE FIVESEVEN

log(.), DCT, ∆1 - ∆2

Short-Time Fourier Transform

Mel Filter Bank

MFCC

Cepstrum Transform

Variability

• The variability in the way humans produce speech due to, for example, gender, accent, age, and emotion necessitates data-drivenapproaches to capture significant trends/behavior in the data.

• The same variability, however, may not be modeled adequately by such systems especially if data are limited and/or corrupted by noise.

Technological Challenges

•Robustness of automatic recognition systems (ASR) to background acoustic and/or channel noise

•ASR robustness to variations due to gender, age, affect, dialect, accent,speaking rate

•Limited data

Within and across speaker variation in voice quality

Motivation

– Voice quality varies both between and within speakers• Within-speaker variability may not be negligible

• Quantitative model for the variability is needed 19

Database

Large number of

speakers

Multiple recording sessions

Multiple speech tasks

High quality recording

TIMIT

Switchboard

…

Japanese

UCLA

20

• Need a new database• Existing databases are not adequate to study between- and within-

speaker variability at the same time

UCLA Database• Data collected in collaboration with the Linguistics department

and Medical school• Inter-speaker variability

– Day/time variability (session variability)– Read speech vs. conversational speech– Low-affect speech vs. high-affect speech

• Recordings– Steady-state vowel /a/ (3 repetition)– Reading sentences– Explaining something to someone they do not know – Phone call to someone they know– Telling something unimportant/ joyful/ annoying – Speaking to pets

21

Preliminary Study Overview• Objective

– Validate the importance of studying inter- and intra-speaker variability in voice quality by acoustic analysis

• Preliminary experiments with steady-state vowels– Motivation

• Minimum intra-speaker variability• Acoustic measures can be estimated with high accuracy

• Stimuli– Subset of the UCLA database– 9 female, 9 male speakers (18 speakers)– 3 repetitions of vowel /a/ from 3 sessions on different days,

(9 tokens per speaker)– Sampling rate 22100Hz

22

Observations– Intra-speaker variance of some features are compatible to, or even larger than inter-

speaker variance

23

females males

F0H

NR

total

Experimental Setup

• Classifier– Support Vector Machine (LIBSVM)

• Train conditions– Matched

• Trained on all 3 sessions with first 2 vowels• Tested on all 3 sessions with the remaining 1 vowel

– Mismatched• Trained on 2 sessions with all 3 vowels• Tested on the remaining 1 session with 3 vowels

24

Speaker Identification Result

• Observations– Good features for matched condition are not necessarily

good for mismatched condition– Perceptual voice quality feature are most robust to session

variability25

Set 1

Vocal tract

Set 2

Voice source

Set 3

Perceptual

Set 4

MFCCs

81.5%

96.3% 94.4%98.2%

61.1%

79.6%

90.7%83.3%

40%

60%

80%

100%

set 1 set 2 set 3 set 4

accuracy

matched mismatched

Height Estimation and Speaker Adaptation using Subglottal

Resonances

The Subglottal System

• The acoustic system below the glottis consists of the trachea, bronchi, and lungs

• Similar to the supraglottalsystem (vocal tract), the subglottal system is characterized by a series of poles and zeros referred to as subglottal resonances and anti-resonances.

Coupling of the Subglottal System• Introduces pole-zero pairs in the vocal tract

transfer function (Stevens, 1998)• Theoretically, subglottal resonances are

independent of speech sounds (Lulich, 2008b)

• The ‘acoustic length’ of the subglottalsystem is proportional to speaker height.

Motivation for studying SGRs• Vocal-tract shape changes continuously, but the

subglottal tract has a fixed configuration.– SGRs (red) remain constant irrespective of spoken content

and language, unlike formants (green).

• SGRs form natural boundaries between vowel categories.

/iy/

/uw/ /aa/

increasing vowel height

increasing vowel backness

Major goals of this research

• Collect a sizable database of time-synchronized speech and subglottal acoustics for both adults and children.– Use an accelerometer to record subglottal acoustics.

• Develop automatic methods to extract subglottal information using speech signals only.

• Investigate the role of subglottal features in tasks requiring speaker-specific information.– Exploit the stationarity of subglottal acoustics for improved

performance when speech data are limited.

The WashU-UCLA corpora• Time-synchronized recordings of speech (microphone) and subglottal

acoustics (accelerometer).

• Subjects: 50 adults (18 to 25 yrs); 43 children (6 to 17 yrs).– Native speakers of American English.– Speaker age and height were recorded.

• Recordings were made in a sound-attenuated booth and were phrases of the form “I said a <CVC> again”.

– 10 repetitions of each <CVC> for adults; 4 repetitions for kids.– Vowel beginning, steady-state and end were labeled manually.

• Data (for adults only) will be released for free by the LDC on 4/15.

K&K Sound HotSpotaccelerometer

Data (speech & subglottal acoustics)

Measurement and analysis

SGR estimation algorithms

Height estimationSpeaker normalization

Estimating subglottal spectral features

Speaker identification

Sg1 (Hz) Sg2 (Hz) Sg3 (Hz)Adults

Average male 542 1327 2198Average female 659 1511 2410Overall average 601 1419 2304

Children (6 to 17 yrs)Average male 727 1720 2710

Average female 752 1778 2720Overall average 730 1740 2710

Results: actual SGRs• Significant differences between:

– Adults and children– Adult males and adult females

Results: SGRs versus body height• Trachea length correlates with height [Griscom & Wohl,

1985].– Results in negative correlations between SGRs and height.

Evaluation setup

• For adults– Training: 35 speakers in the WashU-UCLA corpus.– Evaluation: 14 speakers in the MIT Tracheal Resonance

database (with known SGRs).

• For children– Training: 25 speakers in the WashU-UCLA corpus.– Evaluation: the remaining 18 speakers.

• Utterances are 2 to 3 seconds long; sampled at 8 kHz.

ResultsRMSEs in Hz, for adults (5 dB SNR)

RMSEs in Hz, for children (5 dB SNR)

• RMSEs are on the order of measurement standard deviations.

0

50

100

Sg1 Sg2 Sg3

CleanBabblePinkFactoryWhite

0

100

200

Sg1 Sg2 Sg3

CleanBabblePinkCarWhite

Height Estimation

• Based on the correlation between SGRs and body height.

unknown speaker

WashU-UCLA corpus

automatic SGR estimation algorithm

model the relationship

between SGRs and speaker

height

estimated height

estimated SGRs

actual SGRs and heights

Height estimation: evaluation

Using Sg1 Using Sg2 Ganchev et al.

mean abs. error 5.3 cm 5.4 cm 5.3 cm

RMS error 6.6 cm 6.7 cm 6.8 cm

• Training data: SGRs and heights of 50 speakers.• Evaluation data: speech signals of 604 speakers (TIMIT).

Arsikere, Leung, Lulich and Alwan (Speech Communication, 2013)

ResultsRMSEs (cm) for clean speech – comparing the three SGRs

RMSEs (cm) for noisy speech (0 dB SNR) using Sg1

• Advantages over [Ganchev et al., 2010] (RMSE = 6.8 cm):– Just 1 feature (as opposed to 50); very little training data (50

speakers vs. 468); no need to retrain models for every scenario.

6.8 6.9 7.1

Using Sg1 Using Sg2 Using Sg3

6.8 6.97.2

7.06.8

Clean Babble White Pink Factory

SGR-based normalization (SGRN)• Motivation for using SGRs:

– Role of Sg1 and Sg2 as vowel-feature boundaries.– Phonetic invariance.– Can be estimated fairly well from noisy speech.

200 400 600 800 1000 12001000

1500

2000

2500

3000

F1 (Hz)

F2 (H

z)

Sg1c

Sg2cSg1m

Sg2m

Blue: male adultRed: male child

Evaluation setup• Task: recognition of connected digit utterances.

– TIDIGITS database (utterances of 1 to 7 digits each).– ASR models trained on 112 adult speakers; tested on 50

children.– Training in clean condition.– Testing in clean + babble, pink, white and car noise (5 to 15 dB).

• Hidden Markov Models (HMMs) are used.

• Features: standard MFCCs extracted at 10 ms intervals.

• Performance metric: word error rate (WER).– Total number of substitution, insertion and deletion errors.

Experimental results

39.7

26.218.9

23.217.9 14.416.7 16.1 14.1

6 to 8 yrs 9 to 11 yrs 12 to 15 yrs

BaselineVTLNSGRN

WERs (in %) by age group, averaged across noise types

24.7 27.5 27.5

18.2 18.4 17.414.9 16.2 15.6

1 or 2 words 3 to 5 words 6 or 7 words

BaselineVTLNSGRN

WERs (in %) by utterance length, averaged across noise types

Summary (speaker normalization)• Proposed approach: SGR-based normalization (SGRN).

– ML corrections applied to initial SGR estimates, which are fairly robust to noise; frequency-dependent scaling.

• Experiments on children’s ASR in quiet & noise: SGRN is better than VTLN, especially for young speakers and short utterances.

SID

SGCCs Computed just like speech MFCCs, but from subglottal acoustics.

45

Conclusion and future directions

• Subglottal features are useful for: (1) height estimation, (2) speaker normalization for ASR, (3) speaker identification, and (4) cross-language adaptation.– Effective with limited data.– Robust to environmental noise.

• Future research directions:– Data collection and analysis in other languages.– Pole-zero models for SGR estimation in speech.– Larger databases and better models for height estimation.

Noise Robust ASR

Speech Perception in NoiseHealthy-hearing adults are remarkably adept at perceiving speech in noise. However,

•Nearly 30 million North Americans are hearing impaired, yet less than 6 million use hearing aids. The most common complaint of hearing-aid users is listening to speech in naturally noisy environments.

•The performance of automatic speech recognition (ASR) systems degrades significantly in the presence of noise.

The ‘Robust’ Human Auditory System

The auditory system is extremely robust to noise due to both:

•“Intelligent” High-Level Processing• Inherently Robust Auditory

Representation (Front-End Processing)

Knowledge-Based Signal-processing Techniques

• Adaptation (sensitivity to onsets and offsets): modeled after forward masking experiments

• Peak isolation/Spectral sharpening: physiological and perceptual evidence

• Not all ‘uniform’ segments are equally important (VFR)

• VTTF changes slowly in time (Peak threading)• Extracting the spectral envelope more precisely

(Harmonic demodulation)

(A)

(B)

I. Adaptation: Auditory system adapts as a function of previous input

Varied masker level, probe delay, and frequency

Strope and Alwan (1997, 1998)

“These schemes will succeed only to the extent that metrics can be found that are sensitive to phonetically relevant spectral

differences..” D.H. Klatt, 1981

II. Peak Isolation

Quiet 5 dB SNR

9 6 1 3 Time

• Spectral changes are important perceptual cues for discrimination. Such changes can occur over very short timeintervals.

• Computing frames every 10 ms, as commonly done in ASR, is not sufficient to capture such dynamic changes.

III. Variable Frame Rate Analysis (VFR)

Zhu and Alwan (2000, 2003)You and Alwan (2002)

Frame selection in VFR for a digit string “one two” with silence

An Example of VFR

2000 4000 6000 8000 10000 12000

-2000

0

2000

4000

Sample

50 100 150 200 250 300 350 4000

10

20

30

Frame

d(i)

Selection

Speech waveform

Inter-frame distance based on Euclidean

distanceor Entropy

Frames selected

Summary• Achieved improved performance for rapid

speaker normalization and speaker ID with limited data using subglottal resonance information.

• Estimated speaker height using one feature

• Achieved improved ASR noise-robustness using: adaptation, peak isolation, and variable frame rate analysis

However, we are yet to achieve human performance on these tasks (except for height

estimation)!

Other Research Projects

•High-speed imaging of the vocal cords to improve modeling and synthesis

•Bird song classification using limited data

Acknowledgements

Former and Current Students: S. Wang, H. Ariskere, S. Park, B. Strope

Collaborators: S. Lulich, M. Sommers

Work supported in part by the NSF, NIH, DARPA, SONY Playstation

Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000...

Documents

Transcript of Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000...