ICCS-NTUA : WP1+WP2

ICCS-NTUA : WP1+WP2

Prof. Petros MaragosNTUA, School of ECEURL: http://cvsp.cs.ntua.gr

Computer Vision, Speech Communication and Signal Processing Research Group

HIWIRE

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

ICCS-NTUA in HIWIRE Evaluation

Databases & Baseline Completed

Platform Front-end Release 1st Version

WP1

Noise Robust Features Completed

Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results

Audio-Visual ASR Baseline + Adv. Visual Features

VAD Completed + Integration

WP2

VTLN Platform Integration Completed

Speaker Normalization Research Prelim. Results

Non-native Speech Database Completed


HIWIRE Advanced Front-end: Challenges

Points Considered during Implementation Modular Architecture Implementation in C-Code Incorporation of Different Ideas/Algorithms User-friendly interface providing additional options dealing with on-site demands of the project


HIWIRE Advanced Front-end: Options

WantVAD?

No

LTSDVAD /MTEVAD

Yes

WantDenoising?

No

Yes

WienerDenoising

MFCC/TECC

TECCMFCCS

peech S

ignals

Speech

Processing

(Features)

Speech P

re-Processing

(Denoising)

1 1

2 2

3 3

Support for Input Speech Signals

Different Sampling Frequencies• 8 kHz• 11 kHz• 16 kHz

Different Byte-Ordering• Little-endian• Big-endian

Different Input File Formats• RAW• NIST• HTK

Provides Flags/ Options:

Preprocessing Smoothing of Speech Signals• Hamming Windowing• Pre-emphasis

Denoising/ VAD Algorithms• LTSD-VAD Algorithm (UGR)• MTE-VAD Algorithm (ICCS-NTUA)

• Wiener Denoising Algorithm- (Used only with a VAD algorithm)

Output Features

• MFCC• TECC• C0 or logE


HIWIRE Advanced Front-end: Things to Be Done

• Script is in Testing Phase

• Create a CVS where Additional Modules should be included

• Tested Further in Speech Databases

Evaluation in progress

• Fine-Tuning is Necessary

• Final Version should be Faster (Real-Time Processing)

• Incorporate it in the HIWIRE Platform


ICCS-NTUA in HIWIRE: 1st, 2nd

Year Evaluation



WP1




VAD Completed + Integration?

WP2





Microphone Arrays• Multi-channel Speech Enhancement for Diffuse Noise Fields

– MVDR (Minimum Variance Distortionless Response) Beamforming

– Single Channel Linear and non-linear Post-Filtering

• MSE criterion leads to the linear Wiener Post-filter.

• MSE STSA and MSE log-STSA criteria leads to non-Linear Post-filters.


Microphone Arrays The Overall Speech Enhancement System includes the following steps:

The noisy channel’s inputs are fed into a time alignment module (Different propagation paths for every input channel)

The time aligned noisy observations are projected to a single channel output with minimum noise variance, through the MVDR beamformer.

The output of the beamformer is further processed by a post-filter according to the used speech enhancement criterion (MSE, MSE STSA, MSE log-STSA).

For the post-filters, since they depend on second order statistics of the source and the noise signals, we have to develop an estimation scheme.

Results on CMU Database 10 Speakers (13 utterances) Diffuse Noise SSNR Enhancement : SSNRoutput-E[SSNRinput] (E[] stands for the mean value of the

N input channels) LAR, LSD, IS, LLR : Low values signify high speech quality. These measures are

found to have a high correlation with the human perception.


Results: CMU Database


Spectrograms: CMU Database


Multi-Microphone ASR Experiments

Details on Setup of ASR Tasks: • 700 Sentences for Training and 300 for Testing

• 12-state, left-right HMM w. Gaussian mixtures

• All-pair, unweighted grammar

• MFCC+C0+D+DD (39 coefficients in total)

Correct Word Accuracies (%)Input Signalsfor ASR task Original Noisy McCowan Proposed

MFCC+D+DD 96.37 94.98 93.83 95.23



Year Evaluation



WP1





WP2





Multi-Cue Feature Fusion Goal:

Fuse heterogeneous information streams optimally & adaptively Our approach:

Explicitly model uncertainty in all feature measurements (due to noise or model fitting errors)

Adjust model training to accommodate for uncertainty Dynamically compensate feature uncertainty during decoding Feature uncertainty estimation in the AV-ASR case:

For the Audio Stream/MFCC: speech enhancement process For the Visual Stream: model fitting variance

Properties: Adaptation at the frame level Explain and generalize cue weighting through stream exponents Integrates with a wide range of models, e.g. GMM, HMM Applicable to both audio-audio and audio-visual scenarios Can be combined with asynchronous models, e.g. Product-HMM


Measurement Noise and Adaptive Fusion

C

X

C

X

Y

Our View: We can only measure noise-corrupt features

Conventional View: Features are directly observable

Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06


EM-Training with Partially Known Features

( , ) [log ( ,{ } | ) | , ]Q ΄ p X C X ΄

C

X

C

X

Y

Our View

Conventional View

Hidden

Observed

Hidden

Observed

Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06

Even training data can be uncertain

( , ) [log ( ,{ , } | ) | , ]Q ΄ p Y X C Y ΄


EM-Training: Results for GMME

-Ste

pM

-Ste

p

Filtered feature

estimate

Similar to conventional update rules

Uncertainty-compensated scores

Formulas for HMM are similar


Decoding & Uncertain Features

Variance-Compensated (“Soft”) Scoring

Probabilistic Justification for Stream Exponents

Relative Measurement Error

Adaptation at each frame –stream/class/mixture dependent

stream weights


Audio-visual Asynchrony Modeling

Multi-stream HMM Product HMM

Ref: Gravier et al., 2002


Fusion: Multi-Cue Audio-Audio

Feature Uncertainty for Audio features

Baseline Audio Features: MFCC

Enhancement using GMM of clean speech and Vector Taylor

Series Approximation

Uncertainty is Gaussian with Variance given by the

enhancement process

Used for Audio-Visual Fusion

Fractal Audio Features: MFD

On-going research applying a similar framework (GMM, VTS)


MFD: From Noisy Speech to Feature Uncertainty

Ongoing Research: Noise Compensation for MFD

Estimated Noisy

Clean

Noise

True Noisy

White Noise (0 dB)



Year Evaluation



WP1





WP2





Showcase: Audio-Visual Speech Recognition

+p1 +p2=

1 2=

Both shape & texture can assist lipreading

Active Appearance Models for face modeling Shape and texture of faces “live” in low-dim manifolds

Features: AAM Fitting (nonlinear least squares problem)

Visual feature Uncertainty related to the sensitivity of the least-squares

solution


Demo: AAM fitting and uncertainty estimates

The visual front-end supplies both features and their

respective uncertainty.


Audio-Visual ASR: Database

Subset of CUAVE database used:

36 speakers (30 training, 6 testing)

5 sequences of 10 connected digits per speaker

Training set: 1500 digits (30x5x10)

Test set: 300 digits (6x5x10)

CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)

CUAVE was kindly provided by the Clemson University


Evaluation on the CUAVE Database


Audio-Visual Speech Classification with MS-HMM

Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06


AV Digit Classification Results (Word Accuracy)

Audio: MFCC_D_Z (26 features)Visual: 6 shape + 12 texture AAM coefficientsAV MS-HMM: AudioVisual Multistream HMM, weights (1,1)AV MS-HMM, Var-Comp: AudioVisual Multistream HMM+Variance Compensation

AV P-HMM: AudioVisual Product HMM, weights (1,1)

AV P-HMM, Var-Comp: AudioVisual Product HMM+ Variance Compensation

SNR

(babble)

Audio Visual AV

MS-HMM

AV

MS-HMM

Var-Comp

AV

P-HMMAV

P-HMMVar-Comp

Clean 100% 68.7% 95.1% 97.0% 95.4% 99.6%

10 dB 92.8% - 88.3% 90.2% 90.6% 92.5%

5 dB 73.9% - 84.5% 86.8% 87.2% 89.1%

0 dB 54.7% - 79.6% 81.1% 83.8% 82.6%

Ref: Pitsikalis, Katsamanis, Papandreou, and Maragos, ICSLP’06


AV-ASR: Results with Uncertain Training

Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06



Year Evaluation



WP1





WP2





VTLN on the Platform

Warping in the front-end Piecewise Linear Warping Function

Warping in the filterbank domain by stretching or compressing the frequency axis

Training – HTK Implementation

Testing Fast Implementation using GMM representing normalized

speech to estimate warping factors per utterance.


VTLN on the Platform, Results

87

87.5

88

88.5

89

89.5

MFCC (H

FE)

MFCC(H

TK)

TECC

MFCC+V

TLN (HFE)


VTLN Research, TECC Features Teager Energy Cepstrum Coefficients are actually

energy measurements at the output of a Gammatone filterbank, similarly to MFCC

VTLN can be applied in a similar manner The bark scale along which the filters are uniformly

positioned is properly stretched or shrunk to achieve warping

Evaluation is currently in progress


VTLN Research, using Formants

82

83

84

85

86

87

88

89

90

MFCC (H

FE)

MFCC(H

TK)

VTLN (LPC)

VTLN (Multi

Band)


Raw Formants-Dynamic Programming

time

node

( , ) ( , ) min ( , ) ( 1, )local transm

C t n C t n C m n C t m

2i w,i, ,

( , ) / + β B i ii i

local i i n n ii F i FC t n a F F F d

2

max

( ) ( 1)( , ) i=1 Ni i

trans ii

F t F tC m n

F


Formant Tracking



Year Evaluation


Platform Release 1st Version

WP1





WP2





Next... Fusion

Audio+Audio,

Audio+Visual,

Nonlinear Features+Visual

Visual Front-end

VAD+ Nonlinear Features

ICCS-NTUA : WP1+WP2

Documents

Transcript of ICCS-NTUA : WP1+WP2