ICCS-NTUA : WP1+WP2

38
ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr Computer Vision, Speech Communication and Signal Processing Research Group HIWIRE

description

ICCS-NTUA : WP1+WP2. Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr. HIWIRE. Computer Vision, Speech Communication and Signal Processing Research Group. ICCS-NTUA in HIWIRE. Evaluation Databases & Baseline Completed Platform Front-end Release 1 st Version WP1 - PowerPoint PPT Presentation

Transcript of ICCS-NTUA : WP1+WP2

Page 1: ICCS-NTUA :  WP1+WP2

ICCS-NTUA : WP1+WP2

Prof. Petros MaragosNTUA, School of ECEURL: http://cvsp.cs.ntua.gr

Computer Vision, Speech Communication and Signal Processing Research Group

HIWIRE

Page 2: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

ICCS-NTUA in HIWIRE Evaluation

Databases & Baseline Completed

Platform Front-end Release 1st Version

WP1

Noise Robust Features Completed

Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results

Audio-Visual ASR Baseline + Adv. Visual Features

VAD Completed + Integration

WP2

VTLN Platform Integration Completed

Speaker Normalization Research Prelim. Results

Non-native Speech Database Completed

Page 3: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

ICCS-NTUA in HIWIRE Evaluation

Databases & Baseline Completed

Platform Front-end Release 1st Version

WP1

Noise Robust Features Completed

Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results

Audio-Visual ASR Baseline + Adv. Visual Features

VAD Completed + Integration

WP2

VTLN Platform Integration Completed

Speaker Normalization Research Prelim. Results

Non-native Speech Database Completed

Page 4: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

HIWIRE Advanced Front-end: Challenges

Points Considered during Implementation Modular Architecture Implementation in C-Code Incorporation of Different Ideas/Algorithms User-friendly interface providing additional options dealing with on-site demands of the project

Page 5: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

HIWIRE Advanced Front-end: Options

WantVAD?

No

LTSDVAD /MTEVAD

Yes

WantDenoising?

No

Yes

WienerDenoising

MFCC/TECC

TECCMFCCS

peech S

ignals

Speech

Processing

(Features)

Speech P

re-Processing

(Denoising)

1 1

2 2

3 3

Support for Input Speech Signals

Different Sampling Frequencies• 8 kHz• 11 kHz• 16 kHz

Different Byte-Ordering• Little-endian• Big-endian

Different Input File Formats• RAW• NIST• HTK

Provides Flags/ Options:

Preprocessing Smoothing of Speech Signals• Hamming Windowing• Pre-emphasis

Denoising/ VAD Algorithms• LTSD-VAD Algorithm (UGR)• MTE-VAD Algorithm (ICCS-NTUA)

• Wiener Denoising Algorithm- (Used only with a VAD algorithm)

Output Features

• MFCC• TECC• C0 or logE

Page 6: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

HIWIRE Advanced Front-end: Things to Be Done

• Script is in Testing Phase

• Create a CVS where Additional Modules should be included

• Tested Further in Speech Databases

Evaluation in progress

• Fine-Tuning is Necessary

• Final Version should be Faster (Real-Time Processing)

• Incorporate it in the HIWIRE Platform

Page 7: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

ICCS-NTUA in HIWIRE: 1st, 2nd

Year Evaluation

Databases & Baseline Completed

Platform Front-end Release 1st Version

WP1

Noise Robust Features Completed

Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results

Audio-Visual ASR Baseline + Adv. Visual Features

VAD Completed + Integration?

WP2

VTLN Platform Integration Completed

Speaker Normalization Research Prelim. Results

Non-native Speech Database Completed

Page 8: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Microphone Arrays• Multi-channel Speech Enhancement for Diffuse Noise Fields

– MVDR (Minimum Variance Distortionless Response) Beamforming

– Single Channel Linear and non-linear Post-Filtering

• MSE criterion leads to the linear Wiener Post-filter.

• MSE STSA and MSE log-STSA criteria leads to non-Linear Post-filters.

Page 9: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Microphone Arrays The Overall Speech Enhancement System includes the following steps:

The noisy channel’s inputs are fed into a time alignment module (Different propagation paths for every input channel)

The time aligned noisy observations are projected to a single channel output with minimum noise variance, through the MVDR beamformer.

The output of the beamformer is further processed by a post-filter according to the used speech enhancement criterion (MSE, MSE STSA, MSE log-STSA).

For the post-filters, since they depend on second order statistics of the source and the noise signals, we have to develop an estimation scheme.

Results on CMU Database 10 Speakers (13 utterances) Diffuse Noise SSNR Enhancement : SSNRoutput-E[SSNRinput] (E[] stands for the mean value of the

N input channels) LAR, LSD, IS, LLR : Low values signify high speech quality. These measures are

found to have a high correlation with the human perception.

Page 10: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Results: CMU Database

Page 11: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Spectrograms: CMU Database

Page 12: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Multi-Microphone ASR Experiments

Details on Setup of ASR Tasks: • 700 Sentences for Training and 300 for Testing

• 12-state, left-right HMM w. Gaussian mixtures

• All-pair, unweighted grammar

• MFCC+C0+D+DD (39 coefficients in total)

Correct Word Accuracies (%)Input Signalsfor ASR task Original Noisy McCowan Proposed

MFCC+D+DD 96.37 94.98 93.83 95.23

Page 13: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

ICCS-NTUA in HIWIRE: 1st, 2nd

Year Evaluation

Databases & Baseline Completed

Platform Front-end Release 1st Version

WP1

Noise Robust Features Completed

Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results

Audio-Visual ASR Baseline + Adv. Visual Features

VAD Completed + Integration?

WP2

VTLN Platform Integration Completed

Speaker Normalization Research Prelim. Results

Non-native Speech Database Completed

Page 14: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Multi-Cue Feature Fusion Goal:

Fuse heterogeneous information streams optimally & adaptively Our approach:

Explicitly model uncertainty in all feature measurements (due to noise or model fitting errors)

Adjust model training to accommodate for uncertainty Dynamically compensate feature uncertainty during decoding Feature uncertainty estimation in the AV-ASR case:

For the Audio Stream/MFCC: speech enhancement process For the Visual Stream: model fitting variance

Properties: Adaptation at the frame level Explain and generalize cue weighting through stream exponents Integrates with a wide range of models, e.g. GMM, HMM Applicable to both audio-audio and audio-visual scenarios Can be combined with asynchronous models, e.g. Product-HMM

Page 15: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Measurement Noise and Adaptive Fusion

C

X

C

X

Y

Our View: We can only measure noise-corrupt features

Conventional View: Features are directly observable

Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06

Page 16: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

EM-Training with Partially Known Features

( , ) [log ( ,{ } | ) | , ]Q ΄ p X C X ΄

C

X

C

X

Y

Our View

Conventional View

Hidden

Observed

Hidden

Observed

Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06

Even training data can be uncertain

( , ) [log ( ,{ , } | ) | , ]Q ΄ p Y X C Y ΄

Page 17: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

EM-Training: Results for GMME

-Ste

pM

-Ste

p

Filtered feature

estimate

Similar to conventional update rules

Uncertainty-compensated scores

Formulas for HMM are similar

Page 18: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Decoding & Uncertain Features

Variance-Compensated (“Soft”) Scoring

Probabilistic Justification for Stream Exponents

Relative Measurement Error

Adaptation at each frame –stream/class/mixture dependent

stream weights

Page 19: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Audio-visual Asynchrony Modeling

Multi-stream HMM Product HMM

Ref: Gravier et al., 2002

Page 20: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Fusion: Multi-Cue Audio-Audio

Feature Uncertainty for Audio features

Baseline Audio Features: MFCC

Enhancement using GMM of clean speech and Vector Taylor

Series Approximation

Uncertainty is Gaussian with Variance given by the

enhancement process

Used for Audio-Visual Fusion

Fractal Audio Features: MFD

On-going research applying a similar framework (GMM, VTS)

Page 21: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

MFD: From Noisy Speech to Feature Uncertainty

Ongoing Research: Noise Compensation for MFD

Estimated Noisy

Clean

Noise

True Noisy

White Noise (0 dB)

Page 22: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

ICCS-NTUA in HIWIRE: 1st, 2nd

Year Evaluation

Databases & Baseline Completed

Platform Front-end Release 1st Version

WP1

Noise Robust Features Completed

Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results

Audio-Visual ASR Baseline + Adv. Visual Features

VAD Completed + Integration?

WP2

VTLN Platform Integration Completed

Speaker Normalization Research Prelim. Results

Non-native Speech Database Completed

Page 23: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Showcase: Audio-Visual Speech Recognition

+p1 +p2=

1 2=

Both shape & texture can assist lipreading

Active Appearance Models for face modeling Shape and texture of faces “live” in low-dim manifolds

Features: AAM Fitting (nonlinear least squares problem)

Visual feature Uncertainty related to the sensitivity of the least-squares

solution

Page 24: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Demo: AAM fitting and uncertainty estimates

The visual front-end supplies both features and their

respective uncertainty.

Page 25: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Audio-Visual ASR: Database

Subset of CUAVE database used:

36 speakers (30 training, 6 testing)

5 sequences of 10 connected digits per speaker

Training set: 1500 digits (30x5x10)

Test set: 300 digits (6x5x10)

CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)

CUAVE was kindly provided by the Clemson University

Page 26: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Evaluation on the CUAVE Database

Page 27: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Audio-Visual Speech Classification with MS-HMM

Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06

Page 28: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

AV Digit Classification Results (Word Accuracy)

Audio: MFCC_D_Z (26 features)Visual: 6 shape + 12 texture AAM coefficientsAV MS-HMM: AudioVisual Multistream HMM, weights (1,1)AV MS-HMM, Var-Comp: AudioVisual Multistream HMM+Variance Compensation

AV P-HMM: AudioVisual Product HMM, weights (1,1)

AV P-HMM, Var-Comp: AudioVisual Product HMM+ Variance Compensation

SNR

(babble)

Audio Visual AV

MS-HMM

AV

MS-HMM

Var-Comp

AV

P-HMMAV

P-HMMVar-Comp

Clean 100% 68.7% 95.1% 97.0% 95.4% 99.6%

10 dB 92.8% - 88.3% 90.2% 90.6% 92.5%

5 dB 73.9% - 84.5% 86.8% 87.2% 89.1%

0 dB 54.7% - 79.6% 81.1% 83.8% 82.6%

Ref: Pitsikalis, Katsamanis, Papandreou, and Maragos, ICSLP’06

Page 29: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

AV-ASR: Results with Uncertain Training

Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06

Page 30: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

ICCS-NTUA in HIWIRE: 1st, 2nd

Year Evaluation

Databases & Baseline Completed

Platform Front-end Release 1st Version

WP1

Noise Robust Features Completed

Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results

Audio-Visual ASR Baseline + Adv. Visual Features

VAD Completed + Integration?

WP2

VTLN Platform Integration Completed

Speaker Normalization Research Prelim. Results

Non-native Speech Database Completed

Page 31: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

VTLN on the Platform

Warping in the front-end Piecewise Linear Warping Function

Warping in the filterbank domain by stretching or compressing the frequency axis

Training – HTK Implementation

Testing Fast Implementation using GMM representing normalized

speech to estimate warping factors per utterance.

Page 32: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

VTLN on the Platform, Results

87

87.5

88

88.5

89

89.5

MFCC (H

FE)

MFCC(H

TK)

TECC

MFCC+V

TLN (HFE)

Page 33: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

VTLN Research, TECC Features Teager Energy Cepstrum Coefficients are actually

energy measurements at the output of a Gammatone filterbank, similarly to MFCC

VTLN can be applied in a similar manner The bark scale along which the filters are uniformly

positioned is properly stretched or shrunk to achieve warping

Evaluation is currently in progress

Page 34: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

VTLN Research, using Formants

82

83

84

85

86

87

88

89

90

MFCC (H

FE)

MFCC(H

TK)

VTLN (LPC)

VTLN (Multi

Band)

Page 35: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Raw Formants-Dynamic Programming

time

node

( , ) ( , ) min ( , ) ( 1, )local transm

C t n C t n C m n C t m

2i w,i, ,

( , ) / + β B i ii i

local i i n n ii F i FC t n a F F F d

2

max

( ) ( 1)( , ) i=1 Ni i

trans ii

F t F tC m n

F

Page 36: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Formant Tracking

Page 37: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

ICCS-NTUA in HIWIRE: 1st, 2nd

Year Evaluation

Databases & Baseline Completed

Platform Release 1st Version

WP1

Noise Robust Features Completed

Multi-mic. array Enhancement Prelim. Results Fusion Prelim. Results

Audio-Visual ASR Baseline + Adv. Visual Features

VAD Completed + Integration?

WP2

VTLN Platform Integration Completed

Speaker Normalization Research Prelim. Results

Non-native Speech Database Completed

Page 38: ICCS-NTUA :  WP1+WP2

HIWIRE Meeting, July 2006 HIHIWIWIREREICCS - NTUA

Next... Fusion

Audio+Audio,

Audio+Visual,

Nonlinear Features+Visual

Visual Front-end

VAD+ Nonlinear Features