Computational Hearing in Multisource Environments: the CHiME...

1
email: [email protected] ; url: www.dcs.shef.ac.uk/spandh/chime Computational Hearing in Multisource Environments: the CHiME Challenge Heidi Christensen, Ning Ma, Jon Barker, Phil Green Speech and Hearing Group, University of Sheffield, UK, Emmanuel Vincent INRIA Rennes, France, Example of Grid utterances being embedded into segment of background recording. • Grid reverberated with a BRIR (e.g. lounge, 200cm, 0 0 ). • Mixed with background recordings at range of SNR levels. • Mixing technique produces natural-sounding mixtures. • Sounds like Grid talker was present in the room. Post-processing H. Christensen, N. Ma, J. Barker and P. Green, Interspeech, Makuhari, Japan, 2010 N. Ma, J. Barker, H. Christensen and P. Green, SAPA, Makuhari, Japan, 2010. M. Cooke, J. Barker, S. Cunningham and X. Shao, JASA, 120(5), 2006. A. Farino, Proceedings of 108th AES Convention, Paris, 2000. CHiME is an EPSRC funded project that aims to build systems that understand speech in real environments using human-like hearing principles. Establishing a representative database has been pivotal to the project. Here we present the main design criteria for the CHiME database, the recording details, the data characteristics and an outline of the PASCAL CHiME challenge Our goal was to produce material which is both natural (derived from reverberant domestic environments), and controlled (providing an enumerated range of SNRs spanning 20 dB.) Motivation Noise Background • Complexity that is representative of everyday listening conditions. • Acoustically cluttered data, many noise sources simultaneously active and each source may have a very different characteristic. Noise level • Natural and representative SNRs. • Full control over SNRs. Recording Style • Distant microphone speech. • Binaural microphone setup. Speech Material • The Grid [1] corpus (e.g. BIN RED AT Q2 AGAIN) is used. • Small vocabulary but confusable lexicon. Criteria • 57 background sessions recorded in two rooms. • Lounge -- 15 hours from 22 individual sessions. • Kitchen -- 21 hours from 21 individual sessions. • Recording times range from 7:30 in the morning to 20:00 in the evening. • Sampling frequency of 96kHz and 32 bit precision. • Reverberation time T 60 =300 ms for both rooms. Data • Careful estimation of a set of BRIRs in each room to allow mixing in of Grid utterances as if they’d been spoken in the room. • BRIRs in each room positioned on polar grid. • Determined using sweep method, based on recording sweep response (Farino, 2000). • Sweep played through artificial mouth. Impulse Responses Keywords correct [%] SNR [dB] • All systems are trained on reverberated Grid training data. • MFCC-based system: 13 cepstral coefficients, deltas and accelerations. • MFC+CMN: as above with cepstral mean normalisation. • Multicondition: trained on Grid training data mixed with noisy background data. • Results: percentage of Grid keywords (i.e. the letter and digit) that have been recognised correctly. BIN RED AT Q2 AGAIN Baseline systems Distribution of noise variance. The Challenge The task is to recognise the Grid utterance keywords, i.e., the letter and number tokens in utterances of the form: <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> e.g. "place white at L 3 now. To attract as wide an audience as possible, the challenge has three entry levels: i) separated signal, ii) robust features or iii) recognition output. For i) and ii) the participants can upload their signals/features to the organisers for remote training/adaptation of models and subsequent evaluation. Want to know more? Email [email protected] to join the challenge email list for announcements and news. Check out our website: www.dcs.shef.ac.uk/spandh/chime Impulse response recording locations in kitchen (left- hand side) and lounge (right-hand side). • Recorded in a real house. • Around 40 hours of background recordings. • A set of binaural room impulse responses (BRIRs). • Equipment: B&K head and torso simulator (HATS) and B&K mouth simulator (for BRIRs). Background Recordings Diagram of equipment for background recordings and impulse response estimation. Examples of stationary and non-stationary background noise; auditory spectrogram of 10 sec. data. Baseline results for MFCC, MFCC+CMN and Multi-conditional training.

Transcript of Computational Hearing in Multisource Environments: the CHiME...

Page 1: Computational Hearing in Multisource Environments: the CHiME Challengec4dm.eecs.qmul.ac.uk/mlw2010/posters/MLW_Barker.pdf · 2013. 4. 30. · Heidi Christensen, Ning Ma, Jon Barker,

email: [email protected]; url: www.dcs.shef.ac.uk/spandh/chime

Computational Hearing in Multisource Environments: the CHiME Challenge

Heidi Christensen, Ning Ma, Jon Barker, Phil Green Speech and Hearing Group, University of Sheffield, UK,

Emmanuel VincentINRIA Rennes, France,

Example of Grid utterances being embedded into segment of background recording.

• Grid reverberated with a BRIR (e.g. lounge, 200cm, 00).• Mixed with background recordings at range of SNR levels.• Mixing technique produces natural-sounding mixtures.• Sounds like Grid talker was present in the room.

Post-processing

H. Christensen, N. Ma, J. Barker and P. Green, Interspeech, Makuhari, Japan, 2010N. Ma, J. Barker, H. Christensen and P. Green, SAPA, Makuhari, Japan, 2010. M. Cooke, J. Barker, S. Cunningham and X. Shao, JASA, 120(5), 2006. A. Farino, Proceedings of 108th AES Convention, Paris, 2000.

CHiME is an EPSRC funded project that aims to build systems that understand speech in real environments using human-like hearing principles. Establishing a representative database has been pivotal to the project.Here we present the main design criteria for the CHiME database, the recording details, the data characteristics and an outline of the PASCAL CHiME challengeOur goal was to produce material which is both natural(derived from reverberant domestic environments), and controlled (providing an enumerated range of SNRs spanning 20 dB.)

Motivation

Noise Background• Complexity that is representative of everyday listening conditions.• Acoustically cluttered data, many noise sources simultaneously active and each source may have a very different characteristic.Noise level• Natural and representative SNRs.• Full control over SNRs.Recording Style• Distant microphone speech.• Binaural microphone setup.Speech Material• The Grid [1] corpus (e.g. BIN RED AT Q2 AGAIN) is used.• Small vocabulary but confusable lexicon.

Criteria

• 57 background sessions recorded in two rooms.• Lounge -- 15 hours from 22 individual sessions.• Kitchen -- 21 hours from 21 individual sessions. • Recording times range from 7:30 in the morning to 20:00 in the evening. • Sampling frequency of 96kHz and 32 bit precision.• Reverberation time T60=300 ms for both rooms.

Data

• Careful estimation of a set of BRIRs in each room to allow mixing in of Grid utterances as if they’d been spoken in the room.• BRIRs in each room positioned on polar grid.• Determined using sweep method, based on recording sweep response (Farino, 2000).• Sweep played through artificial mouth.

Impulse Responses

Key

wor

ds

corr

ect [

%]

SNR [dB]

• All systems are trained on reverberated Grid training data.• MFCC-based system: 13 cepstral coefficients, deltas and accelerations.• MFC+CMN: as above with cepstral mean normalisation.• Multicondition: trained on Grid training data mixed with noisy background data.• Results: percentage of Grid keywords (i.e. the letter and digit) that have been recognised correctly.

BIN RED AT Q2 AGAIN

Baseline systems

Distribution of noise variance.

The ChallengeThe task is to recognise the Grid utterance keywords, i.e., the letter and number tokens in utterances of the form:

<command:4><color:4><preposition:4><letter:25><number:10><adverb:4>

e.g. "place white at L 3 now“ .

To attract as wide an audience as possible, the challenge has three entry levels: i) separated signal, ii) robust featuresor iii) recognition output. For i) and ii) the participants can upload their signals/features to the organisers for remote training/adaptation of models and subsequent evaluation.

Want to know more?Email [email protected] to join the challenge email list for announcements and news.Check out our website: www.dcs.shef.ac.uk/spandh/chime

Impulse response recording locations in kitchen (left-hand side) and lounge (right-hand side).

• Recorded in a real house.• Around 40 hours of background recordings.• A set of binaural room impulse responses (BRIRs).• Equipment: B&K head and torso simulator (HATS) and B&K mouth simulator (for BRIRs).

Background Recordings

Diagram of equipment for background recordings and impulse response estimation.

Examples of stationary and non-stationary background noise; auditory spectrogram of 10 sec. data.

Baseline results for MFCC, MFCC+CMN and Multi-conditional training.