Parole Analysis, Perception and Automatic Recognition of Speech Nancy Created in May 2001.

Parole

Analysis, Perception and Automatic Recognition of Speech

Nancy

Created in May 2001

Composition

University staffJean-Paul Haton (Prof. UHP), Irina Illina (MC. Nancy 2), Joseph di Martino (MC. UHP), Odile Mella (MC. UHP), Kamel Smaïli (Prof. Nancy 2), Armelle Brun (MC. Nancy 2), David Langlois (MC, IUFM), Vincent Colotte (MC. UHP), Slim Ouni (MC. Nancy 2)CNRS staffAnne Bonneau (CR), Dominique Fohr (CR), Yves Laprie (CR), Christophe Cerisara (CR)Invited scientistJacques Feldmar (INRIA)Doctoral students Emmanuel Didiot, Pavel Kral, Joseph Razik, Blaise Potard, Vincent Robert, Sebastien Demange, Ghazi Bousselmi, Mathieu Camus, Farid Feïz, Guillaume Henry, Caroline LavecchiaEngineersAlexandre Lafosse (CNRS), Julien Maire (CNRS), Christophe Antoine (INRIA)

Objectives

Which acoustic or articulatory cues are the most relevant for identifying sounds?

How this information can be exploited efficiently and automatically?

Speech analysis and perceptionHow automatic speech recognition (ASR) can be made more robust?

How language models can cope with the complexity of natural language?

Modeling speech for automatic speech recognition

Research issues Applications

Results

Speech analysis and perception

Objective: improve the human or the automatic identification of speech sounds

The key point is the development and the exploitation of acoustic cues:

Algorithms to analyze speech signals (segmentation of burst, automatic formant tracking, copy synthesis…)

Design of acoustic cues and selective training of acoustic HMMs

Transformation of speech signals and phonetic strategies

Deviation of the learner’s prosody with respect to a reference

Speech signal

Exploitation of acoustic cues

Diagnosis/Correction of prosodyFrom a transcription:

Accent detection (syllable)

Show prosodic differences (F0 and duration).

Modify signal by correcting the learner’s FO and rhythm: the learner listens to itself with a F0 and a rhythm close to those of the teacher.

Articulatory modeling and inversion

Objective: recovering the temporal evolution of the vocal tract during speech production

Approach: an analysis by synthesis methodArticulatory codebook

Exploration of the null space of the articulatory to acoustic mapping

Recovering trajectories

Incorporation of phonetic constraints to reduce the under-determination of inversion

Incorporation of constraints on the visible articulators and labial coarticulation modeling

Incorporation of phonetic constraints

1. Shapes recovered are meaningful 2. A small number of articulation places3. Impact of phonetic constraints

Articulator contours from X-ray

Constriction position (cm)

Con

stri

ctio

n ar

ea

(cm

2)

0

2

0 16

One example: inversion of /a/

poor constraint values

high constraint values

Stochastic models for speech recognition

Objective: improving robustness of speech recognition and designing new models

Achievements:Signal:

denoising method based on a probabilistic matching between clean and noisy speech

stochastic matching algorithm to handle non-stationary noises

Models:

missing data recognition

adaptation to speaker and noise

Core recognition platforms: ESPERE (medium size vocabulary tasks) and ANTS (large vocabulary, used for audio stream annotation)

Denoising for robust ASR

PrincipleMaps noisy onto clean speech, via GMM distributions

Multistyle denoiserAdapt to unknown noise

Results (Aurora2)Outperforms multistyle WI008

models

ObjectivesMake it a de facto standard (like MFCC, CMN, …)Keep it simple to implement, validate it in a wide range of conditionsDistribute it widely, opensource (implement it inside Sphinx4 ?)

WI008 - test C

80,00

81,00

82,00

83,00

84,00

85,00

86,00

87,00

1 2 3 4 5 6 7 8

MS denoiser MS denoiser adapted WI008 multistyle

Language modeling

Objective: models can cope with the complexity of natural language

Achievements:Variable-length n-grams using phrases extracted from syntactic sequences of classes

A new method to select the best contextual language model: use distant language model

A set of methods in topic identification

the combination of different methods outperforms the results obtained by the best one by 10%

used for e-mail routing

A new architecture for language models

Wi-1 WiWi-2

Ci-1 CiCi-2 Ci+1

Wi+1

Training an architecture that embodies a deeperrepresentation of language.

Future: dealing with the problem of gender and number agreement in a statistical language model.

International collaborations

Projects fitting exactly our scientific objectives.

A strong involvement in European projects: preparation, participation, coordination.

Multitel in Mons speech synthesisPSL (Dominic Massaro) talking heads and their useKTH articulatory modelingICCS (Athens) automatic speech processingULB acoustic to articulatory inversionUniversity of Granada, ITC-IRST (Italy), TSI (Technical University of Crete)

Objectives for the next four years

Speech analysis and perception

Directions of research:design more robust speech analysis algorithms (copy synthesis, formant tracking)

continue the work about “strong cues” to identify speech sounds

investigate phonetic strategies to improve speech intelligibility

Applications: open new topics about speech therapy (language acquisition, esophageal speech)

language learning

Audiovisual to articulatory inversion

Directions of research:is inversion possible for all the speech sounds?

which is the articulatory information that can be recovered?

with which accuracy?

inversion with standard spectral data instead of resonance frequencies

modeling labial coarticulation

coupling face and vocal tract.

Automatic speech recognition

The challenge is to increase the robustness of ASR.

Using external sources of informationdesign a theoretical framework that can incorporate different sources of information

one promising direction is to use well recognized sounds that contain strong cues

Applicationsextend the usability of automatic speech recognition (non native speakers, children…)

Language models

The challenge is to deal with the complexity of natural language in automatic speech recognition:

How relevant linguistic knowledge can be retrieved automatically?

How to make linguistic units more informative by supporting syntactic features?

How several forms of linguistic knowledge can be used in a single framework?

Application to speech to speech translationUsing phrase, i.e. relevant sequences of words, is a promising approach.

Multidisciplinarity

Multidisciplinarity has benefited each topic addressed in the team by:

opening new directions of research

“strong acoustic cues” confidence islands in automatic speech recognition by using models of well realized sounds

giving help to improve another branch of research

speech analysis to adjust parameters of Mel cepstral coefficient calculation

using probabilistic learning for inversion

providing tools from another research area addressed in the group

piloting a talking head through automatic speech recognition

speech alignment for language learning

using language modeling to analyze sentences to be synthesized

Demo: Real time piloting of a talking head

Objective: providing deaf children at school with a cued speech talking head

Piloting the talking head through real time (500ms delay) automatic speech recognition:

Using phonemes Coarticulation is also developed

RIAM project LABIAO (EDF, DATHA association)

Bon Jou R

Conclusion

A deeper comprehension and modeling of speech production, perception and comprehension to allow natural interaction through speech.

Conclusion

A deeper comprehension and modeling of speech production, perception and comprehension to allow natural interaction through speech.

Signal View::Design Time

Spectro View::Design Time

Whi

te n

oise

filt

ered

with

W

inS

noor

i.

Parole Analysis, Perception and Automatic Recognition of Speech Nancy Created in May 2001.

Documents

Transcript of Parole Analysis, Perception and Automatic Recognition of Speech Nancy Created in May 2001.