Parole Analysis, Perception and Automatic Recognition of Speech Nancy Created in May 2001.
-
Upload
vernon-clark -
Category
Documents
-
view
220 -
download
0
Transcript of Parole Analysis, Perception and Automatic Recognition of Speech Nancy Created in May 2001.
Parole
Analysis, Perception and Automatic Recognition of Speech
Nancy
Created in May 2001
Composition
University staffJean-Paul Haton (Prof. UHP), Irina Illina (MC. Nancy 2), Joseph di Martino (MC. UHP), Odile Mella (MC. UHP), Kamel Smaïli (Prof. Nancy 2), Armelle Brun (MC. Nancy 2), David Langlois (MC, IUFM), Vincent Colotte (MC. UHP), Slim Ouni (MC. Nancy 2)CNRS staffAnne Bonneau (CR), Dominique Fohr (CR), Yves Laprie (CR), Christophe Cerisara (CR)Invited scientistJacques Feldmar (INRIA)Doctoral students Emmanuel Didiot, Pavel Kral, Joseph Razik, Blaise Potard, Vincent Robert, Sebastien Demange, Ghazi Bousselmi, Mathieu Camus, Farid Feïz, Guillaume Henry, Caroline LavecchiaEngineersAlexandre Lafosse (CNRS), Julien Maire (CNRS), Christophe Antoine (INRIA)
Objectives
Which acoustic or articulatory cues are the most relevant for identifying sounds?
How this information can be exploited efficiently and automatically?
Speech analysis and perceptionHow automatic speech recognition (ASR) can be made more robust?
How language models can cope with the complexity of natural language?
Modeling speech for automatic speech recognition
Research issues Applications
Results
Speech analysis and perception
Objective: improve the human or the automatic identification of speech sounds
The key point is the development and the exploitation of acoustic cues:
Algorithms to analyze speech signals (segmentation of burst, automatic formant tracking, copy synthesis…)
Design of acoustic cues and selective training of acoustic HMMs
Transformation of speech signals and phonetic strategies
Deviation of the learner’s prosody with respect to a reference
Speech signal
Exploitation of acoustic cues
Diagnosis/Correction of prosodyFrom a transcription:
Accent detection (syllable)
Show prosodic differences (F0 and duration).
Modify signal by correcting the learner’s FO and rhythm: the learner listens to itself with a F0 and a rhythm close to those of the teacher.
Articulatory modeling and inversion
Objective: recovering the temporal evolution of the vocal tract during speech production
Approach: an analysis by synthesis methodArticulatory codebook
Exploration of the null space of the articulatory to acoustic mapping
Recovering trajectories
Incorporation of phonetic constraints to reduce the under-determination of inversion
Incorporation of constraints on the visible articulators and labial coarticulation modeling
Incorporation of phonetic constraints
1. Shapes recovered are meaningful 2. A small number of articulation places3. Impact of phonetic constraints
Articulator contours from X-ray
Constriction position (cm)
Con
stri
ctio
n ar
ea
(cm
2)
0
2
0 16
One example: inversion of /a/
poor constraint values
high constraint values
Stochastic models for speech recognition
Objective: improving robustness of speech recognition and designing new models
Achievements:Signal:
denoising method based on a probabilistic matching between clean and noisy speech
stochastic matching algorithm to handle non-stationary noises
Models:
missing data recognition
adaptation to speaker and noise
Core recognition platforms: ESPERE (medium size vocabulary tasks) and ANTS (large vocabulary, used for audio stream annotation)
Denoising for robust ASR
PrincipleMaps noisy onto clean speech, via GMM distributions
Multistyle denoiserAdapt to unknown noise
Results (Aurora2)Outperforms multistyle WI008
models
ObjectivesMake it a de facto standard (like MFCC, CMN, …)Keep it simple to implement, validate it in a wide range of conditionsDistribute it widely, opensource (implement it inside Sphinx4 ?)
WI008 - test C
80,00
81,00
82,00
83,00
84,00
85,00
86,00
87,00
1 2 3 4 5 6 7 8
MS denoiser MS denoiser adapted WI008 multistyle
Language modeling
Objective: models can cope with the complexity of natural language
Achievements:Variable-length n-grams using phrases extracted from syntactic sequences of classes
A new method to select the best contextual language model: use distant language model
A set of methods in topic identification
the combination of different methods outperforms the results obtained by the best one by 10%
used for e-mail routing
A new architecture for language models
Wi-1 WiWi-2
Ci-1 CiCi-2 Ci+1
Wi+1
Training an architecture that embodies a deeperrepresentation of language.
Future: dealing with the problem of gender and number agreement in a statistical language model.
International collaborations
Projects fitting exactly our scientific objectives.
A strong involvement in European projects: preparation, participation, coordination.
Multitel in Mons speech synthesisPSL (Dominic Massaro) talking heads and their useKTH articulatory modelingICCS (Athens) automatic speech processingULB acoustic to articulatory inversionUniversity of Granada, ITC-IRST (Italy), TSI (Technical University of Crete)
Objectives for the next four years
Speech analysis and perception
Directions of research:design more robust speech analysis algorithms (copy synthesis, formant tracking)
continue the work about “strong cues” to identify speech sounds
investigate phonetic strategies to improve speech intelligibility
Applications: open new topics about speech therapy (language acquisition, esophageal speech)
language learning
Audiovisual to articulatory inversion
Directions of research:is inversion possible for all the speech sounds?
which is the articulatory information that can be recovered?
with which accuracy?
inversion with standard spectral data instead of resonance frequencies
modeling labial coarticulation
coupling face and vocal tract.
Automatic speech recognition
The challenge is to increase the robustness of ASR.
Using external sources of informationdesign a theoretical framework that can incorporate different sources of information
one promising direction is to use well recognized sounds that contain strong cues
Applicationsextend the usability of automatic speech recognition (non native speakers, children…)
Language models
The challenge is to deal with the complexity of natural language in automatic speech recognition:
How relevant linguistic knowledge can be retrieved automatically?
How to make linguistic units more informative by supporting syntactic features?
How several forms of linguistic knowledge can be used in a single framework?
Application to speech to speech translationUsing phrase, i.e. relevant sequences of words, is a promising approach.
Multidisciplinarity
Multidisciplinarity has benefited each topic addressed in the team by:
opening new directions of research
“strong acoustic cues” confidence islands in automatic speech recognition by using models of well realized sounds
giving help to improve another branch of research
speech analysis to adjust parameters of Mel cepstral coefficient calculation
using probabilistic learning for inversion
providing tools from another research area addressed in the group
piloting a talking head through automatic speech recognition
speech alignment for language learning
using language modeling to analyze sentences to be synthesized
Demo: Real time piloting of a talking head
Objective: providing deaf children at school with a cued speech talking head
Piloting the talking head through real time (500ms delay) automatic speech recognition:
Using phonemes Coarticulation is also developed
RIAM project LABIAO (EDF, DATHA association)
Bon Jou R
Conclusion
A deeper comprehension and modeling of speech production, perception and comprehension to allow natural interaction through speech.
Conclusion
A deeper comprehension and modeling of speech production, perception and comprehension to allow natural interaction through speech.
Signal View::Design Time
Spectro View::Design Time
Whi
te n
oise
filt
ered
with
W
inS
noor
i.