Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched...
Transcript of Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched...
Automatic Speech Recognition
• Goal: Accuratelyand efficientlyconvert a speech
signal into a text message independent of the
device, speaker or the environment.
• Applications: Automation of complex operator-
based tasks, e.g., customer care, dictation, form
filling applications, provisioning of new services,
customer help lines, e-commerce, etc.
Milestones in Speech and
Multimodal Technology Research
Evolution
Digit
Continuous
digits
Command and
control
Read speech
Broadcast
speech
Conversational
speech
Level of difficulty
Word
error
rate
10 %
20%
30%
40%
Issues in Speech Recognition
• Vocabulary size and confusability:
– It is easy to discriminate among a small set ofwords, but error rates naturally increase as thevocabulary size grows.
– Even a small vocabulary can be hard torecognize if it contains confusable words.
– So depending on the size and word contains inthe vocabulary the word error rateperformance can be estimated.
*Doddington, G. (1989). Phonetically Sensitive Discriminants for Improved Speech Recognition. In
Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1989.
Issues in Speech Recognition
• Speaker dependence vs. independence
– A speaker dependent system is intended foruse by a single speaker.
– Speaker independent system is intended foruse by any speaker.
– Error rates are typically 3 to 5 times higher forspeaker independent systems than for speakerdependent ones *.
*Lee, K.F. (1988). Large Vocabulary Speaker-Independent Continuous Speech Recognition: The
SPHINX System. PhD Thesis, Carnegie Mellon University.
Continue…
• Background Condition:
– These include environmental noise (e.g., noise in acar or a factory)
– Acoustical distortions (e.g., echoes, room acoustics)
– Type of microphones (e.g., close-speaking,
Omni directional, or telephone)
– Altered speaking manner (shouting, whining,speaking quickly, etc.).
Furui, S. (1993). Towards Robust Speech Recognition Under
Adverse Conditions. In Proc. of the ESCA Workshop on
Speech Processing and Adverse Conditions, pp. 31-41.
Complexity of the Speech Signal
• Anatomical
– physical dimensions of vocal tract
• Articulator Dynamics
– sound fusion, co-articulation, emotion (joy, anger )
• Dialect
– different geographical areas, social class, by age etc.
• Noise and Channel distortion
8
Speech Recognition is Difficult:
• Difficulty in the acoustic realizations of
phonemes
• Acoustic variabilities
• Within-speaker variabilities
• Across-speaker variabilities
9
10
11
16-01-09 12
Large Vocabulary Continuous
Speech Recognition
• More than 64,000 words
Error rates
Task Vocabulary Error Rate%
Digits 11 0.5
WSJ read speech 5K 3
WSJ read speech 20K 3
Broadcast news 64K+ 10
Conversational Telephone 64K+ 20
Also depends on the corpus and the recording conditions
Human versus machine speech
recognition
Task Vocab. Machine Human
Continuous digits 11 0.5 0.009
WSJ 1995 clean 5K 3 0.9
WSJ 1995 w/noise 5K 9 1.1
SWBD 2004 65K 20 4
Machines about 5 times worse than humans
Gap increases with noisy speech
These numbers are rough, take with grain of salt
• Collect lots and lots of speech, and transcribe all the words.
– Train the model on the labeled speech
– Paradigm: Supervised Machine Learning + Search
• Search through space of all possible sentences.
• Pick the one that is most probable given the waveform.
• What is the most likely sentence out of all sentences in the
language L given some acoustic input O?
• Treat acoustic input O as sequence of individual
observations O = o1,o2,o3,…,ot
• Define a sentence as a sequence of words:
– W = w1,w2,w3,…,wn
• Probabilistic implication: Pick the highest prob S:
Speech recognition architecture
Architecture
• Feature extraction
• Acoustic Modeling
• HMMs, Lexicons, and Pronunciation
– Lexicon: A list of words
– Each one with a pronunciation in terms of phones
• Decoding
• Language Modeling
Basic Elements
• Acoustic Front-End
• Acoustic Model
• Language Model
• Decoder
Acoustic
ModelDecoder
Language
Model
Acoustic
Front-End
Recognised speech
Input
speech
Schematic view of phoneme
recognition
Speech signal of
phonemic durationFrame formation
Wavelet transform
Feature extractionClassifier
(LDA)
Recognised
Phoneme
Feature Extraction
Cepstrum
Filter
Bank
Cepstrum
Filter
Bank
Linear
prediction
Fourier
Transform• Fourier Transform
• Linear Predication
Perceptual
weighting
Filter Bank
Cepstrum
Basic ASR Formulation
• The basic equation of Bayes rule-based speech recognition is:
• where X= X1,X2, …,XN is the is the acoustic observation (feature vector sequence.
• is the corresponding word sequence, P(X|W) is the acoustic model and P(W) is the language model.
)W|X(P)W(Pmaxarg
)X(P
)W|X(P)W(Pmaxarg
)X|W(PmaxargW
w
w
w
M21 w...wwW
Speech Recognition Process
Speech Recognition Processes
• Choose task => sounds, word vocabulary, task syntax (grammar), task semantics– Text training data set => word lexicon, word grammar
(language model), task grammar
– Speech training data set => acoustic models
• Evaluate performance– Speech testing data set
• Training algorithm => build models from training set of text and speech
• Testing algorithm => evaluate performance from testing set of speech.
Feature Extraction
• Goal: Extract robust features(information) from the speech that arerelevant for ASR.
• Method: Spectral analysis through eithera bank-of-filters or through LPC followedby non-linearity and normalization(cepstrum).
• Result: Signal compression where foreach window of speech samples where30 or so cepstral features are extracted(64,000 b/s -> 5,200 b/s).
• Challenges: Robustness to environment(office, airport, car), devices(speakerphones, cell phones), speakers(acents, dialect, style, speaking defects),noise and echo. Feature set forrecognition—cepstral features or thosefrom a high dimensionality space.
What Features to Use?
• Short-time Spectral Analysis:
• Acoustic features:
– cepstrum (LPC, filterbank, wavelets)
– formant frequencies, pitch, prosody
– zero-crossing rate, energy
• Acoustic-Phonetic features:
– manner of articulation (e.g., stop, nasal, voiced)
– place of articulation (labial, dental, velar)
• Articulatory features:
– tongue position, jaw, lips, velum
• Auditory features:
– ensemble interval histogram (EIH), synchrony
• Temporal Analysis: approximation of the velocity and acceleration typically through first and second order central differences.
16-01-09 Computer Application: Skills and
TrendsUGC Academic Staff College, AMU
29
16-01-09 Computer Application: Skills and
TrendsUGC Academic Staff College, AMU
30
Feature Extraction Process
Robustness
• Problem:
– A mismatch in the speech signal between the training
phase and testing phase can result in performance
degradation.
• Methods:
– Traditional techniques for improving system robustness
are based on signal enhancement, feature normalization
or/and model adaptation.
• Perception Approach:
– Extract fundamental acoustic information in narrow
bands of speech. Robust integration of features across
time and frequency.
Methods for Robust Speech
Recognition
• A mismatch in the speech signal between the
training phase and testing phase results in
performance degradation.
34
Acoustical Degradation and Possible
solutions
• Acoustical degradations produced by:– additive noise
– effects of linear filtering
– nonlinearities in transduction or transmission
– impulsive interfering sources
• Possible solution– Dynamic Parameter Adaptation
• Optimal Parameter Estimation
• Feature compensation
• Cepstral High-pass Filtering
– Use of microphones array
– Physiologically motivated signal processing
– Signal enhancement techniques
– Use of audio-visual features
35
Application Areas
• Speech over telephone lines
• Low-SNR environments
• Co-channel speech interference
• Speech over mobile networks.
36
Audio-Visual Automatic Speech
Recognition
• Digital cameras capture still images and stores it as
pixels.
• Little consensus has been reach on the types of visual
features.
HMM
37
Block diagram of an AVASR
Feature Integration Phoneme
Classification
LexiconHMM
Language Model
Feature vector
Word Model
Recognized
speech
Training phase
Training phase
Word
formation
Audio Front
End
Video Front
End
Recognizer
38
Video front end for AVASR
Face
Detection
and
Tracking
Mouth
and Lip
Tracking
Normalization
Low Level /
Transform
Based
High Level /
Shape Based
Hybrid
Feature
Extraction
Pre-processing
Video Front End
39
Lip location and tracking
40
Face Detection
• Manual Red, Green
and Blue skin
thresholds were
trained for each
speaker
• Faces were located
by applying these
thresholds to the
video frames
Computer Application: Skills and TrendsUGC Academic Staff College, AMU Aligarh 41
Finding and tracking eyes
• Top half of face region is searched for eyes
• A shifted version of thresholding was performed to locate possible eye regions.
• Invalid eye candidate regions are removed, and the most likely pair of candidates chosen as the eyes.
• New eye location compared to old, and ignored if too far from old.
• About 40% of sequences had to be manually eye-tracked every 50 frames.
42
Finding and tracking lips
• Eye locations are used to define rotation-normalised lip search region (LSR)
• Unlikely lip-candidates are removed
• Rectangular area with largest amount of lip-candidate area within is lip ROI.
43
Visual features
• Low level, appearance or pixel based features.
– The ROI may be only mouth region, lower part or the entire face.
– Various transformations of this high dimensional data to relatively low
dimensions is required.
– One of the most commonly used techniques is based on Principle
Component Analysis (PCA).
– Discrete Cosine Transform (DCT), Fourier transform and discrete
wavelet transform have been used in literature.
44
Visual features
• High level or shape based features
– The shape of speaker lips
– The lip contour based features i.e. length, width,
area, perimeter of inner and outer lips and
combinations of these.
– Active shape models use deformable templates
that iteratively adjust itself to a an object in an
image.
• Hybrid features which is the combination of
above two.
45
Database
Hindi Audio-Video Database VidTIMIT Database
Acoustic Model
• Goal: Map acoustic features into distinct phonetic labels (e.g., /s/, /aa/).
• Hidden Markov Model (HMM): Statistical method for characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone.
• Advantages: Powerful statistical method for dealing with a wide range of data and reliably recognizing speech.
• Challenges: Understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.
Discrete-Time Markov Process
• The Dow Jones Industrial Average
Basic Problems in HMMs
• Given acoustic observation X and model Φ:
• Evaluation: compute P( X|Φ)
• Decoding: choose optimal state sequence
• Re-estimation: adjust Φ to maximize P( X|Φ)
Design Issues
• Continuous vs. discrete HMM
• Whole-word vs. subword (phone units)
• Number of states, model parameters (e.g.,
Gaussians)
• Ergodic vs. left-right models
• Context-dependent vs. context-independent models
Context Variability in Speech
• At word/sentence level: Mr. Wright should write to Ms.
Wright Right away about his Ford or four door Honda.
• At phone level:/ee/ for word peat and wheel
• Triphones in speech capture coarticulation, phonetic
context
Other Variability in Speech
• Style: discrete (isolated words) vs. continuous
speech (connected words); read vs spontaneous;
slow vs fast talking rate
• Speaker Training: speaker independent, speaker
dependent or speaker adapted
• Environment: background acoustic noise,
telephone channel, cocktail party effect (multiple
interfering speakers).
Comparing ASR systems
• Factors include
– Speaking mode: isolated words vs continuous
speech
– Speaking style: read vs spontaneous
– “Enrollment”: speaker (in)dependent
– Vocabulary size (small <20 … large > 20,000)
– Equipment: good quality noise-cancelling mic
… telephone
– Size of training set (if appropriate) or rule set
– Recognition method
Remaining problems• Robustness – graceful degradation, not catastrophic failure
• Portability – independence of computing platform
• Adaptability – to changing conditions (different mic, background noise, new speaker, new task domain, new language even)
• Language Modelling – is there a role for linguistics in improving the language models?
• Confidence Measures – better methods to evaluate the absolute correctness of hypotheses.
• Out-of-Vocabulary (OOV) Words – Systems must have some method of detecting OOV words, and dealing with them in a sensible way.
• Spontaneous Speech – disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem.
• Prosody –Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger)
• Accent, dialect and mixed language – non-native speech is a huge problem, especially where code-switching is commonplace