Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched...

Automatic Speech Recognition

• Goal: Accuratelyand efficientlyconvert a speech

signal into a text message independent of the

device, speaker or the environment.

• Applications: Automation of complex operator-

based tasks, e.g., customer care, dictation, form

filling applications, provisioning of new services,

customer help lines, e-commerce, etc.

Milestones in Speech and

Multimodal Technology Research

Evolution

Digit

Continuous

digits

Command and

control

Read speech

Broadcast

speech

Conversational

speech

Level of difficulty

Word

error

rate

10 %

20%

30%

40%

Issues in Speech Recognition

• Vocabulary size and confusability:

– It is easy to discriminate among a small set ofwords, but error rates naturally increase as thevocabulary size grows.

– Even a small vocabulary can be hard torecognize if it contains confusable words.

– So depending on the size and word contains inthe vocabulary the word error rateperformance can be estimated.

*Doddington, G. (1989). Phonetically Sensitive Discriminants for Improved Speech Recognition. In

Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1989.

Issues in Speech Recognition

• Speaker dependence vs. independence

– A speaker dependent system is intended foruse by a single speaker.

– Speaker independent system is intended foruse by any speaker.

– Error rates are typically 3 to 5 times higher forspeaker independent systems than for speakerdependent ones *.

*Lee, K.F. (1988). Large Vocabulary Speaker-Independent Continuous Speech Recognition: The

SPHINX System. PhD Thesis, Carnegie Mellon University.

Continue…

• Background Condition:

– These include environmental noise (e.g., noise in acar or a factory)

– Acoustical distortions (e.g., echoes, room acoustics)

– Type of microphones (e.g., close-speaking,

Omni directional, or telephone)

– Altered speaking manner (shouting, whining,speaking quickly, etc.).

Furui, S. (1993). Towards Robust Speech Recognition Under

Adverse Conditions. In Proc. of the ESCA Workshop on

Speech Processing and Adverse Conditions, pp. 31-41.

Complexity of the Speech Signal

• Anatomical

– physical dimensions of vocal tract

• Articulator Dynamics

– sound fusion, co-articulation, emotion (joy, anger )

• Dialect

– different geographical areas, social class, by age etc.

• Noise and Channel distortion

8

Speech Recognition is Difficult:

• Difficulty in the acoustic realizations of

phonemes

• Acoustic variabilities

• Within-speaker variabilities

• Across-speaker variabilities

16-01-09 12

Large Vocabulary Continuous

Speech Recognition

• More than 64,000 words

Error rates

Task Vocabulary Error Rate%

Digits 11 0.5

WSJ read speech 5K 3

WSJ read speech 20K 3

Broadcast news 64K+ 10

Conversational Telephone 64K+ 20

Also depends on the corpus and the recording conditions

Human versus machine speech

recognition

Task Vocab. Machine Human

Continuous digits 11 0.5 0.009

WSJ 1995 clean 5K 3 0.9

WSJ 1995 w/noise 5K 9 1.1

SWBD 2004 65K 20 4

Machines about 5 times worse than humans

Gap increases with noisy speech

These numbers are rough, take with grain of salt

• Collect lots and lots of speech, and transcribe all the words.

– Train the model on the labeled speech

– Paradigm: Supervised Machine Learning + Search

• Search through space of all possible sentences.

• Pick the one that is most probable given the waveform.

• What is the most likely sentence out of all sentences in the

language L given some acoustic input O?

• Treat acoustic input O as sequence of individual

observations O = o1,o2,o3,…,ot

• Define a sentence as a sequence of words:

– W = w1,w2,w3,…,wn

• Probabilistic implication: Pick the highest prob S:

Speech recognition architecture

Architecture

• Feature extraction

• Acoustic Modeling

• HMMs, Lexicons, and Pronunciation

– Lexicon: A list of words

– Each one with a pronunciation in terms of phones

• Decoding

• Language Modeling

Basic Elements

• Acoustic Front-End

• Acoustic Model

• Language Model

• Decoder

Acoustic

ModelDecoder

Language

Model

Acoustic

Front-End

Recognised speech

Input

speech

Schematic view of phoneme

recognition

Speech signal of

phonemic durationFrame formation

Wavelet transform

Feature extractionClassifier

(LDA)

Recognised

Phoneme

Feature Extraction

Cepstrum

Filter

Bank

Cepstrum

Filter

Bank

Linear

prediction

Fourier

Transform• Fourier Transform

• Linear Predication

Perceptual

weighting

Filter Bank

Cepstrum

Basic ASR Formulation

• The basic equation of Bayes rule-based speech recognition is:

• where X= X1,X2, …,XN is the is the acoustic observation (feature vector sequence.

• is the corresponding word sequence, P(X|W) is the acoustic model and P(W) is the language model.

)W|X(P)W(Pmaxarg

)X(P

)W|X(P)W(Pmaxarg

)X|W(PmaxargW

w

w

w

M21 w...wwW

Speech Recognition Process

Speech Recognition Processes

• Choose task => sounds, word vocabulary, task syntax (grammar), task semantics– Text training data set => word lexicon, word grammar

(language model), task grammar

– Speech training data set => acoustic models

• Evaluate performance– Speech testing data set

• Training algorithm => build models from training set of text and speech

• Testing algorithm => evaluate performance from testing set of speech.

Feature Extraction

• Goal: Extract robust features(information) from the speech that arerelevant for ASR.

• Method: Spectral analysis through eithera bank-of-filters or through LPC followedby non-linearity and normalization(cepstrum).

• Result: Signal compression where foreach window of speech samples where30 or so cepstral features are extracted(64,000 b/s -> 5,200 b/s).

• Challenges: Robustness to environment(office, airport, car), devices(speakerphones, cell phones), speakers(acents, dialect, style, speaking defects),noise and echo. Feature set forrecognition—cepstral features or thosefrom a high dimensionality space.

What Features to Use?

• Short-time Spectral Analysis:

• Acoustic features:

– cepstrum (LPC, filterbank, wavelets)

– formant frequencies, pitch, prosody

– zero-crossing rate, energy

• Acoustic-Phonetic features:

– manner of articulation (e.g., stop, nasal, voiced)

– place of articulation (labial, dental, velar)

• Articulatory features:

– tongue position, jaw, lips, velum

• Auditory features:

– ensemble interval histogram (EIH), synchrony

• Temporal Analysis: approximation of the velocity and acceleration typically through first and second order central differences.

16-01-09 Computer Application: Skills and

TrendsUGC Academic Staff College, AMU

29

16-01-09 Computer Application: Skills and

TrendsUGC Academic Staff College, AMU

30

Feature Extraction Process

Robustness

• Problem:

– A mismatch in the speech signal between the training

phase and testing phase can result in performance

degradation.

• Methods:

– Traditional techniques for improving system robustness

are based on signal enhancement, feature normalization

or/and model adaptation.

• Perception Approach:

– Extract fundamental acoustic information in narrow

bands of speech. Robust integration of features across

time and frequency.

Methods for Robust Speech

Recognition

• A mismatch in the speech signal between the

training phase and testing phase results in

performance degradation.

34

Acoustical Degradation and Possible

solutions

• Acoustical degradations produced by:– additive noise

– effects of linear filtering

– nonlinearities in transduction or transmission

– impulsive interfering sources

• Possible solution– Dynamic Parameter Adaptation

• Optimal Parameter Estimation

• Feature compensation

• Cepstral High-pass Filtering

– Use of microphones array

– Physiologically motivated signal processing

– Signal enhancement techniques

– Use of audio-visual features

35

Application Areas

• Speech over telephone lines

• Low-SNR environments

• Co-channel speech interference

• Speech over mobile networks.

36

Audio-Visual Automatic Speech

Recognition

• Digital cameras capture still images and stores it as

pixels.

• Little consensus has been reach on the types of visual

features.

HMM

37

Block diagram of an AVASR

Feature Integration Phoneme

Classification

LexiconHMM

Language Model

Feature vector

Word Model

Recognized

speech

Training phase

Training phase

Word

formation

Audio Front

End

Video Front

End

Recognizer

38

Video front end for AVASR

Face

Detection

and

Tracking

Mouth

and Lip

Tracking

Normalization

Low Level /

Transform

Based

High Level /

Shape Based

Hybrid

Feature

Extraction

Pre-processing

Video Front End

39

Lip location and tracking

40

Face Detection

• Manual Red, Green

and Blue skin

thresholds were

trained for each

speaker

• Faces were located

by applying these

thresholds to the

video frames

Computer Application: Skills and TrendsUGC Academic Staff College, AMU Aligarh 41

Finding and tracking eyes

• Top half of face region is searched for eyes

• A shifted version of thresholding was performed to locate possible eye regions.

• Invalid eye candidate regions are removed, and the most likely pair of candidates chosen as the eyes.

• New eye location compared to old, and ignored if too far from old.

• About 40% of sequences had to be manually eye-tracked every 50 frames.

42

Finding and tracking lips

• Eye locations are used to define rotation-normalised lip search region (LSR)

• Unlikely lip-candidates are removed

• Rectangular area with largest amount of lip-candidate area within is lip ROI.

43

Visual features

• Low level, appearance or pixel based features.

– The ROI may be only mouth region, lower part or the entire face.

– Various transformations of this high dimensional data to relatively low

dimensions is required.

– One of the most commonly used techniques is based on Principle

Component Analysis (PCA).

– Discrete Cosine Transform (DCT), Fourier transform and discrete

wavelet transform have been used in literature.

44

Visual features

• High level or shape based features

– The shape of speaker lips

– The lip contour based features i.e. length, width,

area, perimeter of inner and outer lips and

combinations of these.

– Active shape models use deformable templates

that iteratively adjust itself to a an object in an

image.

• Hybrid features which is the combination of

above two.

45

Database

Hindi Audio-Video Database VidTIMIT Database

Acoustic Model

• Goal: Map acoustic features into distinct phonetic labels (e.g., /s/, /aa/).

• Hidden Markov Model (HMM): Statistical method for characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone.

• Advantages: Powerful statistical method for dealing with a wide range of data and reliably recognizing speech.

• Challenges: Understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.

Discrete-Time Markov Process

• The Dow Jones Industrial Average

Basic Problems in HMMs

• Given acoustic observation X and model Φ:

• Evaluation: compute P( X|Φ)

• Decoding: choose optimal state sequence

• Re-estimation: adjust Φ to maximize P( X|Φ)

Design Issues

• Continuous vs. discrete HMM

• Whole-word vs. subword (phone units)

• Number of states, model parameters (e.g.,

Gaussians)

• Ergodic vs. left-right models

• Context-dependent vs. context-independent models

Context Variability in Speech

• At word/sentence level: Mr. Wright should write to Ms.

Wright Right away about his Ford or four door Honda.

• At phone level:/ee/ for word peat and wheel

• Triphones in speech capture coarticulation, phonetic

context

Other Variability in Speech

• Style: discrete (isolated words) vs. continuous

speech (connected words); read vs spontaneous;

slow vs fast talking rate

• Speaker Training: speaker independent, speaker

dependent or speaker adapted

• Environment: background acoustic noise,

telephone channel, cocktail party effect (multiple

interfering speakers).

Comparing ASR systems

• Factors include

– Speaking mode: isolated words vs continuous

speech

– Speaking style: read vs spontaneous

– “Enrollment”: speaker (in)dependent

– Vocabulary size (small <20 … large > 20,000)

– Equipment: good quality noise-cancelling mic

… telephone

– Size of training set (if appropriate) or rule set

– Recognition method

Remaining problems• Robustness – graceful degradation, not catastrophic failure

• Portability – independence of computing platform

• Adaptability – to changing conditions (different mic, background noise, new speaker, new task domain, new language even)

• Language Modelling – is there a role for linguistics in improving the language models?

• Confidence Measures – better methods to evaluate the absolute correctness of hypotheses.

• Out-of-Vocabulary (OOV) Words – Systems must have some method of detecting OOV words, and dealing with them in a sensible way.

• Spontaneous Speech – disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem.

• Prosody –Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger)

• Accent, dialect and mixed language – non-native speech is a huge problem, especially where code-switching is commonplace

Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched...

Documents

Transcript of Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched...