Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched...

53
Automatic Speech Recognition Goal: Accuratelyand efficientlyconvert a speech signal into a text message independent of the device, speaker or the environment. Applications: Automation of complex operator- based tasks, e.g., customer care, dictation, form filling applications, provisioning of new services, customer help lines, e-commerce, etc.

Transcript of Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched...

Page 1: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Automatic Speech Recognition

• Goal: Accuratelyand efficientlyconvert a speech

signal into a text message independent of the

device, speaker or the environment.

• Applications: Automation of complex operator-

based tasks, e.g., customer care, dictation, form

filling applications, provisioning of new services,

customer help lines, e-commerce, etc.

Page 2: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Milestones in Speech and

Multimodal Technology Research

Page 3: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Evolution

Digit

Continuous

digits

Command and

control

Read speech

Broadcast

speech

Conversational

speech

Level of difficulty

Word

error

rate

10 %

20%

30%

40%

Page 4: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Issues in Speech Recognition

• Vocabulary size and confusability:

– It is easy to discriminate among a small set ofwords, but error rates naturally increase as thevocabulary size grows.

– Even a small vocabulary can be hard torecognize if it contains confusable words.

– So depending on the size and word contains inthe vocabulary the word error rateperformance can be estimated.

*Doddington, G. (1989). Phonetically Sensitive Discriminants for Improved Speech Recognition. In

Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1989.

Page 5: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Issues in Speech Recognition

• Speaker dependence vs. independence

– A speaker dependent system is intended foruse by a single speaker.

– Speaker independent system is intended foruse by any speaker.

– Error rates are typically 3 to 5 times higher forspeaker independent systems than for speakerdependent ones *.

*Lee, K.F. (1988). Large Vocabulary Speaker-Independent Continuous Speech Recognition: The

SPHINX System. PhD Thesis, Carnegie Mellon University.

Page 6: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Continue…

• Background Condition:

– These include environmental noise (e.g., noise in acar or a factory)

– Acoustical distortions (e.g., echoes, room acoustics)

– Type of microphones (e.g., close-speaking,

Omni directional, or telephone)

– Altered speaking manner (shouting, whining,speaking quickly, etc.).

Furui, S. (1993). Towards Robust Speech Recognition Under

Adverse Conditions. In Proc. of the ESCA Workshop on

Speech Processing and Adverse Conditions, pp. 31-41.

Page 7: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Complexity of the Speech Signal

• Anatomical

– physical dimensions of vocal tract

• Articulator Dynamics

– sound fusion, co-articulation, emotion (joy, anger )

• Dialect

– different geographical areas, social class, by age etc.

• Noise and Channel distortion

Page 8: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

8

Speech Recognition is Difficult:

• Difficulty in the acoustic realizations of

phonemes

• Acoustic variabilities

• Within-speaker variabilities

• Across-speaker variabilities

Page 9: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

9

Page 10: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

10

Page 11: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

11

Page 12: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

16-01-09 12

Page 13: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Large Vocabulary Continuous

Speech Recognition

• More than 64,000 words

Page 14: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Error rates

Task Vocabulary Error Rate%

Digits 11 0.5

WSJ read speech 5K 3

WSJ read speech 20K 3

Broadcast news 64K+ 10

Conversational Telephone 64K+ 20

Also depends on the corpus and the recording conditions

Page 15: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Human versus machine speech

recognition

Task Vocab. Machine Human

Continuous digits 11 0.5 0.009

WSJ 1995 clean 5K 3 0.9

WSJ 1995 w/noise 5K 9 1.1

SWBD 2004 65K 20 4

Machines about 5 times worse than humans

Gap increases with noisy speech

These numbers are rough, take with grain of salt

Page 16: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

• Collect lots and lots of speech, and transcribe all the words.

– Train the model on the labeled speech

– Paradigm: Supervised Machine Learning + Search

• Search through space of all possible sentences.

• Pick the one that is most probable given the waveform.

• What is the most likely sentence out of all sentences in the

language L given some acoustic input O?

• Treat acoustic input O as sequence of individual

observations O = o1,o2,o3,…,ot

• Define a sentence as a sequence of words:

– W = w1,w2,w3,…,wn

Page 17: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

• Probabilistic implication: Pick the highest prob S:

Page 18: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Speech recognition architecture

Page 19: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Architecture

Page 20: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

• Feature extraction

• Acoustic Modeling

• HMMs, Lexicons, and Pronunciation

– Lexicon: A list of words

– Each one with a pronunciation in terms of phones

• Decoding

• Language Modeling

Page 21: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Basic Elements

• Acoustic Front-End

• Acoustic Model

• Language Model

• Decoder

Acoustic

ModelDecoder

Language

Model

Acoustic

Front-End

Recognised speech

Input

speech

Page 22: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Schematic view of phoneme

recognition

Speech signal of

phonemic durationFrame formation

Wavelet transform

Feature extractionClassifier

(LDA)

Recognised

Phoneme

Page 23: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Feature Extraction

Cepstrum

Filter

Bank

Cepstrum

Filter

Bank

Linear

prediction

Fourier

Transform• Fourier Transform

• Linear Predication

Perceptual

weighting

Filter Bank

Cepstrum

Page 24: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Basic ASR Formulation

• The basic equation of Bayes rule-based speech recognition is:

• where X= X1,X2, …,XN is the is the acoustic observation (feature vector sequence.

• is the corresponding word sequence, P(X|W) is the acoustic model and P(W) is the language model.

)W|X(P)W(Pmaxarg

)X(P

)W|X(P)W(Pmaxarg

)X|W(PmaxargW

w

w

w

M21 w...wwW

Page 25: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Speech Recognition Process

Page 26: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Speech Recognition Processes

• Choose task => sounds, word vocabulary, task syntax (grammar), task semantics– Text training data set => word lexicon, word grammar

(language model), task grammar

– Speech training data set => acoustic models

• Evaluate performance– Speech testing data set

• Training algorithm => build models from training set of text and speech

• Testing algorithm => evaluate performance from testing set of speech.

Page 27: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Feature Extraction

• Goal: Extract robust features(information) from the speech that arerelevant for ASR.

• Method: Spectral analysis through eithera bank-of-filters or through LPC followedby non-linearity and normalization(cepstrum).

• Result: Signal compression where foreach window of speech samples where30 or so cepstral features are extracted(64,000 b/s -> 5,200 b/s).

• Challenges: Robustness to environment(office, airport, car), devices(speakerphones, cell phones), speakers(acents, dialect, style, speaking defects),noise and echo. Feature set forrecognition—cepstral features or thosefrom a high dimensionality space.

Page 28: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

What Features to Use?

• Short-time Spectral Analysis:

• Acoustic features:

– cepstrum (LPC, filterbank, wavelets)

– formant frequencies, pitch, prosody

– zero-crossing rate, energy

• Acoustic-Phonetic features:

– manner of articulation (e.g., stop, nasal, voiced)

– place of articulation (labial, dental, velar)

• Articulatory features:

– tongue position, jaw, lips, velum

• Auditory features:

– ensemble interval histogram (EIH), synchrony

• Temporal Analysis: approximation of the velocity and acceleration typically through first and second order central differences.

Page 29: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

16-01-09 Computer Application: Skills and

TrendsUGC Academic Staff College, AMU

29

Page 30: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

16-01-09 Computer Application: Skills and

TrendsUGC Academic Staff College, AMU

30

Page 31: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Feature Extraction Process

Page 32: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Robustness

• Problem:

– A mismatch in the speech signal between the training

phase and testing phase can result in performance

degradation.

• Methods:

– Traditional techniques for improving system robustness

are based on signal enhancement, feature normalization

or/and model adaptation.

• Perception Approach:

– Extract fundamental acoustic information in narrow

bands of speech. Robust integration of features across

time and frequency.

Page 33: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Methods for Robust Speech

Recognition

• A mismatch in the speech signal between the

training phase and testing phase results in

performance degradation.

Page 34: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

34

Acoustical Degradation and Possible

solutions

• Acoustical degradations produced by:– additive noise

– effects of linear filtering

– nonlinearities in transduction or transmission

– impulsive interfering sources

• Possible solution– Dynamic Parameter Adaptation

• Optimal Parameter Estimation

• Feature compensation

• Cepstral High-pass Filtering

– Use of microphones array

– Physiologically motivated signal processing

– Signal enhancement techniques

– Use of audio-visual features

Page 35: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

35

Application Areas

• Speech over telephone lines

• Low-SNR environments

• Co-channel speech interference

• Speech over mobile networks.

Page 36: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

36

Audio-Visual Automatic Speech

Recognition

• Digital cameras capture still images and stores it as

pixels.

• Little consensus has been reach on the types of visual

features.

HMM

Page 37: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

37

Block diagram of an AVASR

Feature Integration Phoneme

Classification

LexiconHMM

Language Model

Feature vector

Word Model

Recognized

speech

Training phase

Training phase

Word

formation

Audio Front

End

Video Front

End

Recognizer

Page 38: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

38

Video front end for AVASR

Face

Detection

and

Tracking

Mouth

and Lip

Tracking

Normalization

Low Level /

Transform

Based

High Level /

Shape Based

Hybrid

Feature

Extraction

Pre-processing

Video Front End

Page 39: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

39

Lip location and tracking

Page 40: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

40

Face Detection

• Manual Red, Green

and Blue skin

thresholds were

trained for each

speaker

• Faces were located

by applying these

thresholds to the

video frames

Page 41: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Computer Application: Skills and TrendsUGC Academic Staff College, AMU Aligarh 41

Finding and tracking eyes

• Top half of face region is searched for eyes

• A shifted version of thresholding was performed to locate possible eye regions.

• Invalid eye candidate regions are removed, and the most likely pair of candidates chosen as the eyes.

• New eye location compared to old, and ignored if too far from old.

• About 40% of sequences had to be manually eye-tracked every 50 frames.

Page 42: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

42

Finding and tracking lips

• Eye locations are used to define rotation-normalised lip search region (LSR)

• Unlikely lip-candidates are removed

• Rectangular area with largest amount of lip-candidate area within is lip ROI.

Page 43: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

43

Visual features

• Low level, appearance or pixel based features.

– The ROI may be only mouth region, lower part or the entire face.

– Various transformations of this high dimensional data to relatively low

dimensions is required.

– One of the most commonly used techniques is based on Principle

Component Analysis (PCA).

– Discrete Cosine Transform (DCT), Fourier transform and discrete

wavelet transform have been used in literature.

Page 44: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

44

Visual features

• High level or shape based features

– The shape of speaker lips

– The lip contour based features i.e. length, width,

area, perimeter of inner and outer lips and

combinations of these.

– Active shape models use deformable templates

that iteratively adjust itself to a an object in an

image.

• Hybrid features which is the combination of

above two.

Page 45: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

45

Database

Hindi Audio-Video Database VidTIMIT Database

Page 46: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Acoustic Model

• Goal: Map acoustic features into distinct phonetic labels (e.g., /s/, /aa/).

• Hidden Markov Model (HMM): Statistical method for characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone.

• Advantages: Powerful statistical method for dealing with a wide range of data and reliably recognizing speech.

• Challenges: Understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.

Page 47: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Discrete-Time Markov Process

• The Dow Jones Industrial Average

Page 48: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Basic Problems in HMMs

• Given acoustic observation X and model Φ:

• Evaluation: compute P( X|Φ)

• Decoding: choose optimal state sequence

• Re-estimation: adjust Φ to maximize P( X|Φ)

Page 49: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Design Issues

• Continuous vs. discrete HMM

• Whole-word vs. subword (phone units)

• Number of states, model parameters (e.g.,

Gaussians)

• Ergodic vs. left-right models

• Context-dependent vs. context-independent models

Page 50: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Context Variability in Speech

• At word/sentence level: Mr. Wright should write to Ms.

Wright Right away about his Ford or four door Honda.

• At phone level:/ee/ for word peat and wheel

• Triphones in speech capture coarticulation, phonetic

context

Page 51: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Other Variability in Speech

• Style: discrete (isolated words) vs. continuous

speech (connected words); read vs spontaneous;

slow vs fast talking rate

• Speaker Training: speaker independent, speaker

dependent or speaker adapted

• Environment: background acoustic noise,

telephone channel, cocktail party effect (multiple

interfering speakers).

Page 52: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Comparing ASR systems

• Factors include

– Speaking mode: isolated words vs continuous

speech

– Speaking style: read vs spontaneous

– “Enrollment”: speaker (in)dependent

– Vocabulary size (small <20 … large > 20,000)

– Equipment: good quality noise-cancelling mic

… telephone

– Size of training set (if appropriate) or rule set

– Recognition method

Page 53: Automatic Speech Recognition · Finding and tracking eyes • Top half of face region is searched for eyes • A shifted version of thresholding was performed to locate possible eye

Remaining problems• Robustness – graceful degradation, not catastrophic failure

• Portability – independence of computing platform

• Adaptability – to changing conditions (different mic, background noise, new speaker, new task domain, new language even)

• Language Modelling – is there a role for linguistics in improving the language models?

• Confidence Measures – better methods to evaluate the absolute correctness of hypotheses.

• Out-of-Vocabulary (OOV) Words – Systems must have some method of detecting OOV words, and dealing with them in a sensible way.

• Spontaneous Speech – disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem.

• Prosody –Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger)

• Accent, dialect and mixed language – non-native speech is a huge problem, especially where code-switching is commonplace