LYU0103 Speech Recognition Techniques for Digital Video Library

LYU0103

Speech Recognition Techniques for

Digital Video Library

Supervisor : Prof Michael R. LyuSupervisor : Prof Michael R. Lyu

Students: Students: Gao Zheng HongGao Zheng Hong

Lei MoLei Mo

Outline of Presentation

Project objectivesProject objectives ViaVoice recognition experimentsViaVoice recognition experiments Speech recognition editing toolSpeech recognition editing tool Audio scene change detectionAudio scene change detection Speech classificationSpeech classification SummarySummary

Our Project Objectives

Audio information retrievalAudio information retrieval Speech recognitionSpeech recognition

Last Term’s Work

Extract audio channel (stereo 44.1 kHz) Extract audio channel (stereo 44.1 kHz) from a mpeg video files into wave files from a mpeg video files into wave files (mono 22 kHz) (mono 22 kHz)

Segmented the wave files into sentences by Segmented the wave files into sentences by detecting its frame energydetecting its frame energy

Developed a visual training toolDeveloped a visual training tool

Visual Training Tool

Video Window; Dictation Window; Text Editor

IBM ViaVoice Experiments

Employed 7 student helpersEmployed 7 student helpers Produce transcripts of 77 news video clipsProduce transcripts of 77 news video clips Four experiments:Four experiments:

Baseline measurementBaseline measurement Trained model measurementTrained model measurement Slow down measurementSlow down measurement Indoor news measurement Indoor news measurement

Baseline Measurement

To measure the ViaVoice recognition To measure the ViaVoice recognition accuracy using TVB news videoaccuracy using TVB news video

Testing set: 10 video clipsTesting set: 10 video clips The segmented wav files are dictatedThe segmented wav files are dictated Employ the hidden Markov model toolkit Employ the hidden Markov model toolkit

(HTK) to examine the accuracy(HTK) to examine the accuracy

Trained Model Measurement

To measure the accuracy of ViaVoice, trained by To measure the accuracy of ViaVoice, trained by its correctly recognized wordsits correctly recognized words

10 videos clips are segmented and dictated 10 videos clips are segmented and dictated The correctly dictated words of training set are The correctly dictated words of training set are

used to train the ViaVoice by the SMAPI function used to train the ViaVoice by the SMAPI function SmWordCorrectionSmWordCorrection

Repeat the procedures of “baseline measurement” Repeat the procedures of “baseline measurement” after training to get the recognition performanceafter training to get the recognition performance

Repeat the procedures of using 20 videos clips Repeat the procedures of using 20 videos clips

Slow Down Measurement

Investigate the effect of slowing down the Investigate the effect of slowing down the audio channelaudio channel

Resample the segment wave files in the Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.61.2, 1.3, 1.4, and 1.6

Repeat the procedures of “baseline Repeat the procedures of “baseline measurement”measurement”

Indoor News Measurement

Eliminate the effect of noiseEliminate the effect of noise Select the indoor news reporter sentenceSelect the indoor news reporter sentence Dictate the test set using untrained model Dictate the test set using untrained model Repeat the procedure using trained model Repeat the procedure using trained model

Experimental Results

ExperimentExperiment Accuracy (Max. performance) Accuracy (Max. performance)

Baseline Baseline 25.27% 25.27%

Trained Model Trained Model 25.87% (25.87% (with 20 video with 20 video trained) trained)

Slow SpeechSlow Speech 25.67% (25.67% (max. at ratio = 1.15) max. at ratio = 1.15)

Indoor Speech (untrained Indoor Speech (untrained model) model)

35.22% 35.22%

Indoor Speech (trained Indoor Speech (trained model) model)

36.31% (36.31% (with 20 video with 20 video trained) trained)

Overall Recognition Results (ViaVoice, TVB News )

Experimental Result Cont.

Trained Video Trained Video NumberNumber

UntrainedUntrained 10 10 videosvideos 20 20 videosvideos

Accuracy Accuracy 25.27% 25.27% 25.82% 25.82% 25.87% 25.87%

RatioRatio 11 1.051.05 1.11.1 1.151.15 1.21.2 1.31.3 1.41.4 1.51.5

Accuracy Accuracy (%)(%)

25.2725.27 25.46 25.46 25.63 25.63 25.67 25.67 25.82 25.82 17.18 17.18 12.34 12.34 4.04 4.04

Result of trained model with different number of training videos

Result of using different slow down ratio

Analysis of Experimental Result

Trained model: about 1% accuracy Trained model: about 1% accuracy improvementimprovement

Slowing down speeches: about 1% accuracy Slowing down speeches: about 1% accuracy improvementimprovement

Indoor speeches are recognized much betterIndoor speeches are recognized much better Mandarin: estimated baseline accuracy is Mandarin: estimated baseline accuracy is

about 70 % ( >> Cantonese)about 70 % ( >> Cantonese)

Speech Processor

Training does not increase accuracy Training does not increase accuracy significantlysignificantly

Need manually editing of the recognition Need manually editing of the recognition resultresult

Word timing information is also importantWord timing information is also important

Editing Functionality

The recognition result is organized in a The recognition result is organized in a basic unit called “firm word”basic unit called “firm word”

Retrieve the timing information from the Retrieve the timing information from the speech enginespeech engine

Record the timing information of every firm Record the timing information of every firm word in an indexword in an index

Highlight corresponding firm word during Highlight corresponding firm word during video playbackvideo playback

Dynamic Time Index Alignment

While editing recognition result, firm word While editing recognition result, firm word structure may be changedstructure may be changed

Time index need to be updated to maintain Time index need to be updated to maintain new firm wordnew firm word

In speech processor, time index is aligned In speech processor, time index is aligned with firm words whenever user edits the with firm words whenever user edits the texttext

Time Index Alignment Example

Before Editing Editing

After Editing

Motivation for Doing Speech Segmentation and Classification Gender classification can help us to build Gender classification can help us to build

gender dependent modelgender dependent model Detection of scene changes from video Detection of scene changes from video

content is not accurate enough, so we need content is not accurate enough, so we need audio scene change detection as an assistant audio scene change detection as an assistant tooltool

Flow Diagram of Audio Information Retrieval System

AudioSignal

FromNews’ AudioChannel

AudioSignal

MFCC

FeatureExtraction

Segmentation

Audio Scene Change

Detect cont’vowel > 30%

Speech

Non-Speech

Male?

Female?

MusicPatternMatching

Speaker Identification/Classification

By MFCC var.

By 256 GMMBy Clustering

Feature Extraction by MFCC

The first thing we should do on the raw The first thing we should do on the raw audio input dataaudio input data

MFCC stands for “mel-frequency cepstral MFCC stands for “mel-frequency cepstral coefficient”coefficient”

Human perception of the frequency of Human perception of the frequency of sound does not follow a linear scalesound does not follow a linear scale

Detection of Audio Scene Change by Bayesian Information Criterion (BIC) Bayesian information criterion (BIC)Bayesian information criterion (BIC) is a is a

likelihood criterionlikelihood criterion We maximize the likelihood functions We maximize the likelihood functions

separately for each model separately for each model MM and obtain L and obtain L (X,(X,MM))

The main principle is to penalize the system The main principle is to penalize the system by the model complexityby the model complexity

Detection of a single point change using BIC

We define:We define:HH0 0 : x: x11, x, x2 2 … x… xN N ~ N(μ,Σ)~ N(μ,Σ)

to be the whole sequence without changes andto be the whole sequence without changes andHH11: x: x11, x, x2 2 … x… xL L ~ N(μ~ N(μ11,Σ,Σ11), x), xL+1L+1,, xxL+2 L+2 … x… xN N ~ N(μ~ N(μ22,Σ,Σ22),),is the hypothesis that change occurring at time i.is the hypothesis that change occurring at time i. The maximum likelihood ratio is defined as: The maximum likelihood ratio is defined as:

R(I)=Nlog| Σ|-NR(I)=Nlog| Σ|-N11log| Σlog| Σ11|-N|-N22log| Σlog| Σ22||

Detection of a single point change using BIC

The difference between the BIC values of The difference between the BIC values of two models can be expressed as:two models can be expressed as:

BIC(I) = R(I) – λPBIC(I) = R(I) – λP

P=(1/2)(d+(1/2d(d+1))logNP=(1/2)(d+(1/2d(d+1))logN If BIC value>0, detection of scene changeIf BIC value>0, detection of scene change

Detection of multiple point changes by BIC

a. Initialize the interval [a, b] with a=1, b=2a. Initialize the interval [a, b] with a=1, b=2 b. Detect if there is one changing point in interval [a, b] b. Detect if there is one changing point in interval [a, b]

using BICusing BIC c. If (there is no change in [a, b])c. If (there is no change in [a, b])

let b= b + 1let b= b + 1elseelselet t be the changing point detectedlet t be the changing point detectedassign a = t +1; b = a+1;assign a = t +1; b = a+1;endend

d. go to step (b) if necessaryd. go to step (b) if necessary

Advantages of BIC approach

RobustnessRobustness Thresholding-freeThresholding-free Optimality Optimality

Comparison of different algorithms

Audio scene change detection

Gender Classification

The mean and covariance of male and female The mean and covariance of male and female feature vector is quite differentfeature vector is quite different

So we can model it by a Gaussian Mixture So we can model it by a Gaussian Mixture Model (GMM)Model (GMM)

Male/Female Classification(freq count vs. values)

Male Female

Gender Classification

Music/Speech classification by pitch tracking speech has more continue contour than speech has more continue contour than

music. music. Speech clip always has 30%-55% Speech clip always has 30%-55%

continuous contour whereas silence or continuous contour whereas silence or music has1%-15%music has1%-15%

Thus, we choose >20% for speech.Thus, we choose >20% for speech.

Frequency Vs no of frames

Speech Music

Summary

ViaVoice training experimentsViaVoice training experiments Speech recognition editing toolSpeech recognition editing tool Dynamic time index alignmentDynamic time index alignment Audio scene change detectionAudio scene change detection Speech classificationSpeech classification Integrated the above functions into a speech Integrated the above functions into a speech

processorprocessor

Future Work

Classify the indoor news and outdoor news Classify the indoor news and outdoor news for further process the video clipsfor further process the video clips

Train the gender dependent models for Train the gender dependent models for ViaVoice engine. It may increase the ViaVoice engine. It may increase the recognition accuracy by having a gender recognition accuracy by having a gender dependent modeldependent model

LYU0103 Speech Recognition Techniques for Digital Video Library

Documents

Transcript of LYU0103 Speech Recognition Techniques for Digital Video Library