LYU0103 Speech Recognition Techniques for Digital Video Library

35
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Supervisor : Prof Michael R. Lyu Lyu Students: Students: Gao Zheng Hong Gao Zheng Hong Lei Mo Lei Mo

description

LYU0103 Speech Recognition Techniques for Digital Video Library. Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo. Outline of Presentation. Project objectives ViaVoice recognition experiments Speech recognition editing tool Audio scene change detection - PowerPoint PPT Presentation

Transcript of LYU0103 Speech Recognition Techniques for Digital Video Library

Page 1: LYU0103 Speech Recognition  Techniques for  Digital Video Library

LYU0103

Speech Recognition Techniques for

Digital Video Library

Supervisor : Prof Michael R. LyuSupervisor : Prof Michael R. Lyu

Students: Students: Gao Zheng HongGao Zheng Hong

Lei MoLei Mo

Page 2: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Outline of Presentation

Project objectivesProject objectives ViaVoice recognition experimentsViaVoice recognition experiments Speech recognition editing toolSpeech recognition editing tool Audio scene change detectionAudio scene change detection Speech classificationSpeech classification SummarySummary

Page 3: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Our Project Objectives

Audio information retrievalAudio information retrieval Speech recognitionSpeech recognition

Page 4: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Last Term’s Work

Extract audio channel (stereo 44.1 kHz) Extract audio channel (stereo 44.1 kHz) from a mpeg video files into wave files from a mpeg video files into wave files (mono 22 kHz) (mono 22 kHz)

Segmented the wave files into sentences by Segmented the wave files into sentences by detecting its frame energydetecting its frame energy

Developed a visual training toolDeveloped a visual training tool

Page 5: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Visual Training Tool

Video Window; Dictation Window; Text Editor

Page 6: LYU0103 Speech Recognition  Techniques for  Digital Video Library

IBM ViaVoice Experiments

Employed 7 student helpersEmployed 7 student helpers Produce transcripts of 77 news video clipsProduce transcripts of 77 news video clips Four experiments:Four experiments:

Baseline measurementBaseline measurement Trained model measurementTrained model measurement Slow down measurementSlow down measurement Indoor news measurement Indoor news measurement

Page 7: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Baseline Measurement

To measure the ViaVoice recognition To measure the ViaVoice recognition accuracy using TVB news videoaccuracy using TVB news video

Testing set: 10 video clipsTesting set: 10 video clips The segmented wav files are dictatedThe segmented wav files are dictated Employ the hidden Markov model toolkit Employ the hidden Markov model toolkit

(HTK) to examine the accuracy(HTK) to examine the accuracy

Page 8: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Trained Model Measurement

To measure the accuracy of ViaVoice, trained by To measure the accuracy of ViaVoice, trained by its correctly recognized wordsits correctly recognized words

10 videos clips are segmented and dictated 10 videos clips are segmented and dictated The correctly dictated words of training set are The correctly dictated words of training set are

used to train the ViaVoice by the SMAPI function used to train the ViaVoice by the SMAPI function SmWordCorrectionSmWordCorrection

Repeat the procedures of “baseline measurement” Repeat the procedures of “baseline measurement” after training to get the recognition performanceafter training to get the recognition performance

Repeat the procedures of using 20 videos clips Repeat the procedures of using 20 videos clips

Page 9: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Slow Down Measurement

Investigate the effect of slowing down the Investigate the effect of slowing down the audio channelaudio channel

Resample the segment wave files in the Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.61.2, 1.3, 1.4, and 1.6

Repeat the procedures of “baseline Repeat the procedures of “baseline measurement”measurement”

Page 10: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Indoor News Measurement

Eliminate the effect of noiseEliminate the effect of noise Select the indoor news reporter sentenceSelect the indoor news reporter sentence Dictate the test set using untrained model Dictate the test set using untrained model Repeat the procedure using trained model Repeat the procedure using trained model

Page 11: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Experimental Results

ExperimentExperiment Accuracy (Max. performance) Accuracy (Max. performance)

Baseline Baseline 25.27% 25.27%

Trained Model Trained Model 25.87% (25.87% (with 20 video with 20 video trained) trained)

Slow SpeechSlow Speech 25.67% (25.67% (max. at ratio = 1.15) max. at ratio = 1.15)

Indoor Speech (untrained Indoor Speech (untrained model) model)

35.22% 35.22%

Indoor Speech (trained Indoor Speech (trained model) model)

36.31% (36.31% (with 20 video with 20 video trained) trained)

Overall Recognition Results (ViaVoice, TVB News )

Page 12: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Experimental Result Cont.

Trained Video Trained Video NumberNumber

UntrainedUntrained 10 10 videosvideos 20 20 videosvideos

Accuracy Accuracy 25.27% 25.27% 25.82% 25.82% 25.87% 25.87%

RatioRatio 11 1.051.05 1.11.1 1.151.15 1.21.2 1.31.3 1.41.4 1.51.5

Accuracy Accuracy (%)(%)

25.2725.27 25.46 25.46 25.63 25.63 25.67 25.67 25.82 25.82 17.18 17.18 12.34 12.34 4.04 4.04

Result of trained model with different number of training videos

Result of using different slow down ratio

Page 13: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Analysis of Experimental Result

Trained model: about 1% accuracy Trained model: about 1% accuracy improvementimprovement

Slowing down speeches: about 1% accuracy Slowing down speeches: about 1% accuracy improvementimprovement

Indoor speeches are recognized much betterIndoor speeches are recognized much better Mandarin: estimated baseline accuracy is Mandarin: estimated baseline accuracy is

about 70 % ( >> Cantonese)about 70 % ( >> Cantonese)

Page 14: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Speech Processor

Training does not increase accuracy Training does not increase accuracy significantlysignificantly

Need manually editing of the recognition Need manually editing of the recognition resultresult

Word timing information is also importantWord timing information is also important

Page 15: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Editing Functionality

The recognition result is organized in a The recognition result is organized in a basic unit called “firm word”basic unit called “firm word”

Retrieve the timing information from the Retrieve the timing information from the speech enginespeech engine

Record the timing information of every firm Record the timing information of every firm word in an indexword in an index

Highlight corresponding firm word during Highlight corresponding firm word during video playbackvideo playback

Page 16: LYU0103 Speech Recognition  Techniques for  Digital Video Library
Page 17: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Dynamic Time Index Alignment

While editing recognition result, firm word While editing recognition result, firm word structure may be changedstructure may be changed

Time index need to be updated to maintain Time index need to be updated to maintain new firm wordnew firm word

In speech processor, time index is aligned In speech processor, time index is aligned with firm words whenever user edits the with firm words whenever user edits the texttext

Page 18: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Time Index Alignment Example

Before Editing Editing

After Editing

Page 19: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Motivation for Doing Speech Segmentation and Classification Gender classification can help us to build Gender classification can help us to build

gender dependent modelgender dependent model Detection of scene changes from video Detection of scene changes from video

content is not accurate enough, so we need content is not accurate enough, so we need audio scene change detection as an assistant audio scene change detection as an assistant tooltool

Page 20: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Flow Diagram of Audio Information Retrieval System

AudioSignal

FromNews’ AudioChannel

AudioSignal

MFCC

FeatureExtraction

Segmentation

Audio Scene Change

Detect cont’vowel > 30%

Speech

Non-Speech

Male?

Female?

MusicPatternMatching

Speaker Identification/Classification

By MFCC var.

By 256 GMMBy Clustering

Page 21: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Feature Extraction by MFCC

The first thing we should do on the raw The first thing we should do on the raw audio input dataaudio input data

MFCC stands for “mel-frequency cepstral MFCC stands for “mel-frequency cepstral coefficient”coefficient”

Human perception of the frequency of Human perception of the frequency of sound does not follow a linear scalesound does not follow a linear scale

Page 22: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Detection of Audio Scene Change by Bayesian Information Criterion (BIC) Bayesian information criterion (BIC)Bayesian information criterion (BIC) is a is a

likelihood criterionlikelihood criterion We maximize the likelihood functions We maximize the likelihood functions

separately for each model separately for each model MM and obtain L and obtain L (X,(X,MM))

The main principle is to penalize the system The main principle is to penalize the system by the model complexityby the model complexity

Page 23: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Detection of a single point change using BIC

We define:We define:HH0 0 : x: x11, x, x2 2 … x… xN N ~ N(μ,Σ)~ N(μ,Σ)

to be the whole sequence without changes andto be the whole sequence without changes andHH11: x: x11, x, x2 2 … x… xL L ~ N(μ~ N(μ11,Σ,Σ11), x), xL+1L+1,, xxL+2 L+2 … x… xN N ~ N(μ~ N(μ22,Σ,Σ22),),is the hypothesis that change occurring at time i.is the hypothesis that change occurring at time i. The maximum likelihood ratio is defined as: The maximum likelihood ratio is defined as:

R(I)=Nlog| Σ|-NR(I)=Nlog| Σ|-N11log| Σlog| Σ11|-N|-N22log| Σlog| Σ22||

Page 24: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Detection of a single point change using BIC

The difference between the BIC values of The difference between the BIC values of two models can be expressed as:two models can be expressed as:

BIC(I) = R(I) – λPBIC(I) = R(I) – λP

P=(1/2)(d+(1/2d(d+1))logNP=(1/2)(d+(1/2d(d+1))logN If BIC value>0, detection of scene changeIf BIC value>0, detection of scene change

Page 25: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Detection of multiple point changes by BIC

a.       Initialize the interval [a, b] with a=1, b=2a.       Initialize the interval [a, b] with a=1, b=2 b.      Detect if there is one changing point in interval [a, b] b.      Detect if there is one changing point in interval [a, b]

using BICusing BIC c.       If (there is no change in [a, b])c.       If (there is no change in [a, b])

let b= b + 1let b= b + 1elseelselet t be the changing point detectedlet t be the changing point detectedassign a = t +1; b = a+1;assign a = t +1; b = a+1;endend

d. go to step (b) if necessaryd. go to step (b) if necessary

Page 26: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Advantages of BIC approach

RobustnessRobustness Thresholding-freeThresholding-free Optimality Optimality

Page 27: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Comparison of different algorithms

Page 28: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Audio scene change detection

Page 29: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Gender Classification

The mean and covariance of male and female The mean and covariance of male and female feature vector is quite differentfeature vector is quite different

So we can model it by a Gaussian Mixture So we can model it by a Gaussian Mixture Model (GMM)Model (GMM)

Page 30: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Male/Female Classification(freq count vs. values)

Male Female

Page 31: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Gender Classification

Page 32: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Music/Speech classification by pitch tracking speech has more continue contour than speech has more continue contour than

music. music. Speech clip always has 30%-55% Speech clip always has 30%-55%

continuous contour whereas silence or continuous contour whereas silence or music has1%-15%music has1%-15%

Thus, we choose >20% for speech.Thus, we choose >20% for speech.

Page 33: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Frequency Vs no of frames

Speech Music

Page 34: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Summary

ViaVoice training experimentsViaVoice training experiments Speech recognition editing toolSpeech recognition editing tool Dynamic time index alignmentDynamic time index alignment Audio scene change detectionAudio scene change detection Speech classificationSpeech classification Integrated the above functions into a speech Integrated the above functions into a speech

processorprocessor

Page 35: LYU0103 Speech Recognition  Techniques for  Digital Video Library

Future Work

Classify the indoor news and outdoor news Classify the indoor news and outdoor news for further process the video clipsfor further process the video clips

Train the gender dependent models for Train the gender dependent models for ViaVoice engine. It may increase the ViaVoice engine. It may increase the recognition accuracy by having a gender recognition accuracy by having a gender dependent modeldependent model