LYU0103 Speech Recognition Techniques for Digital Video Library
description
Transcript of LYU0103 Speech Recognition Techniques for Digital Video Library
LYU0103
Speech Recognition Techniques for
Digital Video Library
Supervisor : Prof Michael R. LyuSupervisor : Prof Michael R. Lyu
Students: Students: Gao Zheng HongGao Zheng Hong
Lei MoLei Mo
Outline of Presentation
Project objectivesProject objectives ViaVoice recognition experimentsViaVoice recognition experiments Speech recognition editing toolSpeech recognition editing tool Audio scene change detectionAudio scene change detection Speech classificationSpeech classification SummarySummary
Our Project Objectives
Audio information retrievalAudio information retrieval Speech recognitionSpeech recognition
Last Term’s Work
Extract audio channel (stereo 44.1 kHz) Extract audio channel (stereo 44.1 kHz) from a mpeg video files into wave files from a mpeg video files into wave files (mono 22 kHz) (mono 22 kHz)
Segmented the wave files into sentences by Segmented the wave files into sentences by detecting its frame energydetecting its frame energy
Developed a visual training toolDeveloped a visual training tool
Visual Training Tool
Video Window; Dictation Window; Text Editor
IBM ViaVoice Experiments
Employed 7 student helpersEmployed 7 student helpers Produce transcripts of 77 news video clipsProduce transcripts of 77 news video clips Four experiments:Four experiments:
Baseline measurementBaseline measurement Trained model measurementTrained model measurement Slow down measurementSlow down measurement Indoor news measurement Indoor news measurement
Baseline Measurement
To measure the ViaVoice recognition To measure the ViaVoice recognition accuracy using TVB news videoaccuracy using TVB news video
Testing set: 10 video clipsTesting set: 10 video clips The segmented wav files are dictatedThe segmented wav files are dictated Employ the hidden Markov model toolkit Employ the hidden Markov model toolkit
(HTK) to examine the accuracy(HTK) to examine the accuracy
Trained Model Measurement
To measure the accuracy of ViaVoice, trained by To measure the accuracy of ViaVoice, trained by its correctly recognized wordsits correctly recognized words
10 videos clips are segmented and dictated 10 videos clips are segmented and dictated The correctly dictated words of training set are The correctly dictated words of training set are
used to train the ViaVoice by the SMAPI function used to train the ViaVoice by the SMAPI function SmWordCorrectionSmWordCorrection
Repeat the procedures of “baseline measurement” Repeat the procedures of “baseline measurement” after training to get the recognition performanceafter training to get the recognition performance
Repeat the procedures of using 20 videos clips Repeat the procedures of using 20 videos clips
Slow Down Measurement
Investigate the effect of slowing down the Investigate the effect of slowing down the audio channelaudio channel
Resample the segment wave files in the Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.61.2, 1.3, 1.4, and 1.6
Repeat the procedures of “baseline Repeat the procedures of “baseline measurement”measurement”
Indoor News Measurement
Eliminate the effect of noiseEliminate the effect of noise Select the indoor news reporter sentenceSelect the indoor news reporter sentence Dictate the test set using untrained model Dictate the test set using untrained model Repeat the procedure using trained model Repeat the procedure using trained model
Experimental Results
ExperimentExperiment Accuracy (Max. performance) Accuracy (Max. performance)
Baseline Baseline 25.27% 25.27%
Trained Model Trained Model 25.87% (25.87% (with 20 video with 20 video trained) trained)
Slow SpeechSlow Speech 25.67% (25.67% (max. at ratio = 1.15) max. at ratio = 1.15)
Indoor Speech (untrained Indoor Speech (untrained model) model)
35.22% 35.22%
Indoor Speech (trained Indoor Speech (trained model) model)
36.31% (36.31% (with 20 video with 20 video trained) trained)
Overall Recognition Results (ViaVoice, TVB News )
Experimental Result Cont.
Trained Video Trained Video NumberNumber
UntrainedUntrained 10 10 videosvideos 20 20 videosvideos
Accuracy Accuracy 25.27% 25.27% 25.82% 25.82% 25.87% 25.87%
RatioRatio 11 1.051.05 1.11.1 1.151.15 1.21.2 1.31.3 1.41.4 1.51.5
Accuracy Accuracy (%)(%)
25.2725.27 25.46 25.46 25.63 25.63 25.67 25.67 25.82 25.82 17.18 17.18 12.34 12.34 4.04 4.04
Result of trained model with different number of training videos
Result of using different slow down ratio
Analysis of Experimental Result
Trained model: about 1% accuracy Trained model: about 1% accuracy improvementimprovement
Slowing down speeches: about 1% accuracy Slowing down speeches: about 1% accuracy improvementimprovement
Indoor speeches are recognized much betterIndoor speeches are recognized much better Mandarin: estimated baseline accuracy is Mandarin: estimated baseline accuracy is
about 70 % ( >> Cantonese)about 70 % ( >> Cantonese)
Speech Processor
Training does not increase accuracy Training does not increase accuracy significantlysignificantly
Need manually editing of the recognition Need manually editing of the recognition resultresult
Word timing information is also importantWord timing information is also important
Editing Functionality
The recognition result is organized in a The recognition result is organized in a basic unit called “firm word”basic unit called “firm word”
Retrieve the timing information from the Retrieve the timing information from the speech enginespeech engine
Record the timing information of every firm Record the timing information of every firm word in an indexword in an index
Highlight corresponding firm word during Highlight corresponding firm word during video playbackvideo playback
Dynamic Time Index Alignment
While editing recognition result, firm word While editing recognition result, firm word structure may be changedstructure may be changed
Time index need to be updated to maintain Time index need to be updated to maintain new firm wordnew firm word
In speech processor, time index is aligned In speech processor, time index is aligned with firm words whenever user edits the with firm words whenever user edits the texttext
Time Index Alignment Example
Before Editing Editing
After Editing
Motivation for Doing Speech Segmentation and Classification Gender classification can help us to build Gender classification can help us to build
gender dependent modelgender dependent model Detection of scene changes from video Detection of scene changes from video
content is not accurate enough, so we need content is not accurate enough, so we need audio scene change detection as an assistant audio scene change detection as an assistant tooltool
Flow Diagram of Audio Information Retrieval System
AudioSignal
FromNews’ AudioChannel
AudioSignal
MFCC
FeatureExtraction
Segmentation
Audio Scene Change
Detect cont’vowel > 30%
Speech
Non-Speech
Male?
Female?
MusicPatternMatching
Speaker Identification/Classification
By MFCC var.
By 256 GMMBy Clustering
Feature Extraction by MFCC
The first thing we should do on the raw The first thing we should do on the raw audio input dataaudio input data
MFCC stands for “mel-frequency cepstral MFCC stands for “mel-frequency cepstral coefficient”coefficient”
Human perception of the frequency of Human perception of the frequency of sound does not follow a linear scalesound does not follow a linear scale
Detection of Audio Scene Change by Bayesian Information Criterion (BIC) Bayesian information criterion (BIC)Bayesian information criterion (BIC) is a is a
likelihood criterionlikelihood criterion We maximize the likelihood functions We maximize the likelihood functions
separately for each model separately for each model MM and obtain L and obtain L (X,(X,MM))
The main principle is to penalize the system The main principle is to penalize the system by the model complexityby the model complexity
Detection of a single point change using BIC
We define:We define:HH0 0 : x: x11, x, x2 2 … x… xN N ~ N(μ,Σ)~ N(μ,Σ)
to be the whole sequence without changes andto be the whole sequence without changes andHH11: x: x11, x, x2 2 … x… xL L ~ N(μ~ N(μ11,Σ,Σ11), x), xL+1L+1,, xxL+2 L+2 … x… xN N ~ N(μ~ N(μ22,Σ,Σ22),),is the hypothesis that change occurring at time i.is the hypothesis that change occurring at time i. The maximum likelihood ratio is defined as: The maximum likelihood ratio is defined as:
R(I)=Nlog| Σ|-NR(I)=Nlog| Σ|-N11log| Σlog| Σ11|-N|-N22log| Σlog| Σ22||
Detection of a single point change using BIC
The difference between the BIC values of The difference between the BIC values of two models can be expressed as:two models can be expressed as:
BIC(I) = R(I) – λPBIC(I) = R(I) – λP
P=(1/2)(d+(1/2d(d+1))logNP=(1/2)(d+(1/2d(d+1))logN If BIC value>0, detection of scene changeIf BIC value>0, detection of scene change
Detection of multiple point changes by BIC
a. Initialize the interval [a, b] with a=1, b=2a. Initialize the interval [a, b] with a=1, b=2 b. Detect if there is one changing point in interval [a, b] b. Detect if there is one changing point in interval [a, b]
using BICusing BIC c. If (there is no change in [a, b])c. If (there is no change in [a, b])
let b= b + 1let b= b + 1elseelselet t be the changing point detectedlet t be the changing point detectedassign a = t +1; b = a+1;assign a = t +1; b = a+1;endend
d. go to step (b) if necessaryd. go to step (b) if necessary
Advantages of BIC approach
RobustnessRobustness Thresholding-freeThresholding-free Optimality Optimality
Comparison of different algorithms
Audio scene change detection
Gender Classification
The mean and covariance of male and female The mean and covariance of male and female feature vector is quite differentfeature vector is quite different
So we can model it by a Gaussian Mixture So we can model it by a Gaussian Mixture Model (GMM)Model (GMM)
Male/Female Classification(freq count vs. values)
Male Female
Gender Classification
Music/Speech classification by pitch tracking speech has more continue contour than speech has more continue contour than
music. music. Speech clip always has 30%-55% Speech clip always has 30%-55%
continuous contour whereas silence or continuous contour whereas silence or music has1%-15%music has1%-15%
Thus, we choose >20% for speech.Thus, we choose >20% for speech.
Frequency Vs no of frames
Speech Music
Summary
ViaVoice training experimentsViaVoice training experiments Speech recognition editing toolSpeech recognition editing tool Dynamic time index alignmentDynamic time index alignment Audio scene change detectionAudio scene change detection Speech classificationSpeech classification Integrated the above functions into a speech Integrated the above functions into a speech
processorprocessor
Future Work
Classify the indoor news and outdoor news Classify the indoor news and outdoor news for further process the video clipsfor further process the video clips
Train the gender dependent models for Train the gender dependent models for ViaVoice engine. It may increase the ViaVoice engine. It may increase the recognition accuracy by having a gender recognition accuracy by having a gender dependent modeldependent model