Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Post on 24-Feb-2016

15 views 0 download

Tags:

description

Speech Processing Laboratory Temple University. DESIGN OF A KEYWORD SPOTTING SYSTEM USING MODIFIED CROSS-CORRELATION IN THE TIME AND MFCC DOMAIN Presented by: Olakunle Anifowose Thesis Advisor: Dr. Robert Yantorno Committee Members: Dr. Joseph Picone Dr. Dennis Silage. - PowerPoint PPT Presentation

Transcript of Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

DESIGN OF A KEYWORD SPOTTING SYSTEM USING MODIFIED CROSS-CORRELATION IN THE TIME AND MFCC

DOMAIN

Presented by:Olakunle Anifowose

Thesis Advisor:Dr. Robert Yantorno

Committee Members: Dr. Joseph PiconeDr. Dennis Silage

Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Speech Processing LaboratoryTemple University

3

Outline

Introduction to keyword spotting Motivation for this work Experimental Conditions Common approach to Keyword Spotting Method used

Time Domain MFCC Domain

Conclusions Future Work

Keyword Spotting Identify keyword in spoken utterance or

written document Determine if keyword is present in

utterance Location of keyword in utterance

Possible operational results Hits False Alarms Miss

Keyword Spotting Speaker dependent/Independent

(speech recognition) Speaker Dependent

Single speaker Lacks flexibility and not speaker adaptable Easier to develop

Speaker Independent Multi-Speaker Flexible Harder to develop

Monitor conversations for flag words. Automated response system. Security. Automatically search through speeches

for certain words or phrases. Voice command/dialing.

Applications

7

Motivation for this work

Typical Large Vocabulary Continous Speech Recognizer (LVCSR) / Hidden Markov Model (HMM) based approaches requires a garbage model

To train the system for non-keyword speech data.

The better the garbage model, the better the keyword spotting performance

Use of LVCSR techniques can introduce

Computational load, complexity.

Need for training data.

Research Objectives Development of a simple keyword

spotting system based on cross-correlation.

Maximize hits while keeping false alarms and misses low.

Speech Database Used Call Home Database

Contains more than 40 telephone conversation between male and female speakers.

30 minutes long conversation. Switchboard Database

Two sided conversations collected from various speakers in the United States.

Experimental Setup Conversations are split into single channels Call home database

60 utterances ranging from 30secs to 2mins. Keyword of interest were college, university,

language, something, student, school, zero, relationship, necessarily, really, think, English, program, tomorrow, bizarre, conversation and circumstance.

Switchboard database 30 utterances ranging from 30secs to 2mins. Keyword of interest always, money and something.

Hidden Markov Model Statistical model – hidden states / observable

outputs Emission probability – p(x|q1) Transition probability – p(q2|q1)

Common Approaches

First order Markov Process – probability of next state depends only on current state.

Infer output given the underlying system.

HMM for Speech Recognition Each word – sequence of unobservable states

with certain emission probabilities (features) and transition probabilities (to next state).

Estimate the model for each word in the training vocabulary.

For each test keyword, model that maximizes the likelihood is selected as a match – Viterbi Algorithm.

KWS directly built using HMM based Large Vocabulary Continuous Speech Recognizer (LVCSR).

Hidden Markov Models

Limitation Large amount of training data required. Training data has to be transcribed in word

level and/or phone level. Transcribed data costs time and money. Not available in all languages.

HMM

Various keyword Spotting system

HMM Context dependent state of the art phoneme

recognizer Keyword model. Garbage model.

Evaluated on the Conversational Telephone Speech database.

Accuracy varies with keyword 52.6% with keyword “because”. 94.5% with keyword “zero”.

Ketadbar etal, 2006

Various keyword Spotting system Spoken Term Detection using Phonetic

Posteriogram Trained on acoustic phonetic models. Compared using dynamic time warping. Trained on switchboard cellular corpus. Tested on Fisher english development test

from NIST. Average precision for top 10hits was 63.3%.

Hazen etal, 2009

Various keyword Spotting system S-DTW

Evaluated on the switchboard corpus. 75% accuracy for all keywords tested.

Jansen etal 2010

Contributions We have proposed a novel approach to

keyword spotting in both the time and MFCC domain using cross-correlation.

The Design of a Global keyword for cross-correlation in the time domain.

Cross-correlation Measure of similarity between two

signals. Two signals compared by:

Sliding one signal by a certain time lag Multiplying both the overlapping regions

and taking the sum Repeating the process and adding the

products until there is no more overlap If both signals are exactly the same,

there’s a maximum peak at the time = 0, and the rest of the correlation signal tapers off to zero.

Research Using Cross-correlation

The identification of cover songs Search musical database and determine

songs that similar but performed by different artist with different instruments Features of choice - chroma features

Representation for music Entire spectrum projected onto 12 bins

representing 12 distinct semitones. Method used is cross-correlation Cross-correlation is used to determine

similarities betweeen songs based on their chroma features

Cases Considered Time Domain

Initial approach Modified approach

MFCC Domain

Time Domain Initial Approach

0 1 2 3 4 5 6

x 104

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.81. Let the length

of thekeyword or phrase

be n.The cross

correlation of the keyword and

the first n samples of the

utterance is computed.

xcorr

2. Observe position of peak to see if it’s around the zero lag.

Yes: KeywordNo: Not keyword

3. Shift observed portion by a small amount and repeat process

If a portion is reached where the peak is close to the zero lag, then that’s where the keyword is. If not, the utterance does not contain the keyword.

The power around the “zero” lag is obtained and compared to the power in the rest of the correlation signal. This ratio is referred to as Zero lag to Rest Ratio (ZRR).

If the ZRR is greater than a certain threshold(2.5) then that segment of the utterance contains the keyword or phrase.

The test utterance is shifted and the process is repeated

If there is no segment with a ZRR greater than 2.5, the utterance does not contain the keyword

ZRR-Zero lag to Rest Ratio

• Same Speaker

• Keyword part of the utterance

• Different Speaker

• Keyword from different speaker

Test Cases

Results(utterance-male1 keyword-male1)

0 50 100 150 200 250 3000

1

2

3

4

5

6

7

8

Rat

io P

lot

S hift Count

Result(utterance-male1 keyword-male2)

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3

3.5

Rat

io P

lot

S hift Count

Results( utterance-female1 keyword-female1 )

0 50 100 150 200 250 300 350 400 450 5000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Rat

io P

lot

S hift Count

Results( utterance-female1 keyword-female2 )

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Rat

io P

lot

S hift Count

Result Speaker Dependent Initial Time Domain Approach

Tested on 30 utterances single instances of the following keyword

bizarre, conversation, something, really, necessarily, relationship, think, tomorrow extracted from the same speaker

Percentage

Hits 86%

False Alarm 14%

Miss 14%

Result Speaker Independent Initial time Domain approach

Tested on 40 utterances Multiple instances of the following

keyword bizarre, conversation, something, really, necessarily, relationship, think, tomorrow extracted from various speakers. Percentage

Hits 26%

False Alarm 65%

Miss 26%

Keyword → REALLY* 13 and 14 → same gender (female)

Challenge

Time Domain ModifiedUtterance Global Keyword

from Quantized Dynamic Time

Warping Pitch Smoothening

Cross-Correlate both signals and Computer zero lag to

Rest Ratio (ZRR) on a frame by frame basis

Highest Zero Lag ratio is the location of the

keyword

Measure of frequency level Change in pitch results in a change in the

fundamental frequency of speech Difference in pitch between keyword and

utterance increases detection errors.

Pitch

Pitch is a form of speaker information Limit the effects pitch has on a speech

system Kawahara Algorithm

STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of Weighted Spectrum) algorithm to modify pitch.

It reduces periodic variation in time caused by excitation

Pitch Normalization

Utterance Pitch Normalization Straight Algorithm

Elimination of periodicity interference Temporal interference around peaks can be removed by

constructing a new timing window based on a cardinal B-spline basis function that is adaptive to the fundamental period.

F0 Extraction Natural speech is not purely periodic

Speech resythensis The extracted F0 is then used to resynthesize speech

Modeling a Global Keyword

Compute MFCC features for each keyword

Perform Quantized Dynamic Time Warping (DTW) on several keyword templates.

MFCC Take the Fourier transform of (a windowed

portion of) a signal. Map the powers of the spectrum obtained above

onto the mel scale, using triangular overlapping windows.

Take the log of the power at each of the mel frequencies.

Take the Discrete Cosine Transform (DCT) of the mel log powers, as if it were a signal.

The MFCCs are the amplitudes of the resulting spectrum.

Dynamic Time Warping Time stretching and

contracting one signal so that it aligns with the other signal.

Time-series similarity measure. Reference and test keyword

arranged along two side of the grid.

Template keyword – vertical axis, test keyword – horizontal.

Each block in the grid – distance between corresponding feature vectors.

Best match – path through the grid that minimizes cumulative distance.

Quantized Dynamic Time Warping

The MFCC features extracted from various instances of a keyword will be divided into 2 sets: A and B . Each reference template Ai will be paired with only one Bi.

For each pair Ai and Bi the optimal path will be computed (using the classic DTW algorithm).

The new vector Ci= (c1, c2,…cNc) will be generated Repeat the process considering the pair (Ci, Ci+1)

as a new Ai and Bi pair . Result is a single reference vector Cy Invert the vector into a time domain signal

Sample Results

Sample Results

Result Using a Global Keyword and Pitch Normalized utterances and keywords. Tested on 60 utterances Used a global keyword 10 utterances associated with each keyword keyword of interest bizarre, conversation, something,

really, necessarily, relationship, think, tomorrow, computer, college, university, zero, student, school, language, program.

Percentage

Hits 41.2%

False Alarm 37%

Miss 42%

Results Analysis Result differ from keyword to keyword The best performing keyword was

bizarre which had a hit rate of 60% Time domain is not suitable due to

uneven statistical behavior of signals.

MFCC Domain Steps for cross-correlating the keyword and utterance in the MFCC domain.

Step 1: Pitch normalized utterances and keywords Using the straight Algorithm

Step 2: Estimate the length of the keyword (n) and computed its MFCC feature

Step3: Compute the MFCC feature of the first n samples of the utterance

Step 4: Normalize the MFCC features of the utterance and keyword and cross-correlate them.

Step 5: Store a single value from the cross-correlation result in a matrix and shift along the utterance by a couple of sample and repeat steps 3-5 until the end of the utterance.

Step 6: Identify the maximum value in the matrix as the location of the keyword.

Normalizing MFCC Features Divide the features by the square root of

the sum of the squares of each vector Similar to dividing a vector by its unit

norm to obtain a unit vector. Reason so MFCC features ranges from

zero to one

Interpreting cross-correlation result of MFCC features

Similar to cosine similarity measure. If two vector are exactly the same there

is an angle of zero between them and the cosine of that would be a one.

The closer the cross-correlation result of two vector is to one. The more likely they are to be a match.

Vectors that are dissimilar will have a wider angle and their cross-correlation results will be a lot less than one.

Distance Between MFCC Features for Different Keywords

College Distance

College 1.1*10-8

University 2.1*10-4

Something 8.3*10-3

Conversation 1.98*10-5

School 1.1*10-2

Zero 0.98*10-3

Program 2.1*10-6

Language 1.5*10-4

Bizarre 7.8*10-5

Circumstance 2.6*10-4

Really 3.2*10-2

Speaker Dependent Test were

conducted on 30 utterances

Keywords were extracted from the same speaker College,

university, student, school, bizarre

The maximum matching score corresponds to the location of the keyword.

Speaker Independent Test Test Samples

Average of 5 utterances associated with each keyword

Average of 5 version of keyword

20-25 trials 13 keywords

More Results

The maximum matching score corresponds to the location of the keyword.

More Results

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

segment count for utterance

mat

chin

g sc

ore

Maximum matching score location of keyword University

Second Highest score is the word Universities

Result Using a Cross-correlation in MFCC Domain. Average considering results from every

keyword 20-25 trials per keyword 13 keywords considered call home

database 3 keywords considered in the

switchboard databasePercentage

Hits 66%

False Alarm 12%

Miss 23%

Keyword DependenceKeyword Accuracy

College 0.83

Circumstance 0.77

Conversation 0.63

English 0.35

Computer 0.50

Always 0.85

School 0.63

College 0.63

Language 0.40

Student 0.62

Money 0.7

Program 0.62

Something 0.68

System Performance Threshold

0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.970

0.2

0.4

0.6

0.8

1

1.2

HitsFalse AlarmMiss

Threshold

Syst

em A

ccur

acy

%

Conclusions Cross-correlation in the time domain is not

very accurate for a speaker independent system Because of the behavior of the signals in the

time domain. Improvement with the use of a global

keyword and pitch normalization Not enough to deem a success

Cross-correlation of MFCC features is a very viable alternative to keyword spotting

Future Work Experiments using more keywords Use larger dataset to optimize system

performance Test cross-correlation in other domains

THANK YOU!

Any Questions?