Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

56

description

Speech Processing Laboratory Temple University. DESIGN OF A KEYWORD SPOTTING SYSTEM USING MODIFIED CROSS-CORRELATION IN THE TIME AND MFCC DOMAIN Presented by: Olakunle Anifowose Thesis Advisor: Dr. Robert Yantorno Committee Members: Dr. Joseph Picone Dr. Dennis Silage. - PowerPoint PPT Presentation

Transcript of Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Page 1: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY
Page 2: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

DESIGN OF A KEYWORD SPOTTING SYSTEM USING MODIFIED CROSS-CORRELATION IN THE TIME AND MFCC

DOMAIN

Presented by:Olakunle Anifowose

Thesis Advisor:Dr. Robert Yantorno

Committee Members: Dr. Joseph PiconeDr. Dennis Silage

Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Speech Processing LaboratoryTemple University

Page 3: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

3

Outline

Introduction to keyword spotting Motivation for this work Experimental Conditions Common approach to Keyword Spotting Method used

Time Domain MFCC Domain

Conclusions Future Work

Page 4: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Keyword Spotting Identify keyword in spoken utterance or

written document Determine if keyword is present in

utterance Location of keyword in utterance

Possible operational results Hits False Alarms Miss

Page 5: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Keyword Spotting Speaker dependent/Independent

(speech recognition) Speaker Dependent

Single speaker Lacks flexibility and not speaker adaptable Easier to develop

Speaker Independent Multi-Speaker Flexible Harder to develop

Page 6: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Monitor conversations for flag words. Automated response system. Security. Automatically search through speeches

for certain words or phrases. Voice command/dialing.

Applications

Page 7: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

7

Motivation for this work

Typical Large Vocabulary Continous Speech Recognizer (LVCSR) / Hidden Markov Model (HMM) based approaches requires a garbage model

To train the system for non-keyword speech data.

The better the garbage model, the better the keyword spotting performance

Use of LVCSR techniques can introduce

Computational load, complexity.

Need for training data.

Page 8: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Research Objectives Development of a simple keyword

spotting system based on cross-correlation.

Maximize hits while keeping false alarms and misses low.

Page 9: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Speech Database Used Call Home Database

Contains more than 40 telephone conversation between male and female speakers.

30 minutes long conversation. Switchboard Database

Two sided conversations collected from various speakers in the United States.

Page 10: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Experimental Setup Conversations are split into single channels Call home database

60 utterances ranging from 30secs to 2mins. Keyword of interest were college, university,

language, something, student, school, zero, relationship, necessarily, really, think, English, program, tomorrow, bizarre, conversation and circumstance.

Switchboard database 30 utterances ranging from 30secs to 2mins. Keyword of interest always, money and something.

Page 11: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Hidden Markov Model Statistical model – hidden states / observable

outputs Emission probability – p(x|q1) Transition probability – p(q2|q1)

Common Approaches

First order Markov Process – probability of next state depends only on current state.

Infer output given the underlying system.

Page 12: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

HMM for Speech Recognition Each word – sequence of unobservable states

with certain emission probabilities (features) and transition probabilities (to next state).

Estimate the model for each word in the training vocabulary.

For each test keyword, model that maximizes the likelihood is selected as a match – Viterbi Algorithm.

KWS directly built using HMM based Large Vocabulary Continuous Speech Recognizer (LVCSR).

Hidden Markov Models

Page 13: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Limitation Large amount of training data required. Training data has to be transcribed in word

level and/or phone level. Transcribed data costs time and money. Not available in all languages.

HMM

Page 14: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Various keyword Spotting system

HMM Context dependent state of the art phoneme

recognizer Keyword model. Garbage model.

Evaluated on the Conversational Telephone Speech database.

Accuracy varies with keyword 52.6% with keyword “because”. 94.5% with keyword “zero”.

Ketadbar etal, 2006

Page 15: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Various keyword Spotting system Spoken Term Detection using Phonetic

Posteriogram Trained on acoustic phonetic models. Compared using dynamic time warping. Trained on switchboard cellular corpus. Tested on Fisher english development test

from NIST. Average precision for top 10hits was 63.3%.

Hazen etal, 2009

Page 16: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Various keyword Spotting system S-DTW

Evaluated on the switchboard corpus. 75% accuracy for all keywords tested.

Jansen etal 2010

Page 17: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Contributions We have proposed a novel approach to

keyword spotting in both the time and MFCC domain using cross-correlation.

The Design of a Global keyword for cross-correlation in the time domain.

Page 18: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Cross-correlation Measure of similarity between two

signals. Two signals compared by:

Sliding one signal by a certain time lag Multiplying both the overlapping regions

and taking the sum Repeating the process and adding the

products until there is no more overlap If both signals are exactly the same,

there’s a maximum peak at the time = 0, and the rest of the correlation signal tapers off to zero.

Page 19: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Research Using Cross-correlation

The identification of cover songs Search musical database and determine

songs that similar but performed by different artist with different instruments Features of choice - chroma features

Representation for music Entire spectrum projected onto 12 bins

representing 12 distinct semitones. Method used is cross-correlation Cross-correlation is used to determine

similarities betweeen songs based on their chroma features

Page 20: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Cases Considered Time Domain

Initial approach Modified approach

MFCC Domain

Page 21: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Time Domain Initial Approach

0 1 2 3 4 5 6

x 104

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.81. Let the length

of thekeyword or phrase

be n.The cross

correlation of the keyword and

the first n samples of the

utterance is computed.

xcorr

2. Observe position of peak to see if it’s around the zero lag.

Yes: KeywordNo: Not keyword

3. Shift observed portion by a small amount and repeat process

If a portion is reached where the peak is close to the zero lag, then that’s where the keyword is. If not, the utterance does not contain the keyword.

Page 22: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

The power around the “zero” lag is obtained and compared to the power in the rest of the correlation signal. This ratio is referred to as Zero lag to Rest Ratio (ZRR).

If the ZRR is greater than a certain threshold(2.5) then that segment of the utterance contains the keyword or phrase.

The test utterance is shifted and the process is repeated

If there is no segment with a ZRR greater than 2.5, the utterance does not contain the keyword

ZRR-Zero lag to Rest Ratio

Page 23: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

• Same Speaker

• Keyword part of the utterance

• Different Speaker

• Keyword from different speaker

Test Cases

Page 24: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Results(utterance-male1 keyword-male1)

0 50 100 150 200 250 3000

1

2

3

4

5

6

7

8

Rat

io P

lot

S hift Count

Page 25: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Result(utterance-male1 keyword-male2)

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3

3.5

Rat

io P

lot

S hift Count

Page 26: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Results( utterance-female1 keyword-female1 )

0 50 100 150 200 250 300 350 400 450 5000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Rat

io P

lot

S hift Count

Page 27: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Results( utterance-female1 keyword-female2 )

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Rat

io P

lot

S hift Count

Page 28: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Result Speaker Dependent Initial Time Domain Approach

Tested on 30 utterances single instances of the following keyword

bizarre, conversation, something, really, necessarily, relationship, think, tomorrow extracted from the same speaker

Percentage

Hits 86%

False Alarm 14%

Miss 14%

Page 29: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Result Speaker Independent Initial time Domain approach

Tested on 40 utterances Multiple instances of the following

keyword bizarre, conversation, something, really, necessarily, relationship, think, tomorrow extracted from various speakers. Percentage

Hits 26%

False Alarm 65%

Miss 26%

Page 30: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Keyword → REALLY* 13 and 14 → same gender (female)

Challenge

Page 31: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Time Domain ModifiedUtterance Global Keyword

from Quantized Dynamic Time

Warping Pitch Smoothening

Cross-Correlate both signals and Computer zero lag to

Rest Ratio (ZRR) on a frame by frame basis

Highest Zero Lag ratio is the location of the

keyword

Page 32: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Measure of frequency level Change in pitch results in a change in the

fundamental frequency of speech Difference in pitch between keyword and

utterance increases detection errors.

Pitch

Page 33: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Pitch is a form of speaker information Limit the effects pitch has on a speech

system Kawahara Algorithm

STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of Weighted Spectrum) algorithm to modify pitch.

It reduces periodic variation in time caused by excitation

Pitch Normalization

Page 34: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Utterance Pitch Normalization Straight Algorithm

Elimination of periodicity interference Temporal interference around peaks can be removed by

constructing a new timing window based on a cardinal B-spline basis function that is adaptive to the fundamental period.

F0 Extraction Natural speech is not purely periodic

Speech resythensis The extracted F0 is then used to resynthesize speech

Page 35: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Modeling a Global Keyword

Compute MFCC features for each keyword

Perform Quantized Dynamic Time Warping (DTW) on several keyword templates.

Page 36: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

MFCC Take the Fourier transform of (a windowed

portion of) a signal. Map the powers of the spectrum obtained above

onto the mel scale, using triangular overlapping windows.

Take the log of the power at each of the mel frequencies.

Take the Discrete Cosine Transform (DCT) of the mel log powers, as if it were a signal.

The MFCCs are the amplitudes of the resulting spectrum.

Page 37: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Dynamic Time Warping Time stretching and

contracting one signal so that it aligns with the other signal.

Time-series similarity measure. Reference and test keyword

arranged along two side of the grid.

Template keyword – vertical axis, test keyword – horizontal.

Each block in the grid – distance between corresponding feature vectors.

Best match – path through the grid that minimizes cumulative distance.

Page 38: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Quantized Dynamic Time Warping

The MFCC features extracted from various instances of a keyword will be divided into 2 sets: A and B . Each reference template Ai will be paired with only one Bi.

For each pair Ai and Bi the optimal path will be computed (using the classic DTW algorithm).

The new vector Ci= (c1, c2,…cNc) will be generated Repeat the process considering the pair (Ci, Ci+1)

as a new Ai and Bi pair . Result is a single reference vector Cy Invert the vector into a time domain signal

Page 39: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Sample Results

Page 40: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Sample Results

Page 41: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Result Using a Global Keyword and Pitch Normalized utterances and keywords. Tested on 60 utterances Used a global keyword 10 utterances associated with each keyword keyword of interest bizarre, conversation, something,

really, necessarily, relationship, think, tomorrow, computer, college, university, zero, student, school, language, program.

Percentage

Hits 41.2%

False Alarm 37%

Miss 42%

Page 42: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Results Analysis Result differ from keyword to keyword The best performing keyword was

bizarre which had a hit rate of 60% Time domain is not suitable due to

uneven statistical behavior of signals.

Page 43: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

MFCC Domain Steps for cross-correlating the keyword and utterance in the MFCC domain.

Step 1: Pitch normalized utterances and keywords Using the straight Algorithm

Step 2: Estimate the length of the keyword (n) and computed its MFCC feature

Step3: Compute the MFCC feature of the first n samples of the utterance

Step 4: Normalize the MFCC features of the utterance and keyword and cross-correlate them.

Step 5: Store a single value from the cross-correlation result in a matrix and shift along the utterance by a couple of sample and repeat steps 3-5 until the end of the utterance.

Step 6: Identify the maximum value in the matrix as the location of the keyword.

Page 44: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Normalizing MFCC Features Divide the features by the square root of

the sum of the squares of each vector Similar to dividing a vector by its unit

norm to obtain a unit vector. Reason so MFCC features ranges from

zero to one

Page 45: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Interpreting cross-correlation result of MFCC features

Similar to cosine similarity measure. If two vector are exactly the same there

is an angle of zero between them and the cosine of that would be a one.

The closer the cross-correlation result of two vector is to one. The more likely they are to be a match.

Vectors that are dissimilar will have a wider angle and their cross-correlation results will be a lot less than one.

Page 46: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Distance Between MFCC Features for Different Keywords

College Distance

College 1.1*10-8

University 2.1*10-4

Something 8.3*10-3

Conversation 1.98*10-5

School 1.1*10-2

Zero 0.98*10-3

Program 2.1*10-6

Language 1.5*10-4

Bizarre 7.8*10-5

Circumstance 2.6*10-4

Really 3.2*10-2

Page 47: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Speaker Dependent Test were

conducted on 30 utterances

Keywords were extracted from the same speaker College,

university, student, school, bizarre

The maximum matching score corresponds to the location of the keyword.

Page 48: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Speaker Independent Test Test Samples

Average of 5 utterances associated with each keyword

Average of 5 version of keyword

20-25 trials 13 keywords

Page 49: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

More Results

The maximum matching score corresponds to the location of the keyword.

Page 50: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

More Results

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

segment count for utterance

mat

chin

g sc

ore

Maximum matching score location of keyword University

Second Highest score is the word Universities

Page 51: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Result Using a Cross-correlation in MFCC Domain. Average considering results from every

keyword 20-25 trials per keyword 13 keywords considered call home

database 3 keywords considered in the

switchboard databasePercentage

Hits 66%

False Alarm 12%

Miss 23%

Page 52: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Keyword DependenceKeyword Accuracy

College 0.83

Circumstance 0.77

Conversation 0.63

English 0.35

Computer 0.50

Always 0.85

School 0.63

College 0.63

Language 0.40

Student 0.62

Money 0.7

Program 0.62

Something 0.68

Page 53: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

System Performance Threshold

0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.970

0.2

0.4

0.6

0.8

1

1.2

HitsFalse AlarmMiss

Threshold

Syst

em A

ccur

acy

%

Page 54: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Conclusions Cross-correlation in the time domain is not

very accurate for a speaker independent system Because of the behavior of the signals in the

time domain. Improvement with the use of a global

keyword and pitch normalization Not enough to deem a success

Cross-correlation of MFCC features is a very viable alternative to keyword spotting

Page 55: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Future Work Experiments using more keywords Use larger dataset to optimize system

performance Test cross-correlation in other domains

Page 56: Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

THANK YOU!

Any Questions?