NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Natural Language Processing

Audio Processing, Zero Crossing Rate,

Dynamic Time Warping, Spoken Word

Recognition

Vladimir Kulyukin

www.vkedco.blogspot.com

Outline

Audio Processing

Zero Crossing Rate

Dynamic Time Warping

Spoken Word Recognition

Audio Processing

Samples

Samples are successive snapshots of a

specific signal

Audio files are samples of sound waves

Microphones convert acoustic signals into

analog electrical signals and then analog-to-

digital converter transform analog signals

into digital samples

Digital Audio Signal

pressure

Amplitude

Amplitude (in audio processing) is a

measure of sound pressure

Amplitude is measured at a specific rate

Amplitude measures result in digital

samples

Some samples have positive values

Some samples have negative values

Digital Approximation Accuracy

Any digitization of analog signals carries some

inaccuracy

Approximation accuracy depends on two

factors: 1) sampling rate and 2) resolution

In audio processing, sampling is reduction of

continuous signal to discrete signal

Sampling rate is the number of samples per unit

of time

Resolution is the size of a sample (e.g., the

number of bits)

Sampling Rate & Resolution

Sampling rate is measured in Hertz

Hertz or Hz are measured in samples per

second

For example, if the audio is sampled at a

rate of 44100 per second, then its sampling

rate is 44100Hz

Some typical resolutions are 8 bits, 16 bits,

and 32 bits

Nyquist-Shannon Sampling Theorem

This theorem states that perfect reconstruction of

a signal is possible if the sampling frequency is

greater than two times the maximum frequency of

the signal being sampled

For example, if a signal has a maximum frequency

of 50Hz, then it can, theoretically, be

reconstructed if sampled at a rate of 100Hz and

avoid aliasing (the effect of indistinguishable

sounds)

Audio File Formats

WAVE (WAV) is often associated with Windows but

are now implemented on other platforms

AIFF is common on Mac OS

AU is common on Unix/Linux

These are similar formats that vary in how they

represent data, pack samples (e.g., little-endian vs.

big-endian), etc.

Java example of how to manipulate Wav files can

be downloaded from WavFileManip.java

Zero Crossing Rate

What is Zero Crossing Rate (ZCR)?

Zero Crossing Rate (ZCR) is a measure of the

number of times, in a given sample, when

amplitude crosses the horizontal line at 0

ZCR can be used to detect silence vs. non-

silence, voice vs. unvoiced, speaker’s identity,

ZCR is essentially the count of successive

samples changing algebraic signs

ZCR Source

public class ZeroCrossingRate {

public static double computeZCR01(double[] signals, double normalizer)

long numZC = 0;

for(int i = 1; i < signals.length; i++) {

if ( (signals[i] >= 0 && signals[i-1] < 0) ||

(signals[i] < 0 && signals[i-1] >= 0) ) {

numZC++;

return numZC/normalizer;

source code is in ZeroCrossingRate.java

ZCR in Voiced vs. Unvoiced Speech

Voiced speech is produced when vowels are spoken

Voiced speech is characterized of constant

frequency tones of some duration

Unvoiced speech is produced when consonants are

spoken

Unvoiced speech is non-periodic, random-like

because air passes through a narrow constriction of

the vocal tract

ZCR in Voiced vs. Unvoiced Speech

Phonetic theory states that voiced speech

has a smooth air flow through the vocal tract

whereas unvoiced speech has a turbulent air

flow that produces noise

Thus, voiced speech should have a low ZCR

whereas unvoiced speech should have a high

Amplitude of Voiced vs. Unvoiced Speech

Amplitude of unvoiced speech is low

Amplitude of voiced speech is high

Given a digital sample, we can use average

amplitude as a measure of the sample’s

energy

This can be used to classify samples as

vowels and consonants

ZCR & Amplitude of Voiced & Unvoiced Speech

ZCR Amplitude

Voiced LOW HIGH

Unvoiced HIGH LOW

Detection of Silence & Non-Silence

silence_buffer = [];

non_silence_buffer = [];

buffer = [];

while ( there are still frames left ) {

Read a specific number of frames into buffer;

Compute ZCR and average amplitude of buffer;

if ( ZCR and average amplitude are below specific thresholds ) {

add the buffer to silence_buffer;

else {

add the buffer to non_silence_buffer;

source code is in WavFileManip.detectSilence()

Dynamic Time Warping

source code is in DTW.java

Introduction

Dynamic Time Warping (DTW) is a method to

find an optimal alignment between two time-

dependent sequences

DTW aligns (“warps”) two sequences in a non-

linear way to match each other

DTW has been successfully used in automatic

speech recognition (ASR), bioinformatics

(genetic sequence matching), and video

analysis

Basic Definitions

There are two sequences:

𝑋 = 𝑥1, … , 𝑥𝑁 and 𝑌 = 𝑦1, … , 𝑦𝑀

There is a feature space F such that:

𝑥𝑖 ∈ 𝐹 & 𝑦𝑗 ∈ 𝐹 where 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 𝑀

There is a local cost measure mapping 2-

tuples of features to non-negative reals:

𝑐: 𝐹 x 𝐹 → 𝑅 ≥ 0

Sample Sequences

Sample Alignment

Cost Matrix DTW(N, M)

1 2 …. i … N

𝑑𝑡𝑤 𝑖, 𝑗 is the cost of warping X[1:i] with Y[1:j]

X and Y are sequences X[1:N] and Y[1:M]

Warping Path

𝑃 = 𝑝1, … , 𝑝𝐿 , where 𝑝 = 𝑛𝑗 , 𝑚𝑗 ∈ 1, 𝑁 × [1,𝑀] and

𝑗 ∈ 1, 𝐿 is a warping path if

1) 𝑝1 = 1,1 and 𝑝𝐿 = 𝑁,𝑀 2) 𝑛1 ≤ 𝑛2 ≤ … ≤ 𝑛𝑁 and 𝑚1 ≤ 𝑚2 ≤ … ≤ 𝑚𝑀

3) 𝑝𝑙+1 − 𝑝𝑙 ∈ 1, 0 , 0, 1 , 1, 1 , 1 ≤ 𝑙 ≤ 𝐿 − 1

Valid Warping Path

1 2 3 4

𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where

𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 = 2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = (4, 5)

𝑝5 𝑝6

Invalid Warping Path

1 2 3 4

𝑝1 ≠ 1, 1 so constraint 1 is not satisfied

𝑝1 𝑝2

𝑝5 𝑝6

1 2 3 4

𝑝3 = 3, 3 , 𝑝4 = 2, 4 , 3 > 2 so 2nd constraint is not satisfied

𝑝5 𝑝6

1 2 3 4

𝑝2 = 2, 2 , 𝑝3 = 3, 4 , 𝑝3 − 𝑝2 = 3,4 − 2,2 = 1, 2 ∉1, 0 , 0, 1 , 1, 1 so 3rd condition is not satisfied

𝑝4 𝑝5

Total Cost of a Warping Path

𝑃 = 𝑝1, … , 𝑝𝐿 , is a warping path between sequences X

and Y, then its total cost is

𝑐𝑝 𝑋, 𝑌 = 𝑐(𝑥𝑛𝑗 , 𝑦𝑚𝑗)

𝑗=1

Example

1 2 3 4

5 Assume that 𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where 𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 =2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = 4, 5 ,

is a warping path b/w X[1:4] and Y[1:5].

Then the total cost of P is

𝑐 𝑥1, 𝑦1 + 𝑐 𝑥1, 𝑦2 + 𝑐 𝑥2, 𝑦3 +𝑐 𝑥2, 𝑦4 + 𝑐 𝑥3, 𝑦5 + 𝑐 𝑥4, 𝑦5 .

This notation 𝑐 𝑥𝑖 , 𝑦𝑗 can be simplified

to read 𝑐(𝑖, 𝑗) or 𝑐 𝑋 𝑖 , 𝑌 𝑗 .

𝑝5 𝑝6

DTW(X, Y) – Cost of an Optimal Warping Path

𝐷𝑇𝑊 𝑋, 𝑌 = min 𝑐𝑝 𝑋, 𝑌 𝑝 is a warping path}

Remarks on DTW(X, Y)

There may be several warping paths of the

same DTW(X, Y)

DTW(X, Y) is symmetric whenever the local

cost measure is symmetric

DTW(X, Y) does not necessarily satisfy the

triangle inequality (the sum of the lengths of

two sides is greater than the length of the

remaining side)

DTW Equations: Base Cases

1 2 …. i … N

Initial condition: 𝑑𝑡𝑤 1,1 = 𝑐(1,1)

1st Row: 𝑑𝑡𝑤 𝑖, 1 = 𝑑𝑡𝑤 𝑖 − 1,1 + 𝑐(𝑖, 1)

1st Column:

𝑑𝑡𝑤 1, 𝑗 = 𝑑𝑡𝑤 1, 𝑗 − 1 +𝑐(1, 𝑗)

DTW Equations: Recursion

1 2 … i … N

Inner Cell: 𝑑𝑡𝑤 𝑖, 𝑗 = min 𝑑𝑡𝑤 𝑖 − 1, 𝑗 , 𝑑𝑡𝑤 𝑖 − 1, 𝑗 − 1 , 𝑑𝑡𝑤 𝑖, 𝑗 − 1 + 𝑐(𝑖, 𝑗)

Interpretation: Cost of

warping X[1:i] with Y[1:J] is

the cost of warping X[i] with

Y[j] plus the minimum of the

following three costs: 1) the

cost of warping X[1:i-1] with

Y[1:j]; 2) the cost of warping

X[1:i-1] with Y[1:j-1]; 3) the

cost of warping X[1:i] with

Y[1:j-1]

Example

Let the sequences be:

𝑋 = 𝑎, 𝑏, 𝑔 𝑌 = 𝑎, 𝑏, 𝑏, 𝑔 𝑍 = (𝑎, 𝑔, 𝑔)

Let the feature space 𝐹 = 𝑎, 𝑏, 𝑔 .

Let the local cost measure be

defined as follows:

𝑐 𝑥, 𝑦 = 0 𝑖𝑓 𝑥 = 𝑦1 𝑖𝑓 𝑥 ≠ 𝑦

Let us compute dtw(X,Y), dtw(Y,Z), and dtw(X, Z).

Work it out on paper.

DTW(X, Y)

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 2,1 = 𝑐 2,1 + 𝑑𝑡𝑤 1,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 3,1 = 𝑐 3,1 + 𝑑𝑡𝑤 2,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 1,2 = 𝑐 1,2 + 𝑑𝑡𝑤 1,1= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 1,3 = 𝑐 1,3 + 𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 1,4 = 𝑐 1,4 + 𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,3= 1 + 2 = 3

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 2,2= 𝑐 2,2

+min𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 1,1 ,𝑑𝑡𝑤 2,1

= 𝑐 𝑏, 𝑏 + min 1,0,1 = 0 + 0= 0 1 2

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 3,2= 𝑐 3,2

= 𝑐 𝑔, 𝑏 + min 0,1,2 = 1 + 0= 1 1 2

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 2,3= 𝑐 2,2

= 𝑐 𝑏, 𝑏 + min 2,1,0 = 0 + 0= 0 1 2

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 3,3= 𝑐 3,3

= 𝑐 𝑔, 𝑏 + min 0,0,1 = 1 + 0= 1 1 2

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 2,4= 𝑐 2,4

= 𝑐 𝑏, 𝑔 + min 3,2,0 = 1 + 0= 1 1 2

Example: DTW(X,Y)

a 𝑏 𝑔

𝑑𝑡𝑤 3,4= 𝑐 3,4

= 𝑐 𝑔, 𝑔 +min 1,0,1 = 0 + 0= 0

So DTW(X,Y) = 0

Example: DTW(X,Y)

a 𝑏 𝑔

DTW(X, Y) = 0.

Optimal Warping Path

(OWP) P can be found by

chasing pointers (red

arrows): P = ((1,1), (2, 2),

(2, 3), (3, 4)). 1 2

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0 Z

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 2,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 3,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 4,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 3,1= 1 + 2 = 3

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,2 ,

𝑑𝑡𝑤 1,1 , 𝑑𝑡𝑤 2,1 }

= 1 + 0 = 1

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 3,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,2 ,

𝑑𝑡𝑤 2,1 , 𝑑𝑡𝑤 3,1 }

= 1 +min 1,1,2 = 1 + 1 = 2

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 4,2= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,2 ,

𝑑𝑡𝑤 3,1 , 𝑑𝑡𝑤 4,1 }

= 0 +min 2,2,3 = 0 + 2 = 2

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,3 ,

𝑑𝑡𝑤 1,2 , 𝑑𝑡𝑤 2,2 }

= 1 +min 2,1,1 = 1 + 1 = 2

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 3,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,3 ,

𝑑𝑡𝑤 2,2 , 𝑑𝑡𝑤 3,2 }

= 1 +min 2,1,2 = 1 + 1 = 2

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

𝑑𝑡𝑤 4,3= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,4 ,

𝑑𝑡𝑤 3,2 , 𝑑𝑡𝑤 4,2 }

= 0 +min 2,2,2 = 0 + 2 = 2

DTW(Y, Z)

a 𝑏 𝑏 𝑔

1 2 3 4

DTW(Y, Z) = 2.

Optimal Warping Path (OWP) P

can be found by chasing pointers

(red arrows): P = ((1,1), (2, 2), (3,

2), (4, 3)).

DTW(X, Z)

a 𝑏 𝑔

𝑔 Z

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0

DTW(X, Z)

a 𝑏 𝑔

𝑔 Z

𝑑𝑡𝑤 2,1 = 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

DTW(X, Z)

a 𝑏 𝑔

𝑔 Z

𝑑𝑡𝑤 3,1 = 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

DTW(X, Z)

a 𝑏 𝑔

𝑔 Z

𝑑𝑡𝑤 1,2 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

DTW(X, Z)

a 𝑏 𝑔

𝑔 Z

𝑑𝑡𝑤 1,3 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

DTW(X, Z)

a 𝑏 𝑔

𝑔 Z

𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔

+ min𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 1,1 ,𝑑𝑡𝑤 2,1

= 1 +min 1,0,1= 1 + 0 = 1

DTW(X, Z)

a 𝑏 𝑔

𝑔 Z

𝑑𝑡𝑤 3,2= 𝑐 𝑔, 𝑔

= 0 +min 1,1,2= 0 + 1 = 1

DTW(X, Z)

a 𝑏 𝑔

𝑔 Z

𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔

+ min𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 2,2

= 1 +min 2,1,1= 1 + 1 = 2

DTW(X, Z)

a 𝑏 𝑔

𝑔 Z

𝑑𝑡𝑤 3,3= 𝑐 𝑔, 𝑔

= 0 +min 2,1,2= 0 + 1 = 1

DTW(X, Z)

a 𝑏 𝑔

𝑔 Z

1 0 1 2

2 1 DTW(X, Z) = 1.

Optimal Warping Path (OWP)

P can be found by chasing

pointers (red arrows): P =

((1,1), (2, 2), (3, 3)).

Possible Optimizations of DTW

The computation of DTW can be optimized so that only the

cells within a specific window are considered

Possible Optimizations of DTW

You may have realized by now that if we care

only about the total cost of warping sequence X

with sequence Y, we do not need to compute

the entire N x M cost matrix – we need only two

columns

The storage savings are huge, but the running

time remains the same – O(N x M)

We can also normalize the DTW cost by N x M

to keep it low

Spoken Word Recognition

source code is in WavAudioDictionary.java

General Outline

Given a directory of audio files with spoken words, process

each file into a table that maps specific words (or phrases)

to digital signal vectors

These signal vectors can be pre-processed to eliminate

silences

An input audio file is taken and digitized into a digital

signal vector

The input vector is compared with DTW scores b/w the

input vector and the digital vectors in the table

Optimizations

If we use DTW to compute the similarity b/w the

digital audio input vector and the vectors in the table,

it is vital to keep the vectors as short as possible w/o

sacrificing precision

Possible suggestions: decreasing the sampling rate

and merging samples into super-features (e.g., Haar

coefficients)

Parallelizing similarity computations

References

M. Muller. Information Retrieval for Music and

Motion, Ch.04. Springer, ISBN 978-3-540-74047-6

Bachu, R. G., et al. “Separation of Voiced and

Unvoiced using Zero Crossing Rate and Energy of the

Speech Signal." American Society for Engineering

Education (ASEE) Zone Conference Proceedings. 2008.

NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Technology

Transcript of NLP (Fall 2013): Audio Processing, Zero Crossing Rate, Dynamic Time Warping, Spoken Word Recognition

Warping Calculations

Winding, Warping, Sizing

Natural Language Processing: An Introductioncis.upenn.edu/~cis521/Lectures/NLP-intro.pdf“Natural language, whether spoken, written, or typed, is the ... Abdul Qadeer Khan has admitted

Image warping/morphing Image warpingImage warping · Image warping/morphing Digital Visual Effects ... Steve Seitz, Tom Funkhouser and Alexei Efros Image warpingImage warping Image

Center for PersonKommunikation P.1 CPK NLP Suite for Spoken Language Understanding ( tb/nlpsuite) Tom Brøndsted, CPK. Aalborg University.

Warping l Css

Question Answering What's Next?mandl/Neu/Rijkehildesheim2005_bw.pdf · documents, spoken documents, video, NLP, web, question answering, novelty, robust, genomics, HARD, ... collaboration

Yarn Breakage in Warping

Image Morphing Warping

Warping Torsion - Home Page | Prof Terje Haukaascivil-terje.sites.olt.ubc.ca/files/2020/03/Warping...Warping Torsion Updated March 12, 2020 Page 1 Warping Torsion In addition to shear

Image Warping

Dynamic Time Warping (DTW)

Rotor Warping

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Image warping and stitching

Morphing & Warping

Image Warping (Szeliski 3.6.1)

06 Warping

Query-by-Example Spoken Term Detectionexcel.fit.vutbr.cz/submissions/2015/084/84.pdf · Keywords: Query-by-Example — Spoken Term Detection — Dynamic Time Warping — acoustic

Warping machine