Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

56
Application 3 AUTOMATIC SPEECH RECOGNITON

Transcript of Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Page 1: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Application 3AUTOMATIC SPEECH

RECOGNITON

Page 2: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

THE MALEVOLENT HAL

Page 3: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

THE TURING TEST: A PHILOSOPHICAL INTERLUDE

Page 4: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

THE CHINESE ROOM

Page 5: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

WHAT DO THE TWO THOUGHT EXPERIMENTS HAVE IN COMMON?

Page 6: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

TYPES OF PERFORMANCE

Page 7: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

THE MODEL

Do You Believe This?

Page 8: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

WHY ASR IS HARD

Page 9: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

◦ [t] Tunafish Word initial Vocal chords don’t’ vibrate. Produces a puff of air

◦ [t] Starfish [t] preceded by [s] Vocal chords vibrate. No air puff

◦ [k]: vocal chords don’t vibrate. Produces puff of air.◦ [g]: Vocal vibrate. No air puff◦ But [s] initial changes things: now [k] vibrates◦ Leads to the mishearing of [sk]/[sg]

the sky this guy

◦ There’s more going on: which hearing is more probable?

A TALE OF ASPIRATION

Page 10: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Acoustic-Phonetic Let us pray / Lettuce spraySyntactic Meet her at home Meter at homeSemantic Is the baby crying Is the bay bee cryingDiscourse It is easy to recognize speech It is easy to wreck a nice beachProsody: I’m FLYING to Spokane I’m flying to SPOKANE

AMBIGUITY EXISTS AT DIFFERENT LEVELS (JIM GLASS, 2007)

Page 11: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Language is not a system of rules◦ [t] makes a certain sound◦ “to whom” is correct. “to who” is incorrect

Language is a collection of probabilities

WHAT TO DO?

Page 12: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

What is the most likely sequence of words W out of all word sequences in a language L given some acoustic input O?

Where O is a sequence of observations 0=o1, o2 , o3 , ..., ot

each oi is a floating point value representing ~10ms worth of energy of that slice of 0

And w=w1, w2 , w3 ,..., wn

each wi is a word in L

GOAL OF A PROBABILISTIC NOISY CHANNEL ARCHITECTURE

Page 13: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

ASR AS A CONDITIONAL PROBABILITY

)|(maxarg

)( OWPLW

Whyp

Page 14: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

ASR Is Old as Computing 50s: Bell Labs, RCA Research, Lincoln Labs

Discoveries in acoustic phonetics applied to recognition of single digits, syllables, vowels

60s: Pattern recognition techniques used in US, Japan, Soviet Union

Two Developments in 80s DARPA funding for LVCSR Application of HMMs to speech recognition

AN HISTORICAL ASIDE

Page 15: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Recall the Decoding Task Given an HMM, M Given a hidden state sequence, Q, Observation sequence O Determine p(Q|O)

Recall the Learning Task Given O and Q, create M Where M consists of two matrices

Priors: A =a11, ..., a1n, ..., an1, ..., ann, where aij = p(qj|qi)

Likelihoods: p(oi | qi)

But how do we get from

i.e, to our likelihoods and priors

A SENTIMENTAL JOURNEY

)|(maxarg

)( OWPLW

Whyp

Page 16: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

PARSON BAYES TO THE RESCUE

Author of: Divine Benevolence, or an Attempt to Prove That the Principal End of the Divine Providence and Government is the Happiness of His Creatures (1731)

Page 17: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

BAYES RULE

)|(maxarg

)( OWPLW

Whyp

Lets us transform:

To:

In Fact:

p(o|w) : likelihoods referred to as the acoustic modelp(w) : priors referred to as the language model

Page 18: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

LVCSR

Page 19: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

ANOTHER VIEW

signal Feature Extractor

Acoustic Model

Language Model

Decoder symbols feature set

Diagram of an LVCSR System

p(O|W)

p(W)

rep. of acoustic signal

digital signal processing

Viterbi

Page 20: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Digitize the analog signal through sampling Decide on a window size and perform FFT Output: amount of energy at each frequency range: spectrum log(FFT) is mel scale value Take FFT of the previous value: cepstrum Cepstrum is a model of the vocal tract Save 13 values Compute the change in these 13 over the next window Compute the change in the 13 deltas Total: 39 feature vectors

CREATING FEATURE VECTORSDSP

Page 21: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Computing the likelihood of feature vectors Given an HMM state The HMM state is a partial representation of a linguistic unit p(ot|qi)

But FirstWhat are these Speech Sounds

LEFT OUT

Page 22: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

FUNDAMENTAL THEORY OF PHONETICS

Spoken wordComposed of smaller units of speech

Called phones

Def: A phone is a speech soundPhones are represented by symbols

IPAARPABET

Page 23: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

ENGLISH VOWELS

Page 24: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

HUMAN VOCAL ORGANS

Page 25: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

CLOSE-UP

Page 26: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

TYPES OF SOUND

Glottis: space between vocal folds. Glottis vibrates/doesn’t vibrate:

Voiced consonants like [b], [d], [g], [v], [z], all English vowels Unvoiced consonants like [p], [t], [k], [v], [s]

Sounds passing through nose: nasals [m], [n], [ng]

Page 27: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

PHONE CLASSES

Consonants produced by restricting the airflow

Vowels unrestricted, usually voiced, and longer lasting

semivowels [y], voiced but shorter

Page 28: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

CONSONANT: PLACE OF ARTICULATION

labial—[b], [m] labiodental—[v],[f]Dental—[th]Alveolar—[s],[z],[t],[d]Palatal—[sh],[ch],[zh] (Asian), [jh] (jar)Velar—[k] (cuckoo), [g] (goose), [ng] (kingfisher)

Page 29: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

CONSONANT: MANNER OF ARTICULATION

How the airflow is restrictedstop or plosive: [b],[d],[g]nasal: air passes into the nasal cavity: [n],[m].[ng] fricative: air flow is not cut off completely: [f],[v],[th],[dh],

[s],[z]affricates: stops followed by fricative [ch] (chicken), [jh]

(giraffe)approximants: two articulators are close together but not

close enough to cause turbulent air flow: [y],[w],[r], [l]

Page 30: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

VOWELS

Characterized by height and backnessHigh Front: tongue raised toward the front [iy] (lily)High Back: tongue raised toward the back [uw] (tulip)Low Front: [ae] (bat)Low Back: [aa] (poppy)

Page 31: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

ACOUSTIC PHONETICS

Based on the sine wave

f = cycles per secondA = height of the waveT = 1/f, the amount of time it takes cycle to complete

Page 32: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

SOUND WAVES

Plot the change in air pressure over time Imagine an eardrum blocking air pressure wavesGraph measures the amount of compression and

uncompression.

[iy] in “She just had a baby”

Page 33: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

SHE JUST HAD A BABY

Notice the vowels, fricative [sh], and stop release [b]

Page 34: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

FOURIER ANALYSIS

Every complex wave form can be represented as a sum of component sine waves

two wave forms: 10hz and 100 hz

Page 35: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

SPECTRUM

Spectrum of a signal is a representation of each of its frequency components and their amplitudes.

Spectrum of the 10 + 100 Hz wave forms. Note the two spikes

Page 36: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

WAVE FORM FOR [AE] IN “HAD”

Note •10 major waves and 4 smaller within the 10 larger•The frequency of the larger is 10 cy/.0427 s = 234 Hz•The frequency of the smaller is about 4 times that or ~ 930 HzAlso •Some of the 930 Hz waves have two smaller waves•F ~ 2 * 930 = 1860 Hz

Page 37: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

SPECTRUM FOR [AE]

Notice one of the peaks at just under 1000 HzAnother at just under 2000 Hz

Page 38: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

CONCLUSION

Spectral peaks that are visible in a spectrum are characteristic of different phones

Page 39: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Computing likelihood probability of vectors given a triphone:p(ot|qi)

Language model: p(W)

WHAT REMAINS

Page 40: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

SPECTROGRAM

Representation of different frequencies that make up a waveform over time (spectrum was a single point in time)

x axis: time y axis: frequencies in Hz darkness: amplitude

[ih] [ae] [ah]

Page 41: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

WE NEED A SEQUENCE CLASSIFIER

Page 42: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Observation Sequence in ASR Acoustic Feature Vectors

39 real-valued features Represents changes in energy in different frequency bands Each vector represents 10ms

Hidden States words for simple tasks like digit recognition/yes-no phones or (usually subphones)

HMMS IN ACTION

Page 43: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

SIX

Bakis Network: Left-Right HMM

Each aij is an entry in the priors matrix

Likelihood probabilities not shown•For each state there is a collection of likelihood observations•Each observation (now a vector of 39 features) has a probability given the state

Page 44: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

BUT PHONES CHANGE OVER TIME

Page 45: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

NECESSARY TO MODEL SUBPHONES

As BeforeBakis Network: Left-Right HMM

Each aij is an entry in the priors matrix

Likelihood probabilities not shown•For each state there is a collection of likelihood observations•Each observation (now a vector of 39 features) has a probability given the state: p(ot|qi)

Page 46: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

COARTICULATION

Notice the difference in the 2nd formant of [eh] in each context

Page 47: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Triphone phone left context right context Notation[y-eh+l]: [eh] preceded by [y] and followed by [l]

Suppose there are 50 phones in a language: 125,000 triphones

Not all will appear in a corpus English disallows: [ae-eh+ow] and [m-j+t] WSJ study: 55,000 triphones needed but found only 18,500

SOLUTION

Page 48: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Lucky for us: different contexts sometimes have similar effects. Notice [w iy]/[r iy] and [m iy]/[n iy]

DATA SPARSITY

Page 49: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

STATE TYING

Initial subphones of [t-iy+n] [t-iy+n] share acoustic reps (and likelihood probabilities)

How: Clustering algorithm

Page 50: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

Problem 1: Which observation corresponds to which state? p(ot|qi): Likelihoods

Problem 2: What is the transition probability between states Priors

Hand labeled Training corpus of isolated words in wav file Start and stop time of each phone is marked by hand Can compute the observation likelihoods by counting (like ice cream)

But requires 400 hours to label an hour of speech Humans are bad a labeling units smaller than a phone

Embedded training Wav file + corresponding transcription Pronunciation lexicon Raw (untrained) HMM Baum-Welsh sums over all possible segmentations of words and

phones

ACOUSTIC LIKELIHOOD/TRANSITION PROBABILITY COMPUTATION

Page 51: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

THE RAW MATERIALS

Page 52: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

THE LANGUAGE MODEL

signal Feature Extractor

Acoustic Model

Language Model

Decoder symbols feature set

Recall

p(O|W)

p(W)

rep. of acoustic signal

digital signal processing

Viterbi

Page 53: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

P(W)

Page 54: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

<s> Alex wrote his book</s><s> James wrote a different book</s>

<s>Alex wrote a book suggested by

Judith</s>

BIGRAM PROBABILITIES

P(wrote|Alex) = 2/2 P(a|wrote) = 2/3

P(book|a) = 1/2 P(</s>|book) = 2/3

P(Alex wrote a book) = P(Alex|<s>)P(wrote|Alex)P(a|wrote)P(book|a)P(</s>|book) = .148

Independence Assumption

Page 55: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

HMM requires independence assumptions Researchers are now experimenting with deep belief networks Stacks of Boltzman machines

Non-global Languages: the Google phenomenonDetection of physical and emotional states

anger frustration sleepiness intoxication blame classification among married couples

ISSUES

Page 56: Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.

LANGUAGE IS OUR DEFINING FEATURE

And We Haven’t Even Begun to Talk About Understanding