Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.
-
Upload
barry-foster -
Category
Documents
-
view
220 -
download
2
Transcript of Application 3 AUTOMATIC SPEECH RECOGNITON. THE MALEVOLENT HAL.
Application 3AUTOMATIC SPEECH
RECOGNITON
THE MALEVOLENT HAL
THE TURING TEST: A PHILOSOPHICAL INTERLUDE
THE CHINESE ROOM
WHAT DO THE TWO THOUGHT EXPERIMENTS HAVE IN COMMON?
TYPES OF PERFORMANCE
THE MODEL
Do You Believe This?
WHY ASR IS HARD
◦ [t] Tunafish Word initial Vocal chords don’t’ vibrate. Produces a puff of air
◦ [t] Starfish [t] preceded by [s] Vocal chords vibrate. No air puff
◦ [k]: vocal chords don’t vibrate. Produces puff of air.◦ [g]: Vocal vibrate. No air puff◦ But [s] initial changes things: now [k] vibrates◦ Leads to the mishearing of [sk]/[sg]
the sky this guy
◦ There’s more going on: which hearing is more probable?
A TALE OF ASPIRATION
Acoustic-Phonetic Let us pray / Lettuce spraySyntactic Meet her at home Meter at homeSemantic Is the baby crying Is the bay bee cryingDiscourse It is easy to recognize speech It is easy to wreck a nice beachProsody: I’m FLYING to Spokane I’m flying to SPOKANE
AMBIGUITY EXISTS AT DIFFERENT LEVELS (JIM GLASS, 2007)
Language is not a system of rules◦ [t] makes a certain sound◦ “to whom” is correct. “to who” is incorrect
Language is a collection of probabilities
WHAT TO DO?
What is the most likely sequence of words W out of all word sequences in a language L given some acoustic input O?
Where O is a sequence of observations 0=o1, o2 , o3 , ..., ot
each oi is a floating point value representing ~10ms worth of energy of that slice of 0
And w=w1, w2 , w3 ,..., wn
each wi is a word in L
GOAL OF A PROBABILISTIC NOISY CHANNEL ARCHITECTURE
ASR AS A CONDITIONAL PROBABILITY
)|(maxarg
)( OWPLW
Whyp
ASR Is Old as Computing 50s: Bell Labs, RCA Research, Lincoln Labs
Discoveries in acoustic phonetics applied to recognition of single digits, syllables, vowels
60s: Pattern recognition techniques used in US, Japan, Soviet Union
Two Developments in 80s DARPA funding for LVCSR Application of HMMs to speech recognition
AN HISTORICAL ASIDE
Recall the Decoding Task Given an HMM, M Given a hidden state sequence, Q, Observation sequence O Determine p(Q|O)
Recall the Learning Task Given O and Q, create M Where M consists of two matrices
Priors: A =a11, ..., a1n, ..., an1, ..., ann, where aij = p(qj|qi)
Likelihoods: p(oi | qi)
But how do we get from
i.e, to our likelihoods and priors
A SENTIMENTAL JOURNEY
)|(maxarg
)( OWPLW
Whyp
PARSON BAYES TO THE RESCUE
Author of: Divine Benevolence, or an Attempt to Prove That the Principal End of the Divine Providence and Government is the Happiness of His Creatures (1731)
BAYES RULE
)|(maxarg
)( OWPLW
Whyp
Lets us transform:
To:
In Fact:
p(o|w) : likelihoods referred to as the acoustic modelp(w) : priors referred to as the language model
LVCSR
ANOTHER VIEW
signal Feature Extractor
Acoustic Model
Language Model
Decoder symbols feature set
Diagram of an LVCSR System
p(O|W)
p(W)
rep. of acoustic signal
digital signal processing
Viterbi
Digitize the analog signal through sampling Decide on a window size and perform FFT Output: amount of energy at each frequency range: spectrum log(FFT) is mel scale value Take FFT of the previous value: cepstrum Cepstrum is a model of the vocal tract Save 13 values Compute the change in these 13 over the next window Compute the change in the 13 deltas Total: 39 feature vectors
CREATING FEATURE VECTORSDSP
Computing the likelihood of feature vectors Given an HMM state The HMM state is a partial representation of a linguistic unit p(ot|qi)
But FirstWhat are these Speech Sounds
LEFT OUT
FUNDAMENTAL THEORY OF PHONETICS
Spoken wordComposed of smaller units of speech
Called phones
Def: A phone is a speech soundPhones are represented by symbols
IPAARPABET
ENGLISH VOWELS
HUMAN VOCAL ORGANS
CLOSE-UP
TYPES OF SOUND
Glottis: space between vocal folds. Glottis vibrates/doesn’t vibrate:
Voiced consonants like [b], [d], [g], [v], [z], all English vowels Unvoiced consonants like [p], [t], [k], [v], [s]
Sounds passing through nose: nasals [m], [n], [ng]
PHONE CLASSES
Consonants produced by restricting the airflow
Vowels unrestricted, usually voiced, and longer lasting
semivowels [y], voiced but shorter
CONSONANT: PLACE OF ARTICULATION
labial—[b], [m] labiodental—[v],[f]Dental—[th]Alveolar—[s],[z],[t],[d]Palatal—[sh],[ch],[zh] (Asian), [jh] (jar)Velar—[k] (cuckoo), [g] (goose), [ng] (kingfisher)
CONSONANT: MANNER OF ARTICULATION
How the airflow is restrictedstop or plosive: [b],[d],[g]nasal: air passes into the nasal cavity: [n],[m].[ng] fricative: air flow is not cut off completely: [f],[v],[th],[dh],
[s],[z]affricates: stops followed by fricative [ch] (chicken), [jh]
(giraffe)approximants: two articulators are close together but not
close enough to cause turbulent air flow: [y],[w],[r], [l]
VOWELS
Characterized by height and backnessHigh Front: tongue raised toward the front [iy] (lily)High Back: tongue raised toward the back [uw] (tulip)Low Front: [ae] (bat)Low Back: [aa] (poppy)
ACOUSTIC PHONETICS
Based on the sine wave
f = cycles per secondA = height of the waveT = 1/f, the amount of time it takes cycle to complete
SOUND WAVES
Plot the change in air pressure over time Imagine an eardrum blocking air pressure wavesGraph measures the amount of compression and
uncompression.
[iy] in “She just had a baby”
SHE JUST HAD A BABY
Notice the vowels, fricative [sh], and stop release [b]
FOURIER ANALYSIS
Every complex wave form can be represented as a sum of component sine waves
two wave forms: 10hz and 100 hz
SPECTRUM
Spectrum of a signal is a representation of each of its frequency components and their amplitudes.
Spectrum of the 10 + 100 Hz wave forms. Note the two spikes
WAVE FORM FOR [AE] IN “HAD”
Note •10 major waves and 4 smaller within the 10 larger•The frequency of the larger is 10 cy/.0427 s = 234 Hz•The frequency of the smaller is about 4 times that or ~ 930 HzAlso •Some of the 930 Hz waves have two smaller waves•F ~ 2 * 930 = 1860 Hz
SPECTRUM FOR [AE]
Notice one of the peaks at just under 1000 HzAnother at just under 2000 Hz
CONCLUSION
Spectral peaks that are visible in a spectrum are characteristic of different phones
Computing likelihood probability of vectors given a triphone:p(ot|qi)
Language model: p(W)
WHAT REMAINS
SPECTROGRAM
Representation of different frequencies that make up a waveform over time (spectrum was a single point in time)
x axis: time y axis: frequencies in Hz darkness: amplitude
[ih] [ae] [ah]
WE NEED A SEQUENCE CLASSIFIER
Observation Sequence in ASR Acoustic Feature Vectors
39 real-valued features Represents changes in energy in different frequency bands Each vector represents 10ms
Hidden States words for simple tasks like digit recognition/yes-no phones or (usually subphones)
HMMS IN ACTION
SIX
Bakis Network: Left-Right HMM
Each aij is an entry in the priors matrix
Likelihood probabilities not shown•For each state there is a collection of likelihood observations•Each observation (now a vector of 39 features) has a probability given the state
BUT PHONES CHANGE OVER TIME
NECESSARY TO MODEL SUBPHONES
As BeforeBakis Network: Left-Right HMM
Each aij is an entry in the priors matrix
Likelihood probabilities not shown•For each state there is a collection of likelihood observations•Each observation (now a vector of 39 features) has a probability given the state: p(ot|qi)
COARTICULATION
Notice the difference in the 2nd formant of [eh] in each context
Triphone phone left context right context Notation[y-eh+l]: [eh] preceded by [y] and followed by [l]
Suppose there are 50 phones in a language: 125,000 triphones
Not all will appear in a corpus English disallows: [ae-eh+ow] and [m-j+t] WSJ study: 55,000 triphones needed but found only 18,500
SOLUTION
Lucky for us: different contexts sometimes have similar effects. Notice [w iy]/[r iy] and [m iy]/[n iy]
DATA SPARSITY
STATE TYING
Initial subphones of [t-iy+n] [t-iy+n] share acoustic reps (and likelihood probabilities)
How: Clustering algorithm
Problem 1: Which observation corresponds to which state? p(ot|qi): Likelihoods
Problem 2: What is the transition probability between states Priors
Hand labeled Training corpus of isolated words in wav file Start and stop time of each phone is marked by hand Can compute the observation likelihoods by counting (like ice cream)
But requires 400 hours to label an hour of speech Humans are bad a labeling units smaller than a phone
Embedded training Wav file + corresponding transcription Pronunciation lexicon Raw (untrained) HMM Baum-Welsh sums over all possible segmentations of words and
phones
ACOUSTIC LIKELIHOOD/TRANSITION PROBABILITY COMPUTATION
THE RAW MATERIALS
THE LANGUAGE MODEL
signal Feature Extractor
Acoustic Model
Language Model
Decoder symbols feature set
Recall
p(O|W)
p(W)
rep. of acoustic signal
digital signal processing
Viterbi
P(W)
<s> Alex wrote his book</s><s> James wrote a different book</s>
<s>Alex wrote a book suggested by
Judith</s>
BIGRAM PROBABILITIES
P(wrote|Alex) = 2/2 P(a|wrote) = 2/3
P(book|a) = 1/2 P(</s>|book) = 2/3
P(Alex wrote a book) = P(Alex|<s>)P(wrote|Alex)P(a|wrote)P(book|a)P(</s>|book) = .148
Independence Assumption
HMM requires independence assumptions Researchers are now experimenting with deep belief networks Stacks of Boltzman machines
Non-global Languages: the Google phenomenonDetection of physical and emotional states
anger frustration sleepiness intoxication blame classification among married couples
ISSUES
LANGUAGE IS OUR DEFINING FEATURE
And We Haven’t Even Begun to Talk About Understanding