Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.
-
Upload
sawyer-cuny -
Category
Documents
-
view
218 -
download
3
Transcript of Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.
Lirong Xia
Speech recognition, machine learning
Friday, April 4, 2014
Particle Filtering
2
• Sometimes |X| is too big to use exact inference– |X| may be too big to even store
B(X)– E.g. X is continuous– |X|2 may be too big to do updates
• Solution: approximate inference– Track samples of X, not all values– Samples are called particles– Time per step is linear in the number
of samples– But: number needed may be large– In memory: list of particles
• This is how robot localization works in practice
• Elapse of time
B’(Xt)=Σxt-1p(Xt|xt-1)B(xt-1)
• ObserveB(Xt) ∝p(et|Xt)B’(Xt)
• Renormalize
B(xt) sum up to 1
3
Forward algorithm vs. particle filtering
Forward algorithm Particle filtering
• Elapse of timex--->x’
• Observew(x’)=p(et|x)
• Resample
resample N particles
Dynamic Bayes Nets (DBNs)
4
• We want to track multiple variables over time, using multiple sources of evidence
• Idea: repeat a fixed Bayes net structure at each time• Variables from time t can condition on those from t-1
• DBNs with evidence at leaves are HMMs
HMMs: MLE Queries
5
• HMMs defined by:– States X– Observations E– Initial distribution: p(X1)– Transitions: p(X|X-1)– Emissions: p(E|X)
• Query: most likely explanation:
1:
1: 1:arg max |t
t tx
p x e
1:tx1:tx
Viterbi Algorithm
6
Today
7
• Speech recognition– A massive HMM!– Details of this section not required– CSCI 4962: Natural language processing by Prof.
Heng Ji• Start to learn machine learning
– CSCI 4100: machine learning by Prof. Malik Magdon-Ismail
Speech and Language
8
• Speech technologies– Automatic speech recognition (ASR)– Text-to-speech synthesis (TTS)– Dialog systems
• Language processing technologies– Machine translation– Information extraction– Web search, question answering– Text classification, spam filtering, etc…
Digitizing Speech
9
The Input
10
• Speech input is an acoustic wave form
Graphs from Simon Arnfield’s web tutorial on speech, sheffield:
http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/
11
The Input
12
• Frequency gives pitch; amplitude gives volume– Sampling at ~8 kHz phone, ~16 kHz mic
• Fourier transform of wave displayed as a spectrogram– Darkness indicates energy at each frequency
Acoustic Feature Sequence
13
• Time slices are translated into acoustic feature vectors (~39 real numbers per slice)
• These are the observations, now we need the hidden states X
State Space
14
• p(E|X) encodes which acoustic vectors are appropriate for each phoneme (each kind of sound)
• p(X|X’) encodes how sounds can be strung together• We will have one state for each sound in each word• From some state x, can only:
– Stay in the same state (e.g. speaking slowly)– Move to the next position in the word– At the end of the word, move to the start of the next word
• We build a little state graph for each word and chain them together to form our state space X
HMMs for Speech
15
Transitions with Bigrams
16
Decoding
17
• While there are some practical issues, finding the words given the acoustics is an HMM inference problem
• We want to know which state sequence x1:T is most likely given the evidence e1:T:
• From the sequence x, we can simply read off the words
1:
1:
*1: 1: 1:
1: 1:
arg max |
arg max ,T
T
T T Tx
T Tx
x p x e
p x e
Machine Learning
18
• Up until now: how to reason in a model and how make optimal decisions
• Machine learning: how to acquire a model on the basis of data / experience– Learning parameters (e.g. probabilities)– Learning structure (e.g. BN graphs)– Learning hidden concepts (e.g. clustering)
Parameter Estimation
19
• Estimating the distribution of a random variable• Elicitation: ask a human (why is this hard?)• Empirically: use training data (learning!)
– E.g.: for each outcome x, look at the empirical rate of that value:
– This is the estimate that maximizes the likelihood of the data
count
total samplesML
xp x
, ii
L x p x
1 3MLp r
Estimation: Smoothing
20
• Relative frequencies are the maximum likelihood estimates (MLEs)
• In Bayesian statistics, we think of the parameters as just another random variable, with its own distribution
count
total samplesML
xp x
arg max |
arg max
ML
ii
p X
p X
arg max |
arg max |
arg max |
MAP p X
p X p p X
p X p
????
Estimation: Laplace Smoothing
21
• Laplace’s estimate:– Pretend you saw every outcome
once more than you actually did
– Can derive this as a MAP estimate with Dirichlet priors
ML
LAP
p X
p X
Estimation: Laplace Smoothing
22
• Laplace’s estimate (extended):– Pretend you saw every outcome k
extra times
– What’s Laplace with k=0?– k is the strength of the prior
• Laplace for conditionals:– Smooth each condition
independently:
,0
,1
,100
LAP
LAP
LAP
p X
p X
p X
, =LAP k
c x kp x
N k X
,
,| =LAP k
c x y kp x y
c y k X
Example: Spam Filter
23
• Input: email• Output: spam/ham• Setup:
– Get a large collection of example emails, each labeled “spam” or “ham”
– Note: someone has to hand label all this data!
– Want to learn to predict labels of new, future emails
• Features: the attributes used to make the ham / spam decision– Words: FREE!– Text patterns: $dd, CAPS– Non-text: senderInContacts– ……
Example: Digit Recognition
24
• Input: images / pixel grids• Output: a digit 0-9• Setup:
– Get a large collection of example images, each labeled with a digit
– Note: someone has to hand label all this data!
– Want to learn to predict labels of new, future digit images
• Features: the attributes used to make the digit decision– Pixels: (6,8) = ON– Shape patterns: NumComponents,
AspectRation, NumLoops– ……
A Digit Recognizer
25
• Input: pixel grids
• Output: a digit 0-9
Naive Bayes for Digits
26
• Simple version:– One feature Fij for each grid position <i,j>
– Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image
– Each input maps to a feature vector, e.g.
– Here: lots of features, each is binary valued
• Naive Bayes model:
• What do we need to learn?
General Naive Bayes
27
• A general naive Bayes model:
• We only specify how each feature depends on the class
• Total number of parameters is linear in n
Inference for Naive Bayes
28
• Goal: compute posterior over causes– Step 1: get joint probability of causes and evidence
– Step 2: get probability of evidence– Step 3: renormalize
General Naive Bayes
29
• What do we need in order to use naive Bayes?
– Inference (you know this part)• Start with a bunch of conditionals, p(Y) and the p(F i|Y) tables
• Use standard inference to compute p(Y|F1…Fn)
• Nothing new here
– Estimates of local conditional probability tables• p(Y), the prior over labels
• p(Fi|Y) for each feature (evidence variable)
• These probabilities are collectively called the parameters of the model and denoted by θ
• Up until now, we assumed these appeared by magic, but…
• … they typically come from training data: we’ll look at this now
Examples: CPTs
30
1 0.1
2 0.1
3 0.1
4 0.1
5 0.1
6 0.1
7 0.1
8 0.1
9 0.1
0 0.1
p Y 3,1 |p F on Y
1 0.05
2 0.01
3 0.90
4 0.80
5 0.90
6 0.90
7 0.25
8 0.85
9 0.60
0 0.80
5,5 |p F on Y
1 0.01
2 0.05
3 0.05
4 0.30
5 0.80
6 0.90
7 0.05
8 0.60
9 0.50
0 0.80
Important Concepts
31
• Data: labeled instances, e.g. emails marked spam/ham– Training set– Held out set– Test set
• Features: attribute-value pairs which characterize each x
• Experimentation cycle– Learn parameters (e.g. model probabilities) on training set– (Tune hyperparameters on held-out set)– Compute accuracy of test set– Very important: never “peek” at the test set!
• Evaluation– Accuracy: fraction of instances predicted correctly
• Overfitting and generalization– Want a classifier which does well on test data– Overfitting: fitting the training data very closely, but not generalizing
well– We’ll investigate overfitting and generalization formally in a few
lectures