Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

31
Lirong Xia Speech recognition, machine learning Friday, April 4, 2014

Transcript of Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Page 1: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Lirong Xia

Speech recognition, machine learning

Friday, April 4, 2014

Page 2: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Particle Filtering

2

• Sometimes |X| is too big to use exact inference– |X| may be too big to even store

B(X)– E.g. X is continuous– |X|2 may be too big to do updates

• Solution: approximate inference– Track samples of X, not all values– Samples are called particles– Time per step is linear in the number

of samples– But: number needed may be large– In memory: list of particles

• This is how robot localization works in practice

Page 3: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

• Elapse of time

B’(Xt)=Σxt-1p(Xt|xt-1)B(xt-1)

• ObserveB(Xt) ∝p(et|Xt)B’(Xt)

• Renormalize

B(xt) sum up to 1

3

Forward algorithm vs. particle filtering

Forward algorithm Particle filtering

• Elapse of timex--->x’

• Observew(x’)=p(et|x)

• Resample

resample N particles

Page 4: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Dynamic Bayes Nets (DBNs)

4

• We want to track multiple variables over time, using multiple sources of evidence

• Idea: repeat a fixed Bayes net structure at each time• Variables from time t can condition on those from t-1

• DBNs with evidence at leaves are HMMs

Page 5: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

HMMs: MLE Queries

5

• HMMs defined by:– States X– Observations E– Initial distribution: p(X1)– Transitions: p(X|X-1)– Emissions: p(E|X)

• Query: most likely explanation:

1:

1: 1:arg max |t

t tx

p x e

1:tx1:tx

Page 6: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Viterbi Algorithm

6

Page 7: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Today

7

• Speech recognition– A massive HMM!– Details of this section not required– CSCI 4962: Natural language processing by Prof.

Heng Ji• Start to learn machine learning

– CSCI 4100: machine learning by Prof. Malik Magdon-Ismail

Page 8: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Speech and Language

8

• Speech technologies– Automatic speech recognition (ASR)– Text-to-speech synthesis (TTS)– Dialog systems

• Language processing technologies– Machine translation– Information extraction– Web search, question answering– Text classification, spam filtering, etc…

Page 9: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Digitizing Speech

9

Page 10: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

The Input

10

• Speech input is an acoustic wave form

Graphs from Simon Arnfield’s web tutorial on speech, sheffield:

http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/

Page 11: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

11

Page 12: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

The Input

12

• Frequency gives pitch; amplitude gives volume– Sampling at ~8 kHz phone, ~16 kHz mic

• Fourier transform of wave displayed as a spectrogram– Darkness indicates energy at each frequency

Page 13: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Acoustic Feature Sequence

13

• Time slices are translated into acoustic feature vectors (~39 real numbers per slice)

• These are the observations, now we need the hidden states X

Page 14: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

State Space

14

• p(E|X) encodes which acoustic vectors are appropriate for each phoneme (each kind of sound)

• p(X|X’) encodes how sounds can be strung together• We will have one state for each sound in each word• From some state x, can only:

– Stay in the same state (e.g. speaking slowly)– Move to the next position in the word– At the end of the word, move to the start of the next word

• We build a little state graph for each word and chain them together to form our state space X

Page 15: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

HMMs for Speech

15

Page 16: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Transitions with Bigrams

16

Page 17: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Decoding

17

• While there are some practical issues, finding the words given the acoustics is an HMM inference problem

• We want to know which state sequence x1:T is most likely given the evidence e1:T:

• From the sequence x, we can simply read off the words

1:

1:

*1: 1: 1:

1: 1:

arg max |

arg max ,T

T

T T Tx

T Tx

x p x e

p x e

Page 18: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Machine Learning

18

• Up until now: how to reason in a model and how make optimal decisions

• Machine learning: how to acquire a model on the basis of data / experience– Learning parameters (e.g. probabilities)– Learning structure (e.g. BN graphs)– Learning hidden concepts (e.g. clustering)

Page 19: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Parameter Estimation

19

• Estimating the distribution of a random variable• Elicitation: ask a human (why is this hard?)• Empirically: use training data (learning!)

– E.g.: for each outcome x, look at the empirical rate of that value:

– This is the estimate that maximizes the likelihood of the data

count

total samplesML

xp x

, ii

L x p x

1 3MLp r

Page 20: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Estimation: Smoothing

20

• Relative frequencies are the maximum likelihood estimates (MLEs)

• In Bayesian statistics, we think of the parameters as just another random variable, with its own distribution

count

total samplesML

xp x

arg max |

arg max

ML

ii

p X

p X

arg max |

arg max |

arg max |

MAP p X

p X p p X

p X p

????

Page 21: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Estimation: Laplace Smoothing

21

• Laplace’s estimate:– Pretend you saw every outcome

once more than you actually did

– Can derive this as a MAP estimate with Dirichlet priors

ML

LAP

p X

p X

Page 22: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Estimation: Laplace Smoothing

22

• Laplace’s estimate (extended):– Pretend you saw every outcome k

extra times

– What’s Laplace with k=0?– k is the strength of the prior

• Laplace for conditionals:– Smooth each condition

independently:

,0

,1

,100

LAP

LAP

LAP

p X

p X

p X

, =LAP k

c x kp x

N k X

,

,| =LAP k

c x y kp x y

c y k X

Page 23: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Example: Spam Filter

23

• Input: email• Output: spam/ham• Setup:

– Get a large collection of example emails, each labeled “spam” or “ham”

– Note: someone has to hand label all this data!

– Want to learn to predict labels of new, future emails

• Features: the attributes used to make the ham / spam decision– Words: FREE!– Text patterns: $dd, CAPS– Non-text: senderInContacts– ……

Page 24: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Example: Digit Recognition

24

• Input: images / pixel grids• Output: a digit 0-9• Setup:

– Get a large collection of example images, each labeled with a digit

– Note: someone has to hand label all this data!

– Want to learn to predict labels of new, future digit images

• Features: the attributes used to make the digit decision– Pixels: (6,8) = ON– Shape patterns: NumComponents,

AspectRation, NumLoops– ……

Page 25: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

A Digit Recognizer

25

• Input: pixel grids

• Output: a digit 0-9

Page 26: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Naive Bayes for Digits

26

• Simple version:– One feature Fij for each grid position <i,j>

– Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image

– Each input maps to a feature vector, e.g.

– Here: lots of features, each is binary valued

• Naive Bayes model:

• What do we need to learn?

Page 27: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

General Naive Bayes

27

• A general naive Bayes model:

• We only specify how each feature depends on the class

• Total number of parameters is linear in n

Page 28: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Inference for Naive Bayes

28

• Goal: compute posterior over causes– Step 1: get joint probability of causes and evidence

– Step 2: get probability of evidence– Step 3: renormalize

Page 29: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

General Naive Bayes

29

• What do we need in order to use naive Bayes?

– Inference (you know this part)• Start with a bunch of conditionals, p(Y) and the p(F i|Y) tables

• Use standard inference to compute p(Y|F1…Fn)

• Nothing new here

– Estimates of local conditional probability tables• p(Y), the prior over labels

• p(Fi|Y) for each feature (evidence variable)

• These probabilities are collectively called the parameters of the model and denoted by θ

• Up until now, we assumed these appeared by magic, but…

• … they typically come from training data: we’ll look at this now

Page 30: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Examples: CPTs

30

1 0.1

2 0.1

3 0.1

4 0.1

5 0.1

6 0.1

7 0.1

8 0.1

9 0.1

0 0.1

p Y 3,1 |p F on Y

1 0.05

2 0.01

3 0.90

4 0.80

5 0.90

6 0.90

7 0.25

8 0.85

9 0.60

0 0.80

5,5 |p F on Y

1 0.01

2 0.05

3 0.05

4 0.30

5 0.80

6 0.90

7 0.05

8 0.60

9 0.50

0 0.80

Page 31: Lirong Xia Speech recognition, machine learning Friday, April 4, 2014.

Important Concepts

31

• Data: labeled instances, e.g. emails marked spam/ham– Training set– Held out set– Test set

• Features: attribute-value pairs which characterize each x

• Experimentation cycle– Learn parameters (e.g. model probabilities) on training set– (Tune hyperparameters on held-out set)– Compute accuracy of test set– Very important: never “peek” at the test set!

• Evaluation– Accuracy: fraction of instances predicted correctly

• Overfitting and generalization– Want a classifier which does well on test data– Overfitting: fitting the training data very closely, but not generalizing

well– We’ll investigate overfitting and generalization formally in a few

lectures