16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression...

50
Midterm Review Instructor: Wei Xu

Transcript of 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression...

Page 1: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Midterm Review Instructor: Wei Xu

Page 2: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 3: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Classification

Page 4: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Planning GUIGarbageCollection

Machine Learning NLP

parsertagtrainingtranslationlanguage...

learningtrainingalgorithmshrinkagenetwork...

garbagecollectionmemoryoptimizationregion...

Test document

parserlanguagelabeltranslation…

Text Classification

...planningtemporalreasoningplanlanguage...

?

Page 5: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Classification Methods: Supervised Machine Learning

• Input: – a document d – a fixed set of classes C = {c1, c2,…, cJ}

– A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

• Output: – a learned classifier γ:d ! c

Page 6: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Naïve Bayes Classifier

cMAP = argmaxc∈C

P(c | d)

= argmaxc∈C

P(d | c)P(c)P(d)

= argmaxc∈C

P(d | c)P(c)

MAP is “maximum a posteriori” = most likely class

Bayes Rule

Dropping the denominator

= argmaxc∈C

P(x1, x2,…, xn | c)P(c)

cNB = argmaxc∈C

P(cj ) P(x | c)x∈X∏

bag of word

conditional independence assumption

Page 7: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Multinomial Naïve Bayes: Learning

• maximum likelihood estimates – simply use the frequencies in the data

P̂(wi | cj ) =count(wi ,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

=count(wi ,c)+1

count(w,cw∈V∑ )

⎝⎜⎜

⎠⎟⎟ + V

• For unknown words (which completely doesn’t occur in training set), we can ignore them.

laplace (add-1) smoothing to avoid zero probabilities

Page 8: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Weaknesses of Naïve Bayes

• Assuming conditional independence

• Correlated features -> double counting evidence – Parameters are estimated independently

• This can hurt classifier accuracy and calibration

Page 9: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Logistic Regression

• Doesn’t assume conditional independence of features – Better calibrated probabilities – Can handle highly correlated overlapping

features

Page 10: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Logistic vs. Softmax Regression

one estimated probability for

each class

one estimated probability good for binary classification

the class with highest

probability

Generalization to Multi-class Problems

Page 11: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Maximum Entropy Models (MaxEnt)• a.k.a logistic regression or multinominal logistic regression or

multiclass logistic regression or softmax regression

• belongs to the family of classifiers know as log-linear classifiers

• Math proof of “LR=MaxEnt”: - [Klein and Manning 2003] - [Mount 2011]

http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf

Page 12: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

convert into probabilities between [0, 1]

softmax function

Page 13: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

concave function!

Page 14: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Gradient Ascent

Page 15: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Gradient Ascent

Loop While not converged: For all features j, compute and add derivatives

: Training set log-likelihood (Objective function)

: Gradient vector

Page 16: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 17: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 18: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

LR Gradient

logistic function

Page 19: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Perceptron vs. Logistic Regression

weights (parameters)

inputs (features)

choices of activation function

Page 20: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Initialize weight vector w = 0 Loop for K iterations Loop For all training examples x_i if sign(w * x_i) != y_i w += y_i * x_i

Error-driven!

Perceptron Algorithm• Very similar to logistic regression • Not exactly computing gradient

Page 21: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

MultiClass Perceptron Algorithm

Initialize weight vector w = 0 Loop for K iterations Loop For all training examples x_i y_pred = argmax_y w_y * x_i if y_pred != y_i w_y_gold += x_i w_y_pred -= x_i

increase score for right answer

decrease score for wrong answer

Page 22: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Perceptron vs. Logistic Regression

• Only hyperparameter of perceptron is maximum number of iterations (LR also needs learning rate)

• Perceptron is guaranteed to converge if the data is linearly separable (LR always converge)

Page 23: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Perceptron vs. Logistic Regression• The Perceptron is an online learning algorithm. • Logistic Regression is not:

this update is effectively the same as “w += y_i * x_i”

Page 24: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Multinominal LR Gradient

empirical feature count expected feature count

Page 25: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

MAP-based learning (Perceptron)

.≈

Maximum A Posteriori

approximate using maximization

Page 26: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Markov Assumption, Perplexity, Interpolation, Backoff, and

Kneser-Ney Smoothing

Language Modeling

Page 27: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Probabilistic Language Modeling•Goal: compute the probability of a sentence or sequence of words: P(W) = P(w1,w2,w3,w4,w5…wn)

•Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4)

•A model that computes either of these: P(W) or P(wn|w1,w2…wn-1) is called a language model or LM

Page 28: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

The Chain Rule applied to compute joint probability of words in sentence

Markov Assumption

Page 29: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Bigram model• Condition on the previous word:

Page 30: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Add-one estimation

•Also called Laplace smoothing

• Pretend we saw each word one more time than we did • Just add one to all the counts!

•MLE estimate:

•Add-1 estimate:

PMLE(wi |wi−1) =c(wi−1,wi )c(wi−1)

PAdd−1(wi |wi−1) =c(wi−1,wi )+1c(wi−1)+V

Page 31: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Add-1 estimation is a blunt instrument

• So add-1 isn’t used for N-grams: •We’ll see better methods

• But add-1 is used to smooth other NLP models • For text classification • In domains where the number of zeros isn’t so huge.

Page 32: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

•Linear interpolation

•Backoff •Absolute Discounting •Kneser-Ney Smoothing

Better Language Models

Page 33: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Perplexity

Perplexity is the inverse probability of the test set, “normalized” by the number of words:

Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set • Gives the highest P(sentence)

Chain Rule

for bigram

Page 34: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Hidden Markov Models, Maximum Entropy Markov

Models (Log-linear Models for Tagging), and Viterbi Algorithm

Tagging

Page 35: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Tagging (Sequence Labeling)• Given a sequence (in NLP, words), assign appropriate labels to

each word. • Many NLP problems can be viewed as sequence labeling: - POS Tagging - Chunking - Named Entity Tagging

• Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors

Plays well with others. VBZ RB IN NNS

Page 36: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 37: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 38: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Emission parameters

Trigram parameters

Page 39: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 40: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 41: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 42: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Andrew Viterbi, 1967

Page 43: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 44: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 45: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

running time for trigram HMM

running time for bigram HMM is O(n|S|2)

Page 46: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 47: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 48: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...
Page 49: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...

Decoding§ Linear Perceptron

§ Features must be local, for x=x1…xm, and s=s1…sm

§ Define π(i,si) to be the max score of a sequence of length iending in tag si

§ Viterbi algorithm (HMMs):

§ Viterbi algorithm (Maxent):�(i, si) = max

si�1

p(si|si�1, x1 . . . xm)�(i� 1, si�1)

⇡(i, si) = max

si�1

e(xi|si)q(si|si�1)⇡(i� 1, si�1)

s⇤ = argmaxs

w · �(x, s) · �

�(i, si) = max

si�1

w · ⇥(x, i, si�i, si) + �(i� 1, si�1)

�(x, s) =mX

j=1

�(x, j, sj�1, sj)

Page 50: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...