Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples:...
Transcript of Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples:...
![Page 1: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/1.jpg)
Gaussian and Linear Discriminant Analysis; MulticlassClassification
Professor Ameet Talwalkar
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 1 / 40
![Page 2: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/2.jpg)
Outline
1 Administration
2 Review of last lecture
3 Generative versus discriminative
4 Multiclass classification
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 2 / 40
![Page 3: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/3.jpg)
Announcements
Homework 2: due on Wednesday
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 3 / 40
![Page 4: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/4.jpg)
Outline
1 Administration
2 Review of last lectureLogistic regression
3 Generative versus discriminative
4 Multiclass classification
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 4 / 40
![Page 5: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/5.jpg)
Logistic classification
Setup for two classes
Input: x ∈ RD
Output: y ∈ {0, 1}Training data: D = {(xn, yn), n = 1, 2, . . . , N}Model of conditional distribution
p(y = 1|x; b,w) = σ[g(x)]
whereg(x) = b+
∑d
wdxd = b+wTx
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 5 / 40
![Page 6: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/6.jpg)
Why the sigmoid function?
What does it look like?
σ(a) =1
1 + e−a
where
a = b+wTx−6 −4 −2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Properties
Bounded between 0 and 1 ← thus, interpretable as probability
Monotonically increasing thus, usable to derive classification rulesI σ(a) > 0.5, positive (classify as ’1’)I σ(a) < 0.5, negative (classify as ’0’)I σ(a) = 0.5, undecidable
Nice computational properties Derivative is in a simple form
Linear or nonlinear classifier?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 6 / 40
![Page 7: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/7.jpg)
Why the sigmoid function?
What does it look like?
σ(a) =1
1 + e−a
where
a = b+wTx−6 −4 −2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Properties
Bounded between 0 and 1 ← thus, interpretable as probability
Monotonically increasing thus, usable to derive classification rulesI σ(a) > 0.5, positive (classify as ’1’)I σ(a) < 0.5, negative (classify as ’0’)I σ(a) = 0.5, undecidable
Nice computational properties Derivative is in a simple form
Linear or nonlinear classifier?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 6 / 40
![Page 8: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/8.jpg)
Why the sigmoid function?
What does it look like?
σ(a) =1
1 + e−a
where
a = b+wTx−6 −4 −2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Properties
Bounded between 0 and 1 ← thus, interpretable as probability
Monotonically increasing thus, usable to derive classification rulesI σ(a) > 0.5, positive (classify as ’1’)I σ(a) < 0.5, negative (classify as ’0’)I σ(a) = 0.5, undecidable
Nice computational properties Derivative is in a simple form
Linear or nonlinear classifier?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 6 / 40
![Page 9: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/9.jpg)
Linear or nonlinear?
σ(a) is nonlinear, however, the decision boundary is determined by
σ(a) = 0.5⇒ a = 0⇒ g(x) = b+wTx = 0
which is a linear function in x
We often call b the offset term.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 7 / 40
![Page 10: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/10.jpg)
Likelihood function
Probability of a single training sample (xn, yn)
p(yn|xn; b;w) =
{σ(b+wTxn) if yn = 11− σ(b+wTxn) otherwise
Compact expression, exploring that yn is either 1 or 0
p(yn|xn; b;w) = σ(b+wTxn)yn [1− σ(b+wTxn)]
1−yn
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 8 / 40
![Page 11: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/11.jpg)
Likelihood function
Probability of a single training sample (xn, yn)
p(yn|xn; b;w) =
{σ(b+wTxn) if yn = 11− σ(b+wTxn) otherwise
Compact expression, exploring that yn is either 1 or 0
p(yn|xn; b;w) = σ(b+wTxn)yn [1− σ(b+wTxn)]
1−yn
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 8 / 40
![Page 12: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/12.jpg)
Maximum likelihood estimation
Cross-entropy error (negative log-likelihood)
E(b,w) = −∑n
{yn log σ(b+wTxn) + (1− yn) log[1− σ(b+wTxn)]}
Numerical optimization
Gradient descent: simple, scalable to large-scale problems
Newton method: fast but not scalable
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 9 / 40
![Page 13: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/13.jpg)
Numerical optimization
Gradient descent
Choose a proper step size η > 0
Iteratively update the parameters following the negative gradient tominimize the error function
w(t+1) ← w(t) − η∑n
{σ(wTxn)− yn
}xn
Remarks
Gradient is direction of steepest ascent.
The step size needs to be chosen carefully to ensure convergence.
The step size can be adaptive (i.e. varying from iteration to iteration).
Variant called stochastic gradient descent (later this quarter).
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 10 / 40
![Page 14: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/14.jpg)
Numerical optimization
Gradient descent
Choose a proper step size η > 0
Iteratively update the parameters following the negative gradient tominimize the error function
w(t+1) ← w(t) − η∑n
{σ(wTxn)− yn
}xn
Remarks
Gradient is direction of steepest ascent.
The step size needs to be chosen carefully to ensure convergence.
The step size can be adaptive (i.e. varying from iteration to iteration).
Variant called stochastic gradient descent (later this quarter).
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 10 / 40
![Page 15: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/15.jpg)
Intuition for Newton’s method
Approximate the true function with an easy-to-solve optimizationproblem
f(x)fquad
(x)
xk
xk+d
k
f(x)fquad
(x)
xk
xk+d
k
In particular, we can approximate the cross-entropy error function aroundw(t) by a quadratic function (its second order Taylor expansion), and then
minimize this quadratic function
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 11 / 40
![Page 16: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/16.jpg)
Update Rules
Gradient descent
w(t+1) ← w(t) − η∑n
{σ(wTxn)− yn
}xn
Newton method
w(t+1) ← w(t) −H(t)−1∇E(w(t))
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 12 / 40
![Page 17: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/17.jpg)
Contrast gradient descent and Newton’s method
Similar
Both are iterative procedures.
Different
Newton’s method requires second-order derivatives (less scalable, butfaster convergence)
Newton’s method does not have the magic η to be set
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 13 / 40
![Page 18: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/18.jpg)
Outline
1 Administration
2 Review of last lecture
3 Generative versus discriminativeContrast Naive Bayes and logistic regressionGaussian and Linear Discriminant Analysis
4 Multiclass classification
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 14 / 40
![Page 19: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/19.jpg)
Naive Bayes and logistic regression: two differentmodelling paradigms
Consider spam classification problem
First Strategy:I Use training set to find a decision boundary in the feature space that
separates spam and non-spam emailsI Given a test point, predict its label based on which side of the
boundary it is on.
Second Strategy:I Look at spam emails and build a model of what they look like.
Similarly, build a model of what non-spam emails look like.I To classify a new email, match it against both the spam and non-spam
models to see which is the better fit.
First strategy is discriminative (e.g., logistic regression)Second strategy is generative (e.g., naive bayes)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 15 / 40
![Page 20: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/20.jpg)
Naive Bayes and logistic regression: two differentmodelling paradigms
Consider spam classification problem
First Strategy:I Use training set to find a decision boundary in the feature space that
separates spam and non-spam emailsI Given a test point, predict its label based on which side of the
boundary it is on.
Second Strategy:I Look at spam emails and build a model of what they look like.
Similarly, build a model of what non-spam emails look like.I To classify a new email, match it against both the spam and non-spam
models to see which is the better fit.
First strategy is discriminative (e.g., logistic regression)Second strategy is generative (e.g., naive bayes)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 15 / 40
![Page 21: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/21.jpg)
Naive Bayes and logistic regression: two differentmodelling paradigms
Consider spam classification problem
First Strategy:I Use training set to find a decision boundary in the feature space that
separates spam and non-spam emailsI Given a test point, predict its label based on which side of the
boundary it is on.
Second Strategy:I Look at spam emails and build a model of what they look like.
Similarly, build a model of what non-spam emails look like.I To classify a new email, match it against both the spam and non-spam
models to see which is the better fit.
First strategy is discriminative (e.g., logistic regression)Second strategy is generative (e.g., naive bayes)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 15 / 40
![Page 22: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/22.jpg)
Generative vs Discriminative
Discriminative
Requires only specifying a model for the conditional distributionp(y|x), and thus, maximizes the conditional likelihood∑
n log p(yn|xn).Models that try to learn mappings directly from feature space to thelabels are also discriminative, e.g., perceptron, SVMs (covered later)
Generative
Aims to model the joint probability p(x, y) and thus maximize thejoint likelihood
∑n log p(xn, yn).
The generative models we’ll cover do so by modeling p(x|y) and p(y)
Let’s look at two more examples: Gaussian (or Quadratic)Discriminative Analysis and Linear Discriminative Analysis
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 16 / 40
![Page 23: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/23.jpg)
Generative vs Discriminative
Discriminative
Requires only specifying a model for the conditional distributionp(y|x), and thus, maximizes the conditional likelihood∑
n log p(yn|xn).Models that try to learn mappings directly from feature space to thelabels are also discriminative, e.g., perceptron, SVMs (covered later)
Generative
Aims to model the joint probability p(x, y) and thus maximize thejoint likelihood
∑n log p(xn, yn).
The generative models we’ll cover do so by modeling p(x|y) and p(y)
Let’s look at two more examples: Gaussian (or Quadratic)Discriminative Analysis and Linear Discriminative Analysis
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 16 / 40
![Page 24: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/24.jpg)
Generative vs Discriminative
Discriminative
Requires only specifying a model for the conditional distributionp(y|x), and thus, maximizes the conditional likelihood∑
n log p(yn|xn).Models that try to learn mappings directly from feature space to thelabels are also discriminative, e.g., perceptron, SVMs (covered later)
Generative
Aims to model the joint probability p(x, y) and thus maximize thejoint likelihood
∑n log p(xn, yn).
The generative models we’ll cover do so by modeling p(x|y) and p(y)
Let’s look at two more examples: Gaussian (or Quadratic)Discriminative Analysis and Linear Discriminative Analysis
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 16 / 40
![Page 25: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/25.jpg)
Determining sex based on measurements
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
red = female, blue=male
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 17 / 40
![Page 26: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/26.jpg)
Generative approach
Model joint distribution of (x = (height, weight), y =sex)
our data
Sex Height Weight1 6′ 1750 5′2” 1201 5′6” 1401 6′2” 2400 5.7” 130· · · · · · · · · 55 60 65 70 75 80
80
100
120
140
160
180
200
220
240
260
280
heightw
eigh
t
red = female, blue=male
Intuition: we will model how heights vary (according to a Gaussian) ineach sub-population (male and female).
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 18 / 40
![Page 27: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/27.jpg)
Model of the joint distribution (1D)
p(x, y) = p(y)p(x|y)
=
p0
1√2πσ0
e− (x−µ0)
2
2σ20 if y = 0
p11√2πσ1
e− (x−µ1)
2
2σ21 if y = 1
p0 + p1 = 1 are prior probabilities, andp(x|y) is a class conditional distribution
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
red = female, blue=male
What are the parameters to learn?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 19 / 40
![Page 28: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/28.jpg)
Model of the joint distribution (1D)
p(x, y) = p(y)p(x|y)
=
p0
1√2πσ0
e− (x−µ0)
2
2σ20 if y = 0
p11√2πσ1
e− (x−µ1)
2
2σ21 if y = 1
p0 + p1 = 1 are prior probabilities, andp(x|y) is a class conditional distribution
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
red = female, blue=male
What are the parameters to learn?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 19 / 40
![Page 29: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/29.jpg)
Parameter estimation
Log Likelihood of training data D = {(xn, yn)}Nn=1 with yn ∈ {0, 1}
logP (D) =∑n
log p(xn, yn)
=∑
n:yn=0
log
(p0
1√2πσ0
e− (xn−µ0)
2
2σ20
)
+∑
n:yn=1
log
(p1
1√2πσ1
e− (xn−µ1)
2
2σ21
)
Max log likelihood (p∗0, p∗1, µ∗0, µ∗1, σ∗0, σ∗1) = argmax logP (D)
Max likelihood (D = 2) (p∗0, p∗1,µ
∗0,µ
∗1,Σ
∗0,Σ
∗1) = argmax logP (D)
For Naive Bayes we assume Σ∗i is diagonal
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 20 / 40
![Page 30: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/30.jpg)
Parameter estimation
Log Likelihood of training data D = {(xn, yn)}Nn=1 with yn ∈ {0, 1}
logP (D) =∑n
log p(xn, yn)
=∑
n:yn=0
log
(p0
1√2πσ0
e− (xn−µ0)
2
2σ20
)
+∑
n:yn=1
log
(p1
1√2πσ1
e− (xn−µ1)
2
2σ21
)
Max log likelihood (p∗0, p∗1, µ∗0, µ∗1, σ∗0, σ∗1) = argmax logP (D)
Max likelihood (D = 2) (p∗0, p∗1,µ
∗0,µ
∗1,Σ
∗0,Σ
∗1) = argmax logP (D)
For Naive Bayes we assume Σ∗i is diagonal
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 20 / 40
![Page 31: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/31.jpg)
Parameter estimation
Log Likelihood of training data D = {(xn, yn)}Nn=1 with yn ∈ {0, 1}
logP (D) =∑n
log p(xn, yn)
=∑
n:yn=0
log
(p0
1√2πσ0
e− (xn−µ0)
2
2σ20
)
+∑
n:yn=1
log
(p1
1√2πσ1
e− (xn−µ1)
2
2σ21
)
Max log likelihood (p∗0, p∗1, µ∗0, µ∗1, σ∗0, σ∗1) = argmax logP (D)
Max likelihood (D = 2) (p∗0, p∗1,µ
∗0,µ
∗1,Σ
∗0,Σ
∗1) = argmax logP (D)
For Naive Bayes we assume Σ∗i is diagonal
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 20 / 40
![Page 32: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/32.jpg)
Parameter estimation
Log Likelihood of training data D = {(xn, yn)}Nn=1 with yn ∈ {0, 1}
logP (D) =∑n
log p(xn, yn)
=∑
n:yn=0
log
(p0
1√2πσ0
e− (xn−µ0)
2
2σ20
)
+∑
n:yn=1
log
(p1
1√2πσ1
e− (xn−µ1)
2
2σ21
)
Max log likelihood (p∗0, p∗1, µ∗0, µ∗1, σ∗0, σ∗1) = argmax logP (D)
Max likelihood (D = 2) (p∗0, p∗1,µ
∗0,µ
∗1,Σ
∗0,Σ
∗1) = argmax logP (D)
For Naive Bayes we assume Σ∗i is diagonal
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 20 / 40
![Page 33: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/33.jpg)
Decision boundary
As before, the Bayes optimal one under the assumed jointdistribution depends on
p(y = 1|x) ≥ p(y = 0|x)
which is equivalent to
p(x|y = 1)p(y = 1) ≥ p(x|y = 0)p(y = 0)
Namely,
− (x− µ1)2
2σ21− log
√2πσ1 + log p1 ≥ −
(x− µ0)2
2σ20− log
√2πσ0 + log p0
⇒ ax2 + bx+ c ≥ 0 ← the decision boundary not linear!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 21 / 40
![Page 34: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/34.jpg)
Decision boundary
As before, the Bayes optimal one under the assumed jointdistribution depends on
p(y = 1|x) ≥ p(y = 0|x)
which is equivalent to
p(x|y = 1)p(y = 1) ≥ p(x|y = 0)p(y = 0)
Namely,
− (x− µ1)2
2σ21− log
√2πσ1 + log p1 ≥ −
(x− µ0)2
2σ20− log
√2πσ0 + log p0
⇒ ax2 + bx+ c ≥ 0 ← the decision boundary not linear!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 21 / 40
![Page 35: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/35.jpg)
Decision boundary
As before, the Bayes optimal one under the assumed jointdistribution depends on
p(y = 1|x) ≥ p(y = 0|x)
which is equivalent to
p(x|y = 1)p(y = 1) ≥ p(x|y = 0)p(y = 0)
Namely,
− (x− µ1)2
2σ21− log
√2πσ1 + log p1 ≥ −
(x− µ0)2
2σ20− log
√2πσ0 + log p0
⇒ ax2 + bx+ c ≥ 0 ← the decision boundary not linear!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 21 / 40
![Page 36: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/36.jpg)
Example of nonlinear decision boundary
−2 0 2
−2
0
2
Parabolic Boundary
Note: the boundary is characterized by a quadratic function, giving rise tothe shape of a parabolic curve.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 22 / 40
![Page 37: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/37.jpg)
A special case: what if we assume the two Gaussians havethe same variance?
−(x− µ1)2
2σ21− log
√2πσ1 + log p1 ≥ −
(x− µ0)2
2σ20− log
√2πσ0 + log p0
with σ0 = σ1
We get a linear decision boundary: bx+ c ≥ 0Note: equal variances across two different categories could be a verystrong assumption.
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
red = female, blue=maleFor example, from the plot, it doesseem that the male population hasslightly bigger variance (i.e., biggerellipse) than the female population.So the assumption might not beapplicable.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 23 / 40
![Page 38: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/38.jpg)
A special case: what if we assume the two Gaussians havethe same variance?
−(x− µ1)2
2σ21− log
√2πσ1 + log p1 ≥ −
(x− µ0)2
2σ20− log
√2πσ0 + log p0
with σ0 = σ1We get a linear decision boundary: bx+ c ≥ 0
Note: equal variances across two different categories could be a verystrong assumption.
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
red = female, blue=maleFor example, from the plot, it doesseem that the male population hasslightly bigger variance (i.e., biggerellipse) than the female population.So the assumption might not beapplicable.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 23 / 40
![Page 39: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/39.jpg)
A special case: what if we assume the two Gaussians havethe same variance?
−(x− µ1)2
2σ21− log
√2πσ1 + log p1 ≥ −
(x− µ0)2
2σ20− log
√2πσ0 + log p0
with σ0 = σ1We get a linear decision boundary: bx+ c ≥ 0Note: equal variances across two different categories could be a verystrong assumption.
55 60 65 70 75 8080
100
120
140
160
180
200
220
240
260
280
height
wei
ght
red = female, blue=maleFor example, from the plot, it doesseem that the male population hasslightly bigger variance (i.e., biggerellipse) than the female population.So the assumption might not beapplicable.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 23 / 40
![Page 40: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/40.jpg)
Mini-summary
Gaussian discriminant analysis
A generative approach, assuming the data modeled by
p(x, y) = p(y)p(x|y)
where p(x|y) is a Gaussian distribution.
Parameters (of Gaussian distributions) estimated by max likelihood
Decision boundary
I In general, nonlinear functions of x (quadratic discriminant analysis)I Linear under various assumptions about Gaussian covariance matrices
F Single arbitrary matrix (linear discriminant analysis)F Multiple diagonal matrices (Gaussian Naive Bayes (GNB))F Single diagonal matrix (GNB in HW2 Problem 1)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 24 / 40
![Page 41: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/41.jpg)
Mini-summary
Gaussian discriminant analysis
A generative approach, assuming the data modeled by
p(x, y) = p(y)p(x|y)
where p(x|y) is a Gaussian distribution.
Parameters (of Gaussian distributions) estimated by max likelihood
Decision boundaryI In general, nonlinear functions of x (quadratic discriminant analysis)I Linear under various assumptions about Gaussian covariance matrices
F Single arbitrary matrix (linear discriminant analysis)F Multiple diagonal matrices (Gaussian Naive Bayes (GNB))F Single diagonal matrix (GNB in HW2 Problem 1)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 24 / 40
![Page 42: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/42.jpg)
Mini-summary
Gaussian discriminant analysis
A generative approach, assuming the data modeled by
p(x, y) = p(y)p(x|y)
where p(x|y) is a Gaussian distribution.
Parameters (of Gaussian distributions) estimated by max likelihood
Decision boundaryI In general, nonlinear functions of x (quadratic discriminant analysis)I Linear under various assumptions about Gaussian covariance matrices
F Single arbitrary matrix (linear discriminant analysis)F Multiple diagonal matrices (Gaussian Naive Bayes (GNB))F Single diagonal matrix (GNB in HW2 Problem 1)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 24 / 40
![Page 43: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/43.jpg)
So what is the discriminative counterpart?
IntuitionThe decision boundary in Gaussian discriminant analysis is
ax2 + bx+ c = 0
Let us model the conditional distribution analogously
p(y|x) = σ[ax2 + bx+ c] =1
1 + e−(ax2+bx+c)
Or, even simpler, going after the decision boundary of linear discriminantanalysis
p(y|x) = σ[bx+ c]
Both look very similar to logistic regression — i.e. we focus on writingdown the conditional probability, not the joint probability.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 25 / 40
![Page 44: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/44.jpg)
Does this change how we estimate the parameters?
First change: a smaller number of parameters to estimate
Models only parameterized by a, b and c. There are no prior probabilities(p0, p1) or Gaussian distribution parameters (µ0, µ1, σ0 and σ1).
Second change: maximize the conditional likelihood p(y|x)
(a∗, b∗, c∗) = argmin−∑n
{yn log σ(ax
2n + bxn + c) (1)
+ (1− yn) log[1− σ(ax2n + bxn + c)]}
(2)
No closed form solutions!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 26 / 40
![Page 45: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/45.jpg)
Does this change how we estimate the parameters?
First change: a smaller number of parameters to estimate
Models only parameterized by a, b and c. There are no prior probabilities(p0, p1) or Gaussian distribution parameters (µ0, µ1, σ0 and σ1).
Second change: maximize the conditional likelihood p(y|x)
(a∗, b∗, c∗) = argmin−∑n
{yn log σ(ax
2n + bxn + c) (1)
+ (1− yn) log[1− σ(ax2n + bxn + c)]}
(2)
No closed form solutions!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 26 / 40
![Page 46: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/46.jpg)
How easy for our Gaussian discriminant analysis?
Example
p1 =# of training samples in class 1
# of training samples(3)
µ1 =
∑n:yn=1 xn
# of training samples in class 1(4)
σ21 =
∑n:yn=1(xn − µ1)2
# of training samples in class 1(5)
Note: see textbook for detailed derivation (including generalization tohigher dimensions and multiple classes)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 27 / 40
![Page 47: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/47.jpg)
Generative versus discriminative: which one to use?
There is no fixed rule
Selecting which type of method to use is dataset/task specific
It depends on how well your modeling assumption fits the data
For instance, as we show in HW2, when data follows a specific variantof the Gaussian Naive Bayes assumption, p(y|x) necessarily follows alogistic function. However, the converse is not true.
I Gaussian Naive Bayes makes a stronger assumption than logisticregression
I When data follows this assumption, Gaussian Naive Bayes will likelyyield a model that better fits the data
I But logistic regression is more robust and less sensitive to incorrectmodelling assumption
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 28 / 40
![Page 48: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/48.jpg)
Generative versus discriminative: which one to use?
There is no fixed rule
Selecting which type of method to use is dataset/task specific
It depends on how well your modeling assumption fits the data
For instance, as we show in HW2, when data follows a specific variantof the Gaussian Naive Bayes assumption, p(y|x) necessarily follows alogistic function. However, the converse is not true.
I Gaussian Naive Bayes makes a stronger assumption than logisticregression
I When data follows this assumption, Gaussian Naive Bayes will likelyyield a model that better fits the data
I But logistic regression is more robust and less sensitive to incorrectmodelling assumption
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 28 / 40
![Page 49: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/49.jpg)
Outline
1 Administration
2 Review of last lecture
3 Generative versus discriminative
4 Multiclass classificationUse binary classifiers as building blocksMultinomial logistic regression
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 29 / 40
![Page 50: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/50.jpg)
Setup
Predict multiple classes/outcomes: C1, C2, . . . , CK
Weather prediction: sunny, cloudy, raining, etc
Optical character recognition: 10 digits + 26 characters (lower andupper cases) + special characters, etc
Studied methods
Nearest neighbor classifier
Naive Bayes
Gaussian discriminant analysis
Logistic regression
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 30 / 40
![Page 51: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/51.jpg)
Logistic regression for predicting multiple classes? Easy
The approach of “one versus the rest”
For each class Ck, change the problem into binary classification1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel all the rest data into negative (or ‘0’)
This step is often called 1-of-K encoding. That is, only one is nonzeroand everything else is zero.Example: for class C2, data go through the following change
(x1, C1)→ (x1, 0), (x2, C3)→ (x2, 0), . . . , (xn, C2)→ (xn, 1), . . . ,
Train K binary classifiers using logistic regression to differentiate thetwo classes
When predicting on x, combine the outputs of all binary classifiers1 What if all the classifiers say negative?2 What if multiple classifiers say positive?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 31 / 40
![Page 52: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/52.jpg)
Logistic regression for predicting multiple classes? Easy
The approach of “one versus the rest”
For each class Ck, change the problem into binary classification1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel all the rest data into negative (or ‘0’)
This step is often called 1-of-K encoding. That is, only one is nonzeroand everything else is zero.Example: for class C2, data go through the following change
(x1, C1)→ (x1, 0), (x2, C3)→ (x2, 0), . . . , (xn, C2)→ (xn, 1), . . . ,
Train K binary classifiers using logistic regression to differentiate thetwo classes
When predicting on x, combine the outputs of all binary classifiers1 What if all the classifiers say negative?2 What if multiple classifiers say positive?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 31 / 40
![Page 53: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/53.jpg)
Logistic regression for predicting multiple classes? Easy
The approach of “one versus the rest”
For each class Ck, change the problem into binary classification1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel all the rest data into negative (or ‘0’)
This step is often called 1-of-K encoding. That is, only one is nonzeroand everything else is zero.Example: for class C2, data go through the following change
(x1, C1)→ (x1, 0), (x2, C3)→ (x2, 0), . . . , (xn, C2)→ (xn, 1), . . . ,
Train K binary classifiers using logistic regression to differentiate thetwo classes
When predicting on x, combine the outputs of all binary classifiers1 What if all the classifiers say negative?2 What if multiple classifiers say positive?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 31 / 40
![Page 54: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/54.jpg)
Logistic regression for predicting multiple classes? Easy
The approach of “one versus the rest”
For each class Ck, change the problem into binary classification1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel all the rest data into negative (or ‘0’)
This step is often called 1-of-K encoding. That is, only one is nonzeroand everything else is zero.Example: for class C2, data go through the following change
(x1, C1)→ (x1, 0), (x2, C3)→ (x2, 0), . . . , (xn, C2)→ (xn, 1), . . . ,
Train K binary classifiers using logistic regression to differentiate thetwo classes
When predicting on x, combine the outputs of all binary classifiers1 What if all the classifiers say negative?2 What if multiple classifiers say positive?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 31 / 40
![Page 55: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/55.jpg)
Yet, another easy approach
The approach of “one versus one”
For each pair of classes Ck and Ck′ , change the problem into binaryclassification
1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel training data with label Ck′ into negative (or ‘0’)3 Disregard all other data
Ex: for class C1 and C2,
(x1, C1), (x2, C3), (x3, C2), . . .→ (x1, 1), (x3, 0), . . .
Train K(K − 1)/2 binary classifiers using logistic regression todifferentiate the two classes
When predicting on x, combine the outputs of all binary classifiersThere are K(K − 1)/2 votes!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 32 / 40
![Page 56: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/56.jpg)
Yet, another easy approach
The approach of “one versus one”
For each pair of classes Ck and Ck′ , change the problem into binaryclassification
1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel training data with label Ck′ into negative (or ‘0’)3 Disregard all other data
Ex: for class C1 and C2,
(x1, C1), (x2, C3), (x3, C2), . . .→ (x1, 1), (x3, 0), . . .
Train K(K − 1)/2 binary classifiers using logistic regression todifferentiate the two classes
When predicting on x, combine the outputs of all binary classifiersThere are K(K − 1)/2 votes!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 32 / 40
![Page 57: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/57.jpg)
Yet, another easy approach
The approach of “one versus one”
For each pair of classes Ck and Ck′ , change the problem into binaryclassification
1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel training data with label Ck′ into negative (or ‘0’)3 Disregard all other data
Ex: for class C1 and C2,
(x1, C1), (x2, C3), (x3, C2), . . .→ (x1, 1), (x3, 0), . . .
Train K(K − 1)/2 binary classifiers using logistic regression todifferentiate the two classes
When predicting on x, combine the outputs of all binary classifiersThere are K(K − 1)/2 votes!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 32 / 40
![Page 58: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/58.jpg)
Yet, another easy approach
The approach of “one versus one”
For each pair of classes Ck and Ck′ , change the problem into binaryclassification
1 Relabel training data with label Ck, into positive (or ‘1’)2 Relabel training data with label Ck′ into negative (or ‘0’)3 Disregard all other data
Ex: for class C1 and C2,
(x1, C1), (x2, C3), (x3, C2), . . .→ (x1, 1), (x3, 0), . . .
Train K(K − 1)/2 binary classifiers using logistic regression todifferentiate the two classes
When predicting on x, combine the outputs of all binary classifiersThere are K(K − 1)/2 votes!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 32 / 40
![Page 59: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/59.jpg)
Contrast these two approaches
Pros of each approach
one versus the rest: only needs to train K classifiers.I Makes a big difference if you have a lot of classes to go through.
one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).
I Makes a big difference if you have a lot of data to go through.
Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.
Any other good methods?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40
![Page 60: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/60.jpg)
Contrast these two approaches
Pros of each approach
one versus the rest: only needs to train K classifiers.
I Makes a big difference if you have a lot of classes to go through.
one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).
I Makes a big difference if you have a lot of data to go through.
Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.
Any other good methods?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40
![Page 61: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/61.jpg)
Contrast these two approaches
Pros of each approach
one versus the rest: only needs to train K classifiers.I Makes a big difference if you have a lot of classes to go through.
one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).
I Makes a big difference if you have a lot of data to go through.
Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.
Any other good methods?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40
![Page 62: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/62.jpg)
Contrast these two approaches
Pros of each approach
one versus the rest: only needs to train K classifiers.I Makes a big difference if you have a lot of classes to go through.
one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).
I Makes a big difference if you have a lot of data to go through.
Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.
Any other good methods?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40
![Page 63: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/63.jpg)
Contrast these two approaches
Pros of each approach
one versus the rest: only needs to train K classifiers.I Makes a big difference if you have a lot of classes to go through.
one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).
I Makes a big difference if you have a lot of data to go through.
Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.
Any other good methods?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40
![Page 64: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/64.jpg)
Contrast these two approaches
Pros of each approach
one versus the rest: only needs to train K classifiers.I Makes a big difference if you have a lot of classes to go through.
one versus one: only needs to train a smaller subset of data (onlythose labeled with those two classes would be involved).
I Makes a big difference if you have a lot of data to go through.
Bad about both of themCombining classifiers’ outputs seem to be a bit tricky.
Any other good methods?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 33 / 40
![Page 65: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/65.jpg)
Multinomial logistic regression
Intuition: from the decision rule of our naive Bayes classifier
y∗ = argmaxk p(y = Ck|x) = argmaxk log p(x|y = Ck)p(y = Ck)
= argmaxk log πk +∑i
zi log θki = argmaxkwTk x
Essentially, we are comparing
wT1 x,w
T2 x, · · · ,wT
Kx
with one for each category.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 34 / 40
![Page 66: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/66.jpg)
Multinomial logistic regression
Intuition: from the decision rule of our naive Bayes classifier
y∗ = argmaxk p(y = Ck|x) = argmaxk log p(x|y = Ck)p(y = Ck)
= argmaxk log πk +∑i
zi log θki = argmaxkwTk x
Essentially, we are comparing
wT1 x,w
T2 x, · · · ,wT
Kx
with one for each category.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 34 / 40
![Page 67: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/67.jpg)
First try
So, can we define the following conditional model?
p(y = Ck|x) = σ[wTk x]
This would not work because:∑k
p(y = Ck|x) =∑k
σ[wTk x] 6= 1
as each summand can be any number (independently) between 0 and 1.But we are close!We can learn the K linear models jointly to ensure this property holds!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 35 / 40
![Page 68: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/68.jpg)
First try
So, can we define the following conditional model?
p(y = Ck|x) = σ[wTk x]
This would not work because:∑k
p(y = Ck|x) =∑k
σ[wTk x] 6= 1
as each summand can be any number (independently) between 0 and 1.But we are close!
We can learn the K linear models jointly to ensure this property holds!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 35 / 40
![Page 69: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/69.jpg)
First try
So, can we define the following conditional model?
p(y = Ck|x) = σ[wTk x]
This would not work because:∑k
p(y = Ck|x) =∑k
σ[wTk x] 6= 1
as each summand can be any number (independently) between 0 and 1.But we are close!We can learn the K linear models jointly to ensure this property holds!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 35 / 40
![Page 70: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/70.jpg)
Definition of multinomial logistic regression
Model
For each class Ck, we have a parameter vector wk and model the posteriorprobability as
p(Ck|x) =ew
Tk x∑
k′ ewTk′x
← This is called softmax function
Decision boundary: assign x with the label that is the maximum ofposterior
argmaxk P (Ck|x)→ argmaxkwTk x
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 36 / 40
![Page 71: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/71.jpg)
Definition of multinomial logistic regression
Model
For each class Ck, we have a parameter vector wk and model the posteriorprobability as
p(Ck|x) =ew
Tk x∑
k′ ewTk′x
← This is called softmax function
Decision boundary: assign x with the label that is the maximum ofposterior
argmaxk P (Ck|x)→ argmaxkwTk x
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 36 / 40
![Page 72: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/72.jpg)
How does the softmax function behave?
Suppose we have
wT1 x = 100,wT
2 x = 50,wT3 x = −20
We would pick the winning class label 1.
Softmax translates these scores into well-formed conditionalprobababilities
p(y = 1|x) = e100
e100 + e50 + e−20< 1
preserves relative ordering of scores
maps scores to values between 0 and 1 that also sum to 1
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 37 / 40
![Page 73: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/73.jpg)
How does the softmax function behave?
Suppose we have
wT1 x = 100,wT
2 x = 50,wT3 x = −20
We would pick the winning class label 1.
Softmax translates these scores into well-formed conditionalprobababilities
p(y = 1|x) = e100
e100 + e50 + e−20< 1
preserves relative ordering of scores
maps scores to values between 0 and 1 that also sum to 1
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 37 / 40
![Page 74: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/74.jpg)
Sanity check
Multinomial model reduce to binary logistic regression when K = 2
p(C1|x) =ew
T1 x
ewT1 x + ew
T2 x
=1
1 + e−(w1−w2)Tx
=1
1 + e−wTx
Multinomial thus generalizes the (binary) logistic regression to deal withmultiple classes.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 38 / 40
![Page 75: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/75.jpg)
Parameter estimation
Discriminative approach: maximize conditional likelihood
logP (D) =∑n
logP (yn|xn)
We will change yn to yn = [yn1 yn2 · · · ynK ]T, a K-dimensional vectorusing 1-of-K encoding.
ynk =
{1 if yn = k0 otherwise
Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]T.
⇒∑n
logP (yn|xn) =∑n
logK∏k=1
P (Ck|xn)ynk =∑n
∑k
ynk logP (Ck|xn)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 39 / 40
![Page 76: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/76.jpg)
Parameter estimation
Discriminative approach: maximize conditional likelihood
logP (D) =∑n
logP (yn|xn)
We will change yn to yn = [yn1 yn2 · · · ynK ]T, a K-dimensional vectorusing 1-of-K encoding.
ynk =
{1 if yn = k0 otherwise
Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]T.
⇒∑n
logP (yn|xn) =∑n
logK∏k=1
P (Ck|xn)ynk =∑n
∑k
ynk logP (Ck|xn)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 39 / 40
![Page 77: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/77.jpg)
Parameter estimation
Discriminative approach: maximize conditional likelihood
logP (D) =∑n
logP (yn|xn)
We will change yn to yn = [yn1 yn2 · · · ynK ]T, a K-dimensional vectorusing 1-of-K encoding.
ynk =
{1 if yn = k0 otherwise
Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]T.
⇒∑n
logP (yn|xn) =∑n
logK∏k=1
P (Ck|xn)ynk =∑n
∑k
ynk logP (Ck|xn)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 39 / 40
![Page 78: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/78.jpg)
Cross-entropy error function
Definition: negative log likelihood
E(w1,w2, . . . ,wK) = −∑n
∑k
ynk logP (Ck|xn)
Properties
Convex, therefore unique global optimum
Optimization requires numerical procedures, analogous to those usedfor binary logistic regression
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 40 / 40
![Page 79: Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor](https://reader034.fdocuments.in/reader034/viewer/2022052006/601a45b9e0ec0e0dcd4a17b1/html5/thumbnails/79.jpg)
Cross-entropy error function
Definition: negative log likelihood
E(w1,w2, . . . ,wK) = −∑n
∑k
ynk logP (Ck|xn)
Properties
Convex, therefore unique global optimum
Optimization requires numerical procedures, analogous to those usedfor binary logistic regression
Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 30, 2017 40 / 40