Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by...
Transcript of Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by...
![Page 1: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/1.jpg)
Machine Learning
Support Vector Machines: Training with
Stochastic Gradient Descent
1
![Page 2: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/2.jpg)
Support vector machines
• Training by maximizing margin
• The SVM objective
• Solving the SVM optimization problem
• Support vectors, duals and kernels
2
![Page 3: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/3.jpg)
SVM objective function
3
Regularization term: • Maximize the margin• Imposes a preference over the
hypothesis space and pushes for better generalization
• Can be replaced with other regularization terms which impose other preferences
Empirical Loss: • Hinge loss • Penalizes weight vectors that make
mistakes
• Can be replaced with other loss functions which impose other preferences
A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss
min𝐰
12𝐰"𝐰+ 𝐶)
#
max 0, 1 − 𝑦#𝐰"𝐱
![Page 4: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/4.jpg)
Outline: Training SVM by optimization
1. Review of convex functions and gradient descent
2. Stochastic gradient descent
3. Gradient descent vs stochastic gradient descent
4. Sub-derivatives of the hinge loss
5. Stochastic sub-gradient descent for SVM
6. Comparison to perceptron
4
![Page 5: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/5.jpg)
Outline: Training SVM by optimization
1. Review of convex functions and gradient descent
2. Stochastic gradient descent
3. Gradient descent vs stochastic gradient descent
4. Sub-derivatives of the hinge loss
5. Stochastic sub-gradient descent for SVM
6. Comparison to perceptron
5
![Page 6: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/6.jpg)
Solving the SVM optimization problem
This function is convex in 𝐰
6
min𝐰
12𝐰"𝐰+ 𝐶)
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 7: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/7.jpg)
A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈ [0,1] we have
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
7
Recall: Convex functions
From geometric perspective
Every tangent plane lies below the function
u v
f(v)
f(u)
![Page 8: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/8.jpg)
A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈ [0,1] we have
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
8
u v
f(v)
f(u)
Recall: Convex functions
From geometric perspective
Every tangent plane lies below the function
![Page 9: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/9.jpg)
Convex functions
9
Linear functions max is convex
Some ways to show that a function is convex:
1. Using the definition of convexity
2. Showing that the second derivative is positive (for one dimensional functions)
3. Showing that the second derivative is positive semi-definite (for vector functions)
![Page 10: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/10.jpg)
Not all functions are convex
10
These are concave
These are neither
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≥ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
![Page 11: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/11.jpg)
Convex functions are convenient
A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈[0,1] we have
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
In general: Necessary condition for 𝑥 to be a minimum for the function 𝑓 is that the gradient ∇𝑓 𝑥 = 0
For convex functions, this is both necessary and sufficient
11
u v
f(v)
f(u)
![Page 12: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/12.jpg)
This function is convex in w
• This is a quadratic optimization problem because the objective is quadratic
• Older methods: Used techniques from Quadratic Programming– Very slow
• No constraints, can use gradient descent– Still very slow!
Solving the SVM optimization problem
12
min𝐰
12𝐰"𝐰+ 𝐶)
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 13: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/13.jpg)
Gradient descent
General strategy for minimizing a function 𝐽 𝐰
• Start with an initial guess for 𝐰, say 𝐰+
• Iterate till convergence: – Compute the gradient of the
gradient of 𝐽 at 𝐰!
– Update 𝐰! to get 𝐰!"# by taking a step in the opposite direction of the gradient
13
J(w)
ww0
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
𝐽 𝐰 = min𝐰
12𝐰"𝐰+ 𝐶+
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 14: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/14.jpg)
Gradient descent
General strategy for minimizing a function 𝐽 𝐰
• Start with an initial guess for 𝐰, say 𝐰+
• Iterate till convergence: – Compute the gradient of the
gradient of 𝐽 at 𝐰!
– Update 𝐰! to get 𝐰!"# by taking a step in the opposite direction of the gradient
14
J(w)
ww0w1
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
𝐽 𝐰 = min𝐰
12𝐰"𝐰+ 𝐶+
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 15: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/15.jpg)
Gradient descent
General strategy for minimizing a function 𝐽 𝐰
• Start with an initial guess for 𝐰, say 𝐰+
• Iterate till convergence: – Compute the gradient of the
gradient of 𝐽 at 𝐰!
– Update 𝐰! to get 𝐰!"# by taking a step in the opposite direction of the gradient
15
J(w)
ww0w1w2
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
𝐽 𝐰 = min𝐰
12𝐰"𝐰+ 𝐶+
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 16: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/16.jpg)
Gradient descent
General strategy for minimizing a function 𝐽 𝐰
• Start with an initial guess for 𝐰, say 𝐰+
• Iterate till convergence: – Compute the gradient of the
gradient of 𝐽 at 𝐰!
– Update 𝐰! to get 𝐰!"# by taking a step in the opposite direction of the gradient
16
J(w)
ww0w1w2w3
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
𝐽 𝐰 = min𝐰
12𝐰"𝐰+ 𝐶+
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 17: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/17.jpg)
Gradient descent for SVM
1. Initialize 𝐰%
2. For t = 0, 1, 2, ….1. Compute gradient of 𝐽 𝐰 at 𝐰,. Call it ∇J 𝐰,-.
2. Update w as follows:𝐰,-. ← 𝐰, − 𝑟∇𝐽(𝐰,)
17
𝑟: The learning rate .
We are trying to minimize
𝐽 𝐰 = min𝐰
12𝐰
"𝐰+ 𝐶+#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 18: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/18.jpg)
Outline: Training SVM by optimization
ü Review of convex functions and gradient descent
2. Stochastic gradient descent
3. Gradient descent vs stochastic gradient descent
4. Sub-derivatives of the hinge loss
5. Stochastic sub-gradient descent for SVM
6. Comparison to perceptron
18
![Page 19: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/19.jpg)
Gradient descent for SVM
1. Initialize 𝐰%
2. For t = 0, 1, 2, ….1. Compute gradient of 𝐽 𝐰 at 𝐰,. Call it ∇J 𝐰,-.
2. Update w as follows:𝐰,-. ← 𝐰, − 𝑟∇𝐽(𝐰,)
19
r: Called the learning rate
Gradient of the SVM objective requires summing over the entire training set
Slow, does not really scale
We are trying to minimize
𝐽 𝐰 = min𝐰
12𝐰
"𝐰+ 𝐶+#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 20: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/20.jpg)
Stochastic gradient descent for SVMGiven a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ12. For epoch = 1 … T:
1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆
2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#
,𝐽 𝐰 = min𝐰
12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$
3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#
3. Return final w
20
𝐽 𝐰 =12𝐰"𝐰+ 𝐶+
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 21: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/21.jpg)
Stochastic gradient descent for SVMGiven a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ12. For epoch = 1 … T:
1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆
2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#
,𝐽 𝐰 = min𝐰
12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$
3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#
3. Return final w
21
𝐽 𝐰 =12𝐰"𝐰+ 𝐶+
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 22: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/22.jpg)
Stochastic gradient descent for SVMGiven a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ12. For epoch = 1 … T:
1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆
2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#
,𝐽 𝐰 = min𝐰
12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$
3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#
3. Return final w
22
𝐽 𝐰 =12𝐰"𝐰+ 𝐶+
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 23: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/23.jpg)
Stochastic gradient descent for SVMGiven a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ12. For epoch = 1 … T:
1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆
2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#
,𝐽 𝐰 = min𝐰
12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$
3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#
3. Return final w
23
𝐽 𝐰 =12𝐰"𝐰+ 𝐶+
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 24: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/24.jpg)
Stochastic gradient descent for SVMGiven a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ12. For epoch = 1 … T:
1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆
2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#
,𝐽 𝐰 = min𝐰
12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$
3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#
3. Return final w
24
𝐽 𝐰 =12𝐰"𝐰+ 𝐶+
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 25: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/25.jpg)
Stochastic gradient descent for SVM
Given a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ1
2. For epoch = 1 … T:1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆
2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#
3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#
3. Return final w
25
This algorithm is guaranteed to converge to the minimum of 𝐽 if 𝛾! is small enough.Why? The objective 𝐽 𝐰 is a convex function
𝐽 𝐰 =12𝐰"𝐰+ 𝐶+
#
max 0, 1 − 𝑦#𝐰"𝐱#
![Page 26: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/26.jpg)
Outline: Training SVM by optimization
ü Review of convex functions and gradient descent
ü Stochastic gradient descent
3. Gradient descent vs stochastic gradient descent
4. Sub-derivatives of the hinge loss
5. Stochastic sub-gradient descent for SVM
6. Comparison to perceptron
26
![Page 27: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/27.jpg)
Gradient Descent vs SGD
27
Gradient descent
![Page 28: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/28.jpg)
Gradient Descent vs SGD
28
Stochastic Gradient descent
![Page 29: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/29.jpg)
Gradient Descent vs SGD
29
Stochastic Gradient descent
![Page 30: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/30.jpg)
Gradient Descent vs SGD
30
Stochastic Gradient descent
![Page 31: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/31.jpg)
Gradient Descent vs SGD
31
Stochastic Gradient descent
![Page 32: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/32.jpg)
Gradient Descent vs SGD
32
Stochastic Gradient descent
![Page 33: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/33.jpg)
Gradient Descent vs SGD
33
Stochastic Gradient descent
![Page 34: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/34.jpg)
Gradient Descent vs SGD
34
Stochastic Gradient descent
![Page 35: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/35.jpg)
Gradient Descent vs SGD
35
Stochastic Gradient descent
![Page 36: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/36.jpg)
Gradient Descent vs SGD
36
Stochastic Gradient descent
![Page 37: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/37.jpg)
Gradient Descent vs SGD
37
Stochastic Gradient descent
![Page 38: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/38.jpg)
Gradient Descent vs SGD
38
Stochastic Gradient descent
![Page 39: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/39.jpg)
Gradient Descent vs SGD
39
Stochastic Gradient descent
![Page 40: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/40.jpg)
Gradient Descent vs SGD
40
Stochastic Gradient descent
![Page 41: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/41.jpg)
Gradient Descent vs SGD
41
Stochastic Gradient descent
![Page 42: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/42.jpg)
Gradient Descent vs SGD
42
Stochastic Gradient descent
![Page 43: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/43.jpg)
Gradient Descent vs SGD
43
Stochastic Gradient descent
![Page 44: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/44.jpg)
Gradient Descent vs SGD
44
Stochastic Gradient descent
![Page 45: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/45.jpg)
Gradient Descent vs SGD
45
Stochastic Gradient descent
Many more updates than gradient descent, but each individual update is less computationally expensive
![Page 46: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/46.jpg)
Outline: Training SVM by optimization
ü Review of convex functions and gradient descent
ü Stochastic gradient descent
ü Gradient descent vs stochastic gradient descent
4. Sub-derivatives of the hinge loss
5. Stochastic sub-gradient descent for SVM
6. Comparison to perceptron
46
![Page 47: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/47.jpg)
Stochastic gradient descent for SVM
Given a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ1
2. For epoch = 1 … T:1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆
2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective at the current 𝐰!%# to be ∇J 𝐰&%#
3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#
3. Return final w
47
𝐽 𝐰 = min𝐰
12𝐰"𝐰+ 𝐶+
#
max 0, 1 − 𝑦#𝐰"𝐱
![Page 48: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/48.jpg)
Hinge loss is not differentiable!
What is the derivative of the hinge loss with respect to w?
48
𝐽 𝐰 = min𝐰
12𝐰5𝐰+ 𝐶max 0, 1 − 𝑦0𝐰5𝐱0
![Page 49: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/49.jpg)
Detour: Sub-gradients
Generalization of gradients to non-differentiable functions(Remember that every tangent is a hyperplane that lies below the function for convex functions)
Informally, a sub-tangent at a point is any hyperplane that lies below the function at the point.A sub-gradient is the slope of that line
49
![Page 50: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/50.jpg)
Sub-gradients
50[Example from Boyd]
g1 is a gradient at x1
g2 and g3 is are both subgradients at x2
f is differentiable at x1Tangent at this point
Formally, a vector g is a subgradient to f at point x if
![Page 51: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/51.jpg)
Sub-gradients
51[Example from Boyd]
g1 is a gradient at x1
g2 and g3 is are both subgradients at x2
f is differentiable at x1Tangent at this point
Formally, a vector g is a subgradient to f at point x if
![Page 52: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/52.jpg)
Sub-gradients
52[Example from Boyd]
g1 is a gradient at x1
g2 and g3 is are both subgradients at x2
f is differentiable at x1Tangent at this point
Formally, a vector g is a subgradient to f at point x if
![Page 53: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/53.jpg)
Sub-gradient of the SVM objective
53
General strategy: First solve the max and compute the gradient for each case
![Page 54: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/54.jpg)
Sub-gradient of the SVM objective
54
General strategy: First solve the max and compute the gradient for each case
![Page 55: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/55.jpg)
Outline: Training SVM by optimization
ü Review of convex functions and gradient descent
ü Stochastic gradient descent
ü Gradient descent vs stochastic gradient descent
ü Sub-derivatives of the hinge loss
5. Stochastic sub-gradient descent for SVM
6. Comparison to perceptron
55
![Page 56: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/56.jpg)
Stochastic sub-gradient descent for SVM
Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>
2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:
If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$
else: 𝐰 ← 1 − 𝛾! 𝐰
3. Return 𝐰
56
![Page 57: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/57.jpg)
Stochastic sub-gradient descent for SVM
Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>
2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:
If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$
else: 𝐰 ← 1 − 𝛾! 𝐰
3. Return 𝐰
57
![Page 58: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/58.jpg)
Stochastic sub-gradient descent for SVM
Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>
2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:
If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$
else: 𝐰 ← 1 − 𝛾! 𝐰
3. Return 𝐰
58
Update 𝐰 ← 𝐰− 𝛾!∇𝐽
![Page 59: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/59.jpg)
Stochastic sub-gradient descent for SVM
Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>
2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:
If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$
else: 𝐰 ← 1 − 𝛾! 𝐰
3. Return 𝐰
59
![Page 60: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/60.jpg)
Stochastic sub-gradient descent for SVM
Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>
2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:
If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$
else: 𝐰 ← 1 − 𝛾! 𝐰
3. Return 𝐰
60
𝛾$: learning rate, many tweaks possible
![Page 61: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/61.jpg)
Stochastic sub-gradient descent for SVM
Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>
2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:
If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$
else: 𝐰 ← 1 − 𝛾! 𝐰
3. Return 𝐰
61
𝛾$: learning rate, many tweaks possible
Important to shuffle examples at the start of each epoch
![Page 62: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/62.jpg)
Convergence and learning rates
With enough iterations, it will converge in expectation
Provided the step sizes are “square summable, but not summable”
• Step sizes 𝛾! are positive• Sum of squares of step sizes over t = 1 to 1 is not infinite• Sum of step sizes over t = 1 to 1 is infinity
• Some examples: 𝛾@ =A!
BC"!#$or 𝛾@ =
A!BC@
62
![Page 63: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/63.jpg)
Convergence and learning rates
• Number of iterations to get to accuracy within 𝜖
• For strongly convex functions, N examples, d dimensional:– Gradient descent: 𝑂 𝑁𝑑 ln !
"
– Stochastic gradient descent: 𝑂 #"
• More subtleties involved, but SGD is generally preferable when the data size is huge
• Recently, many variants that are based on this general strategy– Examples: Adagrad, momentum, Nesterov’s accelerated gradient, Adam,
RMSProp, etc…
63
![Page 64: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/64.jpg)
Convergence and learning rates
• Number of iterations to get to accuracy within 𝜖
• For strongly convex functions, N examples, d dimensional:– Gradient descent: 𝑂 𝑁𝑑 ln !
"
– Stochastic gradient descent: 𝑂 #"
• More subtleties involved, but SGD is generally preferable when the data size is huge
• Recently, many variants that are based on this general strategy– Examples: Adagrad, momentum, Nesterov’s accelerated gradient, Adam,
RMSProp, etc…
64
![Page 65: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/65.jpg)
Outline: Training SVM by optimization
ü Review of convex functions and gradient descent
ü Stochastic gradient descent
ü Gradient descent vs stochastic gradient descent
ü Sub-derivatives of the hinge loss
ü Stochastic sub-gradient descent for SVM
6. Comparison to perceptron
65
![Page 66: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/66.jpg)
Stochastic sub-gradient descent for SVM
Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>
2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:
If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$
else: 𝐰 ← 1 − 𝛾! 𝐰
3. Return 𝐰
66
Compare with the Perceptron update:If 𝑦"𝐰#𝐱" ≤ 0,
update 𝐰 ← 𝐰+ 𝛾!𝑦"𝐱"
![Page 67: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/67.jpg)
Perceptron vs. SVM
• Perceptron: Stochastic sub-gradient descent for a different loss– No regularization though
• SVM optimizes the hinge loss– With regularization
67
![Page 68: Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by maximizing margin •The SVM objective •Solving the SVM optimization problem](https://reader033.fdocuments.in/reader033/viewer/2022051606/6029122d967cf234db21d17f/html5/thumbnails/68.jpg)
SVM summary from optimization perspective
• Minimize regularized hinge loss
• Solve using stochastic gradient descent– Very fast, run time does not depend on number of examples
– Compare with Perceptron algorithm: Perceptron does not maximize margin width• Perceptron variants can force a margin
– Convergence criterion is an issue; can be too aggressive in the beginning and get to a reasonably good solution fast; but convergence is slow for very accurate weight vector
• Other successful optimization algorithms exist– Eg: Dual coordinate descent, implemented in liblinear
68Questions?