Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data...
Transcript of Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data...
Support vector machines
CS 446 / ECE 449
2021-02-16 14:33:55 -0600 (d94860f)
Another algorithm for linear prediction. Why?!
1 / 37
“pytorch meta-algorithm”.
...
5. Tweak 1-4 until training error is small.
6. Tweak 1-5, possibly reducing model complexity, until testing erroris small.
Recall: step 5 is easy, step 6 is complicated.Nice way to reduce complexity: favorable inductive bias.
I Support vector machines (SVMs) popularized a very influentialinductive bias: maximum margin predictors.
This concept has consequence beyond linear predictors and beyondsupervised learning.
I Nonlinear SVMs, via kernels, are also highly influential.
E.g., we will revisit them in the deep network lectures.
2 / 37
“pytorch meta-algorithm”.
...
5. Tweak 1-4 until training error is small.
6. Tweak 1-5, possibly reducing model complexity, until testing erroris small.
Recall: step 5 is easy, step 6 is complicated.Nice way to reduce complexity: favorable inductive bias.
I Support vector machines (SVMs) popularized a very influentialinductive bias: maximum margin predictors.
This concept has consequence beyond linear predictors and beyondsupervised learning.
I Nonlinear SVMs, via kernels, are also highly influential.
E.g., we will revisit them in the deep network lectures.
2 / 37
Plan for SVM
I Hard-margin SVM.
I Soft-margin SVM.
I SVM duality.
I Nonlinear SVM: kernels
3 / 37
Linearly separable data
Linearly separable data means there exists u ∈ Rd which perfectly classifiesall training points:
miniyix
Tiu > 0.
We have many ways to solve for u:
I Logistic regression.
I Perceptron.
I Convex programming (convexfunction subject to convex setconstraint).
4 / 37
Linearly separable data
Linearly separable data means there exists u ∈ Rd which perfectly classifiesall training points:
miniyix
Tiu > 0.
We have many ways to solve for u:
I Logistic regression.
I Perceptron.
I Convex programming (convexfunction subject to convex setconstraint).
4 / 37
Maximum margin solution
Best linear classifier onpopulation
Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.
(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.
I We’ve seen inductive bias:least squares and logistic regression choose different predictors.
I This particular bias (margin maximization) is common in machine learning,has many nice properties.
Key insight: can express this as another convex program.
5 / 37
Maximum margin solution
Best linear classifier onpopulation
Arbitrary linear separator ontraining data S
Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.
(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.
I We’ve seen inductive bias:least squares and logistic regression choose different predictors.
I This particular bias (margin maximization) is common in machine learning,has many nice properties.
Key insight: can express this as another convex program.
5 / 37
Maximum margin solution
Best linear classifier onpopulation
Arbitrary linear separator ontraining data S
Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.
(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.
I We’ve seen inductive bias:least squares and logistic regression choose different predictors.
I This particular bias (margin maximization) is common in machine learning,has many nice properties.
Key insight: can express this as another convex program.
5 / 37
Maximum margin solution
Best linear classifier onpopulation
Arbitrary linear separator ontraining data S
Maximum margin solution ontraining data S
Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.
(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.
I We’ve seen inductive bias:least squares and logistic regression choose different predictors.
I This particular bias (margin maximization) is common in machine learning,has many nice properties.
Key insight: can express this as another convex program.
5 / 37
Maximum margin solution
Best linear classifier onpopulation
Arbitrary linear separator ontraining data S
Maximum margin solution ontraining data S
Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.
(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.
I We’ve seen inductive bias:least squares and logistic regression choose different predictors.
I This particular bias (margin maximization) is common in machine learning,has many nice properties.
Key insight: can express this as another convex program.
5 / 37
Maximum margin solution
Best linear classifier onpopulation
Arbitrary linear separator ontraining data S
Maximum margin solution ontraining data S
Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.
(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.
I We’ve seen inductive bias:least squares and logistic regression choose different predictors.
I This particular bias (margin maximization) is common in machine learning,has many nice properties.
Key insight: can express this as another convex program.
5 / 37
Maximum margin solution
Best linear classifier onpopulation
Arbitrary linear separator ontraining data S
Maximum margin solution ontraining data S
Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.
(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.
I We’ve seen inductive bias:least squares and logistic regression choose different predictors.
I This particular bias (margin maximization) is common in machine learning,has many nice properties.
Key insight: can express this as another convex program.5 / 37
Distance to decision boundary
Suppose w ∈ Rd satisfies min(x,y)∈S
yxTw > 0.
I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.
I Thus for each direction w/‖w‖, we can fix a scaling.
Let (x, y) be any example in S that achieves the minimum mini yixTiw.
H
w
yxyxTw‖w‖2
I Rescale w so that yxTw = 1.
(Now scaling is fixed.)
I Distance from yx to H isyxTw
‖w‖ =1
‖w‖ .
This is the (normalized minimum) margin.
I This gives optimization problem
max 1/‖w‖
subj. to min(x,y)∈S
yxTw = 1.
Refinements: (a) can make constraint∀i, yixT
iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.
6 / 37
Distance to decision boundary
Suppose w ∈ Rd satisfies min(x,y)∈S
yxTw > 0.
I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.
I Thus for each direction w/‖w‖, we can fix a scaling.
Let (x, y) be any example in S that achieves the minimum mini yixTiw.
H
w
yxyxTw‖w‖2
I Rescale w so that yxTw = 1.
(Now scaling is fixed.)
I Distance from yx to H isyxTw
‖w‖ =1
‖w‖ .
This is the (normalized minimum) margin.
I This gives optimization problem
max 1/‖w‖
subj. to min(x,y)∈S
yxTw = 1.
Refinements: (a) can make constraint∀i, yixT
iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.
6 / 37
Distance to decision boundary
Suppose w ∈ Rd satisfies min(x,y)∈S
yxTw > 0.
I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.
I Thus for each direction w/‖w‖, we can fix a scaling.
Let (x, y) be any example in S that achieves the minimum mini yixTiw.
H
w
yxyxTw‖w‖2
I Rescale w so that yxTw = 1.
(Now scaling is fixed.)
I Distance from yx to H isyxTw
‖w‖ =1
‖w‖ .
This is the (normalized minimum) margin.
I This gives optimization problem
max 1/‖w‖
subj. to min(x,y)∈S
yxTw = 1.
Refinements: (a) can make constraint∀i, yixT
iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.
6 / 37
Distance to decision boundary
Suppose w ∈ Rd satisfies min(x,y)∈S
yxTw > 0.
I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.
I Thus for each direction w/‖w‖, we can fix a scaling.
Let (x, y) be any example in S that achieves the minimum mini yixTiw.
H
w
yxyxTw‖w‖2
I Rescale w so that yxTw = 1.
(Now scaling is fixed.)
I Distance from yx to H isyxTw
‖w‖ =1
‖w‖ .
This is the (normalized minimum) margin.
I This gives optimization problem
max 1/‖w‖
subj. to min(x,y)∈S
yxTw = 1.
Refinements: (a) can make constraint∀i, yixT
iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.
6 / 37
Distance to decision boundary
Suppose w ∈ Rd satisfies min(x,y)∈S
yxTw > 0.
I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.
I Thus for each direction w/‖w‖, we can fix a scaling.
Let (x, y) be any example in S that achieves the minimum mini yixTiw.
H
w
yxyxTw‖w‖2
I Rescale w so that yxTw = 1.
(Now scaling is fixed.)
I Distance from yx to H isyxTw
‖w‖ =1
‖w‖ .
This is the (normalized minimum) margin.
I This gives optimization problem
max 1/‖w‖
subj. to min(x,y)∈S
yxTw = 1.
Refinements: (a) can make constraint∀i, yixT
iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.
6 / 37
Distance to decision boundary
Suppose w ∈ Rd satisfies min(x,y)∈S
yxTw > 0.
I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.
I Thus for each direction w/‖w‖, we can fix a scaling.
Let (x, y) be any example in S that achieves the minimum mini yixTiw.
H
w
yxyxTw‖w‖2
I Rescale w so that yxTw = 1.
(Now scaling is fixed.)
I Distance from yx to H isyxTw
‖w‖ =1
‖w‖ .
This is the (normalized minimum) margin.
I This gives optimization problem
max 1/‖w‖
subj. to min(x,y)∈S
yxTw = 1.
Refinements: (a) can make constraint∀i, yixT
iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.
6 / 37
Distance to decision boundary
Suppose w ∈ Rd satisfies min(x,y)∈S
yxTw > 0.
I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.
I Thus for each direction w/‖w‖, we can fix a scaling.
Let (x, y) be any example in S that achieves the minimum mini yixTiw.
H
w
yxyxTw‖w‖2
I Rescale w so that yxTw = 1.
(Now scaling is fixed.)
I Distance from yx to H isyxTw
‖w‖ =1
‖w‖ .
This is the (normalized minimum) margin.
I This gives optimization problem
max 1/‖w‖
subj. to min(x,y)∈S
yxTw = 1.
Refinements: (a) can make constraint∀i, yixT
iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.
6 / 37
Maximum margin linear classifier
The solution w to the following mathematical optimization problem:
minw∈Rd
1
2‖w‖22
s.t. yxTw ≥ 1 for all (x, y) ∈ S
gives the linear classifier with the largest minimum margin on S — i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.
I This is a convex optimization problem: minimization of a convex function,subject to a convex constraint.
I There are many solvers for this problem; it is an area of active research.
You will code up an easy one in homework 2!
I The feasible set is nonempty when S is linearly separable;
in this case, the solution is unique.
I Note: some presentations explicitly include and encode the appended “1”feature on x; we will not.
7 / 37
Maximum margin linear classifier
The solution w to the following mathematical optimization problem:
minw∈Rd
1
2‖w‖22
s.t. yxTw ≥ 1 for all (x, y) ∈ S
gives the linear classifier with the largest minimum margin on S — i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.
I This is a convex optimization problem: minimization of a convex function,subject to a convex constraint.
I There are many solvers for this problem; it is an area of active research.
You will code up an easy one in homework 2!
I The feasible set is nonempty when S is linearly separable;
in this case, the solution is unique.
I Note: some presentations explicitly include and encode the appended “1”feature on x; we will not.
7 / 37
Soft-margin SVM
The convex program made no sense if the data is not linearly separable.It is sometimes called the hard-margin SVM.Now we develop a soft-margin SVM for non-separable data.
8 / 37
Soft-margin SVMs (Cortes and Vapnik, 1995)
Start from the hard-margin program:
minw∈Rd
1
2‖w‖22
s.t. yixTiw ≥ 1 for all i = 1, 2, . . . , n.
Suppose it is infeasible (no linear separators).
Introduce slack variables ξ1, . . . , ξn ≥ 0, and a trade-off parameter C > 0:
minw∈Rd,ξ1,...,ξn∈R
1
2‖w‖22 + C
n∑i=1
ξi
s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,
ξi ≥ 0 for all i = 1, 2, . . . , n,
which is always feasible. This is called soft-margin SVM.
(Slack variables are auxiliary variables; not needed to form the linear classifier.)
9 / 37
Soft-margin SVMs (Cortes and Vapnik, 1995)
Start from the hard-margin program:
minw∈Rd
1
2‖w‖22
s.t. yixTiw ≥ 1 for all i = 1, 2, . . . , n.
Suppose it is infeasible (no linear separators).
Introduce slack variables ξ1, . . . , ξn ≥ 0, and a trade-off parameter C > 0:
minw∈Rd,ξ1,...,ξn∈R
1
2‖w‖22 + C
n∑i=1
ξi
s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,
ξi ≥ 0 for all i = 1, 2, . . . , n,
which is always feasible. This is called soft-margin SVM.
(Slack variables are auxiliary variables; not needed to form the linear classifier.)
9 / 37
Interpretation of slack varables
minw∈Rd,ξ1,...,ξn∈R
1
2‖w‖22 + C
n∑i=1
ξi
s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,
ξi ≥ 0 for all i = 1, 2, . . . , n.
H
For given w, ξi/‖w‖2 is distance that xi would have to move to satisfy
yixTiw ≥ 1.
10 / 37
Another interpretation of slack variables
Constraints with non-negative slack variables:
minw∈Rd,ξ1,...,ξn∈R
1
2‖w‖22 + C
n∑i=1
ξi
s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,
ξi ≥ 0 for all i = 1, 2, . . . , n.
Equivalent unconstrained form:Given w, the optimal ξi is max{0, 1− yixT
iw}, thus
minw∈Rd
1
2‖w‖22 + C
n∑i=1
[1− yixT
iw]
+.
Notation: [a]+ := max {0, a} (ReLU!).[1− yxTw
]+
is hinge loss of w on example (x, y).
11 / 37
Another interpretation of slack variables
Constraints with non-negative slack variables:
minw∈Rd,ξ1,...,ξn∈R
1
2‖w‖22 + C
n∑i=1
ξi
s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,
ξi ≥ 0 for all i = 1, 2, . . . , n.
Equivalent unconstrained form:Given w, the optimal ξi is max{0, 1− yixT
iw}, thus
minw∈Rd
1
2‖w‖22 + C
n∑i=1
[1− yixT
iw]
+.
Notation: [a]+ := max {0, a} (ReLU!).
[1− yxTw
]+
is hinge loss of w on example (x, y).
11 / 37
Another interpretation of slack variables
Constraints with non-negative slack variables:
minw∈Rd,ξ1,...,ξn∈R
1
2‖w‖22 + C
n∑i=1
ξi
s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,
ξi ≥ 0 for all i = 1, 2, . . . , n.
Equivalent unconstrained form:Given w, the optimal ξi is max{0, 1− yixT
iw}, thus
minw∈Rd
1
2‖w‖22 + C
n∑i=1
[1− yixT
iw]
+.
Notation: [a]+ := max {0, a} (ReLU!).[1− yxTw
]+
is hinge loss of w on example (x, y).
11 / 37
Unconstrained soft-margin SVM
This form is a regularized ERM:
minw∈Rd
1
2‖w‖22 + C
n∑i=1
`hinge(yixTiw) where `hinge(z) = max{0, 1− z}.
I We can equivalently substitute C = 1 and write λ2‖w‖2.
I C is a hyper-parameter, with no good search procedure.
12 / 37
SVM duality
A convex program is an optimization problem (minimization or maximization)where a convex objective is minimized over a convex constraint (feasible) set.
Every convex program has a corresponding dual program.There is a rich theory about this correspondence.For the SVM, the dual has many nice properties:
I Clarifies the role of support vectors.
I Leads to a nice nonlinear approach via kernels.
I Gives another choice for optimization algorithms.
(Used in homework 2!)
13 / 37
SVM duality
A convex program is an optimization problem (minimization or maximization)where a convex objective is minimized over a convex constraint (feasible) set.
Every convex program has a corresponding dual program.There is a rich theory about this correspondence.For the SVM, the dual has many nice properties:
I Clarifies the role of support vectors.
I Leads to a nice nonlinear approach via kernels.
I Gives another choice for optimization algorithms.
(Used in homework 2!)
13 / 37
SVM duality
A convex program is an optimization problem (minimization or maximization)where a convex objective is minimized over a convex constraint (feasible) set.
Every convex program has a corresponding dual program.There is a rich theory about this correspondence.For the SVM, the dual has many nice properties:
I Clarifies the role of support vectors.
I Leads to a nice nonlinear approach via kernels.
I Gives another choice for optimization algorithms.
(Used in homework 2!)
13 / 37
SVM hard-margin duality.Define the two optimization problems
min
{1
2‖w‖2 : w ∈ Rd,∀i � 1− yixiw ≤ 0
}(primal),
max
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjxTixj : α ∈ Rn,α ≥ 0
(dual).
If the primal is feasible, then primal optimal value = dual optimal value.Given a primal optimum w and a dual optimum α, then
w =n∑i=1
αiyixi.
I The dual variables α have dimension n, same as examples.
I We can write the primal optimum as a linear combination of examples.
I The dual objective is a concave quadratic.
I We will derive this duality using Lagrange multipliers.
14 / 37
SVM hard-margin duality.Define the two optimization problems
min
{1
2‖w‖2 : w ∈ Rd,∀i � 1− yixiw ≤ 0
}(primal),
max
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjxTixj : α ∈ Rn,α ≥ 0
(dual).
If the primal is feasible, then primal optimal value = dual optimal value.Given a primal optimum w and a dual optimum α, then
w =n∑i=1
αiyixi.
I The dual variables α have dimension n, same as examples.
I We can write the primal optimum as a linear combination of examples.
I The dual objective is a concave quadratic.
I We will derive this duality using Lagrange multipliers.
14 / 37
Lagrange multipliers
Move constraints to objective using Lagrange multipliers.
Original problem: minw∈Rd
1
2‖w‖22
s.t. 1− yixTiw ≤ 0 for all i = 1, . . . , n.
I For each constraint 1− yixTiw ≤ 0, associate a dual variable (Lagrange
multiplier) αi ≥ 0.
I Move constraints to objective by adding∑ni=1 αi(1− yix
Tiw) and
maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i).
Lagrangian L(w,α):
L(w,α) :=1
2‖w‖22 +
n∑i=1
αi(1− yixTiw).
Maximizing over α ≥ 0 recovers primal problem: for any w ∈ Rd,
P (w) := supα≥0
L(w,α) =
{12‖w‖22 if mini yix
Tiw ≥ 1,
∞ otherwise.
What if we leave α fixed, and minimize w?
15 / 37
Lagrange multipliers
Move constraints to objective using Lagrange multipliers.
Original problem: minw∈Rd
1
2‖w‖22
s.t. 1− yixTiw ≤ 0 for all i = 1, . . . , n.
I For each constraint 1− yixTiw ≤ 0, associate a dual variable (Lagrange
multiplier) αi ≥ 0.
I Move constraints to objective by adding∑ni=1 αi(1− yix
Tiw) and
maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i).
Lagrangian L(w,α):
L(w,α) :=1
2‖w‖22 +
n∑i=1
αi(1− yixTiw).
Maximizing over α ≥ 0 recovers primal problem: for any w ∈ Rd,
P (w) := supα≥0
L(w,α) =
{12‖w‖22 if mini yix
Tiw ≥ 1,
∞ otherwise.
What if we leave α fixed, and minimize w?
15 / 37
Lagrange multipliers
Move constraints to objective using Lagrange multipliers.
Original problem: minw∈Rd
1
2‖w‖22
s.t. 1− yixTiw ≤ 0 for all i = 1, . . . , n.
I For each constraint 1− yixTiw ≤ 0, associate a dual variable (Lagrange
multiplier) αi ≥ 0.
I Move constraints to objective by adding∑ni=1 αi(1− yix
Tiw) and
maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i).
Lagrangian L(w,α):
L(w,α) :=1
2‖w‖22 +
n∑i=1
αi(1− yixTiw).
Maximizing over α ≥ 0 recovers primal problem: for any w ∈ Rd,
P (w) := supα≥0
L(w,α) =
{12‖w‖22 if mini yix
Tiw ≥ 1,
∞ otherwise.
What if we leave α fixed, and minimize w?15 / 37
Dual problem
Lagrangian
L(w,α) :=1
2‖w‖22 +
n∑i=1
αi(1− yixTiw).
Primal hard-margin SVM
P (w) = supα≥0
L(w,α) = supα≥0
1
2‖w‖22 +
n∑i=1
αi(1− yixTiw)
.
Dual problem D(α) = minw L(w,α): given α ≥ 0, then w 7→ L(w,α) is aconvex quadratic with minimum w =
∑ni=1 αiyixi, giving
D(α) = minw∈Rd
L(w,α) = L
n∑i=1
αiyixi,α
=
n∑i=1
αi −1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2
2
=n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjxTixj .
16 / 37
Dual problem
Lagrangian
L(w,α) :=1
2‖w‖22 +
n∑i=1
αi(1− yixTiw).
Primal hard-margin SVM
P (w) = supα≥0
L(w,α) = supα≥0
1
2‖w‖22 +
n∑i=1
αi(1− yixTiw)
.Dual problem D(α) = minw L(w,α): given α ≥ 0, then w 7→ L(w,α) is aconvex quadratic with minimum w =
∑ni=1 αiyixi, giving
D(α) = minw∈Rd
L(w,α) = L
n∑i=1
αiyixi,α
=
n∑i=1
αi −1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2
2
=
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjxTixj .
16 / 37
Summarizing,
L(w,α) =1
2‖w‖22 +
n∑i=1
αi(1− yixTiw) Lagrangian,
P (w) = maxα≥0
L(w,α) primal problem,
D(α) = minw
L(w,α) dual problem.
I For general Lagrangians, have weak duality
P (w) ≥ D(α),
since P (w) = maxα′≥0
L(w,α′) ≥ L(w,α) ≥ minw′
L(w′,α) = D(α).
I By convexity, have strong duality minw P (w) = maxα≥0 D(α),and an optimum α for D gives an optimum w for P via
w =
n∑i=1
αiyixi = arg minw
L(w, α).
17 / 37
Summarizing,
L(w,α) =1
2‖w‖22 +
n∑i=1
αi(1− yixTiw) Lagrangian,
P (w) = maxα≥0
L(w,α) primal problem,
D(α) = minw
L(w,α) dual problem.
I For general Lagrangians, have weak duality
P (w) ≥ D(α),
since P (w) = maxα′≥0
L(w,α′) ≥ L(w,α) ≥ minw′
L(w′,α) = D(α).
I By convexity, have strong duality minw P (w) = maxα≥0 D(α),and an optimum α for D gives an optimum w for P via
w =
n∑i=1
αiyixi = arg minw
L(w, α).
17 / 37
Summarizing,
L(w,α) =1
2‖w‖22 +
n∑i=1
αi(1− yixTiw) Lagrangian,
P (w) = maxα≥0
L(w,α) primal problem,
D(α) = minw
L(w,α) dual problem.
I For general Lagrangians, have weak duality
P (w) ≥ D(α),
since P (w) = maxα′≥0
L(w,α′) ≥ L(w,α) ≥ minw′
L(w′,α) = D(α).
I By convexity, have strong duality minw P (w) = maxα≥0 D(α),and an optimum α for D gives an optimum w for P via
w =
n∑i=1
αiyixi = arg minw
L(w, α).
17 / 37
Support vectors
Optimal solutions w and α = (α1, . . . , αn) satisfy
I w =n∑i=1
αiyixi =∑i:αi>0
αiyixi,
I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).
The yixi where αi > 0 are called support vectors.
H
I Support vector examples satisfy “margin”constraints with equality.
I Get same solution if non-support vectorsdeleted.
I Primal optimum is a linear combination ofsupport vectors.
18 / 37
Support vectors
Optimal solutions w and α = (α1, . . . , αn) satisfy
I w =n∑i=1
αiyixi =∑i:αi>0
αiyixi,
I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).
The yixi where αi > 0 are called support vectors.
H
I Support vector examples satisfy “margin”constraints with equality.
I Get same solution if non-support vectorsdeleted.
I Primal optimum is a linear combination ofsupport vectors.
18 / 37
Support vectors
Optimal solutions w and α = (α1, . . . , αn) satisfy
I w =n∑i=1
αiyixi =∑i:αi>0
αiyixi,
I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).
The yixi where αi > 0 are called support vectors.
H
I Support vector examples satisfy “margin”constraints with equality.
I Get same solution if non-support vectorsdeleted.
I Primal optimum is a linear combination ofsupport vectors.
18 / 37
Support vectors
Optimal solutions w and α = (α1, . . . , αn) satisfy
I w =n∑i=1
αiyixi =∑i:αi>0
αiyixi,
I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).
The yixi where αi > 0 are called support vectors.
H
I Support vector examples satisfy “margin”constraints with equality.
I Get same solution if non-support vectorsdeleted.
I Primal optimum is a linear combination ofsupport vectors.
18 / 37
Support vectors
Optimal solutions w and α = (α1, . . . , αn) satisfy
I w =n∑i=1
αiyixi =∑i:αi>0
αiyixi,
I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).
The yixi where αi > 0 are called support vectors.
H
I Support vector examples satisfy “margin”constraints with equality.
I Get same solution if non-support vectorsdeleted.
I Primal optimum is a linear combination ofsupport vectors.
18 / 37
Proof of complementary slackness
For the optimal (feasible) solutions w and α, we have
P (w) = D(α) = minw∈Rd
L(w, α) (by strong duality)
≤ L(w, α)
=1
2‖w‖22 +
n∑i=1
αi(1− yixTi w)
≤ 1
2‖w‖22 (constraints are satisfied)
= P (w).
Therefore, every term in sumn∑i=1
αi(1− yixTi w) must be zero:
αi(1− yixTi w) = 0 for all i = 1, . . . , n.
If αi > 0, then must have 1− yixTi w = 0. (Not iff!)
19 / 37
Proof of complementary slackness
For the optimal (feasible) solutions w and α, we have
P (w) = D(α) = minw∈Rd
L(w, α) (by strong duality)
≤ L(w, α)
=1
2‖w‖22 +
n∑i=1
αi(1− yixTi w)
≤ 1
2‖w‖22 (constraints are satisfied)
= P (w).
Therefore, every term in sumn∑i=1
αi(1− yixTi w) must be zero:
αi(1− yixTi w) = 0 for all i = 1, . . . , n.
If αi > 0, then must have 1− yixTi w = 0. (Not iff!)
19 / 37
Proof of complementary slackness
For the optimal (feasible) solutions w and α, we have
P (w) = D(α) = minw∈Rd
L(w, α) (by strong duality)
≤ L(w, α)
=1
2‖w‖22 +
n∑i=1
αi(1− yixTi w)
≤ 1
2‖w‖22 (constraints are satisfied)
= P (w).
Therefore, every term in sumn∑i=1
αi(1− yixTi w) must be zero:
αi(1− yixTi w) = 0 for all i = 1, . . . , n.
If αi > 0, then must have 1− yixTi w = 0. (Not iff!)
19 / 37
Proof of complementary slackness
For the optimal (feasible) solutions w and α, we have
P (w) = D(α) = minw∈Rd
L(w, α) (by strong duality)
≤ L(w, α)
=1
2‖w‖22 +
n∑i=1
αi(1− yixTi w)
≤ 1
2‖w‖22 (constraints are satisfied)
= P (w).
Therefore, every term in sumn∑i=1
αi(1− yixTi w) must be zero:
αi(1− yixTi w) = 0 for all i = 1, . . . , n.
If αi > 0, then must have 1− yixTi w = 0. (Not iff!)
19 / 37
Proof of complementary slackness
For the optimal (feasible) solutions w and α, we have
P (w) = D(α) = minw∈Rd
L(w, α) (by strong duality)
≤ L(w, α)
=1
2‖w‖22 +
n∑i=1
αi(1− yixTi w)
≤ 1
2‖w‖22 (constraints are satisfied)
= P (w).
Therefore, every term in sumn∑i=1
αi(1− yixTi w) must be zero:
αi(1− yixTi w) = 0 for all i = 1, . . . , n.
If αi > 0, then must have 1− yixTi w = 0. (Not iff!)
19 / 37
Proof of complementary slackness
For the optimal (feasible) solutions w and α, we have
P (w) = D(α) = minw∈Rd
L(w, α) (by strong duality)
≤ L(w, α)
=1
2‖w‖22 +
n∑i=1
αi(1− yixTi w)
≤ 1
2‖w‖22 (constraints are satisfied)
= P (w).
Therefore, every term in sumn∑i=1
αi(1− yixTi w) must be zero:
αi(1− yixTi w) = 0 for all i = 1, . . . , n.
If αi > 0, then must have 1− yixTi w = 0. (Not iff!)
19 / 37
SVM (hard-margin) duality summary
Lagrangian
L(w,α) =1
2‖w‖22 +
n∑i=1
αi(1− yixTiw).
Primal maximum margin problem was
P (w) = supα≥0
L(w,α) = supα≥0
1
2‖w‖22 +
n∑i=1
αi(1− yixTiw)
.Dual problem
D(α) = minw∈Rd
L(w,α) =
n∑i=1
αi −1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2
2
.
Given dual optimum α,
I Corresponding primal optimum w =∑ni=1 αiyixi;
I Strong duality P (w) = D(α);
I αi > 0 implies yixTi w = 1,
and these yixi are support vectors.
20 / 37
SVM soft-margin dual
Similarly,
L(w, ξ,α) =1
2‖w‖22 + C
n∑i=1
ξi +
n∑i=1
αi(1− ξi − yixTiw) (Lagrangian),
P (w, ξ) = supα≥0
L(w, ξ,α) (Primal),
=
{12‖w‖22 + C
∑ni=1 ξi ∀i � 1− ξi − yix
Tiw ≤ 0,
∞ otherwise,
D(α) = minw∈Rd,ξ∈Rn
≥0
L(w, ξ,α) (Dual),
= maxα∈Rn
0≤αi≤C
n∑i=1
αi −1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2 .
Remarks.
I Dual solution α still gives primal solution w =∑ni=1 αiyixi.
I Can take C →∞ to recover hard-margin case.
I Dual is still a constrained concave quadratic (used in many solvers).
I Some presentations include bias in primal (xTiw + b);
this introduces a constraint∑ni=1 yiαi = 0 in dual.
21 / 37
SVM soft-margin dual
Similarly,
L(w, ξ,α) =1
2‖w‖22 + C
n∑i=1
ξi +n∑i=1
αi(1− ξi − yixTiw) (Lagrangian),
P (w, ξ) = supα≥0
L(w, ξ,α) (Primal),
=
{12‖w‖22 + C
∑ni=1 ξi ∀i � 1− ξi − yix
Tiw ≤ 0,
∞ otherwise,
D(α) = minw∈Rd,ξ∈Rn
≥0
L(w, ξ,α) (Dual),
= maxα∈Rn
0≤αi≤C
n∑i=1
αi −1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2 .
Remarks.
I Dual solution α still gives primal solution w =∑ni=1 αiyixi.
I Can take C →∞ to recover hard-margin case.
I Dual is still a constrained concave quadratic (used in many solvers).
I Some presentations include bias in primal (xTiw + b);
this introduces a constraint∑ni=1 yiαi = 0 in dual.
21 / 37
SVM soft-margin dual
Similarly,
L(w, ξ,α) =1
2‖w‖22 + C
n∑i=1
ξi +n∑i=1
αi(1− ξi − yixTiw) (Lagrangian),
P (w, ξ) = supα≥0
L(w, ξ,α) (Primal),
=
{12‖w‖22 + C
∑ni=1 ξi ∀i � 1− ξi − yix
Tiw ≤ 0,
∞ otherwise,
D(α) = minw∈Rd,ξ∈Rn
≥0
L(w, ξ,α) (Dual),
= maxα∈Rn
0≤αi≤C
n∑i=1
αi −1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2 .
Remarks.
I Dual solution α still gives primal solution w =∑ni=1 αiyixi.
I Can take C →∞ to recover hard-margin case.
I Dual is still a constrained concave quadratic (used in many solvers).
I Some presentations include bias in primal (xTiw + b);
this introduces a constraint∑ni=1 yiαi = 0 in dual.
21 / 37
Nonlinear SVM: feature mapping annoying in the primal?
SVM hard-margin primal, with a feature mapping φ : Rd → Rp:
min
{1
2‖w‖2 : w ∈ Rp, ∀i � φ(xi)
Tw ≥ 1
}.
Now the search space has p dimensions; potentially p� d.
Can we do better?
22 / 37
Nonlinear SVM: feature mapping annoying in the primal?
SVM hard-margin primal, with a feature mapping φ : Rd → Rp:
min
{1
2‖w‖2 : w ∈ Rp, ∀i � φ(xi)
Tw ≥ 1
}.
Now the search space has p dimensions; potentially p� d.
Can we do better?
22 / 37
Feature mapping in the dual
SVM hard-margin dual, with a feature mapping φ : Rd → Rp:
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjφ(xi)Tφ(xj).
Given dual optimum α, since w =∑ni=1 αiyiφ(xi), we can predict on future x
with
x 7→ φ(x)Tw =
n∑i=1
αiyiφ(x)Tφ(xi).
I Dual form never needs φ(x) ∈ Rp, only φ(x)Tφ(xi) ∈ R.
I Kernel trick: replace every φ(x)Tφ(x′) with kernel evaluation k(x,x′).
Sometimes k(·, ·) is much cheaper than φ(x)Tφ(x′).
I This idea started with SVM, but appears in many other places.
I Downside: implementations usually store Gram matrix G ∈ Rn×n whereGij := k(xi,xj).
23 / 37
Feature mapping in the dual
SVM hard-margin dual, with a feature mapping φ : Rd → Rp:
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjφ(xi)Tφ(xj).
Given dual optimum α, since w =∑ni=1 αiyiφ(xi), we can predict on future x
with
x 7→ φ(x)Tw =
n∑i=1
αiyiφ(x)Tφ(xi).
I Dual form never needs φ(x) ∈ Rp, only φ(x)Tφ(xi) ∈ R.
I Kernel trick: replace every φ(x)Tφ(x′) with kernel evaluation k(x,x′).
Sometimes k(·, ·) is much cheaper than φ(x)Tφ(x′).
I This idea started with SVM, but appears in many other places.
I Downside: implementations usually store Gram matrix G ∈ Rn×n whereGij := k(xi,xj).
23 / 37
Feature mapping in the dual
SVM hard-margin dual, with a feature mapping φ : Rd → Rp:
maxα1,α2,...,αn≥0
n∑i=1
αi −1
2
n∑i,j=1
αiαjyiyjφ(xi)Tφ(xj).
Given dual optimum α, since w =∑ni=1 αiyiφ(xi), we can predict on future x
with
x 7→ φ(x)Tw =
n∑i=1
αiyiφ(x)Tφ(xi).
I Dual form never needs φ(x) ∈ Rp, only φ(x)Tφ(xi) ∈ R.
I Kernel trick: replace every φ(x)Tφ(x′) with kernel evaluation k(x,x′).
Sometimes k(·, ·) is much cheaper than φ(x)Tφ(x′).
I This idea started with SVM, but appears in many other places.
I Downside: implementations usually store Gram matrix G ∈ Rn×n whereGij := k(xi,xj).
23 / 37
Kernel example: affine features
Affine features: φ : Rd → R1+d, where
φ(x) = (1, x1, . . . , xd).
Kernel form:φ(x)Tφ(x′) = 1 + xTx′.
24 / 37
Kernel example: quadratic features
Consider re-normalized quadratic features
φ : Rd → R1+2d+(d2), where
φ(x) =(
1,√
2x1, . . . ,√
2xd, x21, . . . , x
2d,
√2x1x2, . . . ,
√2x1xd, . . . ,
√2xd−1xd
).
Just writing this down takes time O(d2).Meanwhile,
φ(x)Tφ(x′) = (1 + xTx′)2,
time O(d).
Tweaks:
I What if we change exponent “2”?
I What if we replace additive “1” with 0?
25 / 37
Kernel example: quadratic features
Consider re-normalized quadratic features
φ : Rd → R1+2d+(d2), where
φ(x) =(
1,√
2x1, . . . ,√
2xd, x21, . . . , x
2d,
√2x1x2, . . . ,
√2x1xd, . . . ,
√2xd−1xd
).
Just writing this down takes time O(d2).Meanwhile,
φ(x)Tφ(x′) = (1 + xTx′)2,
time O(d).
Tweaks:
I What if we change exponent “2”?
I What if we replace additive “1” with 0?
25 / 37
RBF kernel
For any σ > 0, there is an infinite feature expansion φ : Rd → R∞ such that
φ(x)Tφ(x′) = exp
−∥∥x− x′∥∥2
2
2σ2
,
which can be computed in O(d) time.
This is called a Gaussian kernel or RBF kernel.It has some similarities to nearest neighbor methods (later lecture).
φ maps to an infinite-dimensional space, but there’s no reason to know that.26 / 37
Defining kernels without φ
A (positive definite) kernel function k : X ×X → R is a symmetric func-tion so that for any n and any data examples (xi)
ni=1, the corresponding
Gram matrix G ∈ Rn×n with Gij := k(xi,xj) is positive semi-definite.
I There is a ton of theory about this formalism;
e.g., keywords RKHS, representer theorem, Mercer’s theorem.
I Given any such k, there always exists a corresponding φ.
I This definition ensures the SVM dual is still concave.
27 / 37
Defining kernels without φ
A (positive definite) kernel function k : X ×X → R is a symmetric func-tion so that for any n and any data examples (xi)
ni=1, the corresponding
Gram matrix G ∈ Rn×n with Gij := k(xi,xj) is positive semi-definite.
I There is a ton of theory about this formalism;
e.g., keywords RKHS, representer theorem, Mercer’s theorem.
I Given any such k, there always exists a corresponding φ.
I This definition ensures the SVM dual is still concave.
27 / 37
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
Source data.
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
-12.
500
-10.
000
-7.5
00-5
.000
-2.5
00
-2.500
0.00
0
0.000
2.500
Quadratic SVM.
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
-3.0
00
-2.0
00
-1.0
00
-1.0
00
0.00
0
0.000
1.00
0
2.000
3.000
RBF SVM (σ = 1).
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
-1.500-1.000
-1.000
-1.000
-0.5
00
-0.500
0.00
0
0.000
0.50
0
0.500
1.000
1.50
0
RBF SVM (σ = 0.1).
28 / 37
Summary for SVM
I Hard-margin SVM.
I Soft-margin SVM.
I SVM duality.
I Nonlinear SVM: kernels
29 / 37
(Appendix.)
30 / 37
Soft-margin dual derivation
Let’s derive the final dual form:
L(w, ξ,α) =1
2‖w‖22 + C
n∑i=1
ξi +
n∑i=1
αi(1− ξi − yixTiw)
D(α) = minw∈Rd,ξ∈Rn
≥0
L(w, ξ,α) = maxα∈Rn
0≤αi≤C
n∑i=1
αi −1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2 .
Given α and ξ, the minimizing w is still w =∑ni=1 αiyixi; plugging in,
D(α) = minξ∈Rn≥0
L
n∑i=1
αiyixi, ξ,α
=n∑i=1
αi−1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2
+
n∑i=1
ξi(C−αi).
The goal is to maximize D; if αi > C, then ξi ↑ ∞ gives D(α) = −∞.Otherwise, minimized at ξi = 0. Therefore the dual problem is
maxα∈Rn
0≤αi≤C
n∑i=1
αi −1
2
∥∥∥∥∥∥n∑i=1
αiyixi
∥∥∥∥∥∥2 .
31 / 37
Gaussian kernel feature expansion
First consider d = 1, meaning φ : R→ R∞.
What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?
Reverse engineer using Taylor expansion:
e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y
2/(2σ2) · exy/σ2
= e−x2/(2σ2) · e−y
2/(2σ2) ·∞∑k=0
1
k!
(xy
σ2
)kSo let
φ(x) := e−x2/(2σ2)
(1,x
σ,
1√2!
(x
σ
)2
,1√3!
(x
σ
)3
, . . .
).
How to handle d > 1?
e−‖x−y‖2/(2σ2) = e−‖x‖
2/(2σ2) · e−‖y‖2/(2σ2) · ex
Ty/σ2
= e−‖x‖2/(2σ2) · e−‖y‖
2/(2σ2) ·∞∑k=0
1
k!
(xTy
σ2
)k.
32 / 37
Gaussian kernel feature expansion
First consider d = 1, meaning φ : R→ R∞.
What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?
Reverse engineer using Taylor expansion:
e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y
2/(2σ2) · exy/σ2
= e−x2/(2σ2) · e−y
2/(2σ2) ·∞∑k=0
1
k!
(xy
σ2
)kSo let
φ(x) := e−x2/(2σ2)
(1,x
σ,
1√2!
(x
σ
)2
,1√3!
(x
σ
)3
, . . .
).
How to handle d > 1?
e−‖x−y‖2/(2σ2) = e−‖x‖
2/(2σ2) · e−‖y‖2/(2σ2) · ex
Ty/σ2
= e−‖x‖2/(2σ2) · e−‖y‖
2/(2σ2) ·∞∑k=0
1
k!
(xTy
σ2
)k.
32 / 37
Gaussian kernel feature expansion
First consider d = 1, meaning φ : R→ R∞.
What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?
Reverse engineer using Taylor expansion:
e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y
2/(2σ2) · exy/σ2
= e−x2/(2σ2) · e−y
2/(2σ2) ·∞∑k=0
1
k!
(xy
σ2
)k
So let
φ(x) := e−x2/(2σ2)
(1,x
σ,
1√2!
(x
σ
)2
,1√3!
(x
σ
)3
, . . .
).
How to handle d > 1?
e−‖x−y‖2/(2σ2) = e−‖x‖
2/(2σ2) · e−‖y‖2/(2σ2) · ex
Ty/σ2
= e−‖x‖2/(2σ2) · e−‖y‖
2/(2σ2) ·∞∑k=0
1
k!
(xTy
σ2
)k.
32 / 37
Gaussian kernel feature expansion
First consider d = 1, meaning φ : R→ R∞.
What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?
Reverse engineer using Taylor expansion:
e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y
2/(2σ2) · exy/σ2
= e−x2/(2σ2) · e−y
2/(2σ2) ·∞∑k=0
1
k!
(xy
σ2
)kSo let
φ(x) := e−x2/(2σ2)
(1,x
σ,
1√2!
(x
σ
)2
,1√3!
(x
σ
)3
, . . .
).
How to handle d > 1?
e−‖x−y‖2/(2σ2) = e−‖x‖
2/(2σ2) · e−‖y‖2/(2σ2) · ex
Ty/σ2
= e−‖x‖2/(2σ2) · e−‖y‖
2/(2σ2) ·∞∑k=0
1
k!
(xTy
σ2
)k.
32 / 37
Gaussian kernel feature expansion
First consider d = 1, meaning φ : R→ R∞.
What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?
Reverse engineer using Taylor expansion:
e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y
2/(2σ2) · exy/σ2
= e−x2/(2σ2) · e−y
2/(2σ2) ·∞∑k=0
1
k!
(xy
σ2
)kSo let
φ(x) := e−x2/(2σ2)
(1,x
σ,
1√2!
(x
σ
)2
,1√3!
(x
σ
)3
, . . .
).
How to handle d > 1?
e−‖x−y‖2/(2σ2) = e−‖x‖
2/(2σ2) · e−‖y‖2/(2σ2) · ex
Ty/σ2
= e−‖x‖2/(2σ2) · e−‖y‖
2/(2σ2) ·∞∑k=0
1
k!
(xTy
σ2
)k.
32 / 37
Gaussian kernel feature expansion
First consider d = 1, meaning φ : R→ R∞.
What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?
Reverse engineer using Taylor expansion:
e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y
2/(2σ2) · exy/σ2
= e−x2/(2σ2) · e−y
2/(2σ2) ·∞∑k=0
1
k!
(xy
σ2
)kSo let
φ(x) := e−x2/(2σ2)
(1,x
σ,
1√2!
(x
σ
)2
,1√3!
(x
σ
)3
, . . .
).
How to handle d > 1?
e−‖x−y‖2/(2σ2) = e−‖x‖
2/(2σ2) · e−‖y‖2/(2σ2) · ex
Ty/σ2
= e−‖x‖2/(2σ2) · e−‖y‖
2/(2σ2) ·∞∑k=0
1
k!
(xTy
σ2
)k.
32 / 37
Kernel example: products of all subsets of coordinates
Consider φ : Rd → R2d
, where
φ(x) =
∏i∈S
xi
S⊆{1,2,...,d}
Time O(2d) just to write down.Kernel evaluation takes time O(d):
φ(x)Tφ(x′) =
d∏i=1
(1 + xix′i).
33 / 37
Other kernels
Suppose k1 and k2 are positive definite kernel functions.
1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?
2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?
3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?
Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).
34 / 37
Other kernels
Suppose k1 and k2 are positive definite kernel functions.
1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?
2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?
3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?
Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).
34 / 37
Other kernels
Suppose k1 and k2 are positive definite kernel functions.
1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?
2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?
3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?
Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).
34 / 37
Other kernels
Suppose k1 and k2 are positive definite kernel functions.
1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?
2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?
3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?
Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).
34 / 37
Other kernels
Suppose k1 and k2 are positive definite kernel functions.
1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?
2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?
3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?
Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).
34 / 37
Kernel ridge regression
Kernel ridge regression:
minw
1
2n‖Xw − y‖2 +
λ
2‖w‖2.
Solution:w = (XTX + λnI)−1XTy.
Linear algebra fact:XT(XXT + λnI)−1y.
Therefore predict with
x 7→ xTw = (Xx)T(XXT + λnI)−1y =n∑i=1
(xTix)T
[(XXT + λnI)−1y
]i.
Kernel approach:
I Compute α := (G+ λnI)−1, where G ∈ Rn is the Gram matrix:Gij = k(xi,xj).
I Predict with x 7→∑ni=1 αix
Tix.
35 / 37
Multiclass SVM
There are a few ways;one is to use one-against-all as in lecture.
Many researchers have proposed various multiclass SVM methods,but some of them can be shown to fail in trivial cases.
36 / 37
Supplemental reading
I Shalev-Shwartz/Ben-David: chapter 15.
I Murphy: chapter 14.
37 / 37