Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdfLecture 3:...
Transcript of Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdfLecture 3:...
Lecture 3: Support Vector Machines
Tuo Zhao
Schools of ISYE and CSE, Georgia Tech
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Suppor Vector Machines
Numerous books, papers, and tutorials......Once upon a time, SVMis considered as the ultimate recipe for classification and regression.
Tuo Zhao — Lecture 3: Support Vector Machines 2/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Suppor Vector Machines
SVM have faded away, but it is still an important chapter in anyelementary machine learning textbook.
Tuo Zhao — Lecture 3: Support Vector Machines 3/47
Margins: Intuition
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Margins: Intuition
Consider a standard linear classification problem:
Feature: X ∈ Rd
Response: Y ∈ {−1, 1}Linear Classifier: Y = sign(X>θ)
Our confidence is represented by |X>θ|
Logistic Regression : P(Y = 1) =1
1 + exp(−X>θ).
Tuo Zhao — Lecture 3: Support Vector Machines 5/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Margins: Intuition
Consider a standard linear classification problem:
Feature: X ∈ Rd
Response: Y ∈ {−1, 1}Linear Classifier: Y = sign(X>θ)
Our confidence is represented by |X>θ|
Logistic Regression : P(Y = 1) =1
1 + exp(−X>θ).
Tuo Zhao — Lecture 3: Support Vector Machines 5/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Margins: Intuition
Consider a standard linear classification problem:
Feature: X ∈ Rd
Response: Y ∈ {−1, 1}Linear Classifier: Y = sign(X>θ)
Our confidence is represented by |X>θ|
Logistic Regression : P(Y = 1) =1
1 + exp(−X>θ).
Tuo Zhao — Lecture 3: Support Vector Machines 5/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Margins: Intuition
Consider a standard linear classification problem:
Feature: X ∈ Rd
Response: Y ∈ {−1, 1}Linear Classifier: Y = sign(X>θ)
Our confidence is represented by |X>θ|
Logistic Regression : P(Y = 1) =1
1 + exp(−X>θ).
Tuo Zhao — Lecture 3: Support Vector Machines 5/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Margin: Intuition
We have three testing samples: A, B, C.
Which one should we have the most confidence in classification?
2
θT x(i) ≪ 0 whenever y(i) = 0, since this would reflect a very confident (andcorrect) set of classifications for all the training examples. This seems to bea nice goal to aim for, and we’ll soon formalize this idea using the notion offunctional margins.
For a different type of intuition, consider the following figure, in which x’srepresent positive training examples, o’s denote negative training examples,a decision boundary (this is the line given by the equation θT x = 0, andis also called the separating hyperplane) is also shown, and three pointshave also been labeled A, B and C.
B
A
C
Notice that the point A is very far from the decision boundary. If we areasked to make a prediction for the value of y at A, it seems we should bequite confident that y = 1 there. Conversely, the point C is very close tothe decision boundary, and while it’s on the side of the decision boundaryon which we would predict y = 1, it seems likely that just a small change tothe decision boundary could easily have caused our prediction to be y = 0.Hence, we’re much more confident about our prediction at A than at C. Thepoint B lies in-between these two cases, and more broadly, we see that ifa point is far from the separating hyperplane, then we may be significantlymore confident in our predictions. Again, informally we think it’d be nice if,given a training set, we manage to find a decision boundary that allows usto make all correct and confident (meaning far from the decision boundary)predictions on the training examples. We’ll formalize this later using thenotion of geometric margins.
Tuo Zhao — Lecture 3: Support Vector Machines 6/47
Geometric Margin
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Geometric Margin
Function margin: γi = yi(x>i θ).
Scale Invariance: ‖θ‖2 = 1.
Minimum functional margin:
γmin(θ) = miniyi(x
>i θ).
Maximize the minimum margin:
θ = argmaxθ
γmin(θ) subject to ‖θ‖2 = 1.
Tuo Zhao — Lecture 3: Support Vector Machines 8/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Geometric Margin
Function margin: γi = yi(x>i θ).
Scale Invariance: ‖θ‖2 = 1.
Minimum functional margin:
γmin(θ) = miniyi(x
>i θ).
Maximize the minimum margin:
θ = argmaxθ
γmin(θ) subject to ‖θ‖2 = 1.
Tuo Zhao — Lecture 3: Support Vector Machines 8/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Geometric Margin
Function margin: γi = yi(x>i θ).
Scale Invariance: ‖θ‖2 = 1.
Minimum functional margin:
γmin(θ) = miniyi(x
>i θ).
Maximize the minimum margin:
θ = argmaxθ
γmin(θ) subject to ‖θ‖2 = 1.
Tuo Zhao — Lecture 3: Support Vector Machines 8/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Geometric Margin
Function margin: γi = yi(x>i θ).
Scale Invariance: ‖θ‖2 = 1.
Minimum functional margin:
γmin(θ) = miniyi(x
>i θ).
Maximize the minimum margin:
θ = argmaxθ
γmin(θ) subject to ‖θ‖2 = 1.
Tuo Zhao — Lecture 3: Support Vector Machines 8/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convexification
Maximize the minimum margin:
θ =argmaxθ,γ
γ
subject to ‖θ‖2 = 1, yi(x>i θ) ≥ γ, i = 1, ..., n.
‖θ‖2 = 1 is nonconvex
Rescale the margin
θ =argmaxθ,γ
γ
‖θ‖2subject to yi(x
>i θ) ≥ γ, i = 1, ..., n.
Tuo Zhao — Lecture 3: Support Vector Machines 9/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convexification
Maximize the minimum margin:
θ =argmaxθ,γ
γ
subject to ‖θ‖2 = 1, yi(x>i θ) ≥ γ, i = 1, ..., n.
‖θ‖2 = 1 is nonconvex
Rescale the margin
θ =argmaxθ,γ
γ
‖θ‖2subject to yi(x
>i θ) ≥ γ, i = 1, ..., n.
Tuo Zhao — Lecture 3: Support Vector Machines 9/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convexification
Maximize the minimum margin:
θ =argmaxθ,γ
γ
subject to ‖θ‖2 = 1, yi(x>i θ) ≥ γ, i = 1, ..., n.
‖θ‖2 = 1 is nonconvex
Rescale the margin
θ =argmaxθ,γ
γ
‖θ‖2subject to yi(x
>i θ) ≥ γ, i = 1, ..., n.
Tuo Zhao — Lecture 3: Support Vector Machines 9/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convexification
Maximize the minimum margin:
θ =argmaxθ,γ
γ
subject to ‖θ‖2 = 1, yi(x>i θ) ≥ γ, i = 1, ..., n.
‖θ‖2 = 1 is nonconvex
Rescale the margin
θ =argmaxθ,γ
γ
‖θ‖2subject to yi(x
>i θ) ≥ γ, i = 1, ..., n.
Tuo Zhao — Lecture 3: Support Vector Machines 9/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convexification
The scale of γ does not matter: enforce γ = 1
Equivalent to minimizing ‖θ‖2
θ =argminθ
1
2‖θ‖22
subject to 1− yi(x>i θ) ≤ 0, i = 1, ..., n.
Quadratic Programming with Linear Constraints
Efficient Optimization Solvers
Tuo Zhao — Lecture 3: Support Vector Machines 10/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convexification
The scale of γ does not matter: enforce γ = 1
Equivalent to minimizing ‖θ‖2
θ =argminθ
1
2‖θ‖22
subject to 1− yi(x>i θ) ≤ 0, i = 1, ..., n.
Quadratic Programming with Linear Constraints
Efficient Optimization Solvers
Tuo Zhao — Lecture 3: Support Vector Machines 10/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convexification
The scale of γ does not matter: enforce γ = 1
Equivalent to minimizing ‖θ‖2
θ =argminθ
1
2‖θ‖22
subject to 1− yi(x>i θ) ≤ 0, i = 1, ..., n.
Quadratic Programming with Linear Constraints
Efficient Optimization Solvers
Tuo Zhao — Lecture 3: Support Vector Machines 10/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convexification
The scale of γ does not matter: enforce γ = 1
Equivalent to minimizing ‖θ‖2
θ =argminθ
1
2‖θ‖22
subject to 1− yi(x>i θ) ≤ 0, i = 1, ..., n.
Quadratic Programming with Linear Constraints
Efficient Optimization Solvers
Tuo Zhao — Lecture 3: Support Vector Machines 10/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Optimization Software
Existing Convex Programming Solvers:
CPLEX: IBM ILOG CPLEX Optimization Studio.
Gurobi: Zonghao Gu, Edward Rothberg and Robert Bixby.
Programming Language: C++, Java, Python, MATLAB, R
Modeling Language: AIMMS, AMPL, GAMS, MPL
Free Academic Licenses.
Tuo Zhao — Lecture 3: Support Vector Machines 11/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Optimization Software
Existing Convex Programming Solvers:
CPLEX: IBM ILOG CPLEX Optimization Studio.
Gurobi: Zonghao Gu, Edward Rothberg and Robert Bixby.
Programming Language: C++, Java, Python, MATLAB, R
Modeling Language: AIMMS, AMPL, GAMS, MPL
Free Academic Licenses.
Tuo Zhao — Lecture 3: Support Vector Machines 11/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Optimization Software
Existing Convex Programming Solvers:
CPLEX: IBM ILOG CPLEX Optimization Studio.
Gurobi: Zonghao Gu, Edward Rothberg and Robert Bixby.
Programming Language: C++, Java, Python, MATLAB, R
Modeling Language: AIMMS, AMPL, GAMS, MPL
Free Academic Licenses.
Tuo Zhao — Lecture 3: Support Vector Machines 11/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Optimization Software
Existing Convex Programming Solvers:
CPLEX: IBM ILOG CPLEX Optimization Studio.
Gurobi: Zonghao Gu, Edward Rothberg and Robert Bixby.
Programming Language: C++, Java, Python, MATLAB, R
Modeling Language: AIMMS, AMPL, GAMS, MPL
Free Academic Licenses.
Tuo Zhao — Lecture 3: Support Vector Machines 11/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Optimization Software
Existing Convex Programming Solvers:
CPLEX: IBM ILOG CPLEX Optimization Studio.
Gurobi: Zonghao Gu, Edward Rothberg and Robert Bixby.
Programming Language: C++, Java, Python, MATLAB, R
Modeling Language: AIMMS, AMPL, GAMS, MPL
Free Academic Licenses.
Tuo Zhao — Lecture 3: Support Vector Machines 11/47
Lagrangian Duality
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Lagrangian Multiplier Method
Dual variables: λi ≥ 0, i = 1, ..., n
L(θ,λ) =1
2‖θ‖22 +
n∑
i=1
λi(1− yi(x>i θ))
Min-Max problem:
minθ
maxλ≥0
L(θ,λ)
Strong duality:
minθ
maxλ≥0
L(θ,λ) = maxλ≥0
minθL(θ,λ)
Tuo Zhao — Lecture 3: Support Vector Machines 13/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Lagrangian Multiplier Method
Dual variables: λi ≥ 0, i = 1, ..., n
L(θ,λ) =1
2‖θ‖22 +
n∑
i=1
λi(1− yi(x>i θ))
Min-Max problem:
minθ
maxλ≥0
L(θ,λ)
Strong duality:
minθ
maxλ≥0
L(θ,λ) = maxλ≥0
minθL(θ,λ)
Tuo Zhao — Lecture 3: Support Vector Machines 13/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Lagrangian Multiplier Method
Dual variables: λi ≥ 0, i = 1, ..., n
L(θ,λ) =1
2‖θ‖22 +
n∑
i=1
λi(1− yi(x>i θ))
Min-Max problem:
minθ
maxλ≥0
L(θ,λ)
Strong duality:
minθ
maxλ≥0
L(θ,λ) = maxλ≥0
minθL(θ,λ)
Tuo Zhao — Lecture 3: Support Vector Machines 13/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Penalty of Constraint Violation
Min-Max Problem
minθ
maxλ≥0
1
2‖θ‖22 +
n∑
i=1
λi(1− yi(x>i θ))
Given (1− yi(x>i θ)) ≤ 0,
maxλi
λi(1− yi(x>i θ)) = 0
Given (1− yi(x>i θ)) > 0,
maxλi
λi(1− yi(x>i θ)) = +∞
Tuo Zhao — Lecture 3: Support Vector Machines 14/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Penalty of Constraint Violation
Min-Max Problem
minθ
maxλ≥0
1
2‖θ‖22 +
n∑
i=1
λi(1− yi(x>i θ))
Given (1− yi(x>i θ)) ≤ 0,
maxλi
λi(1− yi(x>i θ)) = 0
Given (1− yi(x>i θ)) > 0,
maxλi
λi(1− yi(x>i θ)) = +∞
Tuo Zhao — Lecture 3: Support Vector Machines 14/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Penalty of Constraint Violation
Min-Max Problem
minθ
maxλ≥0
1
2‖θ‖22 +
n∑
i=1
λi(1− yi(x>i θ))
Given (1− yi(x>i θ)) ≤ 0,
maxλi
λi(1− yi(x>i θ)) = 0
Given (1− yi(x>i θ)) > 0,
maxλi
λi(1− yi(x>i θ)) = +∞
Tuo Zhao — Lecture 3: Support Vector Machines 14/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convex Concave Saddle Point Problem
Tuo Zhao — Lecture 3: Support Vector Machines 15/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Karush-Kuhn-Tucker Condition
KKT Condition:
∂L(θ, λ)
∂θ
∣∣∣θ=θ
= 0,∂L(θ,λ)
∂λ
∣∣∣λ=λ
= 0.
Complementary Slackness
λi = 0 ⇒ 1− yix>i θ ≤ 0,
λi ≥ 0 ⇒ 1− yix>i θ = 0.
λi > 0 — Support vector.
Tuo Zhao — Lecture 3: Support Vector Machines 16/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Karush-Kuhn-Tucker Condition
KKT Condition:
∂L(θ, λ)
∂θ
∣∣∣θ=θ
= 0,∂L(θ,λ)
∂λ
∣∣∣λ=λ
= 0.
Complementary Slackness
λi = 0 ⇒ 1− yix>i θ ≤ 0,
λi ≥ 0 ⇒ 1− yix>i θ = 0.
λi > 0 — Support vector.
Tuo Zhao — Lecture 3: Support Vector Machines 16/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Karush-Kuhn-Tucker Condition
KKT Condition:
∂L(θ, λ)
∂θ
∣∣∣θ=θ
= 0,∂L(θ,λ)
∂λ
∣∣∣λ=λ
= 0.
Complementary Slackness
λi = 0 ⇒ 1− yix>i θ ≤ 0,
λi ≥ 0 ⇒ 1− yix>i θ = 0.
λi > 0 — Support vector.
Tuo Zhao — Lecture 3: Support Vector Machines 16/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Dual Problem
Solve the minimization problem:
∂L(θ,λ)
∂θ= 0⇒ θ =
n∑
i=1
λiyixi.
Maximization problem:
maxλ≥0
L
(n∑
i=1
λiyixi,λ
)
= maxλ≥0
1
2
∥∥∥∥∥n∑
i=1
λiyixi
∥∥∥∥∥
2
2
+
n∑
i=1
λi
(1− yi
(x>i
n∑
k=1
λkykxk
)).
Tuo Zhao — Lecture 3: Support Vector Machines 17/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Dual Problem
Solve the minimization problem:
∂L(θ,λ)
∂θ= 0⇒ θ =
n∑
i=1
λiyixi.
Maximization problem:
maxλ≥0
L
(n∑
i=1
λiyixi,λ
)
= maxλ≥0
1
2
∥∥∥∥∥n∑
i=1
λiyixi
∥∥∥∥∥
2
2
+
n∑
i=1
λi
(1− yi
(x>i
n∑
k=1
λkykxk
)).
Tuo Zhao — Lecture 3: Support Vector Machines 17/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Simplication
∥∥∥∥∥n∑
i=1
λiyixi
∥∥∥∥∥
2
2
=
(n∑
i=1
λiyix>i
)(n∑
i=1
λiyixi
)
=
n∑
i=1
n∑
k=1
λiλkyiykx>i xk,
n∑
i=1
λi
(1− yi
(x>i
n∑
i=1
λiyixi
))
=
n∑
i=1
λi −n∑
i=1
n∑
k=1
λiλkyiykx>i xk.
Tuo Zhao — Lecture 3: Support Vector Machines 18/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Dual Problem
After simplification, we have
λ = argmaxλ≥0
n∑
i=1
λi −1
2
n∑
i=1
n∑
k=1
λiλkyiykx>i xk.
We define Q as Qik = yiykx>i xk.
Quadratic Minimization:
λ =argminλ
1
2λ>Qλ− λ>1
subject to λi ≥ 0, i = 1, ..., n
Tuo Zhao — Lecture 3: Support Vector Machines 19/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Support Vector Machines
Let S denote the set of support vectors. Given x∗, we predict
y∗ = sign(θ>x∗) = sign
(∑ni=1 λiyix
>i x∗)
= sign(∑n
i=1 λiyix>i x∗) = sign
(∑i∈S λiyix
>i x∗)
11
corresponding to constraints that hold with equality, gi(w) = 0). Considerthe figure below, in which a maximum margin separating hyperplane is shownby the solid line.
The points with the smallest margins are exactly the ones closest to thedecision boundary; here, these are the three points (one negative and two pos-itive examples) that lie on the dashed lines parallel to the decision boundary.Thus, only three of the αi’s—namely, the ones corresponding to these threetraining examples—will be non-zero at the optimal solution to our optimiza-tion problem. These three points are called the support vectors in thisproblem. The fact that the number of support vectors can be much smallerthan the size the training set will be useful later.
Let’s move on. Looking ahead, as we develop the dual form of the prob-lem, one key idea to watch out for is that we’ll try to write our algorithmin terms of only the inner product ⟨x(i), x(j)⟩ (think of this as (x(i))T x(j))between points in the input feature space. The fact that we can express ouralgorithm in terms of these inner products will be key when we apply thekernel trick.
When we construct the Lagrangian for our optimization problem we have:
L(w, b, α) =1
2||w||2 −
m!
i=1
αi
"y(i)(wTx(i) + b) − 1
#. (8)
Note that there’re only “αi” but no “βi” Lagrange multipliers, since theproblem has only inequality constraints.
Let’s find the dual form of the problem. To do so, we need to firstminimize L(w, b, α) with respect to w and b (for fixed α), to get θD, which
Tuo Zhao — Lecture 3: Support Vector Machines 20/47
SVM with Soft Margin
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Linearly Separable and Non-separable
19
algorithms that we’ll see later in this class will also be amenable to thismethod, which has come to be known as the “kernel trick.”
8 Regularization and the non-separable case
The derivation of the SVM as presented so far assumed that the data islinearly separable. While mapping data to a high dimensional feature spacevia φ does generally increase the likelihood that the data is separable, wecan’t guarantee that it always will be so. Also, in some cases it is not clearthat finding a separating hyperplane is exactly what we’d want to do, sincethat might be susceptible to outliers. For instance, the left figure belowshows an optimal margin classifier, and when a single outlier is added in theupper-left region (right figure), it causes the decision boundary to make adramatic swing, and the resulting classifier has a much smaller margin.
To make the algorithm work for non-linearly separable datasets as wellas be less sensitive to outliers, we reformulate our optimization (using ℓ1
regularization) as follows:
minγ,w,b1
2||w||2 + C
m!
i=1
ξi
s.t. y(i)(wT x(i) + b) ≥ 1 − ξi, i = 1, . . . , m
ξi ≥ 0, i = 1, . . . , m.
Thus, examples are now permitted to have (functional) margin less than 1,and if an example has functional margin 1 − ξi (with ξ > 0), we would paya cost of the objective function being increased by Cξi. The parameter Ccontrols the relative weighting between the twin goals of making the ||w||2small (which we saw earlier makes the margin large) and of ensuring thatmost examples have functional margin at least 1.
19
algorithms that we’ll see later in this class will also be amenable to thismethod, which has come to be known as the “kernel trick.”
8 Regularization and the non-separable case
The derivation of the SVM as presented so far assumed that the data islinearly separable. While mapping data to a high dimensional feature spacevia φ does generally increase the likelihood that the data is separable, wecan’t guarantee that it always will be so. Also, in some cases it is not clearthat finding a separating hyperplane is exactly what we’d want to do, sincethat might be susceptible to outliers. For instance, the left figure belowshows an optimal margin classifier, and when a single outlier is added in theupper-left region (right figure), it causes the decision boundary to make adramatic swing, and the resulting classifier has a much smaller margin.
To make the algorithm work for non-linearly separable datasets as wellas be less sensitive to outliers, we reformulate our optimization (using ℓ1
regularization) as follows:
minγ,w,b1
2||w||2 + C
m!
i=1
ξi
s.t. y(i)(wT x(i) + b) ≥ 1 − ξi, i = 1, . . . , m
ξi ≥ 0, i = 1, . . . , m.
Thus, examples are now permitted to have (functional) margin less than 1,and if an example has functional margin 1 − ξi (with ξ > 0), we would paya cost of the objective function being increased by Cξi. The parameter Ccontrols the relative weighting between the twin goals of making the ||w||2small (which we saw earlier makes the margin large) and of ensuring thatmost examples have functional margin at least 1.
Tuo Zhao — Lecture 3: Support Vector Machines 22/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Soft Margin
Slack variables: Given ξi ≥ 0, i = 1, ..., 0,
θ =argminθ
1
2‖θ‖22 + µ
n∑
i=1
ξi
subject to 1− yi(x>i θ)− ξi ≤ 0, ξi ≥ 0, i = 1, ..., n,
where C > 0 is a regularization parameter.
Lagrangian Form: Given λ ≥ 0 and β ≥ 0,
L(θ,α,β) =1
2‖θ‖22 + C
n∑
i=1
ξi
+
n∑
i=1
λi(1− yi(x>i θ)− ξi)−n∑
i=1
βiξi
Tuo Zhao — Lecture 3: Support Vector Machines 23/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Soft Margin
Slack variables: Given ξi ≥ 0, i = 1, ..., 0,
θ =argminθ
1
2‖θ‖22 + µ
n∑
i=1
ξi
subject to 1− yi(x>i θ)− ξi ≤ 0, ξi ≥ 0, i = 1, ..., n,
where C > 0 is a regularization parameter.
Lagrangian Form: Given λ ≥ 0 and β ≥ 0,
L(θ,α,β) =1
2‖θ‖22 + C
n∑
i=1
ξi
+
n∑
i=1
λi(1− yi(x>i θ)− ξi)−n∑
i=1
βiξi
Tuo Zhao — Lecture 3: Support Vector Machines 23/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Dual Formulation
Quadratic Minimization:
λ =argminλ
1
2λ>Qλ− λ>1
subject to C ≥ λi ≥ 0
KKT condition ⇒ Complementary Slackness
λi = 0 ⇒ yix>i θ ≥ 1,
C ≥ λi > 0 ⇒ yix>i θ < 1.
λi > 0 — Support vector.
Tuo Zhao — Lecture 3: Support Vector Machines 24/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Dual Formulation
Quadratic Minimization:
λ =argminλ
1
2λ>Qλ− λ>1
subject to C ≥ λi ≥ 0
KKT condition ⇒ Complementary Slackness
λi = 0 ⇒ yix>i θ ≥ 1,
C ≥ λi > 0 ⇒ yix>i θ < 1.
λi > 0 — Support vector.
Tuo Zhao — Lecture 3: Support Vector Machines 24/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Dual Formulation
Quadratic Minimization:
λ =argminλ
1
2λ>Qλ− λ>1
subject to C ≥ λi ≥ 0
KKT condition ⇒ Complementary Slackness
λi = 0 ⇒ yix>i θ ≥ 1,
C ≥ λi > 0 ⇒ yix>i θ < 1.
λi > 0 — Support vector.
Tuo Zhao — Lecture 3: Support Vector Machines 24/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Soft Margin
Tuo Zhao — Lecture 3: Support Vector Machines 25/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Hinge Loss Function
Tuo Zhao — Lecture 3: Support Vector Machines 26/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Unconstrained Formulation
SVM with soft margin is equivalent to
θ = argminθ∈Rd
C
n∑
i=1
max{1− yix>i θ, 0}+1
2‖θ‖22 .
Logistic loss: f(z) = log(1 + exp(−z))
Square Hinge Loss: f(z) = max{1− z, 0}2
Huberized Hinge Loss: f(z) =
−z + 1
2if z ≤ 0,
1
2(z − 1)2 if 0 < z ≤ 1,
0 otherwise.
0-1 Loss: `i(z) = 1(z ≥ 0)
Tuo Zhao — Lecture 3: Support Vector Machines 27/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Huberized Hinge Loss
Tuo Zhao — Lecture 3: Support Vector Machines 28/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Nonsmooth (Stochastic) Optimization
Given a convex function L(θ), the subdifferential of L(θ) at θ is aset of vectors denoted by ∂L(θ) such that for any θ,θ′ ∈ Rd,
L(θ′) ≥ L(θ) + ζ>(θ′ − θ) for all ζ ∈ ∂L(θ).
f(z) = |z|: ∂f(z) = [−1,+1] at z = 0
f(z) = max{1− z, 0}: ∂f(z) = [−1, 0] at z = 1.
Tuo Zhao — Lecture 3: Support Vector Machines 29/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Nonsmooth (Stochastic) Optimization
We consider a nonsmooth optimization problem
minθL(θ), where L(θ) = 1
n
n∑
i=1
`i(θ)
Subgradient Algorithm: At the (t+ 1)-th iteration, we take
θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂L(θ(t)).
Stochastic Variant: At the (t+ 1)-th iteration, we randomlyselect i from 1, ..., n, and take
θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂`i(θ(t)).
Tuo Zhao — Lecture 3: Support Vector Machines 30/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Nonsmooth (Stochastic) Optimization
We consider a nonsmooth optimization problem
minθL(θ), where L(θ) = 1
n
n∑
i=1
`i(θ)
Subgradient Algorithm: At the (t+ 1)-th iteration, we take
θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂L(θ(t)).
Stochastic Variant: At the (t+ 1)-th iteration, we randomlyselect i from 1, ..., n, and take
θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂`i(θ(t)).
Tuo Zhao — Lecture 3: Support Vector Machines 30/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stochastic Subgradient Algorithms
How many iterations do we need?
A sequence of decreasing step size parameters: ηt � 1tµ
‖ζ‖2 ≤M for any ζ ∈ ∂L(θ)Given a pre-specified error ε, we need
T = O
(M2
µ2ε
)
iterations such that
E∥∥∥θ(T ) − θ
∥∥∥2
2≤ ε, where θ
(T )=
2∑T
t=0(t+ 1)θ(t)
(T + 1)(T + 2).
Slower than gradient descent for the square hinge loss SVM.
Tuo Zhao — Lecture 3: Support Vector Machines 31/47
SVM with Kernel
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Nonlinear SVM
Feature Mapping:
φ(x) = (φ1(x), ..., φm(x))>
SVM with Feature Mapping:
θ = argminθ∈Rm
C
n∑
i=1
max{1− yiφ(xi)>θ, 0}+1
2‖θ‖22 .
Dual Problem:
λ =argminC1≥λ≥0
1
2
n∑
i=1
n∑
k=1
λiλkyiykφ(xi)>φ(xk)− λ>1
Tuo Zhao — Lecture 3: Support Vector Machines 33/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Nonlinear SVM
Feature Mapping:
φ(x) = (φ1(x), ..., φm(x))>
SVM with Feature Mapping:
θ = argminθ∈Rm
C
n∑
i=1
max{1− yiφ(xi)>θ, 0}+1
2‖θ‖22 .
Dual Problem:
λ =argminC1≥λ≥0
1
2
n∑
i=1
n∑
k=1
λiλkyiykφ(xi)>φ(xk)− λ>1
Tuo Zhao — Lecture 3: Support Vector Machines 33/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Nonlinear SVM
Feature Mapping:
φ(x) = (φ1(x), ..., φm(x))>
SVM with Feature Mapping:
θ = argminθ∈Rm
C
n∑
i=1
max{1− yiφ(xi)>θ, 0}+1
2‖θ‖22 .
Dual Problem:
λ =argminC1≥λ≥0
1
2
n∑
i=1
n∑
k=1
λiλkyiykφ(xi)>φ(xk)− λ>1
Tuo Zhao — Lecture 3: Support Vector Machines 33/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Feature Mapping
Tuo Zhao — Lecture 3: Support Vector Machines 34/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Kernel Trick
Kernel Function:
K(xi,xj) = 〈φ(xi),φ(xj)〉
Dual Problem:
λ =argminC1≥λ≥0
1
2
n∑
i=1
n∑
k=1
λiλkyiykK(xi,xk)− λ>1
Gaussian/RBF Kernel – infinite dimensional mapping:
K(xi,xj) = exp(−γ ‖xi − xj‖22)
Polynomial Kernel – finite dimensional mapping:
K(xi,xj) = (1 + x>i xj)r
Tuo Zhao — Lecture 3: Support Vector Machines 35/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Kernel Trick
Kernel Function:
K(xi,xj) = 〈φ(xi),φ(xj)〉
Dual Problem:
λ =argminC1≥λ≥0
1
2
n∑
i=1
n∑
k=1
λiλkyiykK(xi,xk)− λ>1
Gaussian/RBF Kernel – infinite dimensional mapping:
K(xi,xj) = exp(−γ ‖xi − xj‖22)
Polynomial Kernel – finite dimensional mapping:
K(xi,xj) = (1 + x>i xj)r
Tuo Zhao — Lecture 3: Support Vector Machines 35/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Kernel Trick
Kernel Function:
K(xi,xj) = 〈φ(xi),φ(xj)〉
Dual Problem:
λ =argminC1≥λ≥0
1
2
n∑
i=1
n∑
k=1
λiλkyiykK(xi,xk)− λ>1
Gaussian/RBF Kernel – infinite dimensional mapping:
K(xi,xj) = exp(−γ ‖xi − xj‖22)
Polynomial Kernel – finite dimensional mapping:
K(xi,xj) = (1 + x>i xj)r
Tuo Zhao — Lecture 3: Support Vector Machines 35/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Kernel Trick
Kernel Function:
K(xi,xj) = 〈φ(xi),φ(xj)〉
Dual Problem:
λ =argminC1≥λ≥0
1
2
n∑
i=1
n∑
k=1
λiλkyiykK(xi,xk)− λ>1
Gaussian/RBF Kernel – infinite dimensional mapping:
K(xi,xj) = exp(−γ ‖xi − xj‖22)
Polynomial Kernel – finite dimensional mapping:
K(xi,xj) = (1 + x>i xj)r
Tuo Zhao — Lecture 3: Support Vector Machines 35/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
More on Kernels
Reproducing Kernel Hilbert Space
Representer Theorem
Mercer Theorem
K � 0, where Kij = K(xi,xj),
Nonparametric SVM:
f = argminf∈H
Cn∑
i=1
max{1− yif(xi), 0}+1
2‖f‖2H .
Tuo Zhao — Lecture 3: Support Vector Machines 36/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
More on Kernels
Reproducing Kernel Hilbert Space
Representer Theorem
Mercer Theorem
K � 0, where Kij = K(xi,xj),
Nonparametric SVM:
f = argminf∈H
C
n∑
i=1
max{1− yif(xi), 0}+1
2‖f‖2H .
Tuo Zhao — Lecture 3: Support Vector Machines 36/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
More on Kernels
Reproducing Kernel Hilbert Space
Representer Theorem
Mercer Theorem
K � 0, where Kij = K(xi,xj),
Nonparametric SVM:
f = argminf∈H
C
n∑
i=1
max{1− yif(xi), 0}+1
2‖f‖2H .
Tuo Zhao — Lecture 3: Support Vector Machines 36/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
More on Kernels
Reproducing Kernel Hilbert Space
Representer Theorem
Mercer Theorem
K � 0, where Kij = K(xi,xj),
Nonparametric SVM:
f = argminf∈H
C
n∑
i=1
max{1− yif(xi), 0}+1
2‖f‖2H .
Tuo Zhao — Lecture 3: Support Vector Machines 36/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Randomized Coordinate Minimization
Solve the dual problem:
λ =argminλ
1
2λ>Qλ− λ>1
subject to C ≥ λi ≥ 0
Minimize λj with λ1, ..., λj−1, λj+1, λn fixed.
At the (t+ 1)-th iteration, we select i ∈ {1, ..., n} and take
λ(t+1)i =argmin
λi
1
2Qiiλ
2i +
∑j 6=iQijλiλ
(t)j − λi
subject to C ≥ λi ≥ 0
and λ(t+1)j = λ
(t)j for all j 6= i.
Tuo Zhao — Lecture 3: Support Vector Machines 37/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Randomized Coordinate Minimization
Solve the dual problem:
λ =argminλ
1
2λ>Qλ− λ>1
subject to C ≥ λi ≥ 0
Minimize λj with λ1, ..., λj−1, λj+1, λn fixed.
At the (t+ 1)-th iteration, we select i ∈ {1, ..., n} and take
λ(t+1)i =argmin
λi
1
2Qiiλ
2i +
∑j 6=iQijλiλ
(t)j − λi
subject to C ≥ λi ≥ 0
and λ(t+1)j = λ
(t)j for all j 6= i.
Tuo Zhao — Lecture 3: Support Vector Machines 37/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Randomized Coordinate Minimization
Solve the dual problem:
λ =argminλ
1
2λ>Qλ− λ>1
subject to C ≥ λi ≥ 0
Minimize λj with λ1, ..., λj−1, λj+1, λn fixed.
At the (t+ 1)-th iteration, we select i ∈ {1, ..., n} and take
λ(t+1)i =argmin
λi
1
2Qiiλ
2i +
∑j 6=iQijλiλ
(t)j − λi
subject to C ≥ λi ≥ 0
and λ(t+1)j = λ
(t)j for all j 6= i.
Tuo Zhao — Lecture 3: Support Vector Machines 37/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Computational Cost per Iteration
An analytical updating formula:
λ(t+1)i =argmin
C≥λi≥0
1
2Qiiλ
2i + λi
∑j 6=iQijλ
(t)j − λi
=
C if λ(t+1) ≥ C,0 if λ(t+1) ≤ 0,
λ(t+1) otherwise,
where λ(t+1) = Q−1ii(∑
j 6=iQijλ(t)j − 1
).
What is the computational cost per iteration? O(n)
Better than gradient descent? O(n2)
Tuo Zhao — Lecture 3: Support Vector Machines 38/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Computational Cost per Iteration
An analytical updating formula:
λ(t+1)i =argmin
C≥λi≥0
1
2Qiiλ
2i + λi
∑j 6=iQijλ
(t)j − λi
=
C if λ(t+1) ≥ C,0 if λ(t+1) ≤ 0,
λ(t+1) otherwise,
where λ(t+1) = Q−1ii(∑
j 6=iQijλ(t)j − 1
).
What is the computational cost per iteration? O(n)
Better than gradient descent? O(n2)
Tuo Zhao — Lecture 3: Support Vector Machines 38/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Computational Cost per Iteration
An analytical updating formula:
λ(t+1)i =argmin
C≥λi≥0
1
2Qiiλ
2i + λi
∑j 6=iQijλ
(t)j − λi
=
C if λ(t+1) ≥ C,0 if λ(t+1) ≤ 0,
λ(t+1) otherwise,
where λ(t+1) = Q−1ii(∑
j 6=iQijλ(t)j − 1
).
What is the computational cost per iteration? O(n)
Better than gradient descent? O(n2)
Tuo Zhao — Lecture 3: Support Vector Machines 38/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Geometric Illustration
Optimal Solution
Tuo Zhao — Lecture 3: Support Vector Machines 39/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Coordinate Strong Smoothness
There exists a constant Lj such that for any λ and λ′ withλi = λ′i for i 6= j, we have
F(λ′)−F(λ)−∇jF(λ)(λ′j − λj) ≤Lj2(λ′j − λj)2.
There exists a constant µ such that for any λ and λ′, we have
F(λ′)−F(λ)−∇F(λ)>(λ′ − λ) ≥ µ
2
∥∥λ′ − λ∥∥22.
Coordinate v.s. Global Strong Smoothness:
L
nµ≤ maxj Lj
µ≤ L
µ.
Tuo Zhao — Lecture 3: Support Vector Machines 40/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Coordinate Strong Smoothness
There exists a constant Lj such that for any λ and λ′ withλi = λ′i for i 6= j, we have
F(λ′)−F(λ)−∇jF(λ)(λ′j − λj) ≤Lj2(λ′j − λj)2.
There exists a constant µ such that for any λ and λ′, we have
F(λ′)−F(λ)−∇F(λ)>(λ′ − λ) ≥ µ
2
∥∥λ′ − λ∥∥22.
Coordinate v.s. Global Strong Smoothness:
L
nµ≤ maxj Lj
µ≤ L
µ.
Tuo Zhao — Lecture 3: Support Vector Machines 40/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Coordinate Strong Smoothness
There exists a constant Lj such that for any λ and λ′ withλi = λ′i for i 6= j, we have
F(λ′)−F(λ)−∇jF(λ)(λ′j − λj) ≤Lj2(λ′j − λj)2.
There exists a constant µ such that for any λ and λ′, we have
F(λ′)−F(λ)−∇F(λ)>(λ′ − λ) ≥ µ
2
∥∥λ′ − λ∥∥22.
Coordinate v.s. Global Strong Smoothness:
L
nµ≤ maxj Lj
µ≤ L
µ.
Tuo Zhao — Lecture 3: Support Vector Machines 40/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Iteration Complexity
How many iterations do we need?
Given a pre-specified error ε, we need
T = O
(nLmax
µlog
(1
ε
))
iterations such that
E∥∥∥λ(T ) − λ
∥∥∥2
2≤ ε.
Faster than the gradient descent algorithm? Yes!
Tuo Zhao — Lecture 3: Support Vector Machines 41/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Iteration Complexity
How many iterations do we need?
Given a pre-specified error ε, we need
T = O
(nLmax
µlog
(1
ε
))
iterations such that
E∥∥∥λ(T ) − λ
∥∥∥2
2≤ ε.
Faster than the gradient descent algorithm? Yes!
Tuo Zhao — Lecture 3: Support Vector Machines 41/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Randomized v.s. Cyclic v.s. Shuffled
Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...
Shuffled and Random Shuffled Order
Randomized Order
(a) 102 iterations (b) 104 iterations
Figure 1: Relative errorf(xk)�f⇤f(x0)�f⇤ v.s. iterations, for 5 methods C-CD, cyclic CGD with small stepsize 1/�max(A), randomly permuted
CD, randomized CD and GD. Minimize f(x) = xT Ax, n = 100, A = Ac with c = 0.8.
Next, we discuss numerical experiments for randomly generated A; for simplicity, we will normalize the
diagonal entries of A to be 1. Since di↵erent random distributions of A will lead to di↵erent comparison
results of various algorithms, we test many distributions and try to understand for which distributions C-CD
performs well/poorly. To guarantee that A is positive semidefinite, we generate a random matrix U and let
A = UT U . We generate the entries of U i.i.d. from a certain random distribution (which implies that the
columns of U are independent), such as N (0, 1) (standard Gaussian distribution), Unif[0, 1] (uniform [0, 1]
distribution), log-normal distribution, etc. It turns out for most distributions C-CD is slower than R-CD
and RP-CD, but for standard Gaussian distribution C-CD is better than R-CD and RP-CD.
Inspired by the numerical experiments for the example (24), we suspect that the performance of C-CD
depends on how large the o↵-diagonal entries of A are (with fixed diagonal entries). When Uij ⇠ N (0, 1),
UT U has small o↵-diagonal entries since E(Aij) = 0, i 6= j. When Uij ⇠ Unif[0, 1], UT U has large o↵-
diagonal entries since E((UT U)ij) = n/4. The di↵erence of these two scenarios lie in the fact that Unif[0, 1]
has non-zero mean while N (0, 1) has zero mean. It is then interesting to compare the case of drawing entries
from Unif[0, 1] with the case of drawing entries from Unif[�0.5, 0.5]. We will do some sort of A/B testing for
each distribution: compare the zero-mean case with the non-zero mean case. To quantify the “o↵-diagonals
over diagonals ratio”, we define
�i =
Pj 6=i |Aij |Aii
, i = 1, . . . , n; �avg =1
n
X
i
�i.
We also define
⌧ =L
Lmin=
�max(A)
mini Aii.
As we have normalized Aii to be 1, we can simply write
�i =X
j 6=i
|Aij |, ⌧ = �max(A) = L.
Obviously L = �max(A) 1 + maxi �i. In many examples we find L to be close to 1 + �avg especially when
both of them are large.
24
Worst Case v.s. Average Case
Tuo Zhao — Lecture 3: Support Vector Machines 42/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Randomized v.s. Cyclic v.s. Shuffled
Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...
Shuffled and Random Shuffled Order
Randomized Order
(a) 102 iterations (b) 104 iterations
Figure 1: Relative errorf(xk)�f⇤f(x0)�f⇤ v.s. iterations, for 5 methods C-CD, cyclic CGD with small stepsize 1/�max(A), randomly permuted
CD, randomized CD and GD. Minimize f(x) = xT Ax, n = 100, A = Ac with c = 0.8.
Next, we discuss numerical experiments for randomly generated A; for simplicity, we will normalize the
diagonal entries of A to be 1. Since di↵erent random distributions of A will lead to di↵erent comparison
results of various algorithms, we test many distributions and try to understand for which distributions C-CD
performs well/poorly. To guarantee that A is positive semidefinite, we generate a random matrix U and let
A = UT U . We generate the entries of U i.i.d. from a certain random distribution (which implies that the
columns of U are independent), such as N (0, 1) (standard Gaussian distribution), Unif[0, 1] (uniform [0, 1]
distribution), log-normal distribution, etc. It turns out for most distributions C-CD is slower than R-CD
and RP-CD, but for standard Gaussian distribution C-CD is better than R-CD and RP-CD.
Inspired by the numerical experiments for the example (24), we suspect that the performance of C-CD
depends on how large the o↵-diagonal entries of A are (with fixed diagonal entries). When Uij ⇠ N (0, 1),
UT U has small o↵-diagonal entries since E(Aij) = 0, i 6= j. When Uij ⇠ Unif[0, 1], UT U has large o↵-
diagonal entries since E((UT U)ij) = n/4. The di↵erence of these two scenarios lie in the fact that Unif[0, 1]
has non-zero mean while N (0, 1) has zero mean. It is then interesting to compare the case of drawing entries
from Unif[0, 1] with the case of drawing entries from Unif[�0.5, 0.5]. We will do some sort of A/B testing for
each distribution: compare the zero-mean case with the non-zero mean case. To quantify the “o↵-diagonals
over diagonals ratio”, we define
�i =
Pj 6=i |Aij |Aii
, i = 1, . . . , n; �avg =1
n
X
i
�i.
We also define
⌧ =L
Lmin=
�max(A)
mini Aii.
As we have normalized Aii to be 1, we can simply write
�i =X
j 6=i
|Aij |, ⌧ = �max(A) = L.
Obviously L = �max(A) 1 + maxi �i. In many examples we find L to be close to 1 + �avg especially when
both of them are large.
24
Worst Case v.s. Average Case
Tuo Zhao — Lecture 3: Support Vector Machines 42/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Randomized v.s. Cyclic v.s. Shuffled
Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...
Shuffled and Random Shuffled Order
Randomized Order
(a) 102 iterations (b) 104 iterations
Figure 1: Relative errorf(xk)�f⇤f(x0)�f⇤ v.s. iterations, for 5 methods C-CD, cyclic CGD with small stepsize 1/�max(A), randomly permuted
CD, randomized CD and GD. Minimize f(x) = xT Ax, n = 100, A = Ac with c = 0.8.
Next, we discuss numerical experiments for randomly generated A; for simplicity, we will normalize the
diagonal entries of A to be 1. Since di↵erent random distributions of A will lead to di↵erent comparison
results of various algorithms, we test many distributions and try to understand for which distributions C-CD
performs well/poorly. To guarantee that A is positive semidefinite, we generate a random matrix U and let
A = UT U . We generate the entries of U i.i.d. from a certain random distribution (which implies that the
columns of U are independent), such as N (0, 1) (standard Gaussian distribution), Unif[0, 1] (uniform [0, 1]
distribution), log-normal distribution, etc. It turns out for most distributions C-CD is slower than R-CD
and RP-CD, but for standard Gaussian distribution C-CD is better than R-CD and RP-CD.
Inspired by the numerical experiments for the example (24), we suspect that the performance of C-CD
depends on how large the o↵-diagonal entries of A are (with fixed diagonal entries). When Uij ⇠ N (0, 1),
UT U has small o↵-diagonal entries since E(Aij) = 0, i 6= j. When Uij ⇠ Unif[0, 1], UT U has large o↵-
diagonal entries since E((UT U)ij) = n/4. The di↵erence of these two scenarios lie in the fact that Unif[0, 1]
has non-zero mean while N (0, 1) has zero mean. It is then interesting to compare the case of drawing entries
from Unif[0, 1] with the case of drawing entries from Unif[�0.5, 0.5]. We will do some sort of A/B testing for
each distribution: compare the zero-mean case with the non-zero mean case. To quantify the “o↵-diagonals
over diagonals ratio”, we define
�i =
Pj 6=i |Aij |Aii
, i = 1, . . . , n; �avg =1
n
X
i
�i.
We also define
⌧ =L
Lmin=
�max(A)
mini Aii.
As we have normalized Aii to be 1, we can simply write
�i =X
j 6=i
|Aij |, ⌧ = �max(A) = L.
Obviously L = �max(A) 1 + maxi �i. In many examples we find L to be close to 1 + �avg especially when
both of them are large.
24
Worst Case v.s. Average Case
Tuo Zhao — Lecture 3: Support Vector Machines 42/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Randomized v.s. Cyclic v.s. Shuffled
Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...
Shuffled and Random Shuffled Order
Randomized Order
(a) 102 iterations (b) 104 iterations
Figure 1: Relative errorf(xk)�f⇤f(x0)�f⇤ v.s. iterations, for 5 methods C-CD, cyclic CGD with small stepsize 1/�max(A), randomly permuted
CD, randomized CD and GD. Minimize f(x) = xT Ax, n = 100, A = Ac with c = 0.8.
Next, we discuss numerical experiments for randomly generated A; for simplicity, we will normalize the
diagonal entries of A to be 1. Since di↵erent random distributions of A will lead to di↵erent comparison
results of various algorithms, we test many distributions and try to understand for which distributions C-CD
performs well/poorly. To guarantee that A is positive semidefinite, we generate a random matrix U and let
A = UT U . We generate the entries of U i.i.d. from a certain random distribution (which implies that the
columns of U are independent), such as N (0, 1) (standard Gaussian distribution), Unif[0, 1] (uniform [0, 1]
distribution), log-normal distribution, etc. It turns out for most distributions C-CD is slower than R-CD
and RP-CD, but for standard Gaussian distribution C-CD is better than R-CD and RP-CD.
Inspired by the numerical experiments for the example (24), we suspect that the performance of C-CD
depends on how large the o↵-diagonal entries of A are (with fixed diagonal entries). When Uij ⇠ N (0, 1),
UT U has small o↵-diagonal entries since E(Aij) = 0, i 6= j. When Uij ⇠ Unif[0, 1], UT U has large o↵-
diagonal entries since E((UT U)ij) = n/4. The di↵erence of these two scenarios lie in the fact that Unif[0, 1]
has non-zero mean while N (0, 1) has zero mean. It is then interesting to compare the case of drawing entries
from Unif[0, 1] with the case of drawing entries from Unif[�0.5, 0.5]. We will do some sort of A/B testing for
each distribution: compare the zero-mean case with the non-zero mean case. To quantify the “o↵-diagonals
over diagonals ratio”, we define
�i =
Pj 6=i |Aij |Aii
, i = 1, . . . , n; �avg =1
n
X
i
�i.
We also define
⌧ =L
Lmin=
�max(A)
mini Aii.
As we have normalized Aii to be 1, we can simply write
�i =X
j 6=i
|Aij |, ⌧ = �max(A) = L.
Obviously L = �max(A) 1 + maxi �i. In many examples we find L to be close to 1 + �avg especially when
both of them are large.
24
Worst Case v.s. Average Case
Tuo Zhao — Lecture 3: Support Vector Machines 42/47
Bias Variance Tradeoff
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Bias Variance Tradeoff
Given a regression model:
Y = f∗(X) + ε,
where Eε = 0 and Eε2 = σ2
Bias Variance Decomposition:
Eε,X(Y − f(X))2
= Eε,X(Y − f∗(X) + f∗(X)− Eε|X f(X) + Eε|X f(X)− f(X))2
= σ2 + EX(f∗(X)− Eε|X f(X))2︸ ︷︷ ︸
Bias2
+EX(f(X)− Eε|X f(X))2︸ ︷︷ ︸
Variance
.
Bias ↑ + Variance ↓ or Bias ↓ + Variance ↑.
Tuo Zhao — Lecture 3: Support Vector Machines 44/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Bias Variance Tradeoff
Given a regression model:
Y = f∗(X) + ε,
where Eε = 0 and Eε2 = σ2
Bias Variance Decomposition:
Eε,X(Y − f(X))2
= Eε,X(Y − f∗(X) + f∗(X)− Eε|X f(X) + Eε|X f(X)− f(X))2
= σ2 + EX(f∗(X)− Eε|X f(X))2︸ ︷︷ ︸
Bias2
+EX(f(X)− Eε|X f(X))2︸ ︷︷ ︸
Variance
.
Bias ↑ + Variance ↓ or Bias ↓ + Variance ↑.
Tuo Zhao — Lecture 3: Support Vector Machines 44/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Bias Variance Tradeoff
Given a regression model:
Y = f∗(X) + ε,
where Eε = 0 and Eε2 = σ2
Bias Variance Decomposition:
Eε,X(Y − f(X))2
= Eε,X(Y − f∗(X) + f∗(X)− Eε|X f(X) + Eε|X f(X)− f(X))2
= σ2 + EX(f∗(X)− Eε|X f(X))2︸ ︷︷ ︸
Bias2
+EX(f(X)− Eε|X f(X))2︸ ︷︷ ︸
Variance
.
Bias ↑ + Variance ↓ or Bias ↓ + Variance ↑.
Tuo Zhao — Lecture 3: Support Vector Machines 44/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Bias Variance Tradeoff
Tuo Zhao — Lecture 3: Support Vector Machines 45/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Overfitting
Tuo Zhao — Lecture 3: Support Vector Machines 46/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Optimal Model Complexity
Simultaneously control bias and variance!
Next Lecture: Model Selection and Regularization
Tuo Zhao — Lecture 3: Support Vector Machines 47/47