Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdfLecture 3:...

Lecture 3: Support Vector Machines

Tuo Zhao

Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Suppor Vector Machines

Numerous books, papers, and tutorials......Once upon a time, SVMis considered as the ultimate recipe for classification and regression.

Tuo Zhao — Lecture 3: Support Vector Machines 2/47

Suppor Vector Machines

SVM have faded away, but it is still an important chapter in anyelementary machine learning textbook.

Margins: Intuition

Consider a standard linear classification problem:

Feature: X ∈ Rd

Response: Y ∈ {−1, 1}Linear Classifier: Y = sign(X>θ)

Our confidence is represented by |X>θ|

Logistic Regression : P(Y = 1) =1

1 + exp(−X>θ).

Margins: Intuition

Feature: X ∈ Rd

1 + exp(−X>θ).

Margins: Intuition

Feature: X ∈ Rd

1 + exp(−X>θ).

Margins: Intuition

Feature: X ∈ Rd

1 + exp(−X>θ).

Margin: Intuition

We have three testing samples: A, B, C.

Which one should we have the most confidence in classification?

θT x(i) ≪ 0 whenever y(i) = 0, since this would reflect a very confident (andcorrect) set of classifications for all the training examples. This seems to bea nice goal to aim for, and we’ll soon formalize this idea using the notion offunctional margins.

For a different type of intuition, consider the following figure, in which x’srepresent positive training examples, o’s denote negative training examples,a decision boundary (this is the line given by the equation θT x = 0, andis also called the separating hyperplane) is also shown, and three pointshave also been labeled A, B and C.

Notice that the point A is very far from the decision boundary. If we areasked to make a prediction for the value of y at A, it seems we should bequite confident that y = 1 there. Conversely, the point C is very close tothe decision boundary, and while it’s on the side of the decision boundaryon which we would predict y = 1, it seems likely that just a small change tothe decision boundary could easily have caused our prediction to be y = 0.Hence, we’re much more confident about our prediction at A than at C. Thepoint B lies in-between these two cases, and more broadly, we see that ifa point is far from the separating hyperplane, then we may be significantlymore confident in our predictions. Again, informally we think it’d be nice if,given a training set, we manage to find a decision boundary that allows usto make all correct and confident (meaning far from the decision boundary)predictions on the training examples. We’ll formalize this later using thenotion of geometric margins.

Geometric Margin

Function margin: γi = yi(x>i θ).

Scale Invariance: ‖θ‖2 = 1.

Minimum functional margin:

γmin(θ) = miniyi(x

>i θ).

Maximize the minimum margin:

θ = argmaxθ

γmin(θ) subject to ‖θ‖2 = 1.

Geometric Margin

>i θ).

θ = argmaxθ

Geometric Margin

>i θ).

θ = argmaxθ

Geometric Margin

>i θ).

θ = argmaxθ

Convexification

θ =argmaxθ,γ

subject to ‖θ‖2 = 1, yi(x>i θ) ≥ γ, i = 1, ..., n.

‖θ‖2 = 1 is nonconvex

Rescale the margin

θ =argmaxθ,γ

‖θ‖2subject to yi(x

>i θ) ≥ γ, i = 1, ..., n.

Convexification

θ =argmaxθ,γ

Rescale the margin

θ =argmaxθ,γ

>i θ) ≥ γ, i = 1, ..., n.

Convexification

θ =argmaxθ,γ

Rescale the margin

θ =argmaxθ,γ

>i θ) ≥ γ, i = 1, ..., n.

Convexification

θ =argmaxθ,γ

Rescale the margin

θ =argmaxθ,γ

>i θ) ≥ γ, i = 1, ..., n.

Convexification

The scale of γ does not matter: enforce γ = 1

Equivalent to minimizing ‖θ‖2

θ =argminθ

2‖θ‖22

subject to 1− yi(x>i θ) ≤ 0, i = 1, ..., n.

Quadratic Programming with Linear Constraints

Efficient Optimization Solvers

Convexification

θ =argminθ

2‖θ‖22

Convexification

θ =argminθ

2‖θ‖22

Convexification

θ =argminθ

2‖θ‖22

Optimization Software

Existing Convex Programming Solvers:

CPLEX: IBM ILOG CPLEX Optimization Studio.

Gurobi: Zonghao Gu, Edward Rothberg and Robert Bixby.

Programming Language: C++, Java, Python, MATLAB, R

Modeling Language: AIMMS, AMPL, GAMS, MPL

Free Academic Licenses.

Lagrangian Duality

Lagrangian Multiplier Method

Dual variables: λi ≥ 0, i = 1, ..., n

L(θ,λ) =1

2‖θ‖22 +

λi(1− yi(x>i θ))

Min-Max problem:

maxλ≥0

L(θ,λ)

Strong duality:

maxλ≥0

L(θ,λ) = maxλ≥0

minθL(θ,λ)

L(θ,λ) =1

2‖θ‖22 +

Min-Max problem:

maxλ≥0

L(θ,λ)

Strong duality:

maxλ≥0

minθL(θ,λ)

L(θ,λ) =1

2‖θ‖22 +

Min-Max problem:

maxλ≥0

L(θ,λ)

Strong duality:

maxλ≥0

minθL(θ,λ)

Penalty of Constraint Violation

Min-Max Problem

maxλ≥0

2‖θ‖22 +

Given (1− yi(x>i θ)) ≤ 0,

maxλi

λi(1− yi(x>i θ)) = 0

Given (1− yi(x>i θ)) > 0,

maxλi

λi(1− yi(x>i θ)) = +∞

Min-Max Problem

maxλ≥0

2‖θ‖22 +

maxλi

λi(1− yi(x>i θ)) = 0

maxλi

λi(1− yi(x>i θ)) = +∞

Min-Max Problem

maxλ≥0

2‖θ‖22 +

maxλi

λi(1− yi(x>i θ)) = 0

maxλi

λi(1− yi(x>i θ)) = +∞

Convex Concave Saddle Point Problem

Karush-Kuhn-Tucker Condition

KKT Condition:

∂L(θ, λ)

∣∣∣θ=θ

= 0,∂L(θ,λ)

∣∣∣λ=λ

Complementary Slackness

λi = 0 ⇒ 1− yix>i θ ≤ 0,

λi ≥ 0 ⇒ 1− yix>i θ = 0.

λi > 0 — Support vector.

KKT Condition:

∂L(θ, λ)

∣∣∣θ=θ

= 0,∂L(θ,λ)

∣∣∣λ=λ

λi = 0 ⇒ 1− yix>i θ ≤ 0,

λi ≥ 0 ⇒ 1− yix>i θ = 0.

KKT Condition:

∂L(θ, λ)

∣∣∣θ=θ

= 0,∂L(θ,λ)

∣∣∣λ=λ

λi = 0 ⇒ 1− yix>i θ ≤ 0,

λi ≥ 0 ⇒ 1− yix>i θ = 0.

Dual Problem

Solve the minimization problem:

∂L(θ,λ)

∂θ= 0⇒ θ =

λiyixi.

Maximization problem:

maxλ≥0

λiyixi,λ

= maxλ≥0

∥∥∥∥∥n∑

λiyixi

∥∥∥∥∥

(1− yi

λkykxk

Dual Problem

Solve the minimization problem:

∂L(θ,λ)

∂θ= 0⇒ θ =

λiyixi.

Maximization problem:

maxλ≥0

λiyixi,λ

= maxλ≥0

∥∥∥∥∥n∑

λiyixi

∥∥∥∥∥

(1− yi

λkykxk

Simplication

∥∥∥∥∥n∑

λiyixi

∥∥∥∥∥

λiyix>i

)(n∑

λiyixi

λiλkyiykx>i xk,

(1− yi

λiyixi

λi −n∑

λiλkyiykx>i xk.

Dual Problem

After simplification, we have

λ = argmaxλ≥0

λi −1

λiλkyiykx>i xk.

We define Q as Qik = yiykx>i xk.

Quadratic Minimization:

λ =argminλ

2λ>Qλ− λ>1

subject to λi ≥ 0, i = 1, ..., n

Support Vector Machines

Let S denote the set of support vectors. Given x∗, we predict

y∗ = sign(θ>x∗) = sign

(∑ni=1 λiyix

>i x∗)

= sign(∑n

i=1 λiyix>i x∗) = sign

(∑i∈S λiyix

>i x∗)

corresponding to constraints that hold with equality, gi(w) = 0). Considerthe figure below, in which a maximum margin separating hyperplane is shownby the solid line.

The points with the smallest margins are exactly the ones closest to thedecision boundary; here, these are the three points (one negative and two pos-itive examples) that lie on the dashed lines parallel to the decision boundary.Thus, only three of the αi’s—namely, the ones corresponding to these threetraining examples—will be non-zero at the optimal solution to our optimiza-tion problem. These three points are called the support vectors in thisproblem. The fact that the number of support vectors can be much smallerthan the size the training set will be useful later.

Let’s move on. Looking ahead, as we develop the dual form of the prob-lem, one key idea to watch out for is that we’ll try to write our algorithmin terms of only the inner product ⟨x(i), x(j)⟩ (think of this as (x(i))T x(j))between points in the input feature space. The fact that we can express ouralgorithm in terms of these inner products will be key when we apply thekernel trick.

When we construct the Lagrangian for our optimization problem we have:

L(w, b, α) =1

2||w||2 −

"y(i)(wTx(i) + b) − 1

#. (8)

Note that there’re only “αi” but no “βi” Lagrange multipliers, since theproblem has only inequality constraints.

Let’s find the dual form of the problem. To do so, we need to firstminimize L(w, b, α) with respect to w and b (for fixed α), to get θD, which

SVM with Soft Margin

Linearly Separable and Non-separable

algorithms that we’ll see later in this class will also be amenable to thismethod, which has come to be known as the “kernel trick.”

8 Regularization and the non-separable case

The derivation of the SVM as presented so far assumed that the data islinearly separable. While mapping data to a high dimensional feature spacevia φ does generally increase the likelihood that the data is separable, wecan’t guarantee that it always will be so. Also, in some cases it is not clearthat finding a separating hyperplane is exactly what we’d want to do, sincethat might be susceptible to outliers. For instance, the left figure belowshows an optimal margin classifier, and when a single outlier is added in theupper-left region (right figure), it causes the decision boundary to make adramatic swing, and the resulting classifier has a much smaller margin.

To make the algorithm work for non-linearly separable datasets as wellas be less sensitive to outliers, we reformulate our optimization (using ℓ1

regularization) as follows:

minγ,w,b1

2||w||2 + C

s.t. y(i)(wT x(i) + b) ≥ 1 − ξi, i = 1, . . . , m

ξi ≥ 0, i = 1, . . . , m.

Thus, examples are now permitted to have (functional) margin less than 1,and if an example has functional margin 1 − ξi (with ξ > 0), we would paya cost of the objective function being increased by Cξi. The parameter Ccontrols the relative weighting between the twin goals of making the ||w||2small (which we saw earlier makes the margin large) and of ensuring thatmost examples have functional margin at least 1.

algorithms that we’ll see later in this class will also be amenable to thismethod, which has come to be known as the “kernel trick.”

8 Regularization and the non-separable case

The derivation of the SVM as presented so far assumed that the data islinearly separable. While mapping data to a high dimensional feature spacevia φ does generally increase the likelihood that the data is separable, wecan’t guarantee that it always will be so. Also, in some cases it is not clearthat finding a separating hyperplane is exactly what we’d want to do, sincethat might be susceptible to outliers. For instance, the left figure belowshows an optimal margin classifier, and when a single outlier is added in theupper-left region (right figure), it causes the decision boundary to make adramatic swing, and the resulting classifier has a much smaller margin.

To make the algorithm work for non-linearly separable datasets as wellas be less sensitive to outliers, we reformulate our optimization (using ℓ1

regularization) as follows:

minγ,w,b1

2||w||2 + C

s.t. y(i)(wT x(i) + b) ≥ 1 − ξi, i = 1, . . . , m

ξi ≥ 0, i = 1, . . . , m.

Thus, examples are now permitted to have (functional) margin less than 1,and if an example has functional margin 1 − ξi (with ξ > 0), we would paya cost of the objective function being increased by Cξi. The parameter Ccontrols the relative weighting between the twin goals of making the ||w||2small (which we saw earlier makes the margin large) and of ensuring thatmost examples have functional margin at least 1.

Soft Margin

Slack variables: Given ξi ≥ 0, i = 1, ..., 0,

θ =argminθ

2‖θ‖22 + µ

subject to 1− yi(x>i θ)− ξi ≤ 0, ξi ≥ 0, i = 1, ..., n,

where C > 0 is a regularization parameter.

Lagrangian Form: Given λ ≥ 0 and β ≥ 0,

L(θ,α,β) =1

2‖θ‖22 + C

λi(1− yi(x>i θ)− ξi)−n∑

βiξi

Soft Margin

Slack variables: Given ξi ≥ 0, i = 1, ..., 0,

θ =argminθ

2‖θ‖22 + µ

subject to 1− yi(x>i θ)− ξi ≤ 0, ξi ≥ 0, i = 1, ..., n,

where C > 0 is a regularization parameter.

Lagrangian Form: Given λ ≥ 0 and β ≥ 0,

L(θ,α,β) =1

2‖θ‖22 + C

λi(1− yi(x>i θ)− ξi)−n∑

βiξi

Dual Formulation

λ =argminλ

2λ>Qλ− λ>1

subject to C ≥ λi ≥ 0

KKT condition ⇒ Complementary Slackness

λi = 0 ⇒ yix>i θ ≥ 1,

C ≥ λi > 0 ⇒ yix>i θ < 1.

Dual Formulation

λ =argminλ

2λ>Qλ− λ>1

λi = 0 ⇒ yix>i θ ≥ 1,

C ≥ λi > 0 ⇒ yix>i θ < 1.

Dual Formulation

λ =argminλ

2λ>Qλ− λ>1

λi = 0 ⇒ yix>i θ ≥ 1,

C ≥ λi > 0 ⇒ yix>i θ < 1.

Soft Margin

Hinge Loss Function

Unconstrained Formulation

SVM with soft margin is equivalent to

θ = argminθ∈Rd

max{1− yix>i θ, 0}+1

2‖θ‖22 .

Logistic loss: f(z) = log(1 + exp(−z))

Square Hinge Loss: f(z) = max{1− z, 0}2

Huberized Hinge Loss: f(z) =

−z + 1

2if z ≤ 0,

2(z − 1)2 if 0 < z ≤ 1,

0 otherwise.

0-1 Loss: `i(z) = 1(z ≥ 0)

Huberized Hinge Loss

Nonsmooth (Stochastic) Optimization

Given a convex function L(θ), the subdifferential of L(θ) at θ is aset of vectors denoted by ∂L(θ) such that for any θ,θ′ ∈ Rd,

L(θ′) ≥ L(θ) + ζ>(θ′ − θ) for all ζ ∈ ∂L(θ).

f(z) = |z|: ∂f(z) = [−1,+1] at z = 0

f(z) = max{1− z, 0}: ∂f(z) = [−1, 0] at z = 1.

We consider a nonsmooth optimization problem

minθL(θ), where L(θ) = 1

`i(θ)

Subgradient Algorithm: At the (t+ 1)-th iteration, we take

θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂L(θ(t)).

Stochastic Variant: At the (t+ 1)-th iteration, we randomlyselect i from 1, ..., n, and take

θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂`i(θ(t)).

We consider a nonsmooth optimization problem

minθL(θ), where L(θ) = 1

`i(θ)

Subgradient Algorithm: At the (t+ 1)-th iteration, we take

θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂L(θ(t)).

Stochastic Variant: At the (t+ 1)-th iteration, we randomlyselect i from 1, ..., n, and take

θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂`i(θ(t)).

Stochastic Subgradient Algorithms

How many iterations do we need?

A sequence of decreasing step size parameters: ηt � 1tµ

‖ζ‖2 ≤M for any ζ ∈ ∂L(θ)Given a pre-specified error ε, we need

iterations such that

E∥∥∥θ(T ) − θ

∥∥∥2

2≤ ε, where θ

t=0(t+ 1)θ(t)

(T + 1)(T + 2).

Slower than gradient descent for the square hinge loss SVM.

SVM with Kernel

Nonlinear SVM

Feature Mapping:

φ(x) = (φ1(x), ..., φm(x))>

SVM with Feature Mapping:

θ = argminθ∈Rm

max{1− yiφ(xi)>θ, 0}+1

2‖θ‖22 .

Dual Problem:

λ =argminC1≥λ≥0

λiλkyiykφ(xi)>φ(xk)− λ>1

Nonlinear SVM

Feature Mapping:

φ(x) = (φ1(x), ..., φm(x))>

θ = argminθ∈Rm

2‖θ‖22 .

Dual Problem:

Nonlinear SVM

Feature Mapping:

φ(x) = (φ1(x), ..., φm(x))>

θ = argminθ∈Rm

2‖θ‖22 .

Dual Problem:

Feature Mapping

Kernel Trick

Kernel Function:

K(xi,xj) = 〈φ(xi),φ(xj)〉

Dual Problem:

λiλkyiykK(xi,xk)− λ>1

Gaussian/RBF Kernel – infinite dimensional mapping:

K(xi,xj) = exp(−γ ‖xi − xj‖22)

Polynomial Kernel – finite dimensional mapping:

K(xi,xj) = (1 + x>i xj)r

Kernel Trick

Kernel Function:

Dual Problem:

Kernel Trick

Kernel Function:

Dual Problem:

Kernel Trick

Kernel Function:

Dual Problem:

More on Kernels

Reproducing Kernel Hilbert Space

Representer Theorem

Mercer Theorem

K � 0, where Kij = K(xi,xj),

Nonparametric SVM:

f = argminf∈H

max{1− yif(xi), 0}+1

2‖f‖2H .

More on Kernels

Representer Theorem

Mercer Theorem

Nonparametric SVM:

f = argminf∈H

2‖f‖2H .

More on Kernels

Representer Theorem

Mercer Theorem

Nonparametric SVM:

f = argminf∈H

2‖f‖2H .

More on Kernels

Representer Theorem

Mercer Theorem

Nonparametric SVM:

f = argminf∈H

2‖f‖2H .

Randomized Coordinate Minimization

Solve the dual problem:

λ =argminλ

2λ>Qλ− λ>1

Minimize λj with λ1, ..., λj−1, λj+1, λn fixed.

At the (t+ 1)-th iteration, we select i ∈ {1, ..., n} and take

λ(t+1)i =argmin

2Qiiλ

∑j 6=iQijλiλ

(t)j − λi

and λ(t+1)j = λ

(t)j for all j 6= i.

λ =argminλ

2λ>Qλ− λ>1

λ(t+1)i =argmin

2Qiiλ

∑j 6=iQijλiλ

(t)j − λi

and λ(t+1)j = λ

λ =argminλ

2λ>Qλ− λ>1

λ(t+1)i =argmin

2Qiiλ

∑j 6=iQijλiλ

(t)j − λi

and λ(t+1)j = λ

Computational Cost per Iteration

An analytical updating formula:

λ(t+1)i =argmin

C≥λi≥0

2Qiiλ

2i + λi

∑j 6=iQijλ

(t)j − λi

C if λ(t+1) ≥ C,0 if λ(t+1) ≤ 0,

λ(t+1) otherwise,

where λ(t+1) = Q−1ii(∑

j 6=iQijλ(t)j − 1

What is the computational cost per iteration? O(n)

Better than gradient descent? O(n2)

λ(t+1)i =argmin

C≥λi≥0

2Qiiλ

2i + λi

∑j 6=iQijλ

(t)j − λi

C if λ(t+1) ≥ C,0 if λ(t+1) ≤ 0,

λ(t+1) otherwise,

λ(t+1)i =argmin

C≥λi≥0

2Qiiλ

2i + λi

∑j 6=iQijλ

(t)j − λi

C if λ(t+1) ≥ C,0 if λ(t+1) ≤ 0,

λ(t+1) otherwise,

Geometric Illustration

Optimal Solution

Coordinate Strong Smoothness

There exists a constant Lj such that for any λ and λ′ withλi = λ′i for i 6= j, we have

F(λ′)−F(λ)−∇jF(λ)(λ′j − λj) ≤Lj2(λ′j − λj)2.

There exists a constant µ such that for any λ and λ′, we have

F(λ′)−F(λ)−∇F(λ)>(λ′ − λ) ≥ µ

∥∥λ′ − λ∥∥22.

Coordinate v.s. Global Strong Smoothness:

nµ≤ maxj Lj

µ≤ L

F(λ′)−F(λ)−∇F(λ)>(λ′ − λ) ≥ µ

∥∥λ′ − λ∥∥22.

nµ≤ maxj Lj

µ≤ L

F(λ′)−F(λ)−∇F(λ)>(λ′ − λ) ≥ µ

∥∥λ′ − λ∥∥22.

nµ≤ maxj Lj

µ≤ L

Iteration Complexity

Given a pre-specified error ε, we need

(nLmax

E∥∥∥λ(T ) − λ

∥∥∥2

2≤ ε.

Faster than the gradient descent algorithm? Yes!

Iteration Complexity

Given a pre-specified error ε, we need

(nLmax

E∥∥∥λ(T ) − λ

∥∥∥2

2≤ ε.

Faster than the gradient descent algorithm? Yes!

Randomized v.s. Cyclic v.s. Shuffled

Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...

Shuffled and Random Shuffled Order

Randomized Order

(a) 102 iterations (b) 104 iterations

Figure 1: Relative errorf(xk)�f⇤f(x0)�f⇤ v.s. iterations, for 5 methods C-CD, cyclic CGD with small stepsize 1/�max(A), randomly permuted

CD, randomized CD and GD. Minimize f(x) = xT Ax, n = 100, A = Ac with c = 0.8.

Next, we discuss numerical experiments for randomly generated A; for simplicity, we will normalize the

diagonal entries of A to be 1. Since di↵erent random distributions of A will lead to di↵erent comparison

results of various algorithms, we test many distributions and try to understand for which distributions C-CD

performs well/poorly. To guarantee that A is positive semidefinite, we generate a random matrix U and let

A = UT U . We generate the entries of U i.i.d. from a certain random distribution (which implies that the

columns of U are independent), such as N (0, 1) (standard Gaussian distribution), Unif[0, 1] (uniform [0, 1]

distribution), log-normal distribution, etc. It turns out for most distributions C-CD is slower than R-CD

and RP-CD, but for standard Gaussian distribution C-CD is better than R-CD and RP-CD.

Inspired by the numerical experiments for the example (24), we suspect that the performance of C-CD

depends on how large the o↵-diagonal entries of A are (with fixed diagonal entries). When Uij ⇠ N (0, 1),

UT U has small o↵-diagonal entries since E(Aij) = 0, i 6= j. When Uij ⇠ Unif[0, 1], UT U has large o↵-

diagonal entries since E((UT U)ij) = n/4. The di↵erence of these two scenarios lie in the fact that Unif[0, 1]

has non-zero mean while N (0, 1) has zero mean. It is then interesting to compare the case of drawing entries

from Unif[0, 1] with the case of drawing entries from Unif[�0.5, 0.5]. We will do some sort of A/B testing for

each distribution: compare the zero-mean case with the non-zero mean case. To quantify the “o↵-diagonals

over diagonals ratio”, we define

�i =

Pj 6=i |Aij |Aii

, i = 1, . . . , n; �avg =1

We also define

⌧ =L

�max(A)

mini Aii.

As we have normalized Aii to be 1, we can simply write

�i =X

|Aij |, ⌧ = �max(A) = L.

Obviously L = �max(A) 1 + maxi �i. In many examples we find L to be close to 1 + �avg especially when

both of them are large.

Worst Case v.s. Average Case

Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...

Randomized Order

�i =

Pj 6=i |Aij |Aii

, i = 1, . . . , n; �avg =1

We also define

⌧ =L

�max(A)

mini Aii.

�i =X

|Aij |, ⌧ = �max(A) = L.

Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...

Randomized Order

�i =

Pj 6=i |Aij |Aii

, i = 1, . . . , n; �avg =1

We also define

⌧ =L

�max(A)

mini Aii.

�i =X

|Aij |, ⌧ = �max(A) = L.

Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...

Randomized Order

�i =

Pj 6=i |Aij |Aii

, i = 1, . . . , n; �avg =1

We also define

⌧ =L

�max(A)

mini Aii.

�i =X

|Aij |, ⌧ = �max(A) = L.

Bias Variance Tradeoff

Given a regression model:

Y = f∗(X) + ε,

where Eε = 0 and Eε2 = σ2

Bias Variance Decomposition:

Eε,X(Y − f(X))2

= Eε,X(Y − f∗(X) + f∗(X)− Eε|X f(X) + Eε|X f(X)− f(X))2

= σ2 + EX(f∗(X)− Eε|X f(X))2︸︷︷︸

+EX(f(X)− Eε|X f(X))2︸︷︷︸

Variance

Bias ↑ + Variance ↓ or Bias ↓ + Variance ↑.

Y = f∗(X) + ε,

Eε,X(Y − f(X))2

= σ2 + EX(f∗(X)− Eε|X f(X))2︸︷︷︸

+EX(f(X)− Eε|X f(X))2︸︷︷︸

Variance

Y = f∗(X) + ε,

Eε,X(Y − f(X))2

= σ2 + EX(f∗(X)− Eε|X f(X))2︸︷︷︸

+EX(f(X)− Eε|X f(X))2︸︷︷︸

Variance

Overfitting

Optimal Model Complexity

Simultaneously control bias and variance!

Next Lecture: Model Selection and Regularization

Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdfLecture 3:...

Documents

Transcript of Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdfLecture 3:...