ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector...

ECS289: Scalable Machine Learning

Cho-Jui HsiehUC Davis

Oct 18, 2016

Outline

Kernel Methods

Solving the dual problem

Kernel approximation

Support Vector Machines (SVM)

SVM is a widely used classifier.

Given:

Training data points x1, · · · , xn.Each xi ∈ Rd is a feature vector:Consider a simple case with two classes: yi ∈ {+1,−1}.

Goal: Find a hyperplane to separate these two classes of data:if yi = 1, wTxi ≥ 1− ξi ; yi = −1, wTxi ≤ −1 + ξi .


Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem (find optimal w ∈ Rd):

minw ,ξ

1

2wTw + C

n∑i=1

ξi

s.t. yi (wTxi ) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,

SVM dual problem (find optimal α ∈ Rn):

minα

1

2αTQα− eTα

s.t.0 ≤ αi ≤ C , ∀i = 1, . . . , n

Q: n-by-n kernel matrix, Qij = yiyjxTi xj

Each αi corresponds to one training data point.Primal-dual relationship: w =

∑i αiyixi

Non-linearly separable problems

What if the data is not linearly separable?

Solution: map data xi to higher dimensional(maybe infinite) featurespace ϕ(xi ), where they are linearly separable.


SVM primal problem:

minw ,ξ

1

2wTw + C

n∑i=1

ξi

s.t. yi (wTϕ(xi )) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,

The dual problem for SVM:

minα

1

2αTQα−eTα,

s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,

where Qij = yiyjϕ(xi )Tϕ(xj) and e = [1, . . . , 1]T .

Kernel trick: define K (xi , xj) = ϕ(xi )Tϕ(xj).

At optimum: w =∑n

i=1 αiyiϕ(xi ),

Various types of kernels

Gaussian kernel: K (xi , yj) = e−γ‖xi−xj‖22 ;

Polynomial kernel: K (xi , xj) = (γxTi xj + c)d .

Other kernels for specific problems:

Graph kernels(Vishwanathan et al., “Graph Kernels”, JMLR, 2010)

Pyramid kernel for image matching(Grauman and Darrell, “The Pyramid Match Kernel: Discriminative Classification

with Sets of Image Features”. In ICCV, 2005)

String kernel(Lodhi et al., “Text classification using string kernels”. JMLR, 2002).

General Kernelized ERM

L2-Regularized Empirical Risk Minimization:

minw

1

2‖w‖2 + C

n∑i=1

`i (wTϕ(xi ))

x1, · · · , xn: training samplesϕ(xi ): nonlinear mapping to a higher dimensional space

Dual problem:

minα

1

2αTQα +

n∑i=1

`∗i (−αi )

where Q ∈ Rn×n and Qij = ϕ(xi )Tϕ(xj) = K (xi , xj).

Kernel Ridge Regression

Given training samples (xi , yi ), i = 1, · · · , n.

minw

1

2‖w‖2 +

1

2

n∑i=1

(wTϕ(xi )− yi )2

Dual problem:minα

αTQα + ‖α‖2 − 2αTy

Scalability

Challenge for solving kernel SVMs (for dataset with n samples):

Space: O(n2) for storing the n-by-n kernel matrix (can be reduced insome cases);Time: O(n3) for computing the exact solution.

Greedy Coordinate Descent for Kernel SVM

Nonlinear SVM

SVM primal problem:

minw ,ξ

1

2wTw + C

n∑i=1

ξi

s.t. yi (wTϕ(xi )) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,

w : an infinite dimensional vector, very hard to solve

The dual problem for SVM:

minα

1

2αTQα−eTα,

s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,

where Qij = yiyjϕ(xi )Tϕ(xj) and e = [1, . . . , 1]T .

Kernel trick: define K (xi , xj) = ϕ(xi )Tϕ(xj).

Example: Gaussian kernel K (xi , xj) = e−γ‖xi−xj‖2

Nonlinear SVM

Can we solve the problem by dual coordinate descent?

The vector w =∑

i yiαiϕ(xi ) may have infinite dimensionality:

Cannot maintain wClosed form solution needs O(n) computational time:

δ∗ = max(−αi min(C − αi ,1− (Qα)i

Qii))

(Assume Q is stored in memory)

Can we improve coordinate descent using the same O(n) timecomplexity?

Greedy Coordinate Descent

The Greedy Coordinate Descent (GCD) algorithm:

For t = 1, 2, . . .

1. Compute δ∗i := argminδ f (α + δei ) for all i = 1, . . . , n

2. Find the best i∗ according to the following criterion:

i∗ = argmaxi|δ∗i |

3. αi∗ ← αi∗ + δ∗i∗


Other variable selection criterion:

The coordinate with the maximum step size:

i∗ = argmaxi|δ∗i |

The coordinate with maximum objective function reduction:

i∗ = argmaxi

(f (α)− f (α + δ∗i ei )

)The coordinate with the maximum projected gradient.

. . .


How to compute the optimal coordinate?

Closed form solution of best δ:

δ∗i = max

(− αi ,min

(C − αi ,

1− (Qα)iQii

))Observations:

1 Computing all δ∗i needs O(n2) time2 If Qα is stored in memory, computing all δ∗i only needs O(n) time

Maintaining Qα also needs O(n) time after each update


Initial: α, z = Qα

For t = 1, 2, . . .

For all i = 1, . . . , n, compute

δ∗i = max

(− αi ,min(C − αi ,

1− ziQii

)

)Let i∗ = argmaxi |δ∗i |α← α + δ∗i∗

z ← z + qi∗δ∗i∗ (qi∗ is the i∗-th column of Q)

(This is a simplified version of the Sequential Minimal Optimization (SMO)algorithm proposed in Platt el al., 1998)(A similar version is implemented in LIBSVM)

How to solve problems with millions of samples?

Q ∈ Rn×n cannot be fully stored

Have a fixed size of memory to “cache” the computed columns of Q

For each coordinate update:

If qi is in memory, directly use it to update

If qi is not in memory1 Kick out the “Least Recent Used” column2 Recompute qi and store it in memory3 Update αi

LIBSVM

Implemented in LIBSVM:

https://www.csie.ntu.edu.tw/~cjlin/libsvm/

Other functionalities:

Multi-class classificationSupport vector regressionCross-validation

https://www.csie.ntu.edu.tw/~cjlin/libsvm/

Kernel Approximation

Kernel Approximation

Kernel methods are hard to scale up because

Kernel matrix G is an n-by-n matrix

Can we approximate the kernel using a low-rank representation?

Two main algorithms:

Nystrom approximation: Approximate the kernel matrixRandom Features: Approximate the kernel function

Kernel Matrix Approximation

We want to form a low rank approximation of G ∈ Rn×n,

where Gij = K (xi , xj)

Can we do SVD?

No, SVD needs to form the n-by-n matrix

O(n2) space and O(n3) time

Can we do matrix completion or matrix sketching?

Main problem: need to generalize to new points

Nystrom Approximation

Nystrom approximation:

Randomly sample m columns of G :

G =

[W G12

G21 G22

]W : m-by-m square matrix (observed)G21: (n −m)-by-m matrix (observed)G12 = GT

21 (observed)G22: (n −m)-by-(n −m) matrix (not observed)

Form a kernel approximation based on these mn elements

G ≈ G̃ =

[WG21

]W †

[W G12

]W †: pseudo-inverse of W


Why W †?

Exact recover the top left m-by-m matrix

The kernel approximation:[W G12

G21 G22

]≈[WG21

]W † [W G12

]=

[W G12

G21 G21W†G12

]


Algorithm:1 Sample m “landmark points”: v1, · · · , vm2 Compute the kernel values between all training data to landmark points:

C =

[WG21

]=

K (x1, v1) · · · K (x1, vm)K (x2, v1) · · · K (x2, vm)

......

...K (xn, v1) · · · K (xn, vm)

3 Form the kernel approximation G̃ = CW †CT

(No need to explicitly form the n-by-n matrix)

Time complexity: mn kernel evaluations (O(mnd) time if use classicalkernels)

How to choose m?

Trade-off between accuracy v.s computational time and memory space

Nystrom Approximation: Training

Solve the dual problem using low-rank representation.

Example: Kernel ridge regression

minα

αT G̃α + λ‖α‖2 − yTα

Solve a linear system (G̃ + λI )α = y

Use iterative methods (such as CG), with fast matrix-vectormultiplication

(G̃ + λI )p = CW †CTp + λp

O(nm) time complexity per iteration

Nystrom Approximation: Training

Another approach: reduce to linear ERM problems!

Rewrite asG̃ = CW †CT = C (W †)1/2(W †)1/2CT

So the approximate kernel can be represent by linear kernel with featurematrix C (W †)1/2:

K̃ (xi , xj) = G̃ij = uTi uj ,

where uj = (W †)1/2

K (xj , v1)...

K (xj , vm)

The problem is equivalent to a linear models with featuresu1, · · · ,un ∈ Rm

Kernel SVM ⇒ Linear SVM with m featuresKernel Ridge Regression ⇒ Linear Ridge regression with m features

Nystrom Approximation: Prediction

Given the dual solution α.

Prediction for testing sample x :

n∑i=1

αi K̃ (xi , x) orn∑

i=1

αiK (xi , x) ?


Given the dual solution α.

Prediction for testing sample x :

n∑i=1

αi K̃ (xi , x)

Need to use approximate kernel instead of original kernel!Approximate kernel gives much better accuracy!Because training & testing should use the same kernel.

Generalization bound:(Alaoui and Mahoney, “Fast Randomized Kernel Methods With StatisticalGuarantees”. 2014. )(Rudi et al., “Less is More: Nystrom Computational Regularization”. NIPS 2015. )(using small number of landmark points can be viewed as a regularization)


How to define K̃ (x , xi )?

G̃ =

K (x1, v1) · · · K (x1, vm)

......

...K (xn, v1) · · · K (x1, vm)K (x , v1) · · · K (x1, vm)

W †

K (v1, x1) · · · K (v1, x)...

......

K (vm, x1) · · · K (vm, x)

Compute u = (W †)1/2

K (v1, x)...

K (vm, x)

(need m kernel evaluations)

Approximate kernel can be defined by

K̃ (x , xi ) = uTui


Prediction:

f (x) =n∑

i=1

αiuTui = uT (n∑

i=1

αiui )

(∑n

i=1 αiui ) can be pre-computed⇒ prediction time is m kernel evaluations

Original kernel method: need n kernel evaluations for prediction.

Summary:

Quality Training time Prediction time

Original kernel exact n1.x kernel + kernel training n kernel

Nystrom (m) rank m nm kernel + linear training m kernel


How to select landmark points (v1, · · · , vm)

Traditional approach: uniform random sampling from training dataImportance sampling (leverage score):

Drineas and Mahoney “On the Nystrom method for approximating a Grammatrix for improved kernel-based learning”. JMLR 2015.Gittens and Mahoney “Revisiting the Nystrom method for improved large-scalemachine learning”. ICML 2013

Kmeans sampling (performs extremely well) :Run kmeans clustering and set v1, · · · , vm to be cluster centers

Zhang et al., “Improved Nystrom low rank approximation and error analysis”.ICML 2008.

Subspace distance:Lim et al., “Double Nystrom method: An efficient and accurate Nystromscheme for large-scale data sets”. ICML 2015.

Nystrom Approximation (other related papers)

Distributed setting: (Kumar et al., “Ensemble Nystrom method”. NIPS2009).

Block diagonal + low-rank approximation: (Si et al., “Memory efficientkernel approximation”. ICML 2014. )

Nystrom method for fast prediction: (Hsieh et al., “Fast prediction forlarge-scale kernel machines. ” NIPS, 2014. )

Structured landmark points: (Si et al., “Computational EfficientNystrom Approximation using Fast Transforms”. ICML 2016. )

Kernel Approximation (random features)

Random Features

Directly approximation the kernel function

(Data-independent)

Generate nonlinear “features”

⇒ reduce the problem to linear model training

Several examples:

Random Fourier Features:(Rahimi and Recht, “Random features for large-scale kernel machines”. NIPS 2007)Random features for polynomial kernel:(Kar and Karnick, “Random feature maps for dot product kernels”. AISTATS 2012. )

Random Fourier Features

Random features for “Shift-invariant kernel”:

A continuous kernel is shift-invariant if K (x , y) = k(x − y).

Gaussian kernel: K (x , y) = e−‖x−y‖2

Laplacian kernel: K (x , y) = e−‖x−y‖1

Shift invariant kernel (if positive definite) can be written as:

k(x − y) =

∫Rd

P(w)e jwT (x−y)dw

for some probability distribution P(w).


Taking the real part we have

k(x − y) =

∫Rd

P(w) cos(wTx −wTy)

=

∫Rd

P(w)(cos(wTx) cos(wTy) + sin(wTx) sin(wTy))

= Ew∼P(w)[zw (x)T zw (y)]

where zw (x)T = [cos(wTx), sin(wTx)].

zw (x)T zw (y) is an unbiased estimator of K (x , y) if w is sampled fromP(·)


Sample m vectors w1, · · · ,wm from P(·) distribution

Generate random features for each sample:

ui =

sin(wT1 xi )

cos(wT1 xi )

sin(wT2 xi )

cos(wT2 xi )

...sin(wT

m xi )cos(wT

m xi )

K (xi , xj) ≈ uT

i uj

(larger m leads to better approximation)

Random Features

G ≈ UUT , where U =

uT1...

uTn

Rank 2m approximation.

The problem can be reduced to linear classification/regression foru1, · · · ,un.

Time complexity: O(nmd) time + linear training

(Nystrom: O(nmd + m3) time + linear training )

Prediction time: O(md) per sample

(Nystrom: O(md) per sample)

Fastfood

Random Fourier features require O(md) time to generate m features:

Main bottleneck: Computing wTi x for all i = 1, · · · ,m

Can we compute this in O(m log d) time?

Fastfood: proposed in (Le et al., “Fastfood Approximating KernelExpansions in Loglinear Time”. ICML 2013. )

Reduce time by using “structured random features”

Fastfood

Tool: fast matrix-vector multiplication for Hadamard matrix(Hd ∈ Rd×d)

Hdx requires O(d log d) time

Hadamard matrix:

H1 = [1]

H2 =

[1 11 −1

]H2k =

[H2k−1 H2k−1

H2k−1 −H2k−1

]Hdx can be computed efficiently by Fast Hadamard Transform(dynamic programming)

Fastfood

Sample m = d random features by

W =[Hd

]v1 0 · · · 00 v2 · · · 0...

...... 0

0 0 · · · vd

So W x only needs O(d log d) time.

Instead of sampling W (d2 elements), we just sample v1, · · · , vd fromN(0, σ)

Each row of W is still N(0, σ) (but rows are not independent)

Faster computation, but performance is slightly worse than originalrandom features.

Extensions

Fastfood: structured random features to improve computational speed:(Le et al., “Fastfood Approximating Kernel Expansions in Loglinear Time”. ICML 2013. )

Structured landmark points for Nystrom approximation:(Si et al., “Computational Efficient Nystrom Approximation using Fast Transforms”. ICML

2016. )

Doubly stochastic gradient with random features:(Dai et al., “Scalable Kernel Methods via Doubly Stochastic Gradients”. NIPS 2015. )

Generalization bound (?)

Coming up

Choose the paper for presentation (due this Sunday, Oct 23).

Questions?

ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector...

Documents

Transcript of ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector...