ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector...

48
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016

Transcript of ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector...

Page 1: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

ECS289: Scalable Machine Learning

Cho-Jui HsiehUC Davis

Oct 18, 2016

Page 2: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Outline

Kernel Methods

Solving the dual problem

Kernel approximation

Page 3: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Support Vector Machines (SVM)

SVM is a widely used classifier.

Given:

Training data points x1, · · · , xn.Each xi ∈ Rd is a feature vector:Consider a simple case with two classes: yi ∈ {+1,−1}.

Goal: Find a hyperplane to separate these two classes of data:if yi = 1, wTxi ≥ 1− ξi ; yi = −1, wTxi ≤ −1 + ξi .

Page 4: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Support Vector Machines (SVM)

Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem (find optimal w ∈ Rd):

minw ,ξ

1

2wTw + C

n∑i=1

ξi

s.t. yi (wTxi ) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,

SVM dual problem (find optimal α ∈ Rn):

minα

1

2αTQα− eTα

s.t.0 ≤ αi ≤ C , ∀i = 1, . . . , n

Q: n-by-n kernel matrix, Qij = yiyjxTi xj

Each αi corresponds to one training data point.Primal-dual relationship: w =

∑i αiyixi

Page 5: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Non-linearly separable problems

What if the data is not linearly separable?

Solution: map data xi to higher dimensional(maybe infinite) featurespace ϕ(xi ), where they are linearly separable.

Page 6: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Support Vector Machines (SVM)

SVM primal problem:

minw ,ξ

1

2wTw + C

n∑i=1

ξi

s.t. yi (wTϕ(xi )) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,

The dual problem for SVM:

minα

1

2αTQα−eTα,

s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,

where Qij = yiyjϕ(xi )Tϕ(xj) and e = [1, . . . , 1]T .

Kernel trick: define K (xi , xj) = ϕ(xi )Tϕ(xj).

At optimum: w =∑n

i=1 αiyiϕ(xi ),

Page 7: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Various types of kernels

Gaussian kernel: K (xi , yj) = e−γ‖xi−xj‖22 ;

Polynomial kernel: K (xi , xj) = (γxTi xj + c)d .

Other kernels for specific problems:

Graph kernels(Vishwanathan et al., “Graph Kernels”, JMLR, 2010)

Pyramid kernel for image matching(Grauman and Darrell, “The Pyramid Match Kernel: Discriminative Classification

with Sets of Image Features”. In ICCV, 2005)

String kernel(Lodhi et al., “Text classification using string kernels”. JMLR, 2002).

Page 8: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

General Kernelized ERM

L2-Regularized Empirical Risk Minimization:

minw

1

2‖w‖2 + C

n∑i=1

`i (wTϕ(xi ))

x1, · · · , xn: training samplesϕ(xi ): nonlinear mapping to a higher dimensional space

Dual problem:

minα

1

2αTQα +

n∑i=1

`∗i (−αi )

where Q ∈ Rn×n and Qij = ϕ(xi )Tϕ(xj) = K (xi , xj).

Page 9: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Kernel Ridge Regression

Given training samples (xi , yi ), i = 1, · · · , n.

minw

1

2‖w‖2 +

1

2

n∑i=1

(wTϕ(xi )− yi )2

Dual problem:minα

αTQα + ‖α‖2 − 2αTy

Page 10: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Scalability

Challenge for solving kernel SVMs (for dataset with n samples):

Space: O(n2) for storing the n-by-n kernel matrix (can be reduced insome cases);Time: O(n3) for computing the exact solution.

Page 11: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Greedy Coordinate Descent for Kernel SVM

Page 12: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nonlinear SVM

SVM primal problem:

minw ,ξ

1

2wTw + C

n∑i=1

ξi

s.t. yi (wTϕ(xi )) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,

w : an infinite dimensional vector, very hard to solve

The dual problem for SVM:

minα

1

2αTQα−eTα,

s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,

where Qij = yiyjϕ(xi )Tϕ(xj) and e = [1, . . . , 1]T .

Kernel trick: define K (xi , xj) = ϕ(xi )Tϕ(xj).

Example: Gaussian kernel K (xi , xj) = e−γ‖xi−xj‖2

Page 13: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nonlinear SVM

Can we solve the problem by dual coordinate descent?

The vector w =∑

i yiαiϕ(xi ) may have infinite dimensionality:

Cannot maintain wClosed form solution needs O(n) computational time:

δ∗ = max(−αi min(C − αi ,1− (Qα)i

Qii))

(Assume Q is stored in memory)

Can we improve coordinate descent using the same O(n) timecomplexity?

Page 14: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Greedy Coordinate Descent

The Greedy Coordinate Descent (GCD) algorithm:

For t = 1, 2, . . .

1. Compute δ∗i := argminδ f (α + δei ) for all i = 1, . . . , n

2. Find the best i∗ according to the following criterion:

i∗ = argmaxi|δ∗i |

3. αi∗ ← αi∗ + δ∗i∗

Page 15: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Greedy Coordinate Descent

Other variable selection criterion:

The coordinate with the maximum step size:

i∗ = argmaxi|δ∗i |

The coordinate with maximum objective function reduction:

i∗ = argmaxi

(f (α)− f (α + δ∗i ei )

)The coordinate with the maximum projected gradient.

. . .

Page 16: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Greedy Coordinate Descent

How to compute the optimal coordinate?

Closed form solution of best δ:

δ∗i = max

(− αi ,min

(C − αi ,

1− (Qα)iQii

))Observations:

1 Computing all δ∗i needs O(n2) time2 If Qα is stored in memory, computing all δ∗i only needs O(n) time

Maintaining Qα also needs O(n) time after each update

Page 17: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Greedy Coordinate Descent

Initial: α, z = Qα

For t = 1, 2, . . .

For all i = 1, . . . , n, compute

δ∗i = max

(− αi ,min(C − αi ,

1− ziQii

)

)Let i∗ = argmaxi |δ∗i |α← α + δ∗i∗

z ← z + qi∗δ∗i∗ (qi∗ is the i∗-th column of Q)

(This is a simplified version of the Sequential Minimal Optimization (SMO)algorithm proposed in Platt el al., 1998)(A similar version is implemented in LIBSVM)

Page 18: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

How to solve problems with millions of samples?

Q ∈ Rn×n cannot be fully stored

Have a fixed size of memory to “cache” the computed columns of Q

For each coordinate update:

If qi is in memory, directly use it to update

If qi is not in memory1 Kick out the “Least Recent Used” column2 Recompute qi and store it in memory3 Update αi

Page 19: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

LIBSVM

Implemented in LIBSVM:

https://www.csie.ntu.edu.tw/~cjlin/libsvm/

Other functionalities:

Multi-class classificationSupport vector regressionCross-validation

Page 20: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Kernel Approximation

Page 21: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Kernel Approximation

Kernel methods are hard to scale up because

Kernel matrix G is an n-by-n matrix

Can we approximate the kernel using a low-rank representation?

Two main algorithms:

Nystrom approximation: Approximate the kernel matrixRandom Features: Approximate the kernel function

Page 22: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Kernel Matrix Approximation

We want to form a low rank approximation of G ∈ Rn×n,

where Gij = K (xi , xj)

Can we do SVD?

No, SVD needs to form the n-by-n matrix

O(n2) space and O(n3) time

Can we do matrix completion or matrix sketching?

Main problem: need to generalize to new points

Page 23: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Kernel Matrix Approximation

We want to form a low rank approximation of G ∈ Rn×n,

where Gij = K (xi , xj)

Can we do SVD?

No, SVD needs to form the n-by-n matrix

O(n2) space and O(n3) time

Can we do matrix completion or matrix sketching?

Main problem: need to generalize to new points

Page 24: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Kernel Matrix Approximation

We want to form a low rank approximation of G ∈ Rn×n,

where Gij = K (xi , xj)

Can we do SVD?

No, SVD needs to form the n-by-n matrix

O(n2) space and O(n3) time

Can we do matrix completion or matrix sketching?

Main problem: need to generalize to new points

Page 25: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Kernel Matrix Approximation

We want to form a low rank approximation of G ∈ Rn×n,

where Gij = K (xi , xj)

Can we do SVD?

No, SVD needs to form the n-by-n matrix

O(n2) space and O(n3) time

Can we do matrix completion or matrix sketching?

Main problem: need to generalize to new points

Page 26: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nystrom Approximation

Nystrom approximation:

Randomly sample m columns of G :

G =

[W G12

G21 G22

]W : m-by-m square matrix (observed)G21: (n −m)-by-m matrix (observed)G12 = GT

21 (observed)G22: (n −m)-by-(n −m) matrix (not observed)

Form a kernel approximation based on these mn elements

G ≈ G̃ =

[WG21

]W †

[W G12

]W †: pseudo-inverse of W

Page 27: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nystrom Approximation

Why W †?

Exact recover the top left m-by-m matrix

The kernel approximation:[W G12

G21 G22

]≈[WG21

]W † [W G12

]=

[W G12

G21 G21W†G12

]

Page 28: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nystrom Approximation

Algorithm:1 Sample m “landmark points”: v1, · · · , vm2 Compute the kernel values between all training data to landmark points:

C =

[WG21

]=

K (x1, v1) · · · K (x1, vm)K (x2, v1) · · · K (x2, vm)

......

...K (xn, v1) · · · K (xn, vm)

3 Form the kernel approximation G̃ = CW †CT

(No need to explicitly form the n-by-n matrix)

Time complexity: mn kernel evaluations (O(mnd) time if use classicalkernels)

How to choose m?

Trade-off between accuracy v.s computational time and memory space

Page 29: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nystrom Approximation: Training

Solve the dual problem using low-rank representation.

Example: Kernel ridge regression

minα

αT G̃α + λ‖α‖2 − yTα

Solve a linear system (G̃ + λI )α = y

Use iterative methods (such as CG), with fast matrix-vectormultiplication

(G̃ + λI )p = CW †CTp + λp

O(nm) time complexity per iteration

Page 30: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nystrom Approximation: Training

Another approach: reduce to linear ERM problems!

Rewrite asG̃ = CW †CT = C (W †)1/2(W †)1/2CT

So the approximate kernel can be represent by linear kernel with featurematrix C (W †)1/2:

K̃ (xi , xj) = G̃ij = uTi uj ,

where uj = (W †)1/2

K (xj , v1)...

K (xj , vm)

The problem is equivalent to a linear models with featuresu1, · · · ,un ∈ Rm

Kernel SVM ⇒ Linear SVM with m featuresKernel Ridge Regression ⇒ Linear Ridge regression with m features

Page 31: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nystrom Approximation: Prediction

Given the dual solution α.

Prediction for testing sample x :

n∑i=1

αi K̃ (xi , x) orn∑

i=1

αiK (xi , x) ?

Page 32: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nystrom Approximation: Prediction

Given the dual solution α.

Prediction for testing sample x :

n∑i=1

αi K̃ (xi , x)

Need to use approximate kernel instead of original kernel!Approximate kernel gives much better accuracy!Because training & testing should use the same kernel.

Generalization bound:(Alaoui and Mahoney, “Fast Randomized Kernel Methods With StatisticalGuarantees”. 2014. )(Rudi et al., “Less is More: Nystrom Computational Regularization”. NIPS 2015. )(using small number of landmark points can be viewed as a regularization)

Page 33: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nystrom Approximation: Prediction

How to define K̃ (x , xi )?

G̃ =

K (x1, v1) · · · K (x1, vm)

......

...K (xn, v1) · · · K (x1, vm)K (x , v1) · · · K (x1, vm)

W †

K (v1, x1) · · · K (v1, x)...

......

K (vm, x1) · · · K (vm, x)

Compute u = (W †)1/2

K (v1, x)...

K (vm, x)

(need m kernel evaluations)

Approximate kernel can be defined by

K̃ (x , xi ) = uTui

Page 34: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nystrom Approximation: Prediction

Prediction:

f (x) =n∑

i=1

αiuTui = uT (n∑

i=1

αiui )

(∑n

i=1 αiui ) can be pre-computed⇒ prediction time is m kernel evaluations

Original kernel method: need n kernel evaluations for prediction.

Summary:

Quality Training time Prediction time

Original kernel exact n1.x kernel + kernel training n kernel

Nystrom (m) rank m nm kernel + linear training m kernel

Page 35: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nystrom Approximation

How to select landmark points (v1, · · · , vm)

Traditional approach: uniform random sampling from training dataImportance sampling (leverage score):

Drineas and Mahoney “On the Nystrom method for approximating a Grammatrix for improved kernel-based learning”. JMLR 2015.Gittens and Mahoney “Revisiting the Nystrom method for improved large-scalemachine learning”. ICML 2013

Kmeans sampling (performs extremely well) :Run kmeans clustering and set v1, · · · , vm to be cluster centers

Zhang et al., “Improved Nystrom low rank approximation and error analysis”.ICML 2008.

Subspace distance:Lim et al., “Double Nystrom method: An efficient and accurate Nystromscheme for large-scale data sets”. ICML 2015.

Page 36: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Nystrom Approximation (other related papers)

Distributed setting: (Kumar et al., “Ensemble Nystrom method”. NIPS2009).

Block diagonal + low-rank approximation: (Si et al., “Memory efficientkernel approximation”. ICML 2014. )

Nystrom method for fast prediction: (Hsieh et al., “Fast prediction forlarge-scale kernel machines. ” NIPS, 2014. )

Structured landmark points: (Si et al., “Computational EfficientNystrom Approximation using Fast Transforms”. ICML 2016. )

Page 37: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Kernel Approximation (random features)

Page 38: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Random Features

Directly approximation the kernel function

(Data-independent)

Generate nonlinear “features”

⇒ reduce the problem to linear model training

Several examples:

Random Fourier Features:(Rahimi and Recht, “Random features for large-scale kernel machines”. NIPS 2007)Random features for polynomial kernel:(Kar and Karnick, “Random feature maps for dot product kernels”. AISTATS 2012. )

Page 39: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Random Fourier Features

Random features for “Shift-invariant kernel”:

A continuous kernel is shift-invariant if K (x , y) = k(x − y).

Gaussian kernel: K (x , y) = e−‖x−y‖2

Laplacian kernel: K (x , y) = e−‖x−y‖1

Shift invariant kernel (if positive definite) can be written as:

k(x − y) =

∫Rd

P(w)e jwT (x−y)dw

for some probability distribution P(w).

Page 40: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Random Fourier Features

Taking the real part we have

k(x − y) =

∫Rd

P(w) cos(wTx −wTy)

=

∫Rd

P(w)(cos(wTx) cos(wTy) + sin(wTx) sin(wTy))

= Ew∼P(w)[zw (x)T zw (y)]

where zw (x)T = [cos(wTx), sin(wTx)].

zw (x)T zw (y) is an unbiased estimator of K (x , y) if w is sampled fromP(·)

Page 41: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Random Fourier Features

Sample m vectors w1, · · · ,wm from P(·) distribution

Generate random features for each sample:

ui =

sin(wT1 xi )

cos(wT1 xi )

sin(wT2 xi )

cos(wT2 xi )

...sin(wT

m xi )cos(wT

m xi )

K (xi , xj) ≈ uT

i uj

(larger m leads to better approximation)

Page 42: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Random Features

G ≈ UUT , where U =

uT1...

uTn

Rank 2m approximation.

The problem can be reduced to linear classification/regression foru1, · · · ,un.

Time complexity: O(nmd) time + linear training

(Nystrom: O(nmd + m3) time + linear training )

Prediction time: O(md) per sample

(Nystrom: O(md) per sample)

Page 43: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Fastfood

Random Fourier features require O(md) time to generate m features:

Main bottleneck: Computing wTi x for all i = 1, · · · ,m

Can we compute this in O(m log d) time?

Fastfood: proposed in (Le et al., “Fastfood Approximating KernelExpansions in Loglinear Time”. ICML 2013. )

Reduce time by using “structured random features”

Page 44: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Fastfood

Random Fourier features require O(md) time to generate m features:

Main bottleneck: Computing wTi x for all i = 1, · · · ,m

Can we compute this in O(m log d) time?

Fastfood: proposed in (Le et al., “Fastfood Approximating KernelExpansions in Loglinear Time”. ICML 2013. )

Reduce time by using “structured random features”

Page 45: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Fastfood

Tool: fast matrix-vector multiplication for Hadamard matrix(Hd ∈ Rd×d)

Hdx requires O(d log d) time

Hadamard matrix:

H1 = [1]

H2 =

[1 11 −1

]H2k =

[H2k−1 H2k−1

H2k−1 −H2k−1

]Hdx can be computed efficiently by Fast Hadamard Transform(dynamic programming)

Page 46: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Fastfood

Sample m = d random features by

W =[Hd

]v1 0 · · · 00 v2 · · · 0...

...... 0

0 0 · · · vd

So W x only needs O(d log d) time.

Instead of sampling W (d2 elements), we just sample v1, · · · , vd fromN(0, σ)

Each row of W is still N(0, σ) (but rows are not independent)

Faster computation, but performance is slightly worse than originalrandom features.

Page 47: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Extensions

Fastfood: structured random features to improve computational speed:(Le et al., “Fastfood Approximating Kernel Expansions in Loglinear Time”. ICML 2013. )

Structured landmark points for Nystrom approximation:(Si et al., “Computational Efficient Nystrom Approximation using Fast Transforms”. ICML

2016. )

Doubly stochastic gradient with random features:(Dai et al., “Scalable Kernel Methods via Doubly Stochastic Gradients”. NIPS 2015. )

Generalization bound (?)

Page 48: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G_Fall2016/lecture9.pdfSupport Vector Machines (SVM) SVM is a widely used classi er. Given: Training data points x 1; ;x n.

Coming up

Choose the paper for presentation (due this Sunday, Oct 23).

Questions?