Difficulties with Nonlinear SVM for Large Problems The nonlinear kernel is fully dense ...

39
Nonlinear SVM for Large Problems The nonlinear kernel K (A;A 0 )2 R mâm is fully dense omputational complexity depends on m arating surface depends on almost entire dat Complexity of nonlinear SSVM ø O((m + 1) 3 ) uns out of memory while storing the kernel m Long CPU time to compute the dense kernel ma O (m 2 ) Need to generate and store entries ed to store the entire dataset even after so the problem
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Difficulties with Nonlinear SVM for Large Problems The nonlinear kernel is fully dense ...

Page 1: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Difficulties with Nonlinear SVM

for Large Problems

The nonlinear kernelK (A;A0) 2 Rmâ m is fully dense

Computational complexity depends on m

Separating surface depends on almost entire dataset

Complexity of nonlinear SSVM ø O((m+ 1)3)

Runs out of memory while storing the kernel matrix

Long CPU time to compute the dense kernel matrix

O(m2) Need to generate and store entries

Need to store the entire dataset even after solving the problem

Page 2: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Reduced Support Vector Machine

K (x0;Aö0)Döuö = í

(ii) Solve the following problem by the Newtonmethod with correspondingD ú D :

2÷kp(eà D(K (A;A0)Döuö à eí );ë)k2

2 + 21kuö; í k2

2min(u; í ) 2 Rm+1

K (x0;Aö0)Döuö = í

(iii) The nonlinear classifier is defined by the optimal(u;í )solution in step (ii):

Using K (A;A0) gives lousy results!

(i) Choose a random subset matrix ofentire data matrixA 2 Rmâ n; (m << m):

A 2 Rmâ n

Nonlinear Classifier:

Page 3: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

A Nonlinear Kernel ApplicationCheckerboard Training Set: 1000 Points

in Separate 486 Asterisks from 514

DotsR2

Page 4: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Conventional SVM Result on Checkerboard

Using 50 Randomly Selected Points Out of 1000

K (A;A0) 2 R50â 50

Page 5: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

RSVM Result on Checkerboard Using SAME 50 Random Points Out of

1000

K (A;A0) 2 R1000â 50

Page 6: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

RSVM on Moderate Sized Problems(Best Test Set Correctness %, CPU

seconds)

Cleveland Heart297 x 13, 30

86.473.04

85.9232.42

76.881.58

BUPA Liver345 x 6 , 35

74.862.68

73.6232.61

68.952.04

Ionosphere 351 x 34, 35

95.195.02

94.3559.88

88.702.13

Pima Indians768 x 8, 50

78.645.72

76.59328.3

57.324.64

Tic-Tac-Toe958 x 9, 96

98.7514.56

98.431033.5

88.248.87

Mushroom8124 x 22, 215

89.04466.20

N/A

N/A

83.90221.50

K (A;A0)mâ m K (A;A0)mâ m K (A;A0)mâ mmâ n; mDataset Size

Page 7: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

RSVM on Large UCI Adult Dataset

Standard Deviation over 50 Runs = 0.001

Average Correctness % & Standard Deviation, 50 Runs

(6414, 26148) 84.47 0.001 77.03 0.014 210 3.2%(11221, 21341) 84.71 0.001 75.96 0.016 225 2.0%(16101, 16461) 84.90 0.001 75.45 0.017 242 1.5%(22697, 9865) 85.31 0.001 76.73 0.018 284 1.2%(32562, 16282) 85.07 0.001 76.95 0.013 326 1.0%

Dataset Size( Train ; Test)

UCI AdultK (A;A0)mâ m

Testing%Std.Dev.

Amâ 123

m m=mK (A;A0)mâ m

%Testing Std.Dev.

Page 8: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Tim

e( C

PU

sec

. )

Training Set Size

RSVMSMOPCGC

Page 9: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Support Vector Regression(Linear Case:f (x) = x0w+ b)

Given the training set:S = f (xi;yi)j xi 2 Rn; yi 2 R; i = 1; . . .; lg

Find a linear function,f (x) = x0w+ b where(w;b)is determined by solving a minimization

problem that guarantees the smallest overallexperiment error made by f (x) = x0w+ b

Motivated by SVM: jjwjj2should be as small as possible

Some tiny error should be discard

Page 10: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

-Insensitive Loss Functionï -insensitive loss function:ï

jyi à f (xi)jï = maxf0; jyi à f (xi)j à ïg

The loss made by the estimation function, fat the data point(xi;yi) is

jøjï = maxf0; jøj à ïg= 0 if jøj6 ïjøj à ï otherwise

ú

If ø2 Rn then jøjï 2 Rn is defined as:

(jøjï)i = jøijï ; i = 1. . .n

Page 11: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

x

x

x

x

x

x

x

x

x

"

"

-Insensitive Linear Regression"

f (x) = x0w+ b

yj à f (xj) à "f (xk) à yk à "

Find (w;b)with the smallest overall error

Page 12: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

ï- insensitive Support Vector Regression Model

Motivated by SVM: jjwjj2should be as small as possible

Some tiny error should be discarded

min(w;b;ø)2Rn+1+m

21jjwjj22+ Ce0 øj jï

where øj jï 2 Rm; ( øj jï)i = max(0; A iw+ bà yij j à ï )

Page 13: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Reformulated - SVR as a Constrained Minimization Problem

min(w;b;ø;øã)2Rn+1+2m

21w0w+ Ce0(ø+ øã)

yà Awà eb 6 eï + øAw+ ebà y 6 eï + øã

ø;øã > 0

subject to

n+1+2m variables and 2m constrains minimization problem

ï

Enlarge the problem size and computational complexity for solving the problem

Page 14: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

SV Regression by Minimizing Quadratic -Insensitive Lossï

We minimize jj(w;b)jj22at the same time Occam’s razor: the simplest is the best

min(w;b;ø)2R n+1+l

21(jjwjj22 + b2) + 2

C jj(jøj")jj22

We have the following (nonsmooth) problem:

where (jøj") i = jyi à (w0xi + b)j"

Have the strong convexity of the problem

Page 15: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

- insensitive Loss Functionï

(à x à ï )+ (x à ï )+xj jï

Page 16: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Quadratic -insensitive Loss Function ï

xj j2ï = ((x à ï )+ + (à x à ï )+)2

= (x à ï )2+ + (à x à ï )2

+

(x à ï )+ á(à x à ï )+ = 0

Page 17: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

p2ï -function replace Us

e ïQuadratic -insensitive Function

p(x; ì )

p2ï (x; ì ) = (p(x à ï ; ì ))2 + (p(à x à ï ; ì ))2

which

is defined byp(x; ì ) = x + ì

1 log(1+ expà ì x)

p-function withë = 10; p(x;10); x 2 [à 3;3]

Page 18: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

xj j2ï p2ï (x; ì ); ï = 1; ì = 5

Page 19: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

-insensitive Smooth Support Vector Regression

ï

min(w;b)2Rn+1

21(w0w+ b2) + 2

C P

i=1

m

p2ï (A iw+ bà yi; ì )2

C P

i=1

m

A iw+ bà yij j2ï

This problem is a strongly convexstrongly convex minimization problem without any constrainsThe object function is twice twice differentiabledifferentiable thus we can use a fast Newton-Armijo methodNewton-Armijo method to solve this problem

min(w;b)2Rn+1

Ðï ;ë(w;b) :=

Page 20: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Nonlinear -SVR ï

w = A0ë; ë 2 R m

Based on duality theorem and KKT–optimality conditions

y ù Aw+ eb

y ù AA0ë + eby ù K (A;A0)ë + ebIn nonlinear

case :

Page 21: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Nonlinear SVR

min(ë;b)2R m+1

21jjëjj22 + C

P

i=1

m

K (A i;A0)ë + bà yij jï

A 2 R mâ n;B 2 R nâ l

K (A;B)

ï à

Letand R mâ n â R nâ l=) R mâ l

K (A i ;A 0) 2 R 1â mand

Nonlinear regression function : f (x) = K (x;A0)ë + b

Page 22: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

min(ë;b)2R m+1

21(ë0ë + b2)

+ 2C P

i=1

m

p2ï (K (A i;A0)ë + bà yi; ì )+ 2

C P

i=1

m

K (A i;A0)ë + bà yij j2ï

Nonlinear Smooth Support Vector

-insensitive Regressionï

Page 23: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Training set and testing set (Slice methodSlice method) Gaussian kernelGaussian kernel is used to generate

nonlinear -SVR in all experiments Reduced kernel techniqueReduced kernel technique is utilized when

training dataset is bigger then 1000 Error measure : 2-norm relative error

Numerical Results

ï

yk k2

yà yêk k2 : observations: predicted values

y

Page 24: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

f (x) = 0:5ãsinc(ù10x)+nois

e

Noise: mean=0

x 2 [à 1;1], 101 points

û = 0:04Parameter:

÷= 50; ö = 5; " = 0:02

Training time : 0.3 sec.

101 Data Points in R â RNonlinear SSVR with Kernel:expà öjjxià xj jj22

Page 25: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

First Artificial Dataset

f (x) = 0:5ãù30x

sinc( ù30x)

+ ú úrandom noise with mean=0,standard deviation 0.04

Training Time : 0.016 sec.Error : 0.059

Training Time : 0.015 sec.Error : 0.068

ï - SSVR LIBSVM

Page 26: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Original Function

Noise : mean=0 ,

û = 0:4

Parameter :

÷= 50; ö = 1; " = 0:5

Training time : 9.61 sec.Mean Absolute Error (MAE) of 49x49 mesh points : 0.1761

Estimated Function

481 Data Points in

R2 â R

Page 27: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Noise : mean=0 ,

û = 0:4

Estimated Function

Original Function

Using Reduced Kernel:

K (A;A0) 2 R28900â 300

Parameter :

Training time : 22.58 sec.MAE of 49x49 mesh points : 0.0513

C = 10000; ö = 1; ï = 0:2

Page 28: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Real Datasets

Page 29: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Linear -SSVRTenfold Numerical Result

ï

Page 30: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Nonlinear -SSVRTenfold Numerical Result

1/2ï

Page 31: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Nonlinear -SSVRTenfold Numerical Result 2/2

ï

Page 32: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Difficulties with Nonlinear SVM

for Large Problems

The nonlinear kernelK (A;A0) 2 Rmâ m is fully dense

Computational complexity depends on m

Separating surface depends on almost entire dataset

Complexity of nonlinear SSVM ø O((m+ 1)3)

Runs out of memory while storing the kernel matrix

Long CPU time to compute the dense kernel matrix

O(m2) Need to generate and store entries

Need to store the entire dataset even after solving the problem

Page 33: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Reduced Support Vector Machine

K (x0;Aö0)Döuö = í

(ii) Solve the following problem by the Newtonmethod with correspondingD ú D :

2÷kp(eà D(K (A;A0)Döuö à eí );ë)k2

2 + 21kuö; í k2

2min(u; í ) 2 Rm+1

K (x0;Aö0)Döuö = í

(iii) The nonlinear classifier is defined by the optimal(u;í )solution in step (ii):

Using K (A;A0) gives lousy results!

(i) Choose a random subset matrix ofentire data matrixA 2 Rmâ n; (m << m):

A 2 Rmâ n

Nonlinear Classifier:

Page 34: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

A Nonlinear Kernel ApplicationCheckerboard Training Set: 1000 Points

in Separate 486 Asterisks from 514

DotsR2

Page 35: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Conventional SVM Result on Checkerboard

Using 50 Randomly Selected Points Out of 1000

K (A;A0) 2 R50â 50

Page 36: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

RSVM Result on Checkerboard Using SAME 50 Random Points Out of

1000

K (A;A0) 2 R1000â 50

Page 37: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

RSVM on Moderate Sized Problems(Best Test Set Correctness %, CPU

seconds)

Cleveland Heart297 x 13, 30

86.473.04

85.9232.42

76.881.58

BUPA Liver345 x 6 , 35

74.862.68

73.6232.61

68.952.04

Ionosphere 351 x 34, 35

95.195.02

94.3559.88

88.702.13

Pima Indians768 x 8, 50

78.645.72

76.59328.3

57.324.64

Tic-Tac-Toe958 x 9, 96

98.7514.56

98.431033.5

88.248.87

Mushroom8124 x 22, 215

89.04466.20

N/A

N/A

83.90221.50

K (A;A0)mâ m K (A;A0)mâ m K (A;A0)mâ mmâ n; mDataset Size

Page 38: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

RSVM on Large UCI Adult Dataset

Standard Deviation over 50 Runs = 0.001

Average Correctness % & Standard Deviation, 50 Runs

(6414, 26148) 84.47 0.001 77.03 0.014 210 3.2%(11221, 21341) 84.71 0.001 75.96 0.016 225 2.0%(16101, 16461) 84.90 0.001 75.45 0.017 242 1.5%(22697, 9865) 85.31 0.001 76.73 0.018 284 1.2%(32562, 16282) 85.07 0.001 76.95 0.013 326 1.0%

Dataset Size( Train ; Test)

UCI AdultK (A;A0)mâ m

Testing%Std.Dev.

Amâ 123

m m=mK (A;A0)mâ m

%Testing Std.Dev.

Page 39: Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Tim

e( C

PU

sec

. )

Training Set Size

RSVMSMOPCGC