Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L....
-
Upload
charles-carroll -
Category
Documents
-
view
217 -
download
1
Transcript of Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L....
![Page 1: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/1.jpg)
Support Vector Machinesfor Data Fitting and Classification
David R. Musicant
with Olvi L. Mangasarian
UW-Madison Data Mining InstituteAnnual ReviewJune 2, 2000
![Page 2: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/2.jpg)
Overview Regression and its role in data mining Robust support vector regression
– Our general formulation
Tolerant support vector regression– Our contributions– Massive support vector regression– Integration with data mining tools
Active support vector machines Other research and future
directions
![Page 3: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/3.jpg)
What is regression? Regression forms a rule for predicting an
unknown numerical feature from known ones. Example: Predicting purchase habits. Can we use...
– age, income, level of education
To predict...– purchasing patterns?
And simultaneously...– avoid the “pitfalls” that standard statistical regression
falls into?
![Page 4: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/4.jpg)
Regression example Can we use.
Age Income Years of Education $ spent on software30 $56,000 / yr 16 $80050 $60,000 / yr 12 $016 $2,000 / yr 11 $200
Age Income Years of Education $ spent on software40 $48,000 / yr 17 ?29 $60,000 / yr 18 ?
To predict:
![Page 5: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/5.jpg)
Role in data mining Goal: Find new relationships in data
– e.g. customer behavior, scientific experimentation
Regression explores importance of each known feature in predicting the unknown one.– Feature selection
Regression is a form of supervised learning– Use data where the predictive value is known for
given instances, to form a rule
Massive datasets
Regression is a fundamental task in data mining.
![Page 6: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/6.jpg)
Part I:Robust Regression
a.k.a. Huber Regression
-1
0
1
2
-2 -1 0 1 2
![Page 7: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/7.jpg)
“Standard” Linear Regression
yê= Aw+be
Aw+beù yFind w, b such that:
y
A
b
![Page 8: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/8.jpg)
Optimization problem Find w, b such that:
Aw+beù y Bound the error by s:
à s ô Aw+beà yô s Minimize the error:
à s ô Aw+beà yô smin
Ps2i
Traditional approach: minimize squared error.
![Page 9: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/9.jpg)
Examining the loss function Standard regression uses a squared error loss
function.– Points which are far from the predicted line (outliers)
are overemphasized.
-1
0
1
2
3
4
-2 -1 0 1 2
![Page 10: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/10.jpg)
Alternative loss function Instead of squared error, try absolute value of
the error:
à s ô Aw+beà yô smin
Pjsij
0
1
2
-2 -1 0 1 2
This is called the 1-norm loss function.
![Page 11: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/11.jpg)
1-Norm Problems And Solution– Overemphasizes error on points close to the
predicted line
Solution: Huber loss function hybrid approach
Quadratic
Linear
Many practitioners prefer the Huber loss function.
-1
0
1
2
-2 -1 0 1 2
![Page 12: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/12.jpg)
Mathematical Formulation indicates switchover from quadratic to linear
-1
0
1
2
-2 -1 0 1 2
ú(t) = t2=2; if jtj ô íí jtj à í 2=2; if jtj > í
ú
Larger means “more quadratic.”
![Page 13: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/13.jpg)
Regression Approach Summary Quadratic Loss Function
– Standard method in statistics– Over-emphasizes outliers
Linear Loss Function (1-norm)– Formulates well as a linear
program– Over-emphasizes small errors
Huber Loss Function (hybrid approach)– Appropriate emphasis on large
and small errors
-1
0
1
2
3
4
-2 -1 0 1 2
0
1
2
-2 -1 0 1 2
-1
0
1
2
-2 -1 0 1 2
![Page 14: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/14.jpg)
Previous attempts complicated Earlier efforts to solve Huber regression:
– Huber: Gauss-Seidel method– Madsen/Nielsen: Newton Method– Li: Conjugate Gradient Method– Smola: Dual Quadratic Program
Our new approach: convex quadratic program
zà t ô Aw+beà yô z+t
minw2R d; z2R l; t2R l
Pz2i=2+ í
Pjtij
Our new approach is simpler and faster.
![Page 15: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/15.jpg)
Experimental Results: Census20k
0 200 400 600
0.1
1
1.345
Li
Madsen/Nielsen
Huber
Smola
MM
Time (CPU sec)
Faster!
20,000 points11 features
![Page 16: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/16.jpg)
Experimental Results: CPUSmall
0 50 100 150 200
0.1
1
1.345
Li
Madsen/Nielsen
Huber
Smola
MM
Time (CPU sec)
Faster!
8,192 points12 features
![Page 17: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/17.jpg)
Introduce nonlinear kernel! Begin with previous formulation:
Substitute w = A’ and minimize instead:
zà t ô AA0ë +beà yô z+t
Substitute K(A,A’) for AA’:
zà t ô K (A;A0)ë +beà yô z+t
A kernel is nonlinear function.
zà t ô Aw+beà yô z+t
minw2R d; z2R l; t2R l
Pz2i=2+ í
Pjtij
![Page 18: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/18.jpg)
Nonlinear results
Dataset Kernel Training Accuracy Testing Accuracy
CPUSmall Linear 94.50% 94.06%Gaussian 97.26% 95.90%
Boston Linear 85.60% 83.81%Housing Gaussian 92.36% 88.15%
Nonlinear kernels improve accuracy.
![Page 19: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/19.jpg)
Part II:Tolerant Regression
a.k.a. Tolerant Training
![Page 20: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/20.jpg)
Regression Approach Summary Quadratic Loss Function
– Standard method in statistics– Over-emphasizes outliers
Linear Loss Function (1-norm)– Formulates well as a linear
program– Over-emphasizes small errors
Huber Loss Function (hybrid approach)– Appropriate emphasis on large
and small errors
-1
0
1
2
3
4
-2 -1 0 1 2
0
1
2
-2 -1 0 1 2
-1
0
1
2
-2 -1 0 1 2
![Page 21: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/21.jpg)
Optimization problem (1-norm) Find w, b such that:
Aw+beù y Bound the error by s:
à s ô Aw+beà yô s Minimize the error:
à s ô Aw+beà yô smin
Pjsij
Minimize the magnitude of the error.
![Page 22: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/22.jpg)
The overfitting issue
A
Noisy training data can be fitted “too well”– leads to poor generalization on future data
Prefer simpler regressions, i.e. where– some w coefficients are
zero– line is “flatter”
yê= Aw+be
![Page 23: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/23.jpg)
Reducing overfitting To achieve both goals
– minimize magnitude of w vector
à s ô Aw+beà yô smin
Pjwij +C
Psi
C is a parameter to balance the two goals– Chosen by experimentation
Reduces overfitting due to points far from surface
![Page 24: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/24.jpg)
Overfitting again: “close” points
A
“Close points” may be wrong due to noise only– Line should be influenced by “real” data, not noise
Ignore errors from those points which are close!yê= Aw+be
"
![Page 25: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/25.jpg)
Tolerant regression
à s ô Aw+beà yô smin
Pjwij +C
Psi
Allow an interval of size with uniform error
e" ô s How large should be?
– Large as possible, while preserving accuracy
à s ô Aw+beà yô smin
Pjwij +C
Psi à Cö"
e" ô s
![Page 26: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/26.jpg)
How about a nonlinear surface?
![Page 27: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/27.jpg)
Introduce nonlinear kernel! Begin with previous formulation:
à s ô Aw+beà yô smin
Pjwij +C
Psi à Cö"
e" ô s Substitute w = A’ and minimize instead:
à s ô AA0ë +beà yô s
Substitute K(A,A’) for AA’:
à s ô K (A;A0)ë +beà yô s
A kernel is nonlinear function.
![Page 28: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/28.jpg)
Our improvements This formulation and interpretation is new!
– Improves intuition from prior results– Uses less variables– Solves faster!
Computational tests run on DMI Locop2– Dell PowerEdge 6300 server with– Four gigabytes of memory, 36 gigabytes of disk
space– Windows NT Server 4.0– CPLEX 6.5 solver
Donated to UW by Microsoft Corporation
![Page 29: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/29.jpg)
Comparison Results
Dataset 0 0.1 0.2 ... 0.7 Total Time
Time (sec) Improvement
Census Tuning set error 5.10% 4.74% Max 0.00 0.02 79.7%
SSR time (sec) 980 935 5086 Avg
MM time (sec) 199 294 3765 26.0%
Comp- Tuning set error 6.60% 6.32% Max
Activ 0.00 3.09 65.7%
SSR time (sec) 1364 1286 7604 Avg
MM time (sec) 468 660 6533 14.1%
Boston Tuning set error 14.69% 14.62% Max
Housing 0.00 0.42 52.0%
SSR time (sec) 36 34 170 Avg
MM time (sec) 17 23 140 17.6%
m
![Page 30: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/30.jpg)
Problem size concerns How does the problem scale?
– m = number of points– n = number of features
For linear kernel: problem size is O(mn)
à s ô Aw+beà yô s For nonlinear kernel: problem size is O(m2)
à s ô K (A;A0)ë +beà yô s
Thousands of data points ==> massive problem!
Need an algorithm that will scale well.
![Page 31: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/31.jpg)
Chunking approach Idea: Use a chunking method
– Bring as much into memory as possible– Solve this subset of the problem– Retain solution and integrate into next subset
Explored in depth by Paul Bradley and O.L. Mangasarian for linear kernels
Solve in pieces, one chunk at a time.
![Page 32: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/32.jpg)
Row-Column Chunking
Why column chunking also?– If non-linear kernel is used, chunks are very
wide.– A wide chunk must have a small number of
rows to fit in memory.
Both these chunks use the same memory!
![Page 33: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/33.jpg)
Chunking Experimental Results
Dataset: 16,000 point subset of Census in R 11+ noiseKernel: Gaussian Radial Basis KernelLP Size: 32,000 nonsparse rows and columnsProblem Size: 1.024 billion nonzero valuesTime to termination: 18.8 daysNumber of SVs: 1621 support vectorsSolution variables: 33 nonzero componentsFinal tuning set error: 9.8%Tuning set error on first chunk (1000 points)
16.2%
![Page 34: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/34.jpg)
Objective Value & Tuning Set Errorfor Billion-Element MatrixObjective Value
0
5000
10000
15000
20000
25000
00
5005
10009
150013
200018
Row-Column Chunk Iteration NumberTime in Days
Objective Value
Tuning Set Error
8%
10%
12%
14%
16%
18%
20%
00
5005
10009
150013
200018
Row-Column Chunk Iteration NumberTime in Days
Tuning Set Error
Given enough time, we find the right answer!
![Page 35: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/35.jpg)
Integration into data mining tools Method runs as a stand-alone application, with
data resident on disk With minimal effort, could sit on top of a RDBMS
to manage data input/output– Queries select a subset of data - easily SQLable
Database queries occur “infrequently”– Data mining can be performed on a different machine
from the one maintaining the DBMS
Licensing of a linear program solver necessary
Algorithm can integrate with data mining tools.
![Page 36: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/36.jpg)
Part III:Active Support Vector Machines
a.k.a. ASVM
![Page 37: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/37.jpg)
The Classification Problem
w
x0w= í +1
x0w= í à 1x0w= íSeparating Surface:
A+
A-
Find surface to best separate two classes.
![Page 38: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/38.jpg)
Active Support Vector Machine Features
– Solves classification problems– No special software tools necessary! No LP or QP!– FAST. Works on very large problems.– Web page: www.cs.wisc.edu/~musicant/asvm
• Available for download and can be integrated into data mining tools
• MATLAB integration already provided
# of points Features Iterations Time (CPU min)4 million 32 5 38.047 million 32 5 95.57
![Page 39: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/39.jpg)
Summary and Future Work Summary
– Robust regression can be modeled simply and efficiently as a quadratic program
– Tolerant regression can be used to solve massive regression problems
– ASVM can solve massive classification problems quickly
Future work– Parallel approaches– Distributed approaches– ASVM for various types of regression
![Page 40: Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.](https://reader035.fdocuments.in/reader035/viewer/2022062518/56649eab5503460f94bb089e/html5/thumbnails/40.jpg)
Questions?