NCAR/TN-217+IA A Guide to Transform Software for Nonlinear ...
SVM — Support Vector Machines A new classification method for both linear and nonlinear data It...
-
Upload
gunner-bryon -
Category
Documents
-
view
216 -
download
2
Transcript of SVM — Support Vector Machines A new classification method for both linear and nonlinear data It...
SVM—Support Vector Machines
• A new classification method for both linear and nonlinear data
• It uses a nonlinear mapping to transform the original training data into a higher dimension
• With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”)
• With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane
• SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors)
SVM—History and Applications• Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to their
ability to model complex nonlinear decision boundaries (margin
maximization)
• Used both for classification and prediction
• Applications:
– handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests
SVM—Linearly Separable• A separating hyperplane can be written as
W ● X + b = 0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
• For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
• The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
• Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
• This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints Quadratic Programming (QP) Lagrangian multipliers
Support vectors
• This means the hyperplane
can be written as22110 awawwx
aa )( vectorsupp. is
iybxi
ii
• The support vectors define the maximum margin hyperplane!– All other instances can be deleted without changing its position and
orientation
Finding support vectors
• Support vector: training instance for which i > 0
• Determine i and b ?—A constrained quadratic optimization problem– Off-the-shelf tools for solving these problems– However, special-purpose algorithms are faster– Example: Platt’s sequential minimal optimization algorithm
(implemented in WEKA)
• Note: all this assumes separable data!
aa )( vectorsupp. is
iybxi
ii
Extending linear classification
• Linear classifiers can’t model nonlinear class boundaries
• Simple trick:– Map attributes into new space consisting of
combinations of attribute values– E.g.: all products of n factors that can be constructed
from the attributes
• Example with two attributes and n = 3:
323
22132
212
311 awaawaawawx
Nonlinear SVMs
• “Pseudo attributes” represent attribute combinations
• Overfitting not a problem because the maximum margin hyperplane is stable– There are usually few support vectors relative to the
size of the training set
• Computation time still an issue– Each time the dot product is computed, all the
“pseudo attributes” must be included
A mathematical trick
• Avoid computing the “pseudo attributes”!• Compute the dot product before doing the
nonlinear mapping • Example: for
compute
• Corresponds to a map into the instance space spanned by all products of n attributes
aa )( vectorsupp. is
iybxi
ii
n
iii iybx ))((
vectorsupp. is
aa
Other kernel functions
• Mapping is called a “kernel function”• Polynomial kernel
• We can use others:
• Only requirement:• Examples:
))(( vectorsupp. is
aa iKybxi
ii
)()(),( jijiK xxxx
2
2
2),( ji
eK ji
xx
xx
djijiK )1(),( xxxx
)tanh(),( bK jiji xxxx
n
iii iybx ))((
vectorsupp. is
aa
Problems with this approach
• 1st problem: speed– 10 attributes, and n = 5 >2000 coefficients
– Use linear regression with attribute selection– Run time is cubic in number of attributes
• 2nd problem: overfitting– Number of coefficients is large relative to the number
of training instances– Curse of dimensionality kicks in
Sparse data
• SVM algorithms speed up dramatically if the data is sparse (i.e. many values are 0)
• Why? Because they compute lots and lots of dot products
• Sparse data compute dot products very efficiently
– Iterate only over non-zero values
• SVMs can process sparse datasets with 10,000s of attributes
Applications
• Machine vision: e.g face identification– Outperforms alternative approaches (1.5% error)
• Handwritten digit recognition: USPS data– Comparable to best alternative (0.8% error)
• Bioinformatics: e.g. prediction of protein secondary structure
• Text classifiation• Can modify SVM technique for numeric
prediction problems