Kernel Logistic Regression and the Import Vector Machinelcarin/Mingtao12.9.2011.pdf · 12/9/2011...

Kernel Logistic Regression and the Import VectorMachine

Ji Zhu and Trevor HastieJournal of Computational and Graphical Statistics, 2005

Presented by Mingtao DingDuke University

December 8, 2011

Mingtao Ding Kernel Logistic Regression and the Import Vector Machine December 8, 2011 1 / 24

Summary

The authors propose a new approach for classification, called theimport vector machine (IVM).

Provides estimates of the class probabilities. Often these aremore useful than the classifications.

Generalizes naturally to M-class classification through kernellogistic regression (KLR).


Problem and Objective

Supervised Learning Problem: a set of training data {(xi , yi)},where xi ∈ Rp is an input vector, and yi (dependent on xi ) is aunivariate continuous output for the regress problem or binaryoutput for the classification problem.Objective: learn a predictive function f (x) from the training data

minf∈F

{1n

n∑i=1

L (yi , f (xi)) + λ2 Φ(‖f‖F )

}.

In this presentation, F is assumed as an reproducing kernelHilbert space (RKHS) HK .


SVM and KLR

The standard SVM can be fitted via Loss + Regularization

minf∈HK

{1n

n∑i=1

[1− yi f (xi)]+ + λ2 ‖f‖

2HK

}.

Under very general conditions, the solution has the form

f (x) =n∑

i=1aiK (x,xi).

Example Conditions:Arbitrary L ((x1, y1, f (x1), ) (x2, y2, f (x2), · · · , ) (xn, yn, f (xn))) andstrictly monotonic increasing function Φ.

The points with yi f (xi) > 1 have no influence in loss function. As aconsequence, it often happens that a sizeable fraction of the nvalues of ai can be zero. The points corresponding nonzero ai arecalled supporting points


SVM and KLR

The loss function (1− yf )+ is plotted in Fig.1, along with the negativelog-likelihood (NLL) of the binomial distribution (of y over {1,−1}).

NLL = ln(1 + e−yf ) =

{− ln p, if y = 1− ln (1− p) , if y = −1

,

where p ≡ P (Y = 1|X = x) = 11+e−f .


SVM and KLR

The SVM only estimates sign [p(x)− 1/2] (by calculating thedistances between x and the hyperplanes), without defining theclass probability p(x).

The NLL of y has a similar shape to that of the SVM.

If we let y ∈ {0,1}, then

NLL = − (y ln p + (1− y) ln p) = −(yf − ln

(1 + ef )) .

This is the loss function of the classical KLR.


SVM and KLR

If we replace (1− yf )+ with ln(1 + e−yf ), the SVM becomes a KLR

problem with the objective junction

minf∈HK

{1n

n∑i=1

ln(1 + e−yf )+ λ

2 ‖f‖2HK

}.

Advantages:Offer a natural estimate of the class probability p(x).

Can naturally be generalized to the M-Class case through kernelmulti-logit regress.

Disadvantages: For the KLR solution f (x) =n∑

i=1aiK (x,xi), all the

ai ’s are nonzero


SVM and KLR


KLR as a Margin Maximizer

Suppose the basis functions of the transformed features space h(x) isrich enough, so that the superplane f (x) = h(x)Tβ + β0 = 0 canseparate the training data.

Theorem 1 Denote by β̂ (λ) the solution to KLR

minf∈HK

{1n

n∑i=1

ln(1 + e−yf )+ λ

2 ‖f‖2HK

},

then limλ→0

β̂ (λ) = β∗, where β∗ is the margin-maximizing SVM solution.


Import Vector Machine

The objective function of KLR can be written as

H = 1n 1T ln

(1 + ey·(K1a)

)+ λ

2 aT K2a.

To find a, we set the derivative of H with respect to a equal to 0, anduse the Newton method iteratively solve the score equation. TheNewton update can be written as



The computational cost of the KLR is O(n3). To save the cost, the IVMalgorithm will find a sub-model to approximate the full model given byKLR.

The sub-model has the form

f (x) =∑

xi∈SaiK (x,xi)

where S is a subset of the training data, and the data in S are calledimport points.



Algorithm 1:



The algorithm can be accelerated by revising Step (2) as

Stopping rule for adding point to S: run the algorithm until|Hk−Hk−∆k ||Hk | < ε.

Choosing the regularization parameter λ: we can split all the datainto a training set and a tuning set, and use the misclassificationerror on the tuning set as a criterion for choosing λ (Algorithm 3).



Algorithm 3:

1. Start with a large regularization parameter λ.


Simulation Results


Real Data Results


Generalization to M-Class Case

Similar with the kernel multi-logit regression, we define the classprobabilities as

C∑c=1

fc (x) = 0.



The M-class KLR fits a model to minimize the regularized NLL ofmultinomial distribution

The approximate solution can be obtained by a M-class IVMprocedure, which is similar to the two-class case.


Conclusion

IVM not only performs as well as SVm in two-class classification,but also can naturally be generalized to the M-class case.

Computational Cost: KLR (O(n3)), SVM (O(n2ns)),IVM (O(n2n2

I ), O(Cn2n2I )).

IVM has limiting optimal margin properties.


Kernel Logistic Regression and the Import Vector Machinelcarin/Mingtao12.9.2011.pdf · 12/9/2011...

Documents

Transcript of Kernel Logistic Regression and the Import Vector Machinelcarin/Mingtao12.9.2011.pdf · 12/9/2011...