ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine.pptx
-
Upload
grssieee -
Category
Technology
-
view
275 -
download
2
Transcript of ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine.pptx
CSI:Florida
A BAYESIAN APPROACH TO LOCALIZED MULTI-KERNEL LEARNING USING THE
RELEVANCE VECTOR MACHINE
R. Close, J. Wilson, P. Gader
Outline
• Benefits of kernel methods• Multi-kernels and localized multi-kernels• Relevance Vector Machines (RVM)• Localized multi-kernel RVM (LMK-RVM)• Application of LMK-RVM to landmine
detection• Conclusions
2
CSI:Florida
Kernel Methods Overview
Using a non-linear mapping a decision surface can become linear in a transformed space
3
CSI:Florida
Kernel Methods Overview
If the mapping satisfies Mercer’s theorem (i.e., the it is finitely positive-definite) then it corresponds to an inner-product kernel
K
4
CSI:Florida
Kernel Methods
• Feature transformations increase dimensionality to create a linear separation between classes
• Utilizing the kernel trick, kernel methods construct these feature transformations in an infinite dimensional space that can be finitely characterized
• The accuracy and robustness of the model becomes directly dependent on the kernel’s ability to represent the correlation between data points– A side benefit is an increased understanding of the latent
relationships between data points once the kernel parameters are learned
5
CSI:Florida
Multi-Kernel Learning
• When using kernel methods, a specific form of kernel function is chosen (e.g. a radial basis function).
• Multi-kernel learning uses a linear combination of kernel functions–
– The weights may be constrained if desired
• As the model is trained, the weights yielding the best input-space to kernel-space mapping are learned.
• Any kernel function whose weight approaches 0 is pruned out of the multi-kernel function.
K
i ii yxkwyxk1
,,
6
CSI:Florida
Localized Multi-Kernel Learning
• Localized multi-kernel (LMK) learning allows different kernels (or different kernel parameters) to be used in separate areas of the feature space.
Thus the model is not limited to the assumption that one kernel function can effectively map the entire feature-space
• Many LMK approaches attempt to simultaneously partition the feature-space and learn the multi-kernel
Different Multi-kernels
7
CSI:Florida
LMK-RVM
• A localized multi-kernel relevance vector machine (LMK-RVM) uses the ARD (automatic relevance determination) prior of the RVM to select the kernels to use over a given feature-space.
• This allows greater flexibility in the localization of the kernels and increased sparsity
8
CSI:Florida
RVM Overview
-4 -3 -2 -1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-4 -3 -2 -1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-4 -3 -2 -1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
)),(|( 1xytN
),|( 1I0 wN
LikelihoodPosterior ARD Prior
),|( mwN
ttKm
1)( KKA t
9
xkwxy t
CSI:Florida
RVM Overview
-4 -3 -2 -1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-4 -3 -2 -1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-4 -3 -2 -1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
)),(|( 1xytN
),|( 1I0 wN
LikelihoodPosterior ARD Prior
),|( mwN
ttKm
1)( KKA tNote the vector
hyper-parameter 10
xkwxy t
CSI:Florida
Automatic Relevance Determination
• Values for and are determined by integrating over the weights, and maximizing the resulting marginal distribution.
• Those training samples that do not help predict the output of other training samples have α values that tend toward
infinity. Their associated w priors become δ functions with mean 0, that is, their weight in predicting outcomes at other points should be exactly 0. Thus, these training vectors can be removed.
• We can use the remaining, relevant, vectors to estimate the outputs associated with new data.
• The design matrix K=Φ is now NxM, where M<<N.
11
CSI:Florida
RVM for Classification
• Start with a two-class problem– t {0,1}
• – () is logistic sigmoid
• Same as RVM for regression except must use IRLS to calculate the mode of the posterior distribution
• •
–
))((),( xwwxy t
)(1* ytAw t
1)( ABt
))1(( nnn yybdiagB
12
CSI:Florida
LMK-RVM
• Using the multi-kernel with the RVM model, we start with:
where wn is the weight on the multi-kernel associated with vector n and wi is the weight on the ith component of each multi-kernel.
• Unlike some kernel methods (e.g. SVM) the RVM is not constrained to use a positive-definite kernel matrix, thus, there is no requirement that the weights be factorized as wnwi. So, in this setting
• We show a sample application of LMK-RVM using two radial basis kernels at each training point with different spreads.
N
n
K
iniin xxkwwxy
1 1
,
N
n
K
inini xxkwxy
1 1
,
13
CSI:Florida
Toy Dataset Example
Kernels with larger σ
Kernels with smaller σ
14
CSI:Florida
GPR Data ExperimentsGPR experiments using data with120 dimension spectral features
Sigma for Kernel 2
Sigm
a fo
r K
erne
l 1
Sum of Squared Error for Ground Penetrating RadarResults of Validation Set Using 10-fold Cross-Validation Training
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.80.75
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
120
140
160
180
200
220
240
Sigma for Kernel 2
Sigm
a fo
r K
erne
l 1
Area Under Curve for Ground Penetrating RadarResults of Hold-out Set of 10-fold Cross-Validation Training
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.65 0.7 0.75 0.80.6
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
0.82
Improvements in classification happen off-diagonal 15
CSI:Florida
GPR ROC
16
CSI:Florida
WEMI Data ExperimentsWEMI experiments using data with 3 dimension GRANMA features
Sigma for Kernel 2
Sigm
a fo
r K
erne
l 1
Area Under Curve for Wideband EMIResults of Hold-out Set of 10-fold Cross-Validation Training
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.87
0.875
0.88
0.885
0.89
0.895
Sigma for Kernel 2
Sigm
a fo
r K
erne
l 1
Sum of Squared Error for Wideband EMIResults of Validation Set Using 10-fold Cross-Validation Training
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
55
56
57
58
59
60
61
62
Improvements in classification happen off-diagonal 17
CSI:Florida
WEMI ROC
18
CSI:Florida
Number of Relevant Vectors
Sigma for Kernel 2
Sig
ma f
or
Kern
el 1
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
20
30
40
50
60
70
80
90
100
WEMI
Sigma for Kernel 2
Sigm
a fo
r K
erne
l 1
Mean Number of Relevant Vectors for Ground Penetrating RadarNumber of Relevant Vectors Averaged Over Training Set
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
20
30
40
50
60
70
80
90
GPR
Number of relevant vectors averaged over all ten folds.
The off-diagonal shows a potentially sparser model 19
CSI:Florida
Conclusions
• The experiment using GPR data features showed that LMK-RVM can provide definite improvement in SSE, AUC, and the ROC
• The experiment using the lower-dimensional WEMI data GRANMA features showed that using the same LMK-RVM method provided some improvement in SSE and AUC and an inconclusive ROC
• Both set of experiments show the potential for sparser models when choosing when using the LMK-RVM
• Question: is there an effective way to learn values for spreads in our simple class of localized multi-kernels?
20
CSI:Florida
References
[1] F. R. Bach, et al., "Multiple Kernel Learning, Conic Duality, and the SMO Algorithm," in International Conference on Machine Learning, Banff, Canada, 2004.[2] T. Damoulas, et al., "Inferring Sparse Kernel Combinations and Relevance Vectors: An Application to Subcellular Localization of Proteins," in Machine Learning and Applications, 2008. ICMLA '08. Seventh International Conference on, 2008, pp. 577-582.[3] G. Camps-Valls, et al., "Nonlinear System Identification With Composite Relevance Vector Machines," Signal Processing Letters, IEEE, vol. 14, pp. 279-282, 2007.[4] B. Wu, et al., "A Genetic Multiple Kernel Relevance Vector Regression Approach," in Education Technology and Computer Science (ETCS), 2010 Second International Workshop on, 2010, pp. 52-55.[5] R. A. Jacobs, et al., "Adaptive Mixtures of Local Experts," Neural Computation, vol. 3, pp. 79-87, 1991.[6] C. E. Rasmussen and Z. Ghahramani, "Infinite Mixtures of Gaussian Process Experts," in Advances in Neural Information Processing Systems, 2002.
21
CSI:Florida
References
[7] L. Yen-Yu, et al., "Local Ensemble Kernel Learning for Object Category Recognition," in Computer Vision and Pattern Recognition, 2007. CVPR '07. IEEE Conference on, 2007, pp. 1-8.[8] M. Gonen and E. Alpaydin, "Localized Multiple Kernel Learning," in 25th International Conference on Machine Learning, Helsinki, Finland, 2008.[9] M. Gonen and E. Alpaydin, "Localized Multiple Kernel Regression," in Pattern Recognition (ICPR), 2010 20th International Conference on, 2010, pp. 1425-1428.[10] M. E. Tipping, "The Relevance Vector Machine," Advances in Neural Information Processing Systems, vol. 12, pp. 652-658, 2000.[11] C. M. Bishop, "Relevance Vector Machines (Analysis of Sparsity)," in Pattern Recognition and Machine Learning, ed: Springer, 2007, pp. 349-353.[12] D. Tzikas, A. Likas, and N. Galatsanos. “Large Scale Multikernel Relevance Vector Machine for Object Detection,” International Journal on Artificial Intelligence Tools, 16(6):967-979, December 2007.[13] D. Tzikas, A. Likas, and N. Galatsanos, "Large Scale Multikernel RVM for Object Detection," presented at the Hellenic Conference on Artificial Intelligence, Heraclion, Crete, Greece, 2006. 22
CSI:Florida
Backup Slides
Expanded Kernels Discussion
23
CSI:Florida
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
x1
x2
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
x1x2
In both these problems linear classification methods have difficulty discriminating the blue class from the others!
What is the actual problem here?- No one line can separate the blue class from the other datapoints! Similar to
the “single-layer” perceptron problem (The XOR problem)!
Kernel Methods Example: The Masked Class Problem
24
CSI:Florida
Decision Surface in Feature Space
x1
x2
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
-0.2
0
0.2
0.4
0.6
0.8
1
x1
x2
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
-0.2
0
0.2
0.4
0.6
0.8
1
Can classify the green and black class with
no problem!
x1
x2
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
0.2
0.25
0.3
0.35
0.4
0.45 Problems when we try to classify the blue class!!!!
25
CSI:Florida
Revisit Masked Class Problem
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
x1
x2
Are linear methods completely useless on this data?
-No, we can perform a non-linear transformation on the data via fixed basis functions!
-Many times when we perform this transformation features that where not linearly separable in the original feature space become linearly separable in the transformed feature space.
26
CSI:Florida
Basis FunctionsModels can be extended by using fixed basis functions which allows for linear combinations of nonlinear functions of the input variables–
– Gaussian (or RBF) basis function:– Basis vector: – Dummy basis function used for bias parameter: – Basis function center ( ) governs location in input space– Scale parameter ( ) determines spatial scale
)x(w)x(w)w,xy( T1M
0jjj
s
)μx(exp)x(
2j
j
T),,()x( 1M0 1)x(0
sj
27
CSI:Florida
Features in Transformed Space are Linearly Separable
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
theta1
thet
a2
Transformed datapoints are
plotted in the new feature space
28
CSI:Florida
Transformed Decision Surface in Feature Space
x1
x2
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
x1
x2
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
0
0.2
0.4
0.6
0.8
1
Again, we can classify the
green and black class with no
problem!
x1
x2
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Now we can classify the blue class with no problem!!!
29
CSI:Florida
Common Kernels
• Squared Exponential: • Gaussian Process Kernel:• Automatic Relevance Determination (ARD) kernel
• Other kernels:– Neural Network– Matern– -exponential, etc.
mtn32
2
mn1
0mn xx}xx2
exp{)x,x( k
}xx2
exp{)x,x(2
mnmn
k
})x(xη2
1exp{θ)x,xk(
D
1i
2minii0mn
30