Sixth Italian Workshop on Machine Learning and Data Mining...
Transcript of Sixth Italian Workshop on Machine Learning and Data Mining...
Sixth Italian Workshop on Machine Learningand Data Mining (MLDM)
Kernel-based non-parametric activationfunctions for neural networks
Authors: S. Scardapane, S. Van Vaerenbergh and A. Uncini
Table of contents
1 Introduction
2 Non-parametric activation functions
3 Proposed kernel activation functions
4 Experimental results
5 Conclusions and future work
Basic NN architecture
The basic layer for a neural network alternates a linear projection witha pointwise nonlinearity:
hl = gl (Wlhl−1 + bl) , (1)
There is a huge literature on the linear component, e.g., initialization,compression, fast multiplication...
In most cases, the matrices {W,bl} are the only adaptable componentsof the network.
What about the nonlinearity?
The choice of the nonlinearity is crucial:
• Having differentiable activations was the basic ingredient for back-propagation.
• In the last decade, ReLU functions g(s) = max(0, s) allowed totrain deep NNs with hundreds of layers.
• Many papers recently on new activation functions, e.g., the Swishfunction [1]:
g(s) = s · sigmoid(s) . (2)
Can we learn the activation functions?
[1] Ramachandran, P., Zoph, B. and Le, Q.V., 2017. Swish: a Self-GatedActivation Function. arXiv preprint arXiv:1710.05941.
What about the nonlinearity?
The choice of the nonlinearity is crucial:
• Having differentiable activations was the basic ingredient for back-propagation.
• In the last decade, ReLU functions g(s) = max(0, s) allowed totrain deep NNs with hundreds of layers.
• Many papers recently on new activation functions, e.g., the Swishfunction [1]:
g(s) = s · sigmoid(s) . (2)
Can we learn the activation functions?
[1] Ramachandran, P., Zoph, B. and Le, Q.V., 2017. Swish: a Self-GatedActivation Function. arXiv preprint arXiv:1710.05941.
Parametric activation functions
Making a single activation function parametric is relatively simple, e.g.,we can add a learnable scale and bandwidth to a tanh:
g(s) =a (1− exp {−bs})
1 + exp {−bs}. (3)
Or learn the slope for the negative part of the ReLU (PReLU):
g(s) =
{s if s ≥ 0
αs otherwise. (4)
These parametric AFs have a small amount of trainable parameters,but their flexibility is severely limited.
Table of contents
1 Introduction
2 Non-parametric activation functions
3 Proposed kernel activation functions
4 Experimental results
5 Conclusions and future work
Adaptive piecewise linear units
An APL nonlinearity is the sum of S linear segments:
g(s) = max {0, s}+
S∑i=1
ai max {0,−s+ bi} . (5)
This is non-parametric because S is a user-defined hyper-parametercontrolling the flexibility of the unit.
The APL introduces S+1 points of non-differentiability for each neuron,which may damage the optimization algorithm. Also, in practice havingS > 3 seems to have less effect on the resulting shapes.
[1] Agostinelli, F., Hoffman, M., Sadowski, P. and Baldi, P., 2014. Lear-ning activation functions to improve deep neural networks. arXiv preprintarXiv:1412.6830.
Spline activation functions
A SAF uses cubic interpolation over a set of adaptable control points:
2 0 2Activation
0
2
4
6SA
F ou
tput
However, regularizing the control points is non-trivial, and SAFs cannotbe easily accelerated on GPU.
[1] Vecci, L., Piazza, F. and Uncini, A., 1998. Learning and approximationcapabilities of adaptive spline activation function neural networks. NeuralNetworks, 11(2), pp. 259-270.
Maxout neurons
A Maxout replaces an entire neuron by taking the maximum over Kseparate linear projections:
g(h) = maxi=1,...,K
{wT
i h + bi}. (6)
With two maxout neurons, a NN with one hidden layer remains anuniversal approximator provided K is sufficiently large.
However, it is impossible to plot the functions for K > 3, and thenumber of parameters can increase drastically with respect to K.
[1] Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A. and Bengio, Y.,2013. Maxout networks. Proc. 30th Int. Conf. on Machine Learning.
Visualization of a Maxout neuron
2 0 21D input
1
0
1
2
Max
out a
ctiv
atio
n
Table of contents
1 Introduction
2 Non-parametric activation functions
3 Proposed kernel activation functions
4 Experimental results
5 Conclusions and future work
Basic structure of the KAF
We model each activation function in terms of a kernel expansion overD terms as:
g(s) =
D∑i=1
αiκ (s, di) , (7)
where:
1 {αi}Di=1 are the mixing coefficients;
2 {di}Di=1 are the dictionary elements;
3 κ(·, ·) : R× R→ R is a 1D kernel function.
To make everything tractable, we only adapt the mixing coefficients,and for the dictionary we sample D values over the x-axis, uniformlyaround zero.
[1] Scardapane, S., Van Vaerenbergh, S. and Uncini, A., 2017. Kafnets:kernel-based non-parametric activation functions for neural networks.arXiv preprint arXiv:1707.04035.
Kernel selection
For our experiments, we use the 1D Gaussian kernel defined as:
κ(s, di) = exp{−γ (s− di)2
}, (8)
where γ ∈ R is called the kernel bandwidth. Based on some prelimi-nary experiments, we use the following rule-of-thumb for selecting thebandwidth:
γ =1
6∆2, (9)
where ∆ is the distance between the grid points.
Choosing the bandwidth
2 0 2Activation
KA
F
(a) γ = 2.0
2 0 2Activation
(b) γ = 0.5
2 0 2Activation
(c) γ = 0.1
Figure 1 : Examples of KAFs. In all cases we sample uniformly 20 pointson the x-axis, while the mixing coefficients are sampled from a normaldistribution. The three plots show three different choices for γ.
Inizialization of the mixing coefficients
Other than initializing the mixing coefficients randomly, we can alsoapproximate any initial function using kernel ridge regression (KRR):
α = (K + εI)−1
t , (10)
where K ∈ RD×D is the kernel matrix computed between the desiredpoints t and the elements of the dictionary d.
Examples of initialization
4 2 0 2 4Activation
0.5
0.0
0.5
KA
F
(a) tanh
4 2 0 2 4Activation
0
1
2
3
KA
F
(b) ELU
Figure 2 : Two examples of initializing a KAF using KRR, with ε = 10−6.(a) A hyperbolic tangent. (b) The ELU function. The red dots indicate thecorresponding initialized values for the mixing coefficients.
Multi-dimensional KAFs
We also consider a two-dimensional variant (2D-KAF), that acts on apair of activation values:
g (s) =
D2∑i=1
αiκ (s,di) , (11)
where di is the i-th element of the dictionary, and we now have D2
adaptable coefficients {αi}D2
i=1 sampled over the plane.
In this case, we consider the 2D Gaussian kernel:
κ (s,di) = exp{−γ ‖s− di‖22
}. (12)
Advantages of the framework
1 Universal approximation properties.
2 Very simple to vectorize and to accelerate on GPUs.
3 Smooth over the entire domain.
4 Mixing coefficients can be regularized easily, including the use ofsparse penalties.
Table of contents
1 Introduction
2 Non-parametric activation functions
3 Proposed kernel activation functions
4 Experimental results
5 Conclusions and future work
Visualizing the functions
1 0 1Activation
(a)
2 0 2Activation
(b)
2.5 0.0 2.5Activation
(c)
1 0Activation
(d)
0 2Activation
(e)
1 0 1Activation
(f)
Figure 3 : Examples of 6 trained KAFs (with random initialization) on theSensorless dataset. On the y-axis we plot the output value of the KAF. TheKAF after initialization is shown with a dashed red, while the final KAF isshown with a solid green. The distribution of activation values is shown asa references with a light blue.
Results on the SUSY benchmark
Activation function Testing AUC Trainable parameters
ReLU (five hidden layers) 0.8739(0.001)
367201ELU (five hidden layers) 0.8739(0.001)
SELU (five hidden layers) 0.8745(0.002)
PReLU (five hidden layers) 0.8748(0.001) 368701
Maxout (one layer) 0.8744(0.001) 17401
Maxout (two layers) 0.8744(0.002) 288301
APL (one layer) 0.8744(0.002) 7801
APL (two layers) 0.8757(0.002) 99901
KAF (one layer) 0.8756(0.001) 12001
KAF (two layers) 0.8758(0.001) 108301
Table 1 : Results on the SUSY benchmark.
Table of contents
1 Introduction
2 Non-parametric activation functions
3 Proposed kernel activation functions
4 Experimental results
5 Conclusions and future work
Conclusions and future work
1 We proposed a novel family of non-parametric functions, framedin a kernel expansion of their input value.
2 KAFs combine several advantages of previous approaches, withoutintroducing an excessive number of additional parameters.
3 Networks trained with these activations can obtain a higher accu-racy while being significantly smaller.
4 Alternative choices for the kernel expansion are possible, e.g. dicti-onary selection strategies, alternative kernels (e.g., periodic ker-nels), and several others.
5 The framework provides a further link between neural networksand kernel methods, opening up a large number of variations withrespect to our initial approach.
Thanks for the attention,questions?