Sixth Italian Workshop on Machine Learning and Data Mining...

Sixth Italian Workshop on Machine Learningand Data Mining (MLDM)

Kernel-based non-parametric activationfunctions for neural networks

Authors: S. Scardapane, S. Van Vaerenbergh and A. Uncini

Table of contents

1 Introduction

2 Non-parametric activation functions

3 Proposed kernel activation functions

4 Experimental results

5 Conclusions and future work

Basic NN architecture

The basic layer for a neural network alternates a linear projection witha pointwise nonlinearity:

hl = gl (Wlhl−1 + bl) , (1)

There is a huge literature on the linear component, e.g., initialization,compression, fast multiplication...

In most cases, the matrices {W,bl} are the only adaptable componentsof the network.

What about the nonlinearity?

The choice of the nonlinearity is crucial:

• Having differentiable activations was the basic ingredient for back-propagation.

• In the last decade, ReLU functions g(s) = max(0, s) allowed totrain deep NNs with hundreds of layers.

• Many papers recently on new activation functions, e.g., the Swishfunction [1]:

g(s) = s · sigmoid(s) . (2)

Can we learn the activation functions?

[1] Ramachandran, P., Zoph, B. and Le, Q.V., 2017. Swish: a Self-GatedActivation Function. arXiv preprint arXiv:1710.05941.

Parametric activation functions

Making a single activation function parametric is relatively simple, e.g.,we can add a learnable scale and bandwidth to a tanh:

g(s) =a (1− exp {−bs})

1 + exp {−bs}. (3)

Or learn the slope for the negative part of the ReLU (PReLU):

g(s) =

{s if s ≥ 0

αs otherwise. (4)

These parametric AFs have a small amount of trainable parameters,but their flexibility is severely limited.

Table of contents

1 Introduction





Adaptive piecewise linear units

An APL nonlinearity is the sum of S linear segments:

g(s) = max {0, s}+

S∑i=1

ai max {0,−s+ bi} . (5)

This is non-parametric because S is a user-defined hyper-parametercontrolling the flexibility of the unit.

The APL introduces S+1 points of non-differentiability for each neuron,which may damage the optimization algorithm. Also, in practice havingS > 3 seems to have less effect on the resulting shapes.

[1] Agostinelli, F., Hoffman, M., Sadowski, P. and Baldi, P., 2014. Lear-ning activation functions to improve deep neural networks. arXiv preprintarXiv:1412.6830.

Spline activation functions

A SAF uses cubic interpolation over a set of adaptable control points:

2 0 2Activation

0

2

4

6SA

F ou

tput

However, regularizing the control points is non-trivial, and SAFs cannotbe easily accelerated on GPU.

[1] Vecci, L., Piazza, F. and Uncini, A., 1998. Learning and approximationcapabilities of adaptive spline activation function neural networks. NeuralNetworks, 11(2), pp. 259-270.

Maxout neurons

A Maxout replaces an entire neuron by taking the maximum over Kseparate linear projections:

g(h) = maxi=1,...,K

{wT

i h + bi}. (6)

With two maxout neurons, a NN with one hidden layer remains anuniversal approximator provided K is sufficiently large.

However, it is impossible to plot the functions for K > 3, and thenumber of parameters can increase drastically with respect to K.

[1] Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A. and Bengio, Y.,2013. Maxout networks. Proc. 30th Int. Conf. on Machine Learning.

Visualization of a Maxout neuron

2 0 21D input

1

0

1

2

Max

out a

ctiv

atio

n

Table of contents

1 Introduction





Basic structure of the KAF

We model each activation function in terms of a kernel expansion overD terms as:

g(s) =

D∑i=1

αiκ (s, di) , (7)

where:

1 {αi}Di=1 are the mixing coefficients;

2 {di}Di=1 are the dictionary elements;

3 κ(·, ·) : R× R→ R is a 1D kernel function.

To make everything tractable, we only adapt the mixing coefficients,and for the dictionary we sample D values over the x-axis, uniformlyaround zero.

[1] Scardapane, S., Van Vaerenbergh, S. and Uncini, A., 2017. Kafnets:kernel-based non-parametric activation functions for neural networks.arXiv preprint arXiv:1707.04035.

Kernel selection

For our experiments, we use the 1D Gaussian kernel defined as:

κ(s, di) = exp{−γ (s− di)2

}, (8)

where γ ∈ R is called the kernel bandwidth. Based on some prelimi-nary experiments, we use the following rule-of-thumb for selecting thebandwidth:

γ =1

6∆2, (9)

where ∆ is the distance between the grid points.

Choosing the bandwidth

2 0 2Activation

KA

F

(a) γ = 2.0

2 0 2Activation

(b) γ = 0.5

2 0 2Activation

(c) γ = 0.1

Figure 1 : Examples of KAFs. In all cases we sample uniformly 20 pointson the x-axis, while the mixing coefficients are sampled from a normaldistribution. The three plots show three different choices for γ.

Inizialization of the mixing coefficients

Other than initializing the mixing coefficients randomly, we can alsoapproximate any initial function using kernel ridge regression (KRR):

α = (K + εI)−1

t , (10)

where K ∈ RD×D is the kernel matrix computed between the desiredpoints t and the elements of the dictionary d.

Examples of initialization

4 2 0 2 4Activation

0.5

0.0

0.5

KA

F

(a) tanh

4 2 0 2 4Activation

0

1

2

3

KA

F

(b) ELU

Figure 2 : Two examples of initializing a KAF using KRR, with ε = 10−6.(a) A hyperbolic tangent. (b) The ELU function. The red dots indicate thecorresponding initialized values for the mixing coefficients.

Multi-dimensional KAFs

We also consider a two-dimensional variant (2D-KAF), that acts on apair of activation values:

g (s) =

D2∑i=1

αiκ (s,di) , (11)

where di is the i-th element of the dictionary, and we now have D2

adaptable coefficients {αi}D2

i=1 sampled over the plane.

In this case, we consider the 2D Gaussian kernel:

κ (s,di) = exp{−γ ‖s− di‖22

}. (12)

Advantages of the framework

1 Universal approximation properties.

2 Very simple to vectorize and to accelerate on GPUs.

3 Smooth over the entire domain.

4 Mixing coefficients can be regularized easily, including the use ofsparse penalties.

Table of contents

1 Introduction





Visualizing the functions

1 0 1Activation

(a)

2 0 2Activation

(b)

2.5 0.0 2.5Activation

(c)

1 0Activation

(d)

0 2Activation

(e)

1 0 1Activation

(f)

Figure 3 : Examples of 6 trained KAFs (with random initialization) on theSensorless dataset. On the y-axis we plot the output value of the KAF. TheKAF after initialization is shown with a dashed red, while the final KAF isshown with a solid green. The distribution of activation values is shown asa references with a light blue.

Results on the SUSY benchmark

Activation function Testing AUC Trainable parameters

ReLU (five hidden layers) 0.8739(0.001)

367201ELU (five hidden layers) 0.8739(0.001)

SELU (five hidden layers) 0.8745(0.002)

PReLU (five hidden layers) 0.8748(0.001) 368701

Maxout (one layer) 0.8744(0.001) 17401

Maxout (two layers) 0.8744(0.002) 288301

APL (one layer) 0.8744(0.002) 7801

APL (two layers) 0.8757(0.002) 99901

KAF (one layer) 0.8756(0.001) 12001

KAF (two layers) 0.8758(0.001) 108301

Table 1 : Results on the SUSY benchmark.

Table of contents

1 Introduction





Conclusions and future work

1 We proposed a novel family of non-parametric functions, framedin a kernel expansion of their input value.

2 KAFs combine several advantages of previous approaches, withoutintroducing an excessive number of additional parameters.

3 Networks trained with these activations can obtain a higher accu-racy while being significantly smaller.

4 Alternative choices for the kernel expansion are possible, e.g. dicti-onary selection strategies, alternative kernels (e.g., periodic ker-nels), and several others.

5 The framework provides a further link between neural networksand kernel methods, opening up a large number of variations withrespect to our initial approach.

Thanks for the attention,questions?

Sixth Italian Workshop on Machine Learning and Data Mining...

Documents

Transcript of Sixth Italian Workshop on Machine Learning and Data Mining...