Universal Approximation with Convex Optimization: Gimmick or ...

1

Universal Approximation with Convex Optimization: Gimmick or Reality?

Jose C. Principe1, 2, Fellow, IEEE, Badong Chen2, Senior Member, IEEE

1. Department of Electrical and Computer Engineering,University of Florida, Gainesville, FL., USA 2. Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China

[email protected], [email protected]

Abstract This paper surveys in a tutorial fashion the recent history of universal learning machines starting with the multilayer perceptron.The big push in recent years has been on the design of universal learning machines using optimization methods linear in the parameters, such as the Echo State Network, the Extreme Learning Machine and the Kernel Adaptive filter. We call this class of learning machines convex universal learning machines or CULMs. The purpose of the paper is to compare the methods behind these CULMs, highlighting their features using concepts of vector spaces (i.e. basis functions and projections), which are easy to understand by the computational intelligence community. We illustrate how two of the CULMs behave in a simple example, and we conclude that indeed it is practical to create universal mappers with convex adaptation, which is an improvement over backpropagation. Keywords: Convex Universal Learning Machines (CULMs), Echo State Network (ESN), Extreme Learning Machine (ELM), Kernel Adaptive Filter (KAF), Kernel Least Mean Square (KLMS) Introduction In the 1980s,multilayer perceptrons (MLPs) were central players in a revolution to design universal approximators directly from data. The now famous backpropagation algorithm was discovered and applied to solve practical regression and classification problems in a time where statisticians were concentrating on solving low dimensional problems with linear methods. The lesson learned was that in order to achieve universal approximation directly from data, we had to pay the price of non-convex optimization. Thirty years later, a new page is turning with the Echo State Network (ESN) and Liquid State Machine (LSM) [22],[24], the now popular Extreme Learning Machines (ELM) [27-31],and Kernel Adaptive Filters (KAF) [32]. Can we really design and quantify performance of universal learning machines with convex optimization algorithms? For short, this class is called CULM. Learning machines are built from three different sub-modules: the mapper, the cost function, and the learning algorithm. Each has a very specific role, and a full understanding of the learning system requires a clear picture of each, as well as how they interact. This paper provides the context and unbiased perspectives of the

2

advances obtained since MLPs trained with backpropagation. Hopefully, it brings a more realistic and encompassing framework that will elucidate the adaptation of universal approximators with simple gradient descent algorithms. We start with a brief presentation of universal approximators and a review of the literature on how to adapt the mapping with neural networks. Next, the many attempts to simplify their training are briefly addressed, their types are catalogued, and we present some of the problems that still need to be solved to achieve the goal of universal approximation with convex optimization algorithms. Universal Approximation and Multilayer Architectures Regression or adaptive filtering can be framed very simply as finding an optimal subspace approximation for the cloud of points in the multidimensional joint space of inputs and target signals [1]. Normally, the input is used to define the subspace where the solution will lie, and the goal is to optimally project the joint space data such that the distances to the input subspace are minimized in some sense. In classification the problem is similar except that the goal is to shatter (i.e. divide in non-overlapping groups) the cloud of points in the joint space such that the number of mislabels is minimized [2]. Regression is related to David Hilbert’s famous 13th problem to find solutions of polynomial equations: “Can every continuous function of two or more real variables be written as the superposition of continuous functions of one real variable along with addition?” [3], which shows its importance in mathematics. We now know since Kolmogorov [4] that any real-valued continuous function f on an n-dimensional hypercube can be represented exactly (no free parameters) as

𝑓 𝑥1, … , 𝑥𝑛 = 𝜒𝑞𝑞 𝜓𝑝 ,𝑞(𝑥𝑝)𝑝 (1)

where the nonlinear functions 𝜒(. ) and 𝜓(. ) are typically non-smooth, univariate and the sums are finite. It was known from the 1960s that a parametric single layer of smooth nonlinear processing elements (PEs), called the perceptron by Rosenblatt [5], was unable to solve arbitrary partition problems in Rn [6]. It was also clear to many researchers that a parametric multilayer arrangement of smooth nonlinear PEs was much more powerful, but nobody was able to train its parameters. The book by Nilsson [7] is a wonderful account of the state of the art in the early days of adaptive learning machines, which includes neural networks. Mapping Function Let us start with the analysis of the mapping function. Multilayer systems are an embodiment of a class of function approximators that is very different from the mainstream of approximation theory that is based on polynomial expansions. We all learned how any function can be approximated in the neighborhood of an operating point by the Taylor series, but the problem we are interested in here is to approximate any function in its entire domain with an error less than some predetermined quantity! However, very few of us are formally exposed to nested

3

nonlinear functions [8], which are exactly what a multilayer architecture implements. We will cut corners in the mathematical formalism to present the main ideas for the broader audience. Let 𝑦 = 𝑓(𝑥) be a continuous function in Rn. The goal is to approximate f in Rn by a

function 𝑓 𝑥 that is built in the following way:

𝑓 𝑥 = 𝜙( 𝑤𝑖𝜙( 𝑤𝑖 ,𝑗𝑗 𝑥𝑗 + 𝑏𝑖) + 𝑏)𝑖 (2)

where𝜙(. ) are smooth nonlinear functions and w and b are real value parameters (weights and bias), j is an index over the input dimension, and i is the index over the number of functions 𝜙 in the composition, called the PEs. The importance of (2), apart from the similarity with (1),is that it translates the mapping function of a single hidden layer MLP, and according to the Stone-Weierstrass theorem, it is universal when the number of PEs is infinite [9]. This formula and similar results justified the popularity of single hidden layer perceptrons or radial basis function networks, which differ simply in the choice of the nonlinearity (the former are the ridge functions and the latter are Gaussians). Of course, we cannot use in practice an infinite number of PEs but if a large number is employed, one will be closer to achieving the goal. In fact, Cover proved a theorem that showed that the probability of shattering any arbitrary function approaches one when the dimensionality of the projection space approaches infinite [10].The number of layers in (2) can be continued and gives rise to deep networks that have emerged as the next big topic in neural networks. But from a function approximation perspective the single hidden layer is quite adequate as the basic topology. So we should understand what the merit of (2)is for function approximation, and for simplicity we drop the external nonlinearity,

𝑓 𝑥 = 𝜔𝑖𝜙( 𝑤𝑖 ,𝑗𝑗 𝑥𝑗 + 𝑏𝑖) + 𝑏𝑖 . (3)

Let us start with the linear regression problem (Fig. 1) to understand the role of projections in mappings [1]. The linear model for the same setting is a special case of (3), with a single layer and 𝜙 𝑥 = 𝑥,

𝑓 𝑥 = 𝑤𝑗𝑥𝑗 + 𝑏𝑗 = 𝑥 ∙ 𝑤 + 𝑏. (4)

If one interprets the multidimensional x as a vector, (4) is an inner product (projection) of the input x onto the weight vector w with a bias term. In function

approximation, we would like to decrease the error between y and 𝑓 𝑥 in some

sense, e.g. mean square error𝐸 𝑦 − 𝑓 (𝑥) 2

, by controlling the value of the

parameters w, i.e.

min𝑤 𝐸 𝑦 − 𝑓 (𝑥) 2. (5)

4

Kolmogorov provided a beautiful geometric interpretation of this result in functional spaces, as the orthogonal projection of the function f over the space spanned by the input vectors {𝑥 }, which we will call the bases. So in linear function approximation we can see that the input signal provides the bases of the projection space and the goal of regression is to find a weight vector that finds the orthogonal projection of y into the input space. Since the optimization is linear in the parameters, the optimization is convex and has an analytic solution, the least squares [11]. Alternatively, gradient descent can always provide a reasonable approximation (i.e. the final weights are always in a ball of radius of the optimum), with the advantage of computational efficiency (two multiplications per parameter) [1].

Figure 1. Illustration of the problem of regression in the joint space. The input space created by the input signal vectors (assumed horizontal) is used to find the best approximation (the orthogonal projection) of the data into this space.

This explanation is textbook material, but the basic framework can also be extended to understand (3). Let us denote 𝜙𝑖 as

𝜙𝑖 𝑥 = 𝜙( 𝑤𝑖 ,𝑗𝑗 𝑥𝑗 + 𝑏𝑖) (6)

and substituting in (3) we obtain

𝑓 𝑥 = 𝜔𝑖𝜙𝑖(𝑥) + 𝑏𝑖 . (7) If the same geometric interpretation of regression is used in (7), we see that the output of the one hidden layer machine is nothing but a projection on the space created by the outputs 𝜙𝑖(. )of the hidden layer processing units. The only problem is that these bases are controlled by the input data as well as by the parameters wi,j as shown in (6), so they change during learning. Moreover, because of the nonlinear

5

PEs, the space spanned by these bases is no longer limited to the input span. The projection space can be placed where it makes more sense, depending on both the input and target data by adapting the first layer weights wi,j, which is exactly the reason why the one hidden layer machine is a universal approximator. Since the optimization problem remains the same, we can now understand better the role of each one of the layers of our learning machine (Fig. 2): the output weights are still finding the orthogonal projection on a subspace spanned by the 𝜙𝑖 , and this optimization is convex in the parameters, provided that the output processing unit is linear. Moreover, the projection space is movable during learning, because the bases themselves are a function of the weights of the first layer parameters, which change during learning. This understanding calls for different learning rates in each layer to save adaptation time, because it is obvious that the optimal projection can only be determined when the basis functions stabilize. This is also textbook material [1] that was never assimilated by non-practitioners who keep declaring that MLPs are black boxes, which they are not, as this explanation shows!

Figure 2. Top panel shows the single hidden layer MLP and the bottom panel shows the interpretation of the hidden layer PE outputs as the projection space where the optimal solution of the input-output map is obtained.

6

This signal processing approach can also be applied to other topologies such as the functional link network proposed by Pao [18], where the first layer is a product of functions.

𝑓 𝑥 = 𝜔𝑖 𝑔(𝑤𝑖 ,𝑗𝑥𝑗 + 𝑏𝑖 ,𝑗 )𝑗𝑖 (8)

Parameter Learning and Generalization The procedure to change both the bases and the projection with the target signal information is achieved with the backpropagation algorithm [12],[13], [14], which is a gradient descent based procedure that is very detailed but easy to describe: backpropagate the output error through the nonlinearities until it reaches the first layer to tune parameters and enable the minimization of the cost function. Because the optimization is nonlinear in the first layer parameters, it is non-convex and therefore the optimal solution cannot be guaranteed. Therefore, the MLP trained with backpropagation is not a CULM. Practically, in batch mode, there are many ways to control convergence to achieve good solutions, but in online learning mode, it is virtually impossible to guarantee optimal solutions. Moreover, training times are long because of computation and the difficult landscape that may require a high number of iterations. Generalization ability is also a function of the training schedule; therefore, backpropagation training of hidden layer networks for optimal performance is still more of an art than a science [15]. However, for over 25 years, it was believed that difficult training was the price to be paid for universal approximation capabilities. Because of these difficulties, the first wave of improvements was directed towards improving gradient descent learning [16] with second order methods [2], and finally global optimization methods. Indeed, when the optimization problem is nonlinear in the parameters, we should seek global optimizers. Genetic algorithms and more generally evolutionary algorithms [17] have been widely used to train the parameters of MLPs with repeatable results. However, the training is still quite long and requires batch mode. Other approaches were proposed to avoid the backpropagation algorithm and still achieve good, practical results. Perhaps the first approach predated neural networks and was proposed by Wiener in his Wiener model [19] where a linear model follows a static nonlinearity. The most interesting was the functional link network [18] that used random weights for the basis and only adapted the projection weights in (8), but the authors proposed to average the system outputs. Another very interesting idea was to extend the multi-dimensional inputs with pairwise input products [7]. This idea was reinvented many times, and can be extended to the polynomial neural network [20], but the issue is that there is no guarantee of universality for these one-layer architectures. Finally, let us provide a very quick explanation about the difficulty caused by correlated bases, which is the general case in regression and classification because the data (and training methods to a certain extent) controls the correlation among the basis functions. Correlated bases create a compromise between the solution space of mappers (which affects the accuracy on training data) and the ability to generalize (which is a loss of accuracy on test data). This is the generalization

7

problem, which is different from universal approximation. We will keep the geometric framework in our explanation, instead of using the conventional statistical formulation [8]. First, we need to define the concept of reachable set of solutions, akin to machine capacity, with a constraint on the norm of the projection parameters. Let us assume that for ease of illustration, we impose a constraint on the absolute value of all the projection weights (i.e. 𝜔𝑖 ≤ 1/2) as in Fig. 3. Now we can talk about the “volume” of the reachable set of solutions, and we will see that not all bases that span a given subspace will provide the same volume. In fact, if the basis is orthonormal, it provides a cube of volume 1 in the space of solutions. Now let us assume that the basis vectors are correlated, but still linearly independent to span the full space. As we can expect, the matrix whose columns are the bases is still full rank, but because the bases are correlated, the volume of the space for a norm 1/2 weight projection constraint will be smaller than 1 (Fig. 3). So correlated bases have a smaller reachable set when compared with the orthogonal span, i.e. the training set error is not all the same for universal mappers with the same span.

Figure 3. Visualization of the “volume” of the achievable solution for an orthogonal span (top left) versus a correlated span, i.e. vectors are linearly independent but correlated (top right). Equivalent interpretation of vectors created by correlated data in an orthogonal space. The eigenfunctions create the orthogonal axis, and the data correlation shows up in the different scales along each axis. The higher the eigenvalue spread, the higher the difference in scales will be, hence the smallest volume of the solution when a weight norm is imposed.

8

In reality during learning with mean square error, the weight norm is not controlled, so it can be as high as necessary for good matching because the orthogonal projection involves the inverse of the input autocorrelation matrix R times the cross correlation vector (𝜔 = 𝑅−1𝑝).In such a case, the difficulty of correlated bases results in poor generalization. This has to do with the eigenvalue spread of the input data autocorrelation matrix (lower panel of Fig. 3). In the direction of the data’s smallest eigen-direction, the weight norm is small, so a very large weight is needed (multiplication by R-1)compared with the weight for the largest eigen-direction. After training, the model parameters are frozen and the output of the system will replace the unknown target value. For any test sample that is in the neighborhood of one of the training samples, the output is still close to the corresponding target value (good generalization). However, any input that deviates significantly from the training samples will produce a projected output with an error amplified along the direction of the smallest data eigenvector where the weight is very large. This geometric vector space interpretation agrees with statistical learning, which teaches us that the generalization error is proportional to the model weight norm [21]. So, to conclude, not all universal mappers, i.e. systems that can potentially approximate any input-output map, are equally good for generalization, and the culprit is the correlation amongst the basis vectors: either they generalize poorly when no weight norm constraint is imposed, or the achievable training set error is far from the optimum. In regression, the user has no control over the correlation amongst the bases (dictated by the data source), and in MLPs this control is also nonexistent, because backpropagation is blind to how big the weights become. Therefore, regularized cost functions are the widespread solution [1]. CULMs with Random Projections Echo State Networks (ESN): The first attempts to design CULMs appeared in the 1990s but never took off [18], [34], [44]. Jaeger [22] simplified the hidden layer training as proposed in the ESNs as well as Mass [23] in the Liquid State Machines (LSMs) [23], commonly referred to as reservoir computing [24]. Both are recurrent networks, one (ESN) using continuous-valued inputs and the other (LSM) using spike train inputs. The inputs of these learning machines are functions (e.g. time series) instead of static data as in most applications of MLPs. They both have a hidden layer of parameterized recurrent connections, use sparse, fixed first layer, and fixed feedback parameters in the hidden layer, obtained with random number selection to simplify the training. The only adaptation occurs in the output weights. We will concentrate here on the analysis of the ESN (Fig.4), but a similar discussion could be developed for LSMs.

9

Figure 4. Block diagram of a simplified echo state network (ESN). Notice that the hidden layer is recurrent, so this is also called a recurrent MLP.

The framework established in the previous section is useful to understand the ESN because the mapper is still a recurrent MLP. The outputs of the hidden PEs can still be interpreted as the bases of the projection space, while the optimal projection controlled by the output trained weights can be found using least squares or gradient descent learning. The difference is that the location of the bases is no longer a function of the desired response. So this different strategy avoids training the first layer weights, but brings a new problem that is the selection of the basis functions, i.e. the fixed recurrent connections. The proponent of ESNs [22] suggests using random weights and a high number of hidden PEs, to create a “large” projection space. One can understand these choices because the ESN is still a recurrent MLP (RMLP), so the higher the projection space dimensionality, the better the approximation ability, because the RMLP is closer to a universal mapper. Since the hidden parameters are obtained by realizations of a random number generator (RNG) that defines the bases vectors placement, the problem is potential instability because of feedback, and the correlation amongst the bases vectors, i.e. not all reservoirs that span the same space (i.e. with the same number of hidden PEs) are equally good for learning. Regarding stability, if we choose arbitrarily the free parameters of RNG, the reservoir singularities can lead to divergent dynamics, and the reservoir outputs become saturated and independent of the input values. Therefore the norm of the eigenvalues’ sum of the recurrent connections’ weight matrix must be smaller than 1 (the echo state condition). Experience also shows that the performance of the reservoir varies with the norm of the recurrent weight vector (the so called spectral radius). This is a little more difficult to understand, but it has to do with two things: the memory depth associated with the basis functions and the “richness” of the basis functions, i.e. how different they are in time. RMLPs are myopic mappers [25], and as such their finite memory affects the quality of the approximation. The pole’s location of the linearized system controls the memory depth, as electrical engineers surely understand, because the real part of the pole controls the decay time of the

10

impulse response. Moreover, when one looks at the response of the hidden layer PEs when an input is applied, we see that even for nonlinear PEs their correlation over time is large, and the more the PEs saturate, the less correlated their time response becomes. Both of these aspects are controlled by the weight vector norm, which is in fact a user-defined free parameter. Therefore different reservoirs provide different generalization as we explained in the previous section. The proponents suggest sparsifying the connectivity of the feedback layer weights, which is a second free parameter that the user has to select [22]. In spite of these three aspects, when the parameters are “well tuned for the problem”, the ESN works very well for function approximation. This was an eye-opener, because effectively it was the first instance of a dynamic learning machine that was CULM, i.e. at the same time, a universal function approximator and its free trainable parameters (the projection in the space) were determined via a convex optimization algorithm. However, the ESN was not studied from a statistical learning point of view. We proposed an information theoretic metric to describe the richness of the ESN dynamics, called the average state entropy (ASE), and an adaptive bias that avoids the weight norm normalization and effectively reduces one of the free parameters in the design[26]. Extreme Learning Machine (ELM):The ELM [27] was proposed in 2004 and exploits the same basic idea of ESN’s random projections. ELMs don’t have the problem of being unstable because they are static mappers. Since the ELM is formally an MLP, we can understand very well its mapping capabilities with the model presented in the previous section: the ELM is a universal mapper for a large number of hidden PEs, and the projection space is randomly selected, fixed, and independent of the target. The choice of the random projection for the input layer parameters guarantees that the PE outputs form a basis [28], i.e. a set of vectors that are in general position (avoid collinearity); otherwise we are wasting computation (i.e. we can have 100PEs in the hidden layer but the span can be much narrower), see [29]. Therefore, the only adaptive mapping is the optimal projection on this randomly positioned projection space, which can be accomplished by least squares or equivalents and explains why the ELM both is a CULM and possesses much faster adaptation when compared with the MLP trained with backpropagation. However, not all ELMs with a given number of hidden PEs generalize equally well because the correlation amongst the bases depends upon the first layer weights and is not being directly controlled in this RNG approach. Moreover, the random selection of the first layer parameters makes the ELM output a random variable, even if the input / target data are deterministic, which means that we can expect a large difference of performance amongst realizations. Note that this source of randomness is imposed by the design and is different from the imprecision in the first layer parameters caused by backpropagation, although ELM usually displays smaller error standard deviation in its output than MLP trained with backprogation [29]. This is achieved by imposing a constraint on the output weight norm during learning [30] that controls the end projection, but not the cause of the problem. The proponents have been very successful in showing its excellent performance in practical problems [31], but they have relegated to a second plan how to guarantee good generalization with random projection spaces. From the point of view of learning machines, the theoretical

11

generalization ability of the ELM is in principle the mean performance for a given basis size, which removes the uncertainty in the random mappings. But there are better alternatives. Instead of using the mean performance, the proponents exploit today’s cheap computer power to find the best weights (bases) for the problem amongst realizations by repeatedly generating different weights from the same distribution. However, this approach is insufficient since one needs to cross validate performance for each basis set to guarantee performance in a test set, because the first layer weights are effectively free parameters. CULMs with Data Centered Functional Bases As we have seen with the ESN and ELM, random projection spaces implemented by the hidden layer yield MLPs that are CULMs; however, this does not mean that all CULMs need to be built with random projection spaces. As an example, the recently introduced kernel adaptive filters (KAF) fulfill the two conditions being discussed, i.e. mapping universality with convex optimization, so they are CULMs with bases that are functions[32]. Here we just present the simplest member of the family, the kernel least mean square (KLMS) algorithm [33], to show how it differs from the ESN and the ELM. The KLMS is basically a radial basis function (RBF) network with a different training procedure and a different assignment of centers. In RBFs we first cluster the data to place the fixed number of Gaussians in the input space [1],or simply place the centers at pre-determined positions [34], while in KLMS(Fig. 5) we place one Gaussian in each sample, with a predetermined bandwidth , and simply weight their contributions by the corresponding local error ei multiplied with the step size (or learning rate), i.e.

𝑓 𝑥 = 𝜔𝑖𝐺𝜎(𝑥 − 𝑥𝑖)𝑛𝑖=1 = 𝜂𝑒𝑖𝐺𝜎(𝑥 − 𝑥𝑖)

𝑛𝑖=1 . (9)

As we can immediately recognize,(9) is in the same form of (7); however, it is simpler, because the bases are solely dependent upon the input data as we can conclude when comparing with (6), which differs from both MLPs and ESN/ELM. Pao [18] also proposed a similar architecture but he still uses weights for the center placement. In order to understand this deceivingly simple expression, we have to pose the problem as a nonlinear mapping from the input space to a special Hilbert space of functions [35], which is called a reproducing kernel Hilbert space (RKHS). Because of the nonlinear mapping, we can show that (9) is universal for certain kernels, including the Gaussian. Furthermore, we can derive (9) as the gradient descent on the square loss in the RKHS, i.e. this is an extension to functional spaces of the famed LMS algorithm proposed by Widrow [36]. Therefore, the KLMS is a CULM because there is no local minimum in parameter optimization. We can think that the value of the mapping in a certain point x in the space is obtained by combining the values of the Gaussians placed at xi (the dictionary centers) as in Fig. 5.

12

Figure 5. Functional bases in KLMS. Gaussians are placed on each sample of the input data, and the filter output is calculated along a line as a sum of the Gaussians (four terms) weighted by the local errors multiplied by the stepsize.

The interesting point is that the weighting (the parameters) is nothing but the error obtained by the mapping evaluation at each center multiplied by the learning rate. Therefore KLMS uses the functional structure of the RKHS as the bases (representer theorem [37]), and the optimization is online, linear in the parameters and has complexity O(n) for each update, where n is the iteration number. The online adaptation is crucial in many engineering problems and data streaming because it allows optimal solution tracking, i.e. it is truly an adaptive solution. Hence, there is no need for random bases, and if the data is deterministic, the input-output map is also deterministic, which makes the analysis of the algorithm much simpler. In fact, one big appeal of KAF is its formulation, which is well rooted in the theory of Hilbert spaces and of adaptive filtering, so one can mathematically prove mean square convergence [41], prove self-regularization [33], extend the simple KLMS for the kernel affine projection algorithm (KAPA) that includes in the limiting case the kernel recursive least squares (KRLS)[40][42], and even apply KAF to other mathematical objects besides continuous amplitude time series such as spike trains [43]. In KLMS the size of the projection space is controlled by the data set cardinality. If the data set is small, only a few bases are available to construct the projection space, but remember that they are functions, so the span is universal but the accuracy may suffer. On the other extreme, for time series analysis, where the input data is streaming (potentially unbounded number of samples), the sum grows

13

linearly with the number of samples, which is computationally infeasible. Fortunately we now know how to curb the bases to sub-linear growth using the quantized KLMS (QKLMS) [38] that clusters the dictionary centers on a lattice, or even predefines the system order using a fixed budget algorithm [39] that still has complexity O(n) per update. The only “randomness” in the algorithm comes from the “noise” in the errors produced by the gradient estimate, which in a sense is the dual of the ESN/ELM random basis approach. As it is well known [1], the weight vector converges in the mean square to the optimal solution, i.e. after convergence, the solution is always contained in a ball centered in the optimal solution and of radius given by the stepsize . This means that when the iterative process ends, the solution is somewhere inside this ball, which normally is a reasonable approximation of the optimum. Quantifying the Role of the RNG Parameters in the Basis Set The effect of different realizations of the RNG in the performance of ELM (and also of ESN) has not been fully characterized. Following well-established statistical procedures to handle free parameters, the most general methodology would require validation of the quality of each realization on a cross validation set to find the best projection space. However, this approach will be time consuming, requires more data, and should be accounted for in the ELM computation time because it is an intrinsic part of the design. Here we propose a simpler method that can help designers gauge the quality of the random projection solely based on the least correlation between the bases in the training set. It still requires testing several realizations of the RNG with the input training data, but it is conceptually simpler than cross validation, although not as general because it does not take into consideration the target information. We demonstrate the procedure in the design of ELM and QKLMS on a simple problem of approximating a one-dimensional sinc function as in [27]. Both a training set (xi,yi) and testing set with 5000 samples are generated where xi’s are uniformly randomly distributed in the interval [-4,4]. Uniform noise distributed over [-0.4, 0.4] is added to all the functional values, and 200 Monte Carlo runs are conducted. In order to quantify the correlation amongst the basis set, we calculate

the hidden layer correlation matrix across the training set asQ = 𝑕𝑖𝑕𝑖𝑇𝑁𝑡𝑟

𝑖 , where

Ntr is the training set size in samples and hi represents the hidden layer output vector for a given input sample xi. We calculate the eigenvalues of Q, find out the maximum eigenvalue λmax and minimum eigenvalue λmin, and finally calculate their ratio ρ=log λmax/λmin. We should pick the smallest eigenvalue spread quantified by . In the example, we will plot the histogram of ρs for different RNG realizations, both for the KLMS and ELM (Fig. 6). An ELM with 20 sigmoid hidden PEs is implemented and the weight and biases are generated from a zero mean Gaussian distribution. We change the variance of the random number generator to 0.25, 1.0, 4.0, to illustrate the impact on quality of the basis. For the QKLMS algorithm, we select the quantization distance threshold at δ =

14

0.3 to obtain a network size close to 20 PEs to match the ELM size. We choose the kernel size σ = 0.4 and the step size η=0.01from experimentation. As we can see in Fig.6 (top panel) the bases set for the QKLM produces a very peaky distribution with a very small eigenvalue spread, which is good because it means that the bases are not very correlated.

Figure 6 (Top Panel). Histograms of the eigenvalue ratio of the QKLMS and ELM for the sinc data set when the ELM uses Gaussian PEs and a Gaussian random number generator to set with zero mean and different variances to automatically construct the hidden layer. As we can see, the eigenvalue ratio improves (gets smaller) when the variance of the RNG increases, but the dispersion across realizations gets larger. For QKLMS, the eigenvalue ratio across the data is very small which is expected for a proper kernel size.

Figure 6 (Bottom Panel). Histograms of the eigenvalue ratio for the sinc function regression. The only difference is that the random distribution is uniform. As we can see, the same basic trends are present.

15

The picture for the ELM is vastly different. The uncertainty in the mapping created by the RNG shows up in the performance difference amongst realizations, which is seldom reported (only the performance with a given, very likely the best, set of parameters is tested). When the random number generator variance is small, the eigenvalue spread of the matrix created by the projection bases across the data set is very high, which means that the bases are highly correlated because the Gaussian PEs are clustered (or are working in its linear regions for the sigmoid) and so do not provide good coverage of the joint space. This is similar to the ESN problem of the richness of the reservoir. Moreover, when the variance of the RNG is increased the coverage of the space improves the condition number of the bases vector matrix, which is best for variance of 4. The distribution is skewed towards the lower eigenvalue ratio, which is good, but is still worse than that of the QKLMS bases. Another disturbing aspect is that the variance across the Monte Carlo runs also increases with the RNG variance, i.e. the differences in quality amongst the different bases sets become less and less manageable, so an increasing amount of trials needs to be done until one solution is found with small eigenvalue spread. We then change the RNG to a uniform distribution in the intervals [-1,1], [-2,2],[-4,4], [-10,10] (Fig.6 bottom panel).The picture is very similar to the Gaussian RNG case, with basically the same results and trends. We should mention that the uniform number generator produces less dispersion amongst the trials, as can be expected because the dynamic range of the Gaussian distribution is unbounded. Fig. 7 shows the results of a simulation for the mean square error using the best ELM case, i.e. uniform RNG with [-10,10] with least squares. As we can expect from Fig. 6, the MSE of the ELM reflects the uncertainty in the bases, i.e. there is a large discrepancy amongst the test set MSE across realizations. The table shows that the best ELM and QKLMS solutions are comparable, but that the mean testing MSE for QKLMS is much smaller than for the ELM.

Algorithm Testing MSE

Best Average

ELM 0.00017028 0.00092244

QKLMS 0.00014414 0.00040414

Figure 7. Histogram of the errors across realizations of QKLMS and ELM, and the best and average results

16

If more training data are used, the best-case results improve for both methods but the average results stabilize (not shown). Note that this dispersion of performance for the ELM is obtained with the same parameters for the RNG and is due to the random generation of the input layer weights. Even if the RNG parameters are reported in a publication, best results will be very difficult to replicate, unless they are all listed. The QKLMS results are more concentrated, in spite of the stochastic gradient and when the input is a random variable, as in this case. Conclusions This paper raises and provides answers for an important question in learning machines. For many years, the ability to design universal learning machines directly from data was thought to require the solution of a non-convex optimization problem, as illustrated by the multilayer perceptron trained with backpropagation. Recent work shows that this is no longer the case. We have summarized three classes of CULMs: the reservoir computing architectures (ESN and LSM), the extreme learning machines (ELM), and the Kernel Adaptive Filters (KAF)(specifically the KLMS), which are universal and can all be adapted with linear in the parameter optimization algorithms. This is something that our community should be aware and very proud of, because it was unthinkable a few years ago. The three solutions have the great appeal that training is much faster than MLPs using backpropagation; therefore, they are particularly important for computational intelligent solutions in the big data era. In Table 1 we show that the computational complexity is linear in the number of data samples. However, they are practically different. The ESN and ELM use random projection spaces, while the KLMS uses data centered functional bases. The ESN and ELM still suffer from design choices, translated in free parameters, which are difficult to set optimally with the current mathematical framework, so practically they involve many trials and cross validation to find a good projection space, on top of the selection of the number of hidden PEs and the nonlinear functions. On the other hand, the KLMS algorithm answers this question easily: just map the data nonlinearly and deterministically to a Hilbert space and use the Hilbert space functions as the bases selected by the input data (the representer theorem), and adapt online the projection. The KAF and its data centered basis functions have the great advantage of concentrating bases on the part of the functional space where the input data exist. Therefore, KAFs are more parsimonious and can span the cloud of points in the joint space with fewer bases, at least for some applications (e.g. for prediction where the targets and inputs come from the same cloud). Moreover, we can expect that the eigenspread of the basis vector matrix is reasonable, because there is at least one data sample per basis. For the Gaussian kernel, the basis functions are local, and can be controlled by a single parameter, the kernel bandwidth (kernel size). In the Hilbert space, the generalization is mainly controlled by two parameters. Smaller kernel sizes decrease the correlation between the basis functions, so practically, the fundamental compromise of correlated bases is easy to control in KLMS without affecting the computational complexity. The second parameter is the stepsize that

17

controls the weight norm, where a small stepsize also improves the generalization [33]. However, a decrease in the learning rate achieves a better solution at the expense of using more samples to converge. To sparsify the bases a third parameter is needed to control the quantization radius. The ESN is the only dynamic network of the three, but it turns out that KAF has been recently extended to recurrent structures [48]. The KAFs have been mostly applied in adaptive filtering (regression for static data), but recent results show that their characteristics for classification are also very competitive with those of ELMs. Therefore, KAFs are also ready to be applied by the engineering and machine learning communities.

Table 1. COMPARISON of CULMs ESN ELM QKLMS Dynamic (D)/Static mapping (S)

D S S

Random bases /cross validation

Yes Yes No

Complexity for N samples / M bases

O(NM+M3) O(NM+M3) O(NM)

Free parameters Distribution, RNG, order, nonlinearity, spectral radius, sparseness

Distribution, RNG, order, nonlinearity

Kernel, size, step-size, quantization radius

The current literature on ELMs fails to acknowledge the dependency of the results on the careful offline selection of the basis functions. Therefore, the reported results should be interpreted as best case. Moreover, the computation time in finding the best basis functions, which can be considerable, should also be included in the ELM training time. Theoretically, three important questions remain in the ELM approach. How good is the best basis set? How does one find it reliably without cross validation? And, is there a strategy to help select the size of the space and the RNG parameters for the problem? Here it is proper to recall Chen’s framework[44] of incrementally adding random PEs after successive linearized projections to the original problem, which avoids the selection of all the weights randomly and should decrease the correlation amongst the basis functions. In terms of the KAF, the randomness in KLMS appears because of the stochastic gradient and can only be eliminated by the heavy computation of the KRLS. So ways to decrease the rattling in the KLMS for optimal performance without increasing the computation should be investigated, inspired by the ESN/ELM out of the box solution. Another issue worth studying is the role of the kernel size in regression and classification. Most of the work borrows ideas from density estimation to select the kernel size, while in CULMs we are interested in spanning the joint space of inputs and targets, which is a different problem that may benefit from a different approach and also benefits the ELM [45]. Multi-kernels [46] or kernel adaptation [47]are possible solutions, but further research is necessary. Finally, it would be interesting to investigate how to create other members of the CULM class, for instance, by selecting a deterministic basis set. When Broomhead

18

introduced the RBFs, he mentioned briefly that a tessellation that decreases the correlation amongst the bases of the space should be pursued [34], but as far as these authors know, it was never investigated systematically. Our concern is that this approach alleviates the bases randomness of ELM, but may still be wasteful because the corresponding Voronoi tessellation, or orthogonal span for ridge functions, will cover the full space, while the data will very likely exist in a sub- manifold of the full space. This will inevitably create very small eigenvalues along some dimensions. Therefore, the optimal design of a deterministic projection basis for a data set is still an open problem, and the ELM performance shows that it may be a very productive research direction. These issues will keep the community busy for many years to come. Acknowledgements: The authors would like to thank Mr. Ren Wang, a student at Xi’an Jiaotong University for performing the simulations. This work was supported in part by 973 program (no. 2015CB351703)and National Natural Science Foundation of China (no. 61372152). References

1. J.Principe, N.Euliano and C.Lefebvre, Neural and Adaptive Systems: Fundamentals through Simulations. John Wiley, 2000.

2. C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, NewYork,1995.

3. D. Hilbert, “Mathematische probleme ”, Nachr. Akad. Wiss. Gottingen, pp. 290-329, 1900.

4. N. Kolmogorov, “On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition”, Dokl. Akad. NaukSSSR, vol. 114, pp. 953-956, 1957.

5. F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, Spartan Books, 1962.

6. M. Minsky and S. Papert, Perceptrons, MIT Press, Cambridge, MA, 1969. 7. N. Nilsson, The mathematical foundations of learning machines, Morgan

Kaufmann Publishers, San Francisco, 1965. 8. F. Girosi and T. Poggio, “Networks for learning: a view from the theory of

approximation of functions”, in Neural Networks: Concepts, Applications, and Implementations, P. Antognetti and V. Milutinovic, Eds. vol. I, no. 6, pp. 110-155. Prentice Hall, Englewood Cliffs, New Jersey, 1991.

9. G. Cybenko, “Approximation by superposition of a sigmoidal function”, Math. Control Systems Signals, vol.2 no.4 pp.303-314, 1989.

10. T. M. Cover, "Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition", IEEE Trans. on Electronic Computers. EC-14, pp. 326–334, 1965.

11. A.N. Kolmogorov, “Stationary sequences in Hilbert space”, (In Russian) Bull. Moscow Univ. vol.2 no.6 , pp. 1-40, 1941. English translation in Linear least squares estimation, T. Kailath, Ed. Dowden, Hutchinson & Ross 1977.

19

12. P. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974.

13. D. Rumelhart, G. Hinton, and R. Williams, “Learning internal representations by error propagation”, in Parallel Distributed Processing, chapter 8, PP. 318-362. MIT Press, Cambridge, MA, 1986.

14. Y. LeCun, “Une procedure d'apprentissage pour réseau a seuilasymmetrique (a Learning Scheme for Asymmetric Threshold Networks)”, Proceedings of Cognitiva, vol. 85, pp. 599–604, Paris, France, 1985.

15. G. Orr and K-R Muller, Eds, Neural Networks: Tricks of the Trade Springer, 1998

16. S. Haykin, Neural Networks: A Comprehensive Foundation, McMillan, 1995. 17. X. Yao, “Evolutionary artificial neural networks”, International Journal of

Neural Systems, vol. 4, no.3, pp. 203–222, 1993. 18. B. Igelnik and Y-H Pao, “Stochastic Choice of Basis Functions in Adaptive

Function Approximation and the Functional-Link Net”, IEEE Trans. Neural Netw., vol. 6, no. 6, pp. 1320-1329, Nov 1995.

19. N. Wiener, Nonlinear Problems in Random Theory. MIT Press & Wiley 1958. 20. S-K. Oh, W. Pedryc, andB-J. Park, “Polynomial neural networks architecture:

analysis and design”, Computers & Electrical Engineering Vol. 29, no. 6, pp. 703–725, August 2003.

21. V. Vapnik ,The Nature of Statistical learning Theory, Springer, 1999. 22. H. Jaeger, “The echo state approach to analyzing and training recurrent

neural networks”, Technical Report GMD Report 148, German National Research Center for Information Technology, 2001.

23. W.Maass, “Liquid state machines: Motivation, theory, and applications,” In Computability in Context: Computation and Logic in the Real World, B. Cooper and A. Sorbi, Eds, pp. 275-296. Imperial College Press, 2010.

24. H. Jaeger, W. Maass, and J. Principe, “Special issue on echo state networks and liquid state machines”, Neural Networks, vol. 20, no. 3, pp.287-289, 2007.

25. I. Sandberg and L. Xu, “Uniform approximation of multidimensional myopic maps”, IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 44, no.6, pp.477–500, 1997.

26. M. Ozturk, D. Xu, and J. Principe, “Analysis and Design of Echo State Networks for Function Approximation” Neural Computation, vol. 19, no. 1, pp. 111-138, 2007.

27. G-B. Huang, Q-Y. Zhu, and C-K. Siew, “Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks”, International Joint Conference on Neural Networks, vol. 2, pp. 985-990, 2004.

28. T. Tao andV. Vu, “Random matrices: Universal Properties of Eigenvectors”, Random Matrices: Theory, Appl. 01, 2012.

29. G-B. Huang, Q-Y. Zhu, and C-K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, pp.489-501, 2006.

30. G-B. Huang, “An Insight into Extreme Learning Machines: Random Neurons, Random Features and Kernels”, Cognitive Computation, vol. 6, pp. 376-390, 2014.

20

31. G-B. Huang, S. Song, and K. You, “Trends in Extreme Learning Machines: A Review,” Neural Networks,vol. 61, pp. 32-48, 2015.

32. W. Liu, J. Principe, and S. Haykin, Kernel Adaptive Filtering: a Comprehensive Introduction, John Wiley, 2010.

33. W. Liu, P. Pokarel, and J. Principe, “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, vol. 56, no. 2, pp.543 - 554, 2008.

34. D. Broomhead and D. Lowe, “Multivariable functional interpolation and adaptive networks”, Complex Systems, vol. 2, pp. 321-355, 1988.

35. N. Aronszajn, "Theory of Reproducing Kernels". Transactions of the American Mathematical Society, vol. 68,no. 3, pp. 337–404, 1950.

36. B. Widrow, Adaptive Signal Processing, Prentice Hall, 1985. 37. G. Wahba, Spline Models for Observational Data, SIAM, 1990. 38. B. Chen, S. Zhao, P. Zhu, and J. Príncipe, “Quantized Kernel Least Mean Square

Algorithm”, IEEE Trans. Neural Netw. Learning Syst. Vol.23, no. 1, pp. 22-32,2012.

39. S. Zhao, B. Chen, P. Zhu, and J. Principe, "Fixed Budget Quantized Kernel Least-Mean-Square Algorithm, "SignalProcessing, vol. 93, no. 9, pp. 2759–2770, Sep. 2013.

40. B. Chen, S. Zhao, P. Zhu, and J. Príncipe, “Quantized Kernel Recursive Least Squares Algorithm”, IEEE Trans. Neural Netw. Learning Syst. Vol.24, no. 9, pp.1484-1491,2013.

41. B. Chen, S. Zhao, P. Zhu and J. Príncipe, “Mean square convergence analysis for kernel least mean square algorithm”, Signal Processing, vol. 92, no.11,pp. 2624-2632, 2012.

42. W.Liu, I.Park, Y.Wang and J.Principe, “Extended Kernel Recursive Least Squares Algorithm”, IEEE Trans. Signal Proc., vol. 57, no. 10, pp. 3801-38142009.

43. I. Park, S. Seth, M. Rao and J. Principe, “Strictly positive definite kernels for point process divergences,” Neural Computation, vol. 24,no. 8, pp. 2223-2250, Aug. 2012.

44. C. Chen, “A rapid supervised learning neural network for function interpolation and approximation,” IEEE Transactions on Neural Networks, vol. 7, no. 5, pp. 1220–1230, 1996.

45. G-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme Learning Machine for Regression and Multiclass Classification,” IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, vol. 42, no. 2, pp. 513-529, 2012

46. R. Pokharel, S. Seth and J. Principe, “Quantized Mixture Kernel Least Mean Square”, in Proc. IEEE WCCI 2014, Beijing, China

47. B. Chen, J. Liang, N. Zheng and J. Principe, “Kernel Least Mean Square with Adaptive Kernel Size,” ArXiv preprint arXiv: 1401.5899, 2014.

48. P. Zhu, B. Chen and J. Principe, “Learning Nonlinear Generative Models of Time Series with a Kalman Filter in RKHS”, IEEE Trans. Signal Proc., vol. 62, no. 1, pp. 141-155, 2014.

Universal Approximation with Convex Optimization: Gimmick or ...

Documents

Transcript of Universal Approximation with Convex Optimization: Gimmick or ...