Periodic Kernel Approximation by Index Set Fourier Series...

11
Periodic Kernel Approximation by Index Set Fourier Series Features Anthony Tompkins School of Computer Science The University of Sydney [email protected] Fabio Ramos School of Computer Science NVIDIA USA, The University of Sydney [email protected] Abstract Periodicity is often studied in timeseries mod- elling with autoregressive methods but is less popular in the kernel literature, particularly for multi-dimensional problems such as in tex- tures, crystallography, quantum mechanics, and robotics. Large datasets often make modelling periodicity untenable for otherwise powerful non-parametric methods like Gaussian Pro- cesses (GPs) which typically incur an O(N 3 ) computational cost, while approximate feature methods are impeded by their approximate ac- curacy. We introduce a method that efficiently decomposes multi-dimensional periodic kernels into a set of basis functions by exploiting multi- variate Fourier series. Termed Index Set Fourier Series Features (ISFSF), we show that our ap- proximation produces significantly less gener- alisation error than alternative approximations such as those based on random and determin- istic Fourier features on regression problems with periodic data. 1 INTRODUCTION The phenomena of periodicity permeates a vast number of natural and artificial processes [9, 16, 4]. However, it is rare to come across its enquiry in machine learning and more specifically with regard to periodic kernels, kernel methods on manifolds, and their feature-space approxima- tions. Almost all existing work focuses on non-parametric full-kernel methods. Although non-parametric methods [45] are exceptionally flexible methods for statistical mod- elling, they inherently lack scalability. A particular non- parametric method using kernel functions, is the GP [36]. However, the inability to truly scale GP inference to large datasets is a major limitation of such methods. While there have been various efforts to approximate GPs with lower rank solutions based on inducing points [40, 17] these methods are still constrained by their data- dependence. Inspired by the applicability of feature-space kernels for scalable GP regression [23] we note the real- world significance of effective periodic kernel represen- tations for tasks such as texture in-painting [48], pre- dictive representations of infinitely periodic crystal lat- tices and machine-learning aided discovery of materials [32, 37, 11], and in robotics for problems involving pe- riodic systems [30, 10]. Scalable multivariate periodic kernel approximations, unaddressed in the literature, mo- tivates the main contribution of this paper. Specifically, in our contributions we provide: Fourier series approximations of multivariate station- ary periodic kernels with an efficient sparse construc- tion; and a general bound for the cardinality of the resulting full and sparse feature sets as well as an upper bound to the truncation error for the multivariate feature approximation. We compare in detail the proposed method against recent state-of-the-art kernel approximations in terms of both the kernel approximation and, more importantly, predictive generalisation error. Empirical results on real datasets and robot simulations further demonstrate that deterministic index set based features provide significantly improved convergence generalisation properties by reducing both the data samples and the number of features required for equivalently accurate predictions. 2 RELATED WORK While much work has been done for data-independent ker- nel approximations such as RFFs, as opposed to Nyström [46, 14], there is limited work on such approximations of

Transcript of Periodic Kernel Approximation by Index Set Fourier Series...

Page 1: Periodic Kernel Approximation by Index Set Fourier Series ...auai.org/uai2019/proceedings/papers/163.pdf · elling with autoregressive methods but is less popular in the kernel literature,

Periodic Kernel Approximation by Index Set Fourier Series Features

Anthony TompkinsSchool of Computer Science

The University of [email protected]

Fabio RamosSchool of Computer Science

NVIDIA USA, The University of [email protected]

Abstract

Periodicity is often studied in timeseries mod-elling with autoregressive methods but is lesspopular in the kernel literature, particularlyfor multi-dimensional problems such as in tex-tures, crystallography, quantum mechanics, androbotics. Large datasets often make modellingperiodicity untenable for otherwise powerfulnon-parametric methods like Gaussian Pro-cesses (GPs) which typically incur an O(N3)computational cost, while approximate featuremethods are impeded by their approximate ac-curacy. We introduce a method that efficientlydecomposes multi-dimensional periodic kernelsinto a set of basis functions by exploiting multi-variate Fourier series. Termed Index Set FourierSeries Features (ISFSF), we show that our ap-proximation produces significantly less gener-alisation error than alternative approximationssuch as those based on random and determin-istic Fourier features on regression problemswith periodic data.

1 INTRODUCTION

The phenomena of periodicity permeates a vast numberof natural and artificial processes [9, 16, 4]. However, itis rare to come across its enquiry in machine learning andmore specifically with regard to periodic kernels, kernelmethods on manifolds, and their feature-space approxima-tions. Almost all existing work focuses on non-parametricfull-kernel methods. Although non-parametric methods[45] are exceptionally flexible methods for statistical mod-elling, they inherently lack scalability. A particular non-parametric method using kernel functions, is the GP [36].

However, the inability to truly scale GP inference tolarge datasets is a major limitation of such methods.

While there have been various efforts to approximateGPs with lower rank solutions based on inducing points[40, 17] these methods are still constrained by their data-dependence. Inspired by the applicability of feature-spacekernels for scalable GP regression [23] we note the real-world significance of effective periodic kernel represen-tations for tasks such as texture in-painting [48], pre-dictive representations of infinitely periodic crystal lat-tices and machine-learning aided discovery of materials[32, 37, 11], and in robotics for problems involving pe-riodic systems [30, 10]. Scalable multivariate periodickernel approximations, unaddressed in the literature, mo-tivates the main contribution of this paper.

Specifically, in our contributions we provide:

• Fourier series approximations of multivariate station-ary periodic kernels with an efficient sparse construc-tion; and

• a general bound for the cardinality of the resultingfull and sparse feature sets as well as an upper boundto the truncation error for the multivariate featureapproximation.

We compare in detail the proposed method against recentstate-of-the-art kernel approximations in terms of both thekernel approximation and, more importantly, predictivegeneralisation error. Empirical results on real datasets androbot simulations further demonstrate that deterministicindex set based features provide significantly improvedconvergence generalisation properties by reducing boththe data samples and the number of features required forequivalently accurate predictions.

2 RELATED WORK

While much work has been done for data-independent ker-nel approximations such as RFFs, as opposed to Nyström[46, 14], there is limited work on such approximations of

Page 2: Periodic Kernel Approximation by Index Set Fourier Series ...auai.org/uai2019/proceedings/papers/163.pdf · elling with autoregressive methods but is less popular in the kernel literature,

periodic kernels. Two recent works [41, 42] explore ap-proximations for periodic kernels in univariate timeserieswhere some response varies periodically with respect totime. However, it is not clear how to tractably generalisesuch decompositions into multiple dimensions where theresponse varies periodically as a function of multiple in-puts.

The work of [33, 34] termed Random Fourier Features(RFFs) is the idea of explicit data-independent featuremaps using ideas from harmonic analysis and sketch-ing theory. By approximating kernels these maps allowscalable inference with simple linear models. Variousapproximations to different kernels have followed andinclude polynomial kernels [31, 29], dot-product ker-nels [21], histogram and γ-homogenous kernels [25, 44],and approximations based upon the so-called triple-spin[6, 24, 50, 12], operator-valued kernels [5].

A recent work of note, quasi-Monte Carlo features (QMC)[49], uses deterministic sequences on the hypercube toapproximate shift invariant kernels. In [42], the key idea isthat periodic kernels can be harmonically decomposed in adeterministic manner using Fourier series decompositions.However, while tractable in the univariate case, it is notimmediately extensible to multiple dimensions due toexponential complexity in the number of approximatingcoefficients.

Connected to our index set based features are quadra-ture rule based kernel approximations. Often based ongrid based solutions, these similarly have exponential de-pendencies on the input dimension [8, 28, 27] which arecountered to some extent via other assumptions such asadditivity [28]. Also, works on quadrature are exploredonly for very specific families of kernels (sub-Gaussian,Gaussian) and explore numerical optimizations therein(e.g. butterfly algorithm, structured matrices). Our workis further distinct in that we explore the use of the featureset for parametric Bayesian modelling (regression) in lin-ear feature space as opposed to the non-parametric formin kernel space. ISFSF are specifically constructed for thespace of periodic kernels in the sense of the data lying onsome manifold rather than requiring an explicit warpingof the input data using standard aperiodic kernels; thishas not been investigated previously in the multivariatecase. ISFSF can naturally be seen as using an unweightedquadrature scheme, akin to MC and QMC Fourier Fea-tures. Thus, our sparse feature construction would directlybenefit from quadrature rules applied to periodic spaces[19, 22, 18] however this is not the focus of our study inthis paper and deserves future independent investigation.

We compare our body of work with the highy perform-ing Halton and generalised Halton sequence [35]. Wealso stress that our method is inherently different from

methods such as the Spectral Mixture kernel [47] whichoperate in the full kernel space. Our goal is to representkernels for inference in a fashion similar to Sparse Spec-tral Gaussian Processes [23] which make them furtheramenable to Bayesian inference in streaming domainsunder hard computational constraints such as in robotics[13]. Specifically we may perform inference in O(NM2)time for M features where M � N .

3 PRELIMINARIES

Notation. Let R represent the set of real numbers, Zthe set of integers, Z+

0 the set of non-negative integers,and N+ the set of positive integers. For any arbitrary setY 6= ∅, let YD be its Cartesian product Y×...×Y repeatedD-times where D ≥ 1, D ∈ N. Let TD := [a, b]D repre-sent the D-dimensional torus, or the circle with D = 1.Throughout this paper,D represents the spatial dimensionand R ∈ Z+

0 is the maximum refinement. The refinementmay be interpreted as the multidimensional set of integersthat support the the fundamental frequency in the Fourierseries expansion for each dimension. In the next sectionwe introduce univariate Fourier series with an illustra-tive example for deriving an expansion for a univariateperiodic kernel.

3.1 UNIVARIATE FOURIER SERIES FORKERNELS

We demonstrate first how one constructs a stationary pe-riodic kernel with its corresponding Fourier series de-composition. This step is crucial because it is requiredto obtain Fourier series coefficients corresponding to thekernel being approximated. It is possible to construct aperiodic kernel from any stationary kernel by applyingthe warping u(x) = [cos(x), sin(x)] to data x and thenpassing the result into any standard stationary kernel[26].By performing the warping to a stationary kernel with thegeneral squared distance metric ‖x− x′‖2 and replacingx with u(x) we have:

‖u(x)− u(x′)‖2

= (sin(x)− sin(x′))2 + (cos(x)− cos(x′))2

= 2(1− cos(x− x′)).(1)

Example. Consider the well known Squared Exponen-tial (SE) kernel [36] κSE(x− x′) = exp

(− ‖x−x

′‖22l2

)with lengthscale l. After performing (1) we recover theperiodic SE kernel: κperSE(x, x′) = exp

(− cos(wτ)−1

l2

),

where τ = x − x′ and ω is the fundamental periodicfrequency.

Firstly, κperSE is both periodic and symmetric over τ .Since it is periodic, the kernel can be represented as a

Page 3: Periodic Kernel Approximation by Index Set Fourier Series ...auai.org/uai2019/proceedings/papers/163.pdf · elling with autoregressive methods but is less popular in the kernel literature,

Fourier Series over the interval [−L,L] where L is thehalf period ω = π

L is the fundamental frequency. Notethe Fourier series representation of some function:

f(t) ≈ Fk[f(t)] =

∞∑k=−∞

ckeikωt, (2)

with coefficients

c0 =1

2L

∫ L

−Lf(t)dt, (3)

ck =1

2L

∫ L

−Lf(t)e−ikωtdt, ∀k ∈ N+. (4)

This reduces to a series of cosines from (2) for even func-tions, such as stationary periodic kernels. The Fourierseries is defined at integer multiples k of the fundamentalperiodic frequency ω where k ∈ N+. To find the kth

coefficient ck, we evaluate the integral:

ck =1

2L

∫ L

−Lel−2(cos(ωτ)−1)e−ikωτdτ

=e−l−2

2L

∫ L

−Lel−2(cos(ωτ)) cos(kωτ)dτ

=2πIk(l−2)

el−2 ,

(5)

using substitution ω = πL , L = π, where In(z) is the

Modified Bessel function of the first kind of integer ordern and argument z. We obtain the solution using the specialfunction identity In(z) = 1

π

∫ π0ez cos(θ) cos(nθ)dθ [1]

which collapses the integral.

We now have a representation of the kernel as an infiniteFourier series κ(τ) ≈ Fk[κ(τ)]:

κperSE(τ) = Fk[κ(τ)] =

∞∑k=−∞

Ik(l−2)

exp(l−2)cos(kωτ).

(6)Thus, for the periodic SE kernel, we have Fourier seriesfeature coefficients q2

k,

q2k =

Ik(l−2)exp(l−2) if k = 0,2Ik(l−2)exp(l−2) if k = 1, 2, ...,K,

(7)

where K is the truncation factor of the Fourier series, Iis the modified Bessel function of the first kind. Thesecoefficients q2

k are used on a per-dimension basis for themultivariate feature construction.

3.2 FOURIER SERIES IN MULTIPLEDIMENSIONS

Our goal is to represent multi-dimensional periodic ker-nels. In the space of the full kernel, such a composition

can be represented as a product ofD independent periodickernels on each dimension since is known that productcompositions in the space of the kernel have an equivalentcartesian product operation in the feature space [39]. Wehave various results from harmonic theory on weightedsubspaces of the Wiener algebra [3, 20] which allow usto use sparse sampling grids (i.e. index sets) to efficientlyrepresent multivariate periodic kernels. That is to say,if we have functions with Fourier series coefficients thatdecay sufficiently fast, we can obtain sufficiently accurateapproximations with vastly less terms.

Consider a sufficiently smooth multivariate periodic func-tion f : TD → C, D ∈ N+. A function f can formallybe represented as its multivariate Fourier series:

f(x) =∑k∈ZD

fke2πjk·x, (8)

with its Fourier series coefficients fk,k ∈ ZD defined asfk :=

∫TD f(x)e2πjk·xdx. In essence this results in a

tensor product of univariate Fourier series. Let ΠI denotethe space of all multivariate trigonometric polynomialssupported on some arbitrary index set I ⊂ ZD, which isa finite set of integer vectors. We denote the cardinalityof the set I as |I|. Practically, we are interested in a goodapproximating Fourier partial sum SIf ∈ ΠI supportedon some suitable index set I. One may think of indexsets as a multi-dimensional indicator variable. In order toconstruct such an index set we must have a constructionrule or weight function w which tells us which index setcoordinates to discard.

More formally, we define weighted function spaces ofthe Wiener algebra: Aw(TD) := {f ∈ L1(TD) :∑k∈ZD fke

2πjk·x,∑k∈ZD w(k)

∣∣∣fk∣∣∣ < ∞} where we

assume the function f is in the function space Lp(TD) :={f : TD → C

∫TD∣∣f(x)

∣∣p dx < ∞} with 1 ≤ p ≤ ∞,and we define a weight function w : ZD → [1,∞], whichcharacterises the decay of Fourier coefficients fk of allfunctions f ∈ Aw(TD) such that fk decreases faster thanthe weight function in terms of k. That is to say that thedecay of the coefficients defines the smoothness of thefunction f we are approximating with a partial sum. Inthe next section we will introduce explicit index sets withtheir corresponding weight function.

3.3 INDEX SETS

While a naive multivariate Fourier series expansion of aunivariate Fourier series appears plausible, for explicittensor products the cardinality grows exponentially fast indimension D and is therefore computationally intractablein terms of cardinality of the supporting index set. Thiscomputational burden is amplified when we consider the

Page 4: Periodic Kernel Approximation by Index Set Fourier Series ...auai.org/uai2019/proceedings/papers/163.pdf · elling with autoregressive methods but is less popular in the kernel literature,

Figure 1: Visualisation of two common instances of theLPB index set in D = 2. The left image depicts a sparserindex set while the right image depicts a dense tensorindex set. Each solid square represents an index set coor-dinate Ii = [r1, r2] for integer refinement r.

expanded representation required for the separable featuredecomposition when the products of harmonic terms ofthe Fourier series themselves must be expanded into sumsof cosines. It would thus be desirable to; i) maintain highfunction approximation accuracy; and ii) minimise thetotal number of coefficients.

To this end we introduce a variety of D-dimensionalweighted index sets I with their formal definitions. At ahigh level, one may think of an index set as a generalisa-tion of indicator variables for multi-dimensional tensorswhich mask only the most important frequencies for themultivariate Fourier series decomposition. The reasonindex sets are useful is because it is often unnecessaryto completely expand all supporting integers due to ex-ponentially decaying coefficients of the function one isapproximating. For instance, in Figure 1, we can see twoindex sets. On the right is the full tensor product index setwhich is dense in the refinement R, while the left indexset has significantly fewer components. If the function weare trying to approximate has coefficients that decay suffi-ciently fast under the coverage of the sparser index set wemake a significant saving in the number of terms neededto represent the function. There are various explicit indexsets [51, 15, 43, 38], and the first set we introduce is thelp-ball (LPB), 0 < p ≤ ∞ index set ID,γ,p=1

LPB (R). Thisis a generalised index set and it is constructed with thefollowing weight function,

wD,γLPB (k) = max(1, ‖k|lD,γp ‖

), for 0 < p ≤ ∞ (9)

with construction parameter γ = (γd)∞d=1 ∈ [0, 1]N

+

controlling the approximation depth for a given dimension

d, and where,

‖k|lD,γp ‖ =

( D∑d=1

(γ−1d |kd|

)p)1/p

for 0 < p <∞,

maxd=1,...,D

γ−1d |kd| for p =∞.

(10)We also have the Energy Norm Hyperbolic Cross (ENHC)index set ID,γ,ζENHC (R)with sparsity parameter ζ ∈ [0, 1). Ithas weight function,

wD,γ,ζENHC (k) = max(1, ‖k‖1)ζζ−1

D∏d=1

max(1, γ−1d |kd|)

11−ζ .

(11)The ENHC is more suitable for approximating functionsof dominating mixed smoothness.

4 INDEX SET FOURIER SERIESFEATURES

The goal of our work is to show how multivariate Fourierseries representations of kernels with sparse approxima-tion lattices allow efficient and deterministic feature de-compositions of multivariate periodic kernels. Formally,we define the shift invariant multivariate periodic kernelapproximation as a Fourier series expansion supported onan arbitrary index set I:

κper(x,x′) ≈

∑k∈I

fke2πjk·(x−x′) = 〈Φ(x), Φ(x′)〉CM ,

(12)for some explicit feature map Φ and multivariate Fourierseries coefficients fk.

We now continue with our feature construction which weterm Index Set Fourier Series Features (ISFSF) and intro-duce an additionally sparse construction feature count forno loss of accuracy.

4.1 ISFSF FEATURE CONSTRUCTION

This section presents our main contribution for approxi-mating multi-dimensional periodic kernels. We have seenthat simply using multivariate Fourier series is not suffi-cient for tractable decomposition due to an exponentialtensor product in the refinement level. Using index sets formultivariate Fourier series we present a feature construc-tion using frequency grids, and noting that the resultingfeature admits an additionally sparse construction.

We can write the general form of the product expansionfor any particular ith index set coordinate Ii from someindex set I as:

%(Ii) =

D∏d=1

q2rd

cos(rdωd(xd − x′d)), (13)

Page 5: Periodic Kernel Approximation by Index Set Fourier Series ...auai.org/uai2019/proceedings/papers/163.pdf · elling with autoregressive methods but is less popular in the kernel literature,

Algorithm 1: ISFSF feature construction

Input :I ∈ ZD frequency index set, |I| <∞C = |I| set cardinality (Lemma 1, 2)J = number of rows in Ξx ∈ RD raw data to embed into featuresΞ ∈ RJ×D cartesian product sign matrixω ∈ RD fundamental frequencies

Initialize: ΦI ∈ R2CJ

for i := 1, ..., C: doset Ii as the ith set coordinateρi =

∏Dd=1 qrd

for j := 1, ..., J: doset Ξj as jth row of Ξri =

[r1, r2, ..., rD

]prod = (ri � ω �Ξj)x

T

append (ΦI ,√

ρiJ [cos(prod), sin(prod)])

endendOutput :ΦI

where ωd is the dth dimension’s fundamental frequency,Ii is the ith index set coordinate, and rd is the dth dimen-sion integer refinement for a given Ii. For our featureexpansion we are interested in the data-dependent trigono-metric term made up of a product of data-dependentcosines. It is these that allow us to decompose the se-ries into a sum of cosines. For this we require the prod-uct of cosines trigonometric identity cos(u) cos(v) =12 [cos(u − v) + cos(u + v)]. Applying this identity re-cursively to (13), we obtain the following decomposableform:

%(Ii) =1

J

J∑j=1

ρi cos((r � ω �Ξj)∆

T), (14)

with

ρi =

D∏d=1

q2rd, (15)

r =[r1, r2, ..., rD

], (16)

ω =[ω1, ω2, ..., ωD

], (17)

Ξ =[

+ 1_(+1,−1)(D−1)], (18)

∆ =[x1 − x′1, x2 − x′2, ..., xD − x′D

], (19)

where J = 2(D−1) is the number of rows in Ξ, ρi refersto the product of per-dimension Fourier series coefficientscorresponding to a given Ii, (+1,−1)(D−1) is the (D −1)-times Cartesian combination of the ordered integerset (+1,−1), U_Λ denotes a horizontal concatenationbetween matrix U of length |Λ| and every element insome ordered set Λ, and � refers to the element-wise

(Hadamard) product. To clarify Ξ, observe how a twodimensional cosine product cos(A) cos(B) expands tocos(A + B) + cos(A − B) giving Ξ =

[+1 +1+1 −1

]. In

three dimensions one obtains cos(A+B+C) + cos(A+B − C) + cos(A − B + C) + cos(A − B − C) giving

Ξ =

[+1 +1 +1+1 +1 −1+1 −1 +1+1 −1 −1

]. Noting equation (12) and using the

relation e−jτ ·ωk = cos(ω · τ ) − j sin(ω · τ ) and thefact that real kernels have no imaginary part, we canexploit the cosine difference of angles identity cos(u −v) = cos(u) cos(v)+sin(u) sin(v) to obtain the completedecomposed feature as:

ΦfullI (x) =

[√ρiJ

cos((ri � ω �Ξj)x

T),√

ρiJ

sin((ri � ω �Ξj)x

T)]C,Ji=1,j=1

,

(20)

where x = [x1, x2, .., xD].

The feature construction algorithm is depicted explicitlyin Algorithm 1 and consists of two loops that iterate overthe index set coordinates Ii and each row of the carte-sian combination sign matrix Ξ. The construction isembarrassingly parallelizable and is straightforward toimplement.

It is useful to determine computational budget of the fea-ture map and for this it is necessary to determine thenumber of features in the expanded feature map. To dothis we can give the cardinality CI of the resulting de-composed feature map Φfull

I (x) over index set I.Lemma 1. Cardinality of Index Set Fourier series featuremap for some arbitrary index set I(R). Let CI =

∣∣I(R)∣∣

be the cardinality of some given index set of refine-ment R, and let the dimension D ∈ N+ be given. LetC full

Φ=∣∣∣ΦfullI (x)

∣∣∣ denote the cardinality of the decom-posed feature. The following holds (see supplementaryfor proof): ∣∣I(R)

∣∣ ≤ ∣∣∣ΦfullI (x)

∣∣∣ ≤ CI2D.

4.2 SPARSE CONSTRUCTION

Although suitable, the decomposable form for the multi-variate Fourier series features can be improved. An idealfeature representation should not just approximate ourkernel well but should do it efficiently. For ISFSF, thisinvolves the data-dependent term cos(·), occurrences ofwhich we want to minimise. Scrutinising the product formin (13), the term rd is an integer r ∈ Z+

0 which clearlycontains the value 0. This means that for all refinement co-ordinates at the 0th refinement for any dimension we have

Page 6: Periodic Kernel Approximation by Index Set Fourier Series ...auai.org/uai2019/proceedings/papers/163.pdf · elling with autoregressive methods but is less popular in the kernel literature,

cos(0·) = 1, therefore the “feature” is simply 1 timessome data-independent coefficient. Furthermore, sinceany single cos(·) term in the product (13) contributes toa multiplier of 2 features before exponentiation due tothe trigonometric product identity recursion, we do notunnecessarily want to include features that will simplyevaluate to a constant. We now define a masking functionκ over some function g(r) with r ∈ Z+

0 :

κ(g(r)

)=

{1 for r = 0,

g(r) otherwise.(21)

This mask acts to identify which redundant harmonicterms to ignore in the feature construction stage. Con-tinuing the decomposition as in the previous section, thesparse feature decomposition is thus:

ΦsparseI (x) =

[√ρiJκ(

cos((ri � ω �Ξj)x

T)),√

ρiJκ(

sin((ri � ω �Ξj)x

T))]C,J

i=1,j=1,

(22)

We now give an improved cardinality for the decomposedsparse ISFSF feature map. We emphasise this improvedfeature map cardinality is for exactly the same reconstruct-ing accuracy. To determine the cardinality of the sparsefeature map, let:

η(Ii) =

D∑d=1

[Ii 6= 0

], (23)

define a function that counts for a particular coordinateIi the non-zero indexes. Essentially, this counts whichcos(·) terms to keep, which always occur at coordinateswith per-dimension refinement not equal to 0.

Lemma 2. Cardinality of sparse Index Set feature mapfor arbitrary index set I(R). Let CI =

∣∣I(R)∣∣ be

the cardinality of some given index set of refinementR, and let the dimension D ∈ N+ be given. LetCsparse

Φ=∣∣∣ΦsparseI (x)

∣∣∣ then denote the cardinality of thedecomposed index set Fourier series feature. The follow-ing holds (see supplementary for proof):

∣∣I(R)∣∣ ≤ ∣∣∣Φsparse

I (x)∣∣∣ =

|I|∑i=1

2η(Ii) ≤∣∣∣ΦfullI (x)

∣∣∣ ≤ CI2D.

4.3 MULTIVARIATE TRUNCATION ERROR

To better understand the effect of Fourier series approx-imations on kernels we can analyse the truncation er-ror as a function of kernel hyperparameters. We knowthat the univariate truncation error [41] for k ∈ N+ is

∣∣cos(ωkτ)∣∣ ≤ 1, and

∑∞k=0 q

2k = 1, since the sum of the

coefficients converges to 1. We extend this to the multi-variate case by considering the tensor index set productexpansion. We have

∏Dd=1

∣∣cos(ωdrdτd)∣∣ ≤ 1. Since

max(κ(τ )) = 1 we obtain the multivariate truncationerror:

ε(R, l) = 1−D∏d=1

[R−1∑r=0

q2rd

], (24)

where R is the refinement, l is the kernel lengthscale.qrd refers to the approximated kernel’s Fourier coeffi-cient at refinement index r, with subscript d referringto evaluation on the dth dimension. A visualization of

0 1 2 3 4 5 6 7 8 91011121314151617181920R

0.00

0.25

0.50

0.75

1.00

ε

D= 2

ls=0.1ls=0.32ls=0.55ls=0.78ls=1.0

0 1 2 3 4 5 6 7 8 91011121314151617181920R

0.00

0.25

0.50

0.75

1.00

ε

D= 12

ls=0.1ls=0.32ls=0.55ls=0.78ls=1.0

Figure 2: Visualisation of the multivariate truncation error,for the periodic SE kernel, for refinements R = [0, 20],dimensions D = {2, 12} and isotropic lengthscales ls ={0.1, 1.0}.

the truncation error for different dimensions, for the peri-odic Squared Exponential kernel, is presented in Figure2. An intuitive explanation is, that with the same refine-ment factor (and therefore the same number of features),larger lengthscales are easier to approximate than smallerlengthscales. This can be understood from the frequencydomain perspective in which a kernel with the smallerlengthscale has a larger spectrum of frequencies. Ourkernel Gram approximation as well as downstream er-ror corroborate these results - data that requires largerlengthscales converge in predictive error faster with fewerfeatures.

5 EXPERIMENTS

We evaluate three aspects of our proposed periodic ker-nel approximation with the ENHC index set. First, wecompare our approximation with the full kernel in termsof Gram matrix reconstruction. Second, we compare ourfeatures to state of the art approximations on large scalereal world textured image data. Next, we perform a com-parison of predictive qualities against an analytic periodicfunction in higher dimensions. Finally, we demonstratethe kernel on predicting a periodic trajectory of variouswalking robots used commonly in Reinforcement Learn-ing and Control tasks.

Page 7: Periodic Kernel Approximation by Index Set Fourier Series ...auai.org/uai2019/proceedings/papers/163.pdf · elling with autoregressive methods but is less popular in the kernel literature,

0 250 500 750 1000

Total Components

0.0

0.2

0.4

0.6

0.8

1.0N

orm

aliz

ed E

rror

0 250 500 750 1000

Total Components0 250 500 750 1000

Total Components

ENHCTOTHYPEUCRFF+WHal+WGHal+W

0 500 1000 1500 2000

Total Components

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

Err

or

0 500 1000 1500 2000

Total Components0 500 1000 1500 2000

Total Components

ENHCTOTHYPEUCRFF+WHal+WGHal+W

Figure 3: Reconstruction error on simulated data withthe normalised Frobenius error between Fourier Featuremethods (RFF, Hal, GHal) with periodic warpings, andIndex Set Fourier Series with various index sets. Row1: D = 3, ls = {0.5, 1.0, 1.5}, Row 2: D = 9, ls ={1.5, 2.0, 2.5}. ENHC with weighting γ = 2

3 for D = 9.

Figure 4: Reconstruction error on real texture datasetsusing learned hyperparameters. Note how smaller length-scales in Pores and Rubber require more features thanlarger lengthscales in Tread.

5.1 QUALITY OF KERNEL APPROXIMATION

We first analyse the proposed feature in terms of the re-construction error between a true Gram matrix K, usingthe analytic periodic kernel, and its approximated Grammatrix Ki,j = κ(xi, yj). For all comparisons the met-

ric we use is the normalized Frobenius error ‖K−K‖F‖K‖Fusing N = 4000 uniformly drawn samples from [−2, 2].The primary comparison in Figure 3 compares the ef-fects of various index set constructions, RFFs, QMC (Hal-ton, Generalised Halton), and the following index sets:Energy Norm Hyperbolic Cross (ENHC), Total Order(TOT), Hyperbolic (HYP), and Euclidean (EUC). Thesupplementary contains an extended comparison of in-dex set parameters, dimensionality, and nuances of thereconstruction. The first observation we can make fromFigure 3 is that for lower dimensions D ≈ 3 the bestperforming features are those with the Euclidean degreeor Total order index sets. Of the index sets the Hyperbolic

and Energy Norm Hyperbolic Cross perform the worst inparticular for smaller lengthscales. Overall the index setsall perform significantly better than the warped FourierFeature methods, amongst which, the original MC basedRFF performs the worst and the standard Halton sequenceappears to perform marginally better than the generalisedHalton.

As the number of dimensions increases the Total andEuclidean index sets become intractable due to theirheavy dependency on cross-dimensional products of data-dependent harmonic terms. Considering FF methods, theapproximation accuracy of the standard Halton sequencefalls behind even RFFs while the generalised Halton re-mains consistently ahead. As we have seen, as the di-mensionality increases, the Total and Euclidean indexsets have no parameterisation that allows them to scaleproperly; indeed their flexibility is due entirely being aspecific instance of the LPBall index set which can givesparser Hyperbolic index sets. On the other hand, theENHC can be parameterised by sparsity ζ and weightingγ giving additional flexibility.

An interesting observation in the Frobenius norms of thereal texture datasets is the errors and the connection be-tween the truncation error defined in 24. We can see howthe smaller lengthscales (Pores and Rubber) result in moredifficult inference in terms of number of features requiredfor better predictions, than larger lengthscales (Tread) -this can be seen in the generalisation experiments in thenext section where we evaluate predictive error on anincreasing range of features.

We conjecture that the significantly improved Gram ma-trix approximation performance of ISFSF is not just fromour deterministic construction, but also due to a largesuppression of negative covariances which we have ob-served in reconstruction plots. This is a known issue forRFFs with Gaussian Processes which can negatively af-fect predictive uncertainties. An extended discourse onthis behaviour is provided in the supplementary.

5.2 GENERALISATION ERROR

We evaluate predictive performance with Root MeanSquare Error (RMSE) and Mean Negative Log Loss(MNLL). The MNLL accounts for the model’s predic-tive mean and uncertainty. It is defined as MNLL =1N

∑Ni=1

12 log(2πσ∗i )) +

(µ∗i−f∗i )2

2σ∗iwhere σ∗i , µ∗i , and f∗i

are respectively the predictive standard deviation, predic-tive mean, and true value at the ith test point. This sectiondemonstrates the generalisation performance of a singlemultidimensional periodic kernel on image texture data.We use images from [48] with the same 12675 train and4225 test set pixel locations x ∈ R2. A Bayesian Linear

Page 8: Periodic Kernel Approximation by Index Set Fourier Series ...auai.org/uai2019/proceedings/papers/163.pdf · elling with autoregressive methods but is less popular in the kernel literature,

Figure 5: Predicted missing area for the Pores and Rubberdatasets. Left to right, each column represents predictionsmade using ISFSF, RFF, and GHAL. Top to bottom, eachrow represents an increasing number of features used at49, 201, 793 respectively.

0 200 400 600 800 1000

0.1

0.2

RM

SE

Pores

0 200 400 600 800 1000

0

2

MN

LL

Pores

0 200 400 600 800 1000

0.10

0.12

0.14

RM

SE

Rubber

0 200 400 600 800 1000

−0.8

−0.6

−0.4

−0.2

MN

LL

Rubber

0 200 400 600 800 1000

Total Components

0.062

0.064

0.066

0.068

RM

SE

Tread

0 200 400 600 800 1000

Total Components

−1.30

−1.25

−1.20

−1.15

MN

LL

Tread

ENHCRFF+WHal+WGHal+W

Figure 6: Comparison of predictive RMSE and MNLLwith our method alongside Fourier Feature methods, forincreasing number of components.

Regression [2] model is used as the regression model. Wefix the hyperparameters across different kernel represen-tations in order for the comparison to be consistent forthe same underlying kernel. Overall, for the same ker-nel, both qualitatively and quantitatively the results showclear advantages of ISFSF over the alternative featuremethods. The visual effect is demonstrated in in Figure 7and empirically in Figure 5.

Textures. In both the pores and rubber datasets we cansee the RMSE and MNLL for the ISFSF based features(using the ENHC) perform the best in all cases. TheRMSE performs exceedingly well even with only 49 fea-tures almost equaling the performance of RFF and QMCmethods which require 794 features. In both RMSEand MNLL, the ISFSF with 201 features outperforms

0 200 400 600 800 1000 1200

0.0

0.1

0.2

0.3

RM

SE

D = 4

0 200 400 600 800 1000 1200

0

100

200

300

MN

LL

D = 4

200 400 600 800 1000 1200 1400

0.1

0.2

0.3

RM

SE

D = 8

200 400 600 800 1000 1200 1400

0

100

200

MN

LL

D = 8

0 250 500 750 1000 1250 1500

Total Components

0.3

0.4

0.5

RM

SE

D = 16

0 250 500 750 1000 1250 1500

Total Components

200

400

600

MN

LL

D = 16

ENHCRFF+WGHal+W

Figure 7: Higher dimensional comparison of predictiveRMSE and MNLL with our method using the ENHCalongside Fourier Feature methods, for increasing numberof components. We train on 8000 points from [-5,5] andtest on 4000 points.

FF based methods using 794. In the pores dataset thegeneralised Halton marginally outperforms all methodsin the case of 794 features, and the Halton at 94 features.For the tread dataset the resulting performance is interest-ing because for all features, the RMSE performances arealike across all methods with the ISFSF slightly outper-forming for lower features. The asymptotic performanceof both ISFSF and Fourier feature methods are similarand for rubber and tread become marginally worse as weincrease the feature count. This is expected because thedatasets contain small amounts of non-stationary infor-mation which the stationary periodic RBF is unable tocompletely capture. As the feature count increases, themodelling fidelity of the approximated kernel increasesresulting in the slightly larger error.

High Dimensional Tensor Function. To analyse themethod’s efficacy in higher dimensions we perform ex-periments on the D dimensional tensor-product func-tion GD(x) :=

∏Dd g(xd) from [20] where the one-

dimensional function g is defined as

g(x) :=8√

6π/(6369π − 4096)(4 + sgn(x mod 1)− 1

2)

(sin(2πx)3 + sin(2πx)4))

(25)

In this experiment we also observe improved perfor-mance of the ISFSF over standard Fourier Feature meth-ods. Since we know the function is stationary, we observea steady convergence in predictive accuracy unlike thenon-stationary texture datasets. As we increase the di-mensionality the predictive error degrades slightly; this is

Page 9: Periodic Kernel Approximation by Index Set Fourier Series ...auai.org/uai2019/proceedings/papers/163.pdf · elling with autoregressive methods but is less popular in the kernel literature,

Figure 8: "Hopper" multi-jointed robot with visualisationof joint and robot trajectories.

Figure 9: "Ant" multi-jointed robot with visualisation ofjoint and robot trajectories.

expected when we consider multivariate truncation errorwhich is affected by dimension.Periodic trajectory tracking of jointed robots.Robotics is an area in which various hardware platformsof interest are often constructed with jointed actuators.We demonstrate an application of multivariate periodickernels for periodic motion tracking in two simulatedjointed robots commonly used in Reinforcement Learning(RL) and Control tasks [10]. We consider the problemof regressing on the vertical trajectory of the robot as afunction of the input joints. 500 timesteps of trajectorywere collected alongside a 6 ("Hopper", Figure 8) and 16("Ant", Figure 9) dimensional position and orientationvector of the joints from a pre-learned policy for the"Hopper" and "Ant" environments [7]. 300 steps wereused for training. Simulation was performed in the opensource simulator PyBullet. The results are visualized inFigure 10. For both "Hopper" and "Ant" robots, we cansee a significant improvement of the ISFSF features overthe standard periodic kernel formulation. This is mostsignificant in the RMSE, and holds in the MNLL. Theearly convergence of predictions suggest that ISFSFs

0 500 1000 1500 2000 2500

0.04

0.06

RMSE

Robot = Hopper

0 500 1000 1500 2000 2500

0

5

MNL

L

Robot = Hopper

0 500 1000 1500

0.0125

0.0150

0.0175

RMSE

Robot = Ant

0 500 1000 1500

−3.0

−2.8

−2.6

MNL

L

Robot = Ant

ENHCRFF+WGHal+W

Figure 10: Periodic trajectory prediction error for increas-ing number of features on various robots.

are both better able to represent the signal as well as theuncertainty of the prediction, as evidenced in the MNLL.An explanation for the faster convergence of the "Ant"task for both feature methods could be observed in thelengthscale of the kernels used in both - "Ant" uses anisotropic lengthscale of 13 and "Hopper" an isotropiclengthscale of 0.5. These results suggest that ISFSF maybe used to improve policy learning in RL, as well as playa role in improved system identification and prediction ofperiodic systems.

6 CONCLUSION

Feature approximations have been a large component ofscaling kernel methods such as GPs. An important issuewith kernel approximation methods is their efficiency intheir approximation and has been a focus of more deter-ministic construction methods. Having effective periodickernel approximations enables numerous applications invarious domains in a much more efficient manner. Cru-cially, we introduce effective sparse approximation tomultivariate periodic kernels using multivariate Fourierseries with sparse index set based sampling grids for effi-cient feature space periodic kernel decompositions. Wedemonstrate experimentally on a range of datasets that ourfeatures result in predictive models of greater accuracywith vastly less components. Future directions includehow to construct kernel-dependent index sets as well asdirect application to learning policies for jointed roboticsystems.

Acknowledgements

We are grateful to Richard Scalzo for early meanderingdiscussions and Lionel Ott for valuable guidance. Wewould also like to thank the anonymous reviewers fortheir comments and helpful suggestions.

Page 10: Periodic Kernel Approximation by Index Set Fourier Series ...auai.org/uai2019/proceedings/papers/163.pdf · elling with autoregressive methods but is less popular in the kernel literature,

References

[1] M. Abramowitz and I. A. Stegun. Handbook ofMathematical Functions with Formulas, Graphs,and Mathematical Tables, book section "ModifiedBessel Functions I and K." §9.6. 1972.

[2] C. Bishop. Pattern recognition and machine learning(information science and statistics). Springer, NewYork, 2007.

[3] F. F. Bonsall and J. Duncan. Complete normed alge-bras. Springer Science & Business Media, 2012.

[4] W. Boomsma and J. Frellsen. Spherical convolutionsand their application in molecular modelling. InAdvances in Neural Information Processing Systems,2017.

[5] R. Brault, M. Heinonen, and F. Buc. Random fourierfeatures for operator-valued kernels. In Asian Con-ference on Machine Learning, 2016.

[6] K. Choromanski, F. Fagan, C. Gouy-Pailler, A. Mor-van, T. Sarlos, and J. Atif. Triplespin-a genericcompact paradigm for fast machine learning compu-tations. arXiv preprint arXiv:1605.09046, 2016.

[7] E. Coumans and Y. Bai. Pybullet, a python mod-ule for physics simulation for games, robotics andmachine learning. GitHub repository, 2016.

[8] T. Dao, C. M. De Sa, and C. Ré. Gaussian quadra-ture for kernel features. In Advances in neural infor-mation processing systems, 2017.

[9] J. C. Dunlap, J. J. Loros, and P. J. DeCoursey.Chronobiology: biological timekeeping. SinauerAssociates, 2004.

[10] T. Erez, Y. Tassa, and E. Todorov. Infinite horizonmodel predictive control for nonlinear periodic tasks.Manuscript under review, 4, 2011.

[11] F. Faber, A. Lindmaa, O. A. von Lilienfeld, andR. Armiento. Crystal structure representations formachine learning models of formation energies. In-ternational Journal of Quantum Chemistry, 2015.

[12] X. Y. Felix, A. T. Suresh, K. M. Choromanski, D. N.Holtmann-Rice, and S. Kumar. Orthogonal randomfeatures. In Advances in Neural Information Pro-cessing Systems, 2016.

[13] A. Gijsberts and G. Metta. Real-time model learningusing incremental sparse spectrum gaussian processregression. Neural Networks, 2013.

[14] A. Gittens and M. W. Mahoney. Revisiting theNyström method for improved large-scale machinelearning. Journal of Machine Learning Research,2013.

[15] K. Hallatschek. Fouriertransformation auf dünnenGittern mit hierarchischen basen. Numerische Math-ematik, 1992.

[16] G. W. Henry and J. N. Winn. The rotation period ofthe planet-hosting star hd 189733. The AstronomicalJournal, 2008.

[17] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussianprocesses for big data. In Uncertainty in ArtificialIntelligence. Citeseer, 2013.

[18] C. Kacwin, J. Oettershagen, M. Ullrich, and T. Ull-rich. Numerical performance of optimized frolovlattices in tensor product reproducing kernel sobolevspaces. arXiv preprint arXiv:1802.08666, 2018.

[19] L. Kämmerer. Multiple rank-1 lattices as samplingschemes for multivariate trigonometric polynomials.Journal of Fourier Analysis and Applications, 2018.

[20] L. Kämmerer, D. Potts, and T. Volkmer. Approxi-mation of multivariate periodic functions by trigono-metric polynomials based on rank-1 lattice sampling.Journal of Complexity, 2015.

[21] P. Kar and H. Karnick. Random feature maps fordot product kernels. In International Conference onArtificial Intelligence and Statistics, 2012.

[22] P. Kritzer, F. Y. Kuo, D. Nuyens, and M. Ullrich. Lat-tice rules with random n achieve nearly the optimalo (n- α- 1/ 2) error independently of the dimension.Journal of Approximation Theory, 2018.

[23] M. Lázaro-Gredilla, J. Quiñonero-Candela, C. E.Rasmussen, and A. R. Figueiras-Vidal. Sparse spec-trum gaussian process regression. Journal of Ma-chine Learning Research, 2010.

[24] Q. Le, T. Sarlós, and A. Smola. Fastfood-approximating kernel expansions in loglinear time.In Proceedings of the International Conference onMachine Learning, 2013.

[25] F. Li, C. Ionescu, and C. Sminchisescu. Randomfourier approximations for skewed multiplicativehistogram kernels. In Joint Pattern RecognitionSymposium. Springer, 2010.

[26] D. J. MacKay. Introduction to Gaussian processes.NATO ASI Series F Computer and Systems Sciences,1998.

Page 11: Periodic Kernel Approximation by Index Set Fourier Series ...auai.org/uai2019/proceedings/papers/163.pdf · elling with autoregressive methods but is less popular in the kernel literature,

[27] M. Munkhoeva, Y. Kapushev, E. Burnaev, and I. Os-eledets. Quadrature-based features for kernel ap-proximation. In Advances in Neural InformationProcessing Systems, 2018.

[28] M. Mutny and A. Krause. Efficient high dimensionalbayesian optimization with additivity and quadraturefourier features. In Advances in Neural InformationProcessing Systems, 2018.

[29] J. Pennington, X. Y. Felix, and S. Kumar. Spheri-cal random features for polynomial kernels. In Ad-vances in Neural Information Processing Systems,2015.

[30] L. Peternel, T. Noda, T. Petric, A. Ude, J. Mori-moto, and J. Babic. Adaptive control of exoskeletonrobots for periodic assistive behaviours based onemg feedback minimisation. PloS one, 2016.

[31] N. Pham and R. Pagh. Fast and scalable polynomialkernels via explicit feature maps. In Proceedings ofthe 19th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining. ACM,2013.

[32] C. L. Phillips and G. A. Voth. Discovering crystalsusing shape matching and machine learning. SoftMatter, 2013.

[33] A. Rahimi and B. Recht. Random features for large-scale kernel machines. Advances in Neural Informa-tion Processing Systems, 2007.

[34] A. Rahimi and B. Recht. Weighted sums of ran-dom kitchen sinks: Replacing minimization withrandomization in learning. In Advances in NeuralInformation Processing Systems, 2008.

[35] D. Rainville, C. Gagné, O. Teytaud, D. Laurendeau,et al. Evolutionary optimization of low-discrepancysequences. ACM Transactions on Modeling andComputer Simulation (TOMACS), 2012.

[36] C. E. Rasmussen and C. K. I. Williams. GaussianProcesses for Machine Learning. MIT Press, 2006.

[37] K. Schütt, H. Glawe, F. Brockherde, A. Sanna,K. Müller, and E. Gross. How to represent crys-tal structures for machine learning: Towards fastprediction of electronic properties. Physical ReviewB, 2014.

[38] P. Seshadri, A. Narayan, and S. Mahadevan. Ef-fectively subsampled quadratures for least squarespolynomial approximations. SIAM/ASA Journal onUncertainty Quantification, 2017.

[39] J. Shawe-Taylor and N. Cristianini. Kernel methodsfor pattern analysis. Cambridge university press,2004.

[40] E. Snelson and Z. Ghahramani. Sparse gaussian pro-cesses using pseudo-inputs. In Advance in NeuralInformation Processing Systems, 2006.

[41] A. Solin and S. Särkkä. Explicit link between pe-riodic covariance functions and state space mod-els. In Proceedings of the Seventeenth InternationalConference on Artificial Intelligence and Statistics,2014.

[42] A. Tompkins and F. Ramos. Fourier feature approx-imations for periodic kernels in time-series mod-elling. In AAAI Conference on Artificial Intelligence,2018.

[43] L. Trefethen. Multivariate polynomial approxima-tion in the hypercube. Proceedings of the AmericanMathematical Society, 2017.

[44] A. Vedaldi and A. Zisserman. Efficient additivekernels via explicit feature maps. IEEE transactionson pattern analysis and machine intelligence, 2012.

[45] L. A. Wasserman. All of nonparametric statistics:with 52 illustrations. Springer, 2006.

[46] C. K. Williams and M. Seeger. Using the Nyströmmethod to speed up kernel machines. In Advancesin Neural Information Processing Systems, 2001.

[47] A. Wilson and R. Adams. Gaussian process kernelsfor pattern discovery and extrapolation. In Interna-tional Conference on Machine Learning, 2013.

[48] A. G. Wilson, E. Gilboa, A. Nehorai, and J. P. Cun-ningham. Fast kernel learning for multidimensionalpattern extrapolation. In Advances in Neural Infor-mation Processing Systems, 2014.

[49] J. Yang, V. Sindhwani, H. Avron, and M. Mahoney.Quasi-monte carlo feature maps for shift-invariantkernels. In International Conference on MachineLearning, 2014.

[50] Z. Yang, A. Wilson, A. Smola, and L. Song. A lacarte–learning fast kernels. In Artificial Intelligenceand Statistics, 2015.

[51] S. Zaremba. La méthode des “bons treillis” pourle calcul des intégrales multiples. In Applicationsof number theory to numerical analysis. Elsevier,1972.