Extended Hamiltonian Learning on Riemannian Manifolds...

13
EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Extended Hamiltonian Learning on Riemannian Manifolds: Numerical Aspects Simone Fiori Abstract—The present paper delivers the second part of a study initiated with the contribution S. Fiori, Extended Hamiltonian Learning on Riemannian Manifolds: Theoretical Aspects, IEEE Transactions on Neural Networks, Vol. 22, No. 5, pp. 687 – 700, May 2011, which aimed at introducing a general framework to develop a theory of learning on differentiable manifolds by extended Hamiltonian stationary-action principle. The present paper discusses the numerical implementation of the extended Hamiltonian learning paradigm by making use of notions from geometric numerical integration to numerically solve differential equations on manifolds. The general-purpose integration schemes as well as the discussion of several cases-of-interest show that the implementation of the dynamical learning equations exhibits a rich structure. The behavior of the discussed learning paradigm is illustrated via several numerical examples and discussions of cases-of-study. The numerical examples confirm the theoretical developments presented in this paper as well as of the first part of the present study. Index Terms—Extended Hamiltonian (second-order) learning; Rie- mannian manifold; Learning by constrained criterion optimization; Geometric numerical integration. I. I NTRODUCTION A CLASSICAL learning theory paradigm is based on criterion optimization over an Euclidean space by a gradient-based (first- order) learning algorithm. A more advanced instance of parameter learning is by constrained optimization. In such context, the con- straints on parameters’ values reflect the natural constraints presented by the learning problem. In this event, differential geometry is an appropriate mathematical instrument to formulate and to implement a learning theory. The contribution [21] aimed at introducing a general framework to develop a theory of learning on differentiable manifolds by extended Hamiltonian stationary-action principle, which leads to a second-order learning paradigm. The first part of the paper [21] presents a derivation of general equations of learning on Riemannian manifolds and discusses the relationship between the theory of dynamical learning by second-order differential equations on manifolds and the classical learning theory based on gradient- steepest-descent optimization. Over a parameter manifold M, the extended Hamiltonian learning dynamics is described by the system of differential equations: ˙ x = 1 M v, ˙ v = −MΓx(v,v) −∇xV μv, (1) where the variable x M represents a set of learning system’s learnable parameters, variable v is proportional to the instantaneous learning velocity, constant M≥ 0 denotes a mass term, function Γx denotes the Christoffel form associated to the metric of the real Riemannian manifold M, function V : M R describes a potential energy field, vector field xV denotes the Riemannian gradient of the potential energy function and constant μ> 0 denotes a viscosity term. The potential energy function V arises from the casting of a Copyright c 2012 IEEE. Personal use of this material is permitted. Cite this paper as: S. Fiori, Extended Hamiltonian learning on Riemannian manifolds: Numerical aspects, IEEE Transactions on Neural Networks and Learning Systems, Vol. 23, No. 1, pp. 7 – 21, January 2012. The author is with Dipartimento di Ingegneria dell’Informazione, Facolt` a di Ingegneria, Universit` a Politecnica delle Marche, Via Brecce Bianche, Ancona I-60131, Italy. (eMail: [email protected]) given learning problem into an optimization problem. The learning dynamics of the system (1), described by the state-pair (x,v), evolves toward a stationarity state (x, 0). The point xM coincides with a local minimum of the potential energy function V over the manifold M. An interesting finding of paper [21] is the resemblance of dynamical learning with classical learning with ‘momentum term’. A convergence analysis was carried out in terms of Lyapunov function and a fine-convergence analysis in the proximity of a stationary point of the learning-goal function, represented by a potential energy function for the dynamical system, was conducted as well. The fine- convergence analysis shows that second-order learning may improve the convergence ability of first-order learning if certain conditions are met. The second part of the paper [21] discusses several cases of learning by extended Hamiltonian systems on manifolds of interest in the scientific literature. The purpose of the calculations was to show how to derive the Christoffel form, the Riemannian gradient and the geodesic expressions by working in intrinsic coordinates only by means of the variational principle stated in the first part of the paper. The formulation of the extended Hamiltonian learning equations on differentiable manifolds requires instruments from differential geometry. Likewise, the numerical implementation of the extended Hamiltonian systems on a computation platform requires instruments from geometric numerical integration [26]. The reason is that the numerical integration techniques which insist on flat spaces are unable (in general) to cope with curved parameter spaces. For a general parameter manifold, the classical Euler-type stepping method should be replaced with a numerical method that is capable of coping with the curved nature of the parameter space. The first fundamental equation of the system (1) may be solved numerically by a step- forward algorithm based on the notion of manifold retraction, that will be explained in section II. The second fundamental equation of the system (1) may be solved numerically by a step-forward algorithm based on the differential-geometrical notion of parallel translation (or by a numerical technique known as vector translation), whose definition will be recalled in section II. The general-purpose integration schemes presented in section II as well as their partic- ularization to several cases-of-interest discussed in section III will show that the implementation of the dynamical learning equations exhibits a rich structure in terms of mathematical analysis and implementation issues. In particular, section III discusses cases of learning by extended Hamiltonian systems on manifolds of interest in the scientific literature, namely, the Stiefel manifold, the special orthogonal group, the unit hypersphere, the Grassmann manifold, the group of symmetric positive definite matrices, the flag manifold and the real symplectic group of matrices. Section IV illustrates the fea- tures of the discussed learning algorithms via comparative numerical experiments and section V concludes the paper and suggests further research topics of interest along the line of the present research. II. NUMERICAL INTEGRATION OF EXTENDED HAMILTONIAN SYSTEMS In general, the extended Hamiltonian learning system (1) is not solvable in closed form, therefore, it is necessary to resort to a nu-

Transcript of Extended Hamiltonian Learning on Riemannian Manifolds...

Page 1: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Extended Hamiltonian Learning on Riemannian Manifolds:Numerical Aspects

Simone Fiori

Abstract—The present paper delivers the second part of a studyinitiated with the contribution S. Fiori, Extended Hamiltonian Learningon Riemannian Manifolds: Theoretical Aspects, IEEE Transactions onNeural Networks, Vol. 22, No. 5, pp. 687 – 700, May 2011, whichaimedat introducing a general framework to develop a theory of learningon differentiable manifolds by extended Hamiltonian stationary-actionprinciple. The present paper discusses the numerical implementation ofthe extended Hamiltonian learning paradigm by making use ofnotionsfrom geometric numerical integration to numerically solve differentialequations on manifolds. The general-purpose integration schemes as wellas the discussion of several cases-of-interest show that the implementationof the dynamical learning equations exhibits a rich structure. Thebehavior of the discussed learning paradigm is illustratedvia severalnumerical examples and discussions of cases-of-study. Thenumericalexamples confirm the theoretical developments presented inthis paperas well as of the first part of the present study.

Index Terms—Extended Hamiltonian (second-order) learning; Rie-mannian manifold; Learning by constrained criterion optim ization;Geometric numerical integration.

I. I NTRODUCTION

A CLASSICAL learning theory paradigm is based on criterionoptimization over an Euclidean space by a gradient-based (first-

order) learning algorithm. A more advanced instance of parameterlearning is by constrained optimization. In such context, the con-straints on parameters’ values reflect the natural constraints presentedby the learning problem. In this event, differential geometry is anappropriate mathematical instrument to formulate and to implementa learning theory. The contribution [21] aimed at introducing ageneral framework to develop a theory of learning on differentiablemanifolds by extended Hamiltonian stationary-action principle, whichleads to a second-order learning paradigm. The first part of thepaper [21] presents a derivation of general equations of learningon Riemannian manifolds and discusses the relationship between thetheory of dynamical learning by second-order differentialequationson manifolds and the classical learning theory based on gradient-steepest-descent optimization. Over a parameter manifoldM , theextended Hamiltonian learning dynamics is described by thesystemof differential equations:

x = 1Mv,

v = −MΓx(v, v)−∇xV − µv,(1)

where the variablex ∈ M represents a set of learning system’slearnable parameters, variablev is proportional to the instantaneouslearning velocity, constantM ≥ 0 denotes a mass term, functionΓx denotes the Christoffel form associated to the metric of therealRiemannian manifoldM , functionV :M → R describes a potentialenergy field, vector field∇xV denotes the Riemannian gradient ofthe potential energy function and constantµ > 0 denotes a viscosityterm. The potential energy functionV arises from the casting of a

Copyright c©2012 IEEE. Personal use of this material is permitted. Cite thispaper as:S. Fiori, Extended Hamiltonian learning on Riemannian manifolds:Numerical aspects,IEEE Transactions on Neural Networks and LearningSystems, Vol. 23, No. 1, pp. 7 – 21, January 2012.The author is with Dipartimento di Ingegneria dell’Informazione, Facolta diIngegneria, Universita Politecnica delle Marche, Via Brecce Bianche, AnconaI-60131, Italy. (eMail: [email protected])

given learning problem into an optimization problem. The learningdynamics of the system (1), described by the state-pair(x, v), evolvestoward a stationarity state(x⋆, 0). The point x⋆ ∈ M coincideswith a local minimum of the potential energy functionV over themanifoldM . An interesting finding of paper [21] is the resemblanceof dynamical learning with classical learning with ‘momentum term’.A convergence analysis was carried out in terms of Lyapunov functionand a fine-convergence analysis in the proximity of a stationarypoint of the learning-goal function, represented by a potential energyfunction for the dynamical system, was conducted as well. The fine-convergence analysis shows that second-order learning mayimprovethe convergence ability of first-order learning if certain conditionsare met. The second part of the paper [21] discusses several cases oflearning by extended Hamiltonian systems on manifolds of interestin the scientific literature. The purpose of the calculations was toshow how to derive the Christoffel form, the Riemannian gradientand the geodesic expressions by working in intrinsic coordinates onlyby means of the variational principle stated in the first partof thepaper.

The formulation of the extended Hamiltonian learning equationson differentiable manifolds requires instruments from differentialgeometry. Likewise, the numerical implementation of the extendedHamiltonian systems on a computation platform requires instrumentsfrom geometric numerical integration[26]. The reason is that thenumerical integration techniques which insist on flat spaces are unable(in general) to cope with curved parameter spaces. For a generalparameter manifold, the classical Euler-type stepping method shouldbe replaced with a numerical method that is capable of copingwiththe curved nature of the parameter space. The first fundamentalequation of the system (1) may be solved numerically by a step-forward algorithm based on the notion ofmanifold retraction, thatwill be explained in section II. The second fundamental equationof the system (1) may be solved numerically by a step-forwardalgorithm based on the differential-geometrical notion ofparalleltranslation(or by a numerical technique known asvector translation),whose definition will be recalled in section II. The general-purposeintegration schemes presented in section II as well as theirpartic-ularization to several cases-of-interest discussed in section III willshow that the implementation of the dynamical learning equationsexhibits a rich structure in terms of mathematical analysisandimplementation issues. In particular, section III discusses cases oflearning by extended Hamiltonian systems on manifolds of interestin the scientific literature, namely, the Stiefel manifold,the specialorthogonal group, the unit hypersphere, the Grassmann manifold, thegroup of symmetric positive definite matrices, the flag manifold andthe real symplectic group of matrices. Section IV illustrates the fea-tures of the discussed learning algorithms via comparativenumericalexperiments and section V concludes the paper and suggests furtherresearch topics of interest along the line of the present research.

II. N UMERICAL INTEGRATION OF EXTENDEDHAMILTONIAN

SYSTEMS

In general, the extended Hamiltonian learning system (1) isnotsolvable in closed form, therefore, it is necessary to resort to a nu-

Page 2: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

2 EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

merical approximation of the exact solution. The present section aimsat explaining the fundamental mathematical tools to solve differentialequations on manifolds and how to apply these tools to the solutionof the extended Hamiltonian learning system in subsection II-D. Inparticular, the numerical implementation makes use of the notion ofparallel translation and of the concept of manifold retraction, whichare recalled in the subsections II-A, II-B and II-C. Subsection II-Ediscusses a relationship between the extended Hamiltonianlearningmethod and the gradient method. For the theory of differentiablemanifolds and Lie groups, readers may refer to [49].

A. Definitions and notation

Let M be a real differentiable manifold of dimensionr. At apoint x ∈ M , the tangent space to the manifoldM is denotedas TxM . The symbolTM denotes the tangent bundle defined asTM

def= (x, v)|x ∈M,v ∈ TxM.

Let the algebraic group(G,m, i, e) be a Lie group, namely, letGbe endowed with a differentiable manifold structure, whichis furthersupposed to be Riemannian. Here, operatorm : G×G → G denotesgroup multiplication, operatori : G → G denotes group inverse ande ∈ G denotes group identity, namelym(x, i(x)) = e for everyx ∈ G. The algebraic and the differential-geometric structuresneedto be compatible, namely, the map(x, y) 7→ m(x, i(y)) needs tobe smooth for everyx, y ∈ G. To the Lie groupG, a Lie algebragdef=TeG is associated.A Riemannian manifoldM is endowed with a inner product

〈·, ·〉x : TxM × TxM → R. SymbolEr denotes a real Euclideanspace of dimensionr and symbol 〈·, ·〉E denotes an Euclideaninner product inE

r. Recall that any tangent spaceTxM of areal differentiable manifoldM ∋ x of dimensionr is isomorphicto an Euclidean spaceEr. A local metric 〈·, ·〉x also defines alocal norm ‖v‖xdef

=√

〈v, v〉x, for v ∈ TxM . Let ψ : M → R

denote a differentiable function. Symbol∇xψ ∈ TxM denotes theRiemannian gradient of functionψ with respect to a metric〈·, ·〉x.The Riemannian gradient is defined as the unique tangent vector inTxM that satisfies the metric compatibility condition

〈∇xψ, v〉x = 〈∂xψ, v〉E for any v ∈ TxM, (2)

where symbol∂x denotes Euclidean gradient evaluated atx.In the following, an over-dot will denote the derivatived

dt, while a

double over-dot will denote the derivatived2

dt2. The standard notation

for covariant and contravariant tensors indices as well as Einstein’sconvention on summation indices are made use of throughout thepresent section.

For u, v ∈ TxM , it holds 〈u, v〉x = gijuivj , wheregij denote

the components of the metric tensor associated to the inner product〈·, ·〉x. The functionsgij depend on coordinatesxk and the symmetrypropertygij = gji holds. The functionsgij denote the components ofthe dual metric tensor, namely,gikgkj = 0 for i 6= j andgikgki = 1.The Christoffel symbols of the first kind associated to a metric tensorof componentsgij are defined as:

Γkjidef=

1

2

(

∂gji∂xk

+∂gik∂xj

− ∂gkj∂xi

)

,

in the Levi-Civita formulation. The Christoffel symbols ofthe secondkind are defined asΓh

ik

def= ghjΓikj . The Christoffel symbols of the

second kind are symmetric in the covariant indices, namely,Γhij =

Γhji. The Christoffel formΓx : TxM × TxM → E

r is defined in

extrinsic coordinates by[Γx(v, w)]kdef=Γk

ijviwj .

On a Riemannian manifoldM with Christoffel form Γ·(·, ·), acurve c : [0, 1] → M that satisfies the second-order differential

equation

c+ Γc(c, c) = 0 (3)

is termedgeodesic. A geodesic curve that satisfies the initial condi-tionsc(0) = x ∈M andc(0) = v ∈ TxM is denoted ascx,v(t). Forsome manifolds of interest in machine learning and for some specificmetric structures, geodesic curves may be expressed in closed forms.However, in some cases either the closed-form expression requiresextensive numerical computations to be evaluated or its closed-form isunknown. For these reasons, some constructions have been envisagedto approximate geodesic curves. See, for example, the method basedon repeated manifold-projections explained in [32] tailored to thecase of shape-manifolds or the more general method explained in[40] that solves the problem of finding a geodesic arc joininggivenendpoints in a connected complete Riemannian manifold.

Given two pointsx1, x2 ∈M and provided they may be joined bya geodesic arccx,v(t), with t ∈ [0, 1], cx,v(0) = x1 and cx,v(1) =x2, the Riemannian distance between the given points inM may bedefined as:

d(x1, x2)def=

∫ 1

0

〈cx,v, cx,v〉12cx,vdt. (4)

In fact, the notion of geodesics on a manifoldM turns the manifoldinto a metric space.

The kinetic energy functional associated to the system(1) is the symmetric bilinear form 1

2M〈x, x〉x, namely

Kx(x, x)def= 1

2Mgij x

ixj . On a Riemannian manifold, the metrictensor is positive-definite, henceKx(x, x) ≥ 0 on every trajectoryx(t) ∈M . (Such property does not hold true on other manifolds ofinterest in neural learning, such as pseudo-Riemannian manifolds[19].) The potential energy functionV depends on the coordinatesxk only. The notation used throughout the paper to denote matrices,namely, by lower-case letters instead upper-case letters,is used tokeep consistency with the notation introduced in paper [21].

B. Parallel translation of a tangent vector along a geodesicarc

Given a curve on a Riemannian manifoldM , a pointx ∈ M onthe curve and a tangent vectorv ∈ TxM to the curve, if such a vectoris moved to another point on the curve by a parallel translation in thesense of ordinary geometry, in general it will not be tangentto thecurve anymore. This observation suggests that it is necessary to definea notion of “parallel translation along a curve” that is compatible withthe geometry of the manifold that the curve belongs to1.

In the present work, the main interest lies on parallel translation ofa tangent vector along a geodesic curve on a Riemannian manifoldonly, thus the present subsection focuses on seeking for a solutionto such a problem. The basic requirement of parallel translation of avector along a curve is that it keeps the translated vector tangentto the curve at any point. Clearly, this condition by itself is tooweak to give rise to a unique solution to the problem of paralleltranslation. In Riemannian geometry, it is additionally required thatparallel translation preserves the inner product between the translatedtangent vector and the curve’s tangent. A parallel-translation rulemay be constructed as follows. Letx(t) ∈ M denote a given curveon a Riemannian manifoldM endowed with a metric tensor ofcomponentsgij , with x(0) being simply denoted asx, and let thecurve v(t) ∈ Tx(t)M , t ∈ [0, 1] denotes parallel translation of atangent vectoru ∈ TxM identified withv(0). The angle-preservation

1Indeed, the notion of parallel translation is fundamental in differentialgeometry, because it enables comparing vectors that belongto different tangentspaces. It is intimately related with the fundamental concept of covariantderivativeof a vector field with respect to a vector.

Page 3: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

FIORI: EXTENDED HAMILTONIAN LEARNING ON RIEMANNIAN MANIFO LDS: NUMERICAL ASPECTS 3

property of parallel-translation implies that the sought-for curvev(t)must meet the formal condition:

〈v(t), x(t)〉x(t) = constant over the curvex(t). (5)

Differentiating the above equation with respect to the parameter tyields:

d

dt(gijv

ixj) = 0, (6)

namely:∂gij∂xk

xkxjvi + gij xj vi + gihx

hvi = 0. (7)

Now, assume that the componentsxh describe a geodesic arc onthe Riemannian manifoldM . Therefore, it holdsxh = −Γh

kj xkxj .

Using such identity in the equation (7) yields:((

∂gij∂xk

− gihΓhkj

)

xkvi + gij vi

)

xj = 0. (8)

Straightforward calculations with Christoffel symbols show that theterms within the inner parentheses sum up toΓikj , therefore, theabove equation may be rewritten as:

ghj(Γhikv

ixk + vh)xj = 0, (9)

which is equivalent to:

〈Γx(v, x) + v, x〉x = 0. (10)

A sufficient condition for the curvev(t) to represent an angle-preserving parallel translation of the given vectoru ∈ TxM alongthe given curvex(t) ∈M is:

v + Γx(v, x) = 0, v(0) = u. (11)

The above Levi-Civita parallel translation equation may beinterpretedas an evolution law of the tangent vectorv(0) that keeps the vectortangent to the manifold along a geodesic curve and that preserves theangle with the geodesic curve.

The parallel translation of a vectoru ∈ TxM along a geodesic arccx,v(t) is denoted byτcx,v(t)(u) and is illustrated in the Figure 1.A noticeable property of any geodesic curvecx,v(t), that will be

v

b

u

b

τcx,v(t)(u)

cx,v(t)

TxM

Tcx,v(t)M

M

x

cx,v(t)

Fig. 1. Notion of parallel translation on a Riemannian manifold M . Thetangent vectoru ∈ TxM is parallel-translated along the geodesic arccx,v(t).

invoked later in the paper, is that it parallel-translates itself. Sucha property may be verified by observing that settingv = x in theequation (11) yields equation (3).

In order to set up the parallel translation equation (11), inthecase that the symmetric Christoffel formΓx(v, v) with x ∈ M and

v ∈ TxM , is known, it is necessary to calculate from it the bilinearChristoffel formΓx(u,w), with x ∈ M andu, w ∈ TxM . Such aresult may be achieved via the following identity:

4Γx(u,w) = Γx(u+ w, u+ w)− Γx(u− w, u−w), (12)

where clearly it makes sense to compute the quantitiesu±w as thesummands belong to the same tangent space.

The parallel translation equation (11) is, in general, difficult tosolve, hence the operatorτcx,v(t) may not be available in closedform for manifolds of interest. Approximations of the exactparalleltranslation are available, as the ‘Schild’s ladder’ construction [35] andthe ‘vector translation’ method [44]. In practice, vector translation isbased on the following idea:

1) Embed the manifoldM of dimensionr into the Euclideanambient spaceEr.

2) Parallel-translate the vectoru ∈ TxM in the sense of ordinarygeometry in the ambient space to a pointy ∈M .

3) Project the vectoru into the tangent spaceTyM by means ofa suitable projectorπy : Er → TyM , for y ∈M .

The vector-translation operator associated to the above procedure is,thus, πy(u), which moves the vectoru from TxM to TyM . Theprojection operatorπy : Er → TyM is defined by the two conditionsπy(z) ∈ TyM for y ∈ M and z ∈ E

r, and 〈v, z〉E = 〈v, πy(z)〉yfor all v ∈ TyM . Depending on the manifold’s geometric structure,vector translation may be much less expensive to compute thanparallel translation.

C. Manifold retractions and exponential maps

A retraction mapR : TM →M is defined such that any restrictionRx : TxM →M , for x ∈M , satisfies conditions [44]:

1) Any restrictionRx is defined in some open ballB(0, x)of radius x > 0 about 0 ∈ TxM and is continuouslydifferentiable;

2) It holdsRx(v) = x if v = 0;3) Let v(t) ∈ B(0, rx) ⊂ TxM denote any smooth curve, with

t ∈ I ∋ 0 andv(0) = 0. The curvey(t) = Rx(v(t)), for t ∈ I,lies in a neighborhood ofx = y(0). It holds y = dRx|v(v)and dRx|v : TvTxM → TRx(v)M denotes a tangent map.For t = 0, it holds dRx|0 : T0TxM → TRx(0)M . IdentifyT0TxM ∼= TxM and TRx(0)M = TxM . In order forRx tobe a retraction, the mapdRx|0 must equate the identity mapin TxM .

In practice, a retractionRx(v) sends a tangent vectorv ∈ TxM toa manifoldM ∋ x into a neighbor of pointx, as exemplified in theFigure 2.

On a Riemannian manifold, given a geodesic curvecx,v(t), t ∈[0, 1], associated to the metric structure, it is defined a map termedexponential mapas cx,v(1). Any exponential map of a Riemannianmanifold is a retraction (the converse is not necessarily true, though).Another class of retractions is the one based on the projection ofEuclidean retractions on the manifold of interest [52].

D. Numerical integration methods

In the massive case (namely, wheneverM 6= 0), the extendedHamiltonian learning system (1) may be re-normalized by setting:

µdef=

µ

M , Vdef=V

M , vdef=

v

M . (13)

The equations of learning stay the same except that the mass termM disappears. It may be assumed, thus, thatM = 1 without lossof generality, so that the equations of learning become:

x = v,v + Γx(v, v) = −∇xV − µv.

(14)

Page 4: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

4 EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

b

M

x

TxM

v

Rx

b

Rx(v)

Fig. 2. Notion of retraction on a manifoldM . The dashed circle represents theneighborhood of the pointx ∈ M that the retracted tangent vectorv ∈ TxM ,namelyRx(v), belongs to.

The system (14) represents an instance of a differential system overa tangent bundle (see, e.g., [7]).

The basic idea to solve numerically the general extended Hamil-tonian system of equations over the manifoldM is to replace thecontinuous-time state-variablesx(t) ∈ M , v(t) ∈ Tx(t)M withdiscrete-time state-variablesxk ∈ M , vk ∈ Txk

M by ensuring thatthe state-variables belong to the respective spaces. This operationrequires a numerical integration scheme of the fundamentalequations.

The first fundamental equation in (14) may be solved by the helpof the notion of retraction, namely, by the discrete-time learning rule:

xk+1 = Rxk(ηvk), k = 0, 1, 2, 3, . . . , (15)

whereη > 0 denotes an integration step-size (usuallyη ≪ 1). Theretraction-based stepping rule (15) may be regarded as an extension ofthe Euler method from Euclidean spaces to curved manifolds.In fact,for a Euclidean spaceEr, the rule (15) would read as a Euler method,because forx ∈ E

r it holdsTxEr = E

r, henceRx(v) = x+ v. Thenumerical solution of differential equations on Euclideanspaces maybenefit from other methods than Euler’s, such as, for instance, theRunge-Kutta family of stepping methods. Likewise, the numericalsolution of differential equations on manifolds may benefitfrom theextension of such methods to curved spaces [9], [38]. In the presentcontext, however, the retraction-based stepping rule (15)is suitable,because the purpose of the numerical implementation is to find anoptimal configuration of the learning system and not to approximatethe solution of the learning differential equation (14) precisely at anytime, unlike in some applications (e.g., in astronomy [26])where thepurpose of a stepping method is to approximate thewhole trajectoryx(t) ∈ M in a satisfactory way. The second fundamental equationin (14) may be solved in different ways, according to the structureof the tangent bundle of the base-manifoldM . In particular, threecases were individuated, which give rise to the three different typesof solution described below.

Type I solution: In those cases in which the tangent bundle istrivial, namely, the velocity variablev evolves over a single tangentspace (or may be replaced with a variable which evolves over asinglelinear space, as it is the case for Lie groups), the second fundamentalequation in (14) may be solved numerically by Euler steppingonthe common tangent space. In fact, the common tangent space to amanifoldM of dimensionr is a vector space isomorphic toEr which

affords Euler stepping. The complete stepping method reads:

xk+1 = Rxk(ηvk),

vk+1 = (1− ηµ)vk − ηΓxk(vk, vk)− η∇xk

V,(16)

for k = 0, 1, 2, 3, . . .. The retractionR : TM → M may be chosenarbitrarily.

Type II solution : In those cases where the tangent bundle isnon-trivial and a closed-form expression for geodesic curves andparallel translation along a geodesic curve are known, the secondfundamental equation of dynamics in (14) may be implementedby aparallel-translation-based rule. Namely, the velocityvk+1 ∈ Txk+1Mis computed by the parallel translation of the Euler-stepped velocityalong the geodesic curvecxk,vk(t) extended for the short timet = η.In order for the procedure to be consistent, it is necessary thatthe endpoint of the geodesic arccxk,vk (η) coincides exactly withthe stepped-forward pointxk+1 computed by the retraction (15).A straightforward way to ensure that such consistency conditionholds is to takeRxk

(ηvk) = cxk,vk (η) in the stepping rule (15).It is further necessary to pay attention to another aspect related toparallel translation. In some circumstances the formula for the paralleltranslation τcx,v(t)(u) of an arbitrary tangent vectoru ∈ TxMdifferent from the initial tangent vectorv ∈ TxM to the geodesicis available. In this case, the complete stepping method reads:

xk+1 = cxk,vk(η),vk+1 = τcxk,vk

(η)((1− ηµ)vk − η∇xkV ),

(17)

for k = 0, 1, 2, 3, . . .. In some circumstances, however, the paralleltranslationτcx,v(t)(u) of a tangent vectoru ∈ TxM different fromthe initial tangent vectorv ∈ TxM to the geodesic is unavailable andthe only available operation is the self-parallel-translation τcx,v(t)(v)of the initial tangent vectorv ∈ TxM to the geodesic. As recalledin section II, any geodesic curve parallel-translates itself, therefore,whenever the closed-form of a geodesic arc on a given manifold witha given metric is known, the self-parallel translation formula is knowntoo, as it holds:

τcx,v(t)(v) = cx,v(t). (18)

In this case, the complete stepping method (17) may be approximatedas follows:

akdef=(1− ηµ)vk − η∇xk

V,xk+1 = cxk,ak

(η),vk+1 = cxk,ak

(η),

(19)

for k = 0, 1, 2, 3, . . .. Such a scheme ensures that only self-parallel-translation is invoked and thatvk+1 ∈ Txk+1M indeed.

Type III solution : In the case that the tangent bundle is non-trivialand no closed-form expression for the parallel translationis available,it is possible to resort to the notion of vector translation that replacesthe parallel translation operation invoked in the Type II solution. Thecomplete stepping method reads:

xk+1 = Rxk(ηvk),

vk+1 = πxk+1((1− ηµ)vk − η∇xkV ),

(20)

where k = 0, 1, 2, 3, . . . and operatorπy : Er → TyM , for y ∈M , denotes projection. Namely, the velocityvk+1 ∈ Txk+1M iscomputed by projecting the Euler-stepped velocity. The retractionR : TM → M may be chosen arbitrarily.

E. Relationship with the retracted-gradient method

Recall from [21] that the conditions which allows the extendedHamiltonian system to improve over the gradient method may bemet by properly selecting the damping coefficient of the dynamicalsystem. In fact, letH denote the Hessian of the potential energyfunction V at a point of stationarity and letλmax denote the

Page 5: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

FIORI: EXTENDED HAMILTONIAN LEARNING ON RIEMANNIAN MANIFO LDS: NUMERICAL ASPECTS 5

largest eigenvalue of the symmetric positive-definite matrix H . Ifthe conditionµ2 > 2λmax holds, the dynamical system convergesfaster than the gradient-steepest descent one in the proximity of suchstationarity point.

The gradient-steepest-descent rule for a learning problemformu-lated on the parameter manifoldM and described in terms of a costfunction V :M → R reads:

x = −∇xV. (21)

The above differential equation on the manifoldM may be solvednumerically by the help of a retraction scheme, namely, by thealgorithm:

xk+1 = Rxk(−η∇xk

V ), (22)

with η > 0 playing the role of learning stepsize andk = 0, 1, 2, . . ..The retraction-based learning rule (22) generalizes the exponential-map-based Euler method on manifolds studied in [24] (see also [3]for an analysis of the error structure of the exponential-map-basedEuler method on manifolds).

In principle, it is possible to simplify the expressions fortheimplementation of the second extended Hamiltonian equation overnontrivial tangent bundles, by choosing the learning stepsize param-eters and the damping coefficient such thatηµ = 1. In this case,equations (17), (19) and (20) reduce, respectively, to parallel trans-lation or vector translation of the scaled Riemannian gradient of thepotential energy function− 1

µ∇xV . In the further hypothesis that the

steps along the geodesic curve are short enough, consider the furthersimplification that the parallel/vector translation of∇xk

V to the pointxk+1 approximately equals∇xk+1V , namely thatvk ≈ − 1

µ∇xk

V .In such over-simplified scenario, equation (15) would read:

xk+1 ≈ Rxk

(

− 1

µ2∇xk

V

)

, (23)

which shows that, under the above hypotheses, the iteration(15)collapses into a retracted-gradient iteration.

Hence, denoting again withλmax the largest eigenvalue of theHessian matrix associated with the potential energy function V ata point of stationarity, an extended Hamiltonian learning algorithmwith stepsizeη and damping coefficientµ such that

√2λmax < µ < η−1 (24)

is expected to improve over a retracted Riemannian gradientalgo-rithm with learning stepsizeη2 and hence it is worth invoking.

III. N UMERICAL IMPLEMENTATION ON SPECIAL MANIFOLDS

The present section explains in details how the numerical inte-gration methods of section II-D can be applied to manifolds ofinterest. In particular, on the line of the previous contribution [21], thepresent section deals with the Stiefel manifold, the special orthogonalgroup, the unit hypersphere, the Grassmann manifold, the group ofsymmetric positive definite matrices, the flag manifold and the realsymplectic group of matrices. An explanation about the circumstanceswhere the specific manifolds studied in the following are importantwas reported in the manuscript [21].

A. Implementation of the extended Hamiltonian system over thecompact Stiefel manifold

The compact Stiefel manifold is defined asSt(n, p)def= x ∈

Rn×p|xTx = ep, where p ≤ n, superscriptT denotes matrix

transpose and symbolep denotes ap× p identity matrix. In the caseof the compact Stiefel manifoldSt(n, p), two metrics are considered,namely, the Euclidean and the canonical metric. In both cases, the

expressions of the geodesic arcs are available. Hence, the Hamiltoniandynamical system (14) may be implemented either by a Type IIand a Type III solution. An expression of the projection operatorπy : Rn×p → TySt(n, p), for y ∈ St(n, p), is:

πy(u) =

(e− 12yyT )u− 1

2yuT y for the Euclidean metric,

u− yuT y for the canonical metric.(25)

The simplest expressions arise in the case of canonical metric,therefore, the canonical metric is the metric of choice in the presentimplementation. Moreover, in principle the optimal learning pointdoes not depend on the chosen metrics [15]. In the case of canonicalmetric, it holds [21]:

Γx(v, v) = −vvTx− xvT (en − xxT )v, (26)

cx,v(t) = [x q] exp

(

t

[

xT v −rTr 0p

]) [

ep0p

]

, (27)

∇xV = ∂xV − x∂Tx V x, (28)

whereq andr denote the factors of the compact QR decompositionof the matrix(en − xxT )v and0p denotes a zerop× p matrix. Theself-parallel-translation formula, as obtained from equation (18), isgiven by:

τcx,v(t)(v) = [v − xrT ] exp

(

t

[

xT v −rTr 0p

])[

ep0p

]

. (29)

By the expressions of the Riemannian gradient of the potentialenergy function and of the Christoffel form, it is obtained theextended Hamiltonian system:

x = v,v = −vvTx− xvT (en − xxT )v−

(∂xV − x(∂xV )Tx)− µv.(30)

According to the expression of the geodesic on the Stiefel manifoldendowed with the canonical metric and of the self-parallel-translationformula (29) corresponding to the canonical metric, the extendedHamiltonian system (30) may be implemented by the Type-II solu-tion:

akdef=(1− ηµ)vk − η

(

∂xkV − xk(∂xk

V )Txk

)

,

(qk, rk)def=qr((en − xkx

Tk )ak, 0),

Pkdef= exp

(

η

[

xTk ak −rTkrk 0p

]) [

ep0p

]

,

xk+1 = [xk qk]Pk,vk+1 = [ak − xkr

Tk ]Pk,

(31)

where η > 0 is a stepsize for the extended Hamiltonian learningsystem,qr(·, 0) denotes the compact QR factorization operator andk = 0, 1, 2, . . ..

Moreover, according to the expression of the geodesic on theStiefel manifold endowed with the canonical metric and of theprojector (25) corresponding to the canonical metric, the extendedHamiltonian system (30) may be implemented by the Type-III solu-tion:

(qk, rk)def=qr((en − xkx

Tk )vk, 0),

xk+1 = [xk qk] exp

(

η

[

xTk vk −rTkrk 0p

]) [

ep0p

]

,

akdef=(1− ηµ)vk − η

(

∂xkV − xk(∂xk

V )Txk

)

,vk+1 = ak − xk+1a

Tk xk+1,

(32)

with η > 0 being a stepsize for the extended Hamiltonian learningsystem andk = 0, 1, 2, . . ..

In both cases, the learning state is represented by the pair(xk, vk) ∈ TSt(n, p) for k ∈ N.

Page 6: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

6 EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

B. Implementation of the extended Hamiltonian system over the unithypersphere

The unit hypersphere is defined asSn−1def=

x ∈ Rn|xTx = 1

.Although the dynamics over the unit hypersphereSn−1 might beviewed as a special case of the dynamics over a Stiefel manifold(namely,St(n, 1)), it is worth discussing the numerical implementa-tion of the dynamical-learning system (14) over a unit hypersphereas a separate issue because the expression of the geodesic curveis particularly simple and because a closed-form solution of theparallel-translation equation (11) is available. Hence, in the caseof the manifold Sn−1, both a Type-II and a Type-III numericalimplementations are available.

Recall that, for the unit-hypersphere endowed with the canonicalmetric, it holds [21]:

Γx(v, v) = −x‖v‖2, (33)

cx,v(t) = x cos(‖v‖t) + v‖v‖−1 sin(‖v‖t), (34)

∇xV = (en − xxT )∂xV, (35)

where symbol‖ · ‖ denotes vector2-norm. By gathering the expres-sions of the Christoffel operator and of the Riemannian gradient ofthe potential energy, the following extended Hamiltonian system isobtained:

x = v,v = ‖v‖2x− (en − xxT )∂xV − µv.

(36)

By making use of the identity (12), it is found that, for two arbitraryvectorsv, w ∈ TxS

n−1, it holds:

Γx(v, w) = −x(vTw), (37)

hence the parallel-translation equation (11) for a vectoru ∈ TxSn−1

along a geodesic arccx,v(t) particularizes to:

w(t)− cx,v(t)cTx,v(t)w(t) = 0, w(0) = u, (38)

whose solutionw(t) represents parallel translation of the tangentvectoru ∈ TxM . The solution of such equation may be adapted from[30] (for a non-unitary initial tangent vectorv) and the correspondingparallel-translation operator reads:

τcx,v(t)(u) = (en − ‖v‖−2vvT )u+ uT v‖v‖−2cx,v(t). (39)

The component of the vectoru orthogonal to the initial velocity of thegeodesiccx,v is mapped by an affine translation while the componentof the vectoru parallel to the initial velocity of the geodesic is‘copied’ along the geodesic. By the expression (34), the geodesic’stangent vector field computes as:

cx,v(t) = −x‖v‖ sin(‖v‖t) + v cos(‖v‖t). (40)

Hence, the parallel translation on the hypersphereSn−1 of the tangentvector u ∈ TxS

n−1 along the geodesic arccx,v of an extenttcomputes by:

τcx,v(t)(u) = [en + (‖v‖−2(cos(‖v‖t)− 1)v − sin(‖v‖t)x)vT ]u.(41)

In view of numerical implementation, an expression of a projectorπy : R

n → TySn−1, for y ∈ Sn−1 is of use. According to the

canonical metric, it may be chosen as:

πy(u) = (en − yyT )u. (42)

According to the expression of the geodesic over the unit hy-persphere associated to the canonical metric and of the parallel-translation operator (41), the extended Hamiltonian system (36) may

be implemented by a Type-II solution as:

xk+1 = xk cos(η‖vk‖) + vk‖vk‖−1 sin(η‖vk‖),vk+1 = [en + ‖vk‖−2(cos(‖vk‖η)− 1)vkv

Tk −

sin(‖vk‖η)xkvTk ][(1− ηµ)vk−

η(en − xkxTk )∂xk

V ],

(43)

with η > 0 being a stepsize for the extended Hamiltonian learningsystem andk = 0, 1, 2, . . ..

In addition, according to the expression of the geodesic overthe unit hypersphere associated to the canonical metric andofthe projector (42), the extended Hamiltonian system (36) may beimplemented by a Type-III solution as:

xk+1 = xk cos(η‖vk‖) + vk‖vk‖−1 sin(η‖vk‖),vk+1 =

(

en − xk+1xTk+1

)

[η(xkxTk − en)∂xk

V+(1− ηµ)vk],

(44)

with η > 0 being a learning stepsize andk = 0, 1, 2, . . ..In both cases, the learning state is represented by the variable-pair

(xk, vk) ∈ TSn−1 for any k ∈ N.A differentiable manifold closely related to the unit hypersphere

is the oblique manifold [51], defined as:

OB(n)def= x ∈ R

n×n|diag(xTx) = en, (45)

where the operatordiag(·) returns the zero matrix except for themain diagonal that is copied from the main diagonal of its argument.The geometry of the oblique manifold may be easily studied onthebasis of the geometry of the unit hypersphere.

C. Implementation of the extended Hamiltonian system over thespecial orthogonal group

The manifold of special orthogonal matrices is definedas SO(n)

def=

x ∈ Rn×n|xTx = en, det(x) = 1

. The manifoldSO(n) of special orthogonal matrices may be regarded as a Lie group,with Lie algebra:

so(n) = ω ∈ Rn×n|ωT + ω = 0, (46)

hence it affords an implementation of Type I. The geometricalexpressions of use in the present section associated with the specialorthogonal group endowed with the canonical metric, read [21]:

Γx(xω, xω) = −xω2, (47)

cx,xω(t) = x exp(tω), (48)

∇xV =1

2

(

∂xV − x∂Tx V x

)

. (49)

By replacing the expression of the Christoffel form and of theRiemannian gradient into the general extended Hamiltonianlearn-ing equations (14), straightforward calculations lead to the specialorthogonal group dynamics:

x = xω,ω = − 1

2

(

xT∂xV − ∂Tx V x

)

− µω.(50)

The kinetic energy has expressionK = − 12tr(ω2).

It is worth noting that the second equation of the system (50)describes a dynamics over the Lie algebraso(n), which is a linearspace (namely, the tangent space to the manifold at the origin),while the second equation of the general system (14) describes adynamics over a collection of tangent spaces, different from eachother. This fact noticeably simplifies the numerical integration of theequations (50) with respect to the general case. In particular, thesecond equation may be integrated numerically by a Euler steppingmethod, while the first one may be integrated via a suitable retraction

Page 7: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

FIORI: EXTENDED HAMILTONIAN LEARNING ON RIEMANNIAN MANIFO LDS: NUMERICAL ASPECTS 7

of the special orthogonal group. Namely, the system (50) maybeimplemented by a Type-I solution as:

xk+1 = Rxk(ηxkωk),

ωk+1 = ωk + η(

− 12

(

xTk ∂xk

V − (∂xkV )Txk

)

− µωk

)

,(51)

where η > 0 plays the role of learning stepsize for the extendedHamiltonian learning system. The learning state is represented bythe pair(xk, ωk), with xk ∈ SO(n) andωk ∈ so(n) for k ∈ N.

For a list of suitable retractions of the orthogonal group (besidesthe exponential map associated to the geodesic), readers might seethe publication [17]. For example, an alternative approachto theexponential-map based retraction would be the Cayley retraction.However, such solution does not look appealing because of somenumerical problems related to the singularities of the Cayley mapwhich were recently underlined in [29].

A differentiable manifold closely related to the special orthogonalgroup is the special Euclidean manifold, denoted asSE(n), that findsapplication, primarily, in computer graphics and in robotics (see, e.g.,[54]). The special Euclidean manifold is a set of(n+ 1)× (n+ 1)matrices defined as:

SE(n)def=

[

r δ0 1

]∣

r ∈ SO(n), δ ∈ Rn

. (52)

The geometrical features of the special Euclidean manifoldmay bededuced from those of the special orthogonal group for what concernsthe geometrical integration of differential equations on the tangentbundleTSE(n).

D. Implementation of the extended Hamiltonian system over theGrassmann manifold

A Grassmann manifoldGr(n, p) is a set of subspaces ofRn×n

spanned byp independent vectors. A representation of any of suchsubspace may be assumed as the equivalence class[x] = xρ|x ∈St(n, p), ρ ∈ R

p×p, ρTρ = ep. In practice, an element[x] ofthe Grassmann manifoldGr(n, p) is represented by a matrix inSt(n, p) whose columns span the space[x]. In the case of learning onthe Grassmann manifoldGr(n, p), closed-form expressions for thegeodesic curves as well as of the parallel translation operator maybe taken advantage of, hence an implementation of Type II maybesummoned. The geometrical expressions of use in the presentsectionassociated with the Grassmann manifold endowed with the canonicalmetric, read [21]:

c[x],v(t) = [xβ α]

[

cos(σt)sin(σt)

]

βT , (53)

∇xV = (en − xxT )∂xV, (54)

whereασβT denotes the compact singular value decomposition ofthe matrixv. The parallel translation of a vectoru ∈ T[x]Gr(n, p)along the geodesicc[x],v(t) with v ∈ T[x]Gr(n, p) is given by [10]:

τc[x],v(t)(u) =

(

[xβ α]

[

− sin(σt)cos(σt)

]

αT + en − ααT

)

u. (55)

According to the expression of the geodesic over the Grassmannmanifold associated to the canonical metric and of the parallel-translation operator (55), the extended Hamiltonian system may beimplemented by a Type-II solution as:

(αk, σk, βk)def=svd(vk, 0),

xk+1 = [xkβk αk]

[

cos(ησk)sin(ησk)

]

βTk ,

akdef=(1− ηµ)vk − η(en − xkx

Tk )∂xk

V,

vk+1 =

(

[xkβk αk]

[

− sin(ησk)cos(ησk)

]

αTk + en − αkα

Tk

)

ak,

(56)

where η > 0 is a stepsize for the extended Hamiltonian learningsystem,svd(·, 0) denotes the compact singular value decompositionoperator andk = 0, 1, 2, . . .. The learning state is represented bythe pair([xk], vk) ∈ TGr(n, p) for k ∈ N. An alternative retractionfor the Grassmann manifold to the geodesic-based one is discussedin [34] and is based on the QR decomposition instead of the SVDdecomposition.

E. Implementation of the extended Hamiltonian system over aflagmanifold

A flag in a finite dimensional vector spaceΨ ⊂ Rn is a sequence

of subspacesΦi, where each subspaceΦi is a proper subspace ofthe next, namely0 ⊂ Φ1 ⊂ Φ2 ⊂ . . . ⊂ Φr = Ψ. Settingdi

def= dimΦi, then it holds that0 = d0 < d1 < d2 < · · · dr = p,

where p ≤ n is the dimension of the spaceΨ. Any flag may besplit into a direct sum of vector spacesΨi ⊂ R

n of dimensionsni

def= dimΨi > 0 such thatni = di − di−1, for i = 1, . . . , r,

namelyΨdef=Ψ1 ⊕Ψ2 ⊕Ψ3 ⊕ . . .⊕Ψr. The dimension of the space

Ψ satisfiesp =∑r

i=1 ni ≤ n. The set of all such vector spaces istermed flag manifold2 and is denoted byFl(n, n1, n2, . . . , nr). Pointson the flag manifold[x] ∈ Fl(n, n1, n2, . . . , nr) are representedby matricesx ∈ St(n, p) that obey the following rule: all thematricesxdiag(ρ1, ρ2, . . . , ρr) with ρi ∈ R

ni×ni , ρTi ρi = eni,

i = 1, 2, . . . , r, represent the same point on the flag manifoldas the matrixx. It is convenient to partition the matrix representing agiven point[x] ∈ Fl(n, n1, n2, . . . , nr) asx = [x(1) x(2) . . . x(r)],x(i) ∈ R

n×ni , i = 1, 2, . . . , r. It is worth remarking the followingproperties:

• If n1 = n2 = . . . = nr = 1, then the flag manifoldFl(n, n1, n2, . . . , nr) reduces to the Stiefel manifoldSt(n, r).

• If r = 1, then the flag manifoldFl(n, n1, n2, . . . , nr) reducesto the Grassmann manifoldGr(n, n1).

The geometrical expressions of interest in the present sectionassociated with the flag manifold endowed with the geometricalstructure devised in [39] read:

c[x],v(t) = exp(

t(vxT − xvT ))

x, vdef=

(

en − 1

2xxT

)

v, (57)

(∇xV )(i) =(

en − x(i)xT(i)

)

∂x(i)V −

j 6=i

x(j)∂Tx(j)

V x(i). (58)

The above expressions are associated to the canonical metric ofthe Stiefel manifold which is assumed as the metric of choicefor the flag manifold [39]. In order to implement the extendedHamiltonian equations for the generalized flag manifold by thescheme retraction/vector-translation, it is useful to write an expressionfor the projectorπ[y] : R

n×p → T[y]Fl(n, n1, n2, . . . , nr), for[y] ∈ Fl(n, n1, n2, . . . , nr). Such a projector may be expressed,block-wise, as:

π[y](u)(i) =(

en − y(i)yT(i)

)

u(i) −∑

j 6=i

y(j)uT(i)y(i). (59)

The dynamical system associated to the flag manifoldFl(n, n1, n2, . . . , nr) may be numerically implemented via theType-III solution:

vkdef=

(

en − 12xkx

Tk

)

vk,

xk+1 = exp(

η(vkxTk − xkv

Tk )

)

xk,

akdef=(1− ηµ)vk − η(en − xkx

Tk )∂xk

V,vk+1 = π[xk+1](ak),

(60)

2The definition given here amends the incorrect definition given in the paper[21].

Page 8: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

8 EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

where η > 0 is a stepsize for the extended Hamiltonian learningsystem andk = 0, 1, 2, . . .. The learning state is represented by thepair ([xk], vk) ∈ TFl(n, n1, n2, . . . , nr) for k ∈ N.

F. Implementation of the extended Hamiltonian system over themanifold of symmetric positive-definite matrices

The manifold of symmetric positive definite matrices is defined asS+(n)

def= x ∈ R

n×n|xT − x = 0, x > 0. The manifoldS+(n)of symmetric positive-definite matrices, endowed with its canonicalmetric, has associated the following geometric quantities:

Γx(v, v) = −vx−1v, (61)

cx,v(t) = x12 exp(tx− 1

2 vx− 12 )x

12 , (62)

∇xV =1

2x(

∂xV + ∂Tx V

)

x. (63)

The geodesic distance between two give symmetric positive-definite matrices may be calculated according to the definition (4).The distance associated to the canonical metric of the manifoldof symmetric positive-definite matrices is given byd(x, y) =‖ log(x−1y)‖F, for x, y ∈ S+(n) (see, e.g., [36]). In the aboveexpression, symbol‖ · ‖F denotes Frobenius norm.

Recall thatTxS+(n) = S(n), namely, the set ofn× n symmetric

matrices. Hence, all tangent spaces coincide with each other. Suchobservation leads to an implementation of Type I. By gatheringthe expressions of the Christoffel operator and of the Riemanniangradient of the potential energy function, the following extendedHamiltonian learning system equations are obtained:

x = v,v = vx−1v − 1

2x(

∂Tx V − ∂xV

)

x− µv.(64)

The kinetic energy function has expressionK = 12tr((vx−1)2).

The second equation of the system (64) describes a dynamicsover the vector space of symmetric matrices and may be integratednumerically by a Euler stepping method, while the first one maybe integrated via a retraction of the manifold of symmetric positivedefinite matrices. Namely, the system (64) may be implemented bya Type-I solution:

xk+1 = Rxk(ηvk),

vk+1 = η[

vkx−1k vk − 1

2xk(∂

TxkV − ∂xk

V )xk

]

+(1− ηµ)vk,

(65)

where η > 0 plays the role of learning step-size for the extendedHamiltonian learning system andk = 0, 1, 2, . . .. The learning stateis represented byxk ∈ S+(n) andvk ∈ S(n) for k ∈ N.

A suitable retraction map of the manifold of symmetric positivedefinite matrices is the exponential map associated to the geodesic,namelyRx(v) = x

12 exp

(

x− 12 vx− 1

2

)

x12 .

A related problem arises about the manifoldS+(n, p) of symmetricfixed-rank positive-semidefinite matrices, that may be defined as:

S+(n, p)def= UR2UT |U ∈ St(n, p), R2 ∈ S+(p), p < n. (66)

The interest in symmetric fixed-rank positive-semidefinitematricesstems from the observation that, with the growing use of low-rankapproximations of matrices as a way to retain tractability in large-scale applications, there is a need to extend the calculus ofpositivedefinite matrices to their low-rank counterparts [4].

G. Implementation of the extended Hamiltonian system over thegroup of real symplectic matrices

The manifold of real symplectic matrices is defined as follows:

Sp(2n,R) = x ∈ R2n×2n|xT qx = q, q =

[

0n en−en 0n

]

.

(67)

The manifoldSp(2n,R) is regarded as a Lie group with Lie algebra:

sp(2n,R) = h ∈ R2n×2n|hT q + qh = 02n. (68)

For simplicity, the dimensionn is not indicated in the notation of thefundamental skew-symmetric matrixq. The real symplectic groupis supposed to be endowed with the Bloch-Crouch-Marsden-Sayalmetric as explained in [21]. Upon such a choice, the real symplecticgroup has associated the following geometric quantities [21]:

Γx(v, v) = −vx−1v + xvT qxqx−1v − vvT qxq, (69)

∇xV =1

2xq

(

∂Tx V xq − qxT∂xV

)

. (70)

The complete extended Hamiltonian system describing a learningdynamics over the real symplectic group may now be obtained byinserting the expressions of the Christoffel matrix function Γx andthe expression of the Riemannian gradient∇xV of the potentialenergy function into the general extended Hamiltonian system (14).Calculations lead to the following expressions:

x = xh,

h = hTh− hhT − 12q(

(∂xV )Txq − qxT∂xV)

− µh.(71)

The kinetic energy function has expressionK = 12tr(hTh).

Likewise the case of the orthogonal groupSO(n), the secondequation describes a dynamics over the Lie algebrasp(2n,R). Hence,the second equation may be integrated numerically by a Eulerstepping method, while the first one may be integrated via a suitableretraction of the symplectic group. Namely, the system (71)may beimplemented by a Type-I solution as:

xk+1 = Rxk(ηxkhk),

hk+1 = η(

hTk hk − hkh

Tk

)

+ (1− ηµ)hk−12ηq

(

∂TxkV xkq − qxT

k ∂xkV)

,(72)

where the parameterη > 0 again plays the role of learning stepsizefor the extended Hamiltonian learning system andk = 0, 1, 2, . . ..The learning state is represented by the couplexk ∈ Sp(2n,R) andhk ∈ sp(2n,R) for k ∈ N. Possible retraction of the real symplecticgroup areRx(v) = x exp(x−1v), whose properties were studied inthe contributions [20], [27], and the Cayley map, as explained in [1].

IV. N UMERICAL EXPERIMENTS

The present section aims at illustrating the numerical behaviorof the extended Hamiltonian learning paradigm. In the presentmanuscript, no particular attention is paid to the questionof thecomputational complexity of the learning algorithms explained insection III nor to the optimization of matrix-type expressions to attainminimal redundancy. The implementation of the learning algorithmsof section III requires the repeated computation of QR and SVDfactorizations as well as of matrix exponentiation, which makes theircomplexity of orderO(p3) whenever matrices of sizep × p areinvolved. The numerical issues behind QR and SVD factorizations arewell known [25]. Advancements on the exponentiation of structuredmatrices may be read of in the papers [6], [46], [48] and referencestherein. The present section discusses the following cases: 1) Experi-ments of learning over the manifoldM = R

2 that aim at illustratingnumerically the behavior of the extended Hamiltonian system whenthe damping condition (24) is/is-not verified, in comparison to thebehavior of the gradient learning system. 2) An experimentsoflearning over the manifoldM = R that aims at illustrating thebenefit of the initial momentum in escaping from the plateausofthe potential energy function landscape. 3) A discussion oflearningover the manifoldM = SO(2) that aims at illustrating the needof appropriate geometrical integration of the extended Hamiltonian

Page 9: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

FIORI: EXTENDED HAMILTONIAN LEARNING ON RIEMANNIAN MANIFO LDS: NUMERICAL ASPECTS 9

learning equations. 4) Experiments of learning over the manifoldM = Sn−1 on applications such as minor component analysisand blind channel deconvolution. 5) Experiments of learning overthe manifoldM = S+(n) on positive-definite symmetric matrixaveraging. In the following subsections, the parametert is interpretedas a time parameter ranging in the interval[0, 1] and is measured inseconds. All the experiments were coded in MATLAB.

A. Experiments of learning on the spaceR2

In order to illustrate numerically the theoretical developmentsabout the comparison between a dynamical system and a gradient-based system recalled in subsection II-E, the present subsection dealswith a learning problem in the manifoldM = R

2 which affordsan easy graphical rendering of the behavior of such methods.Thepotential energy function isV (x) = 1

2xTAx, with A ∈ S+(2)

and the sought-for optimal connection patternx⋆ in the presentunconstrained case coincides with the origin of the planeR

2. Itis of particular interest here to illustrate numerically the behaviorof the dynamical system when the condition (24) is not verifiedand when it is verified. In the present case, the eigenvalueλmax

coincides with the maximum eigenvalue of the matrixA. The Figure 3corresponds to the case thatµ = 0.2

√2λmax. Such a choice of

the damping parameter corresponds to an under-damped dynamicalsystem that exhibits a slowly-converging oscillating behavior. The

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Time t

Pot

entia

l fun

ctio

n V

(x(t)

)

−1.5 −1 −0.5 0 0.5 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

First coordinate

Sec

ond

coor

dina

te

Fig. 3. Experiment of learning on the planeR2: Under-damped dynamicalsystem. Left-hand panel: Course of the criterion functionV (x(t)). Right-handpanel: Learning seen over the parameter-manifoldR

2. (Solid line: ExtendedHamiltonian learning algorithm. Dashed line: Gradient algorithm.)

Figure 4, conversely, corresponds to the case thatµ = 1.2√2λmax.

Such a choice of parameters meets condition (24) and corresponds toa damped dynamical system that exhibits a non-oscillating behaviorand that converges faster than the gradient-based learningsystem(both with the same learning stepsizeη = 0.001).

B. Experiments of learning on the spaceR

In order to illustrate the benefit of the initial momentum in escapingfrom the plateaus of the potential energy function landscape, anexperiment of learning over the manifoldM = R is discussed.The potential energy function used in the present numericaltest isillustrated in the Figure 5 in the top-right panel. The results reportedin the Figure 5 corresponds to the case thatx1 = 98

100x0. The initial

point x0 corresponds to a flat zone of the potential energy functionand the shown numerical results confirm numerically that thegradient

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Time t

Pot

entia

l fun

ctio

n V

(x(t)

)

−0.5 0 0.5 1 1.5−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

First coordinate

Sec

ond

coor

dina

te

Fig. 4. Experiment of learning on the planeR2: Damped dynamical systemthat meets condition (24). Left-hand panel: Course of the criterion functionV (x(t)). Right-hand panel: Learning seen over the parameter-manifold R2.(Solid line: Extended Hamiltonian learning algorithm. Dashed line: Gradientalgorithm.)

method is unable to escape the plateau while the dynamical systemis able to move out of the plateau thanks to the initial kinetic energycorresponding to the impressed initial momentum.

0 0.2 0.4 0.6 0.8 11.4

1.5

1.6

1.7

1.8

1.9

2

2.1

Time t

Pot

entia

l fun

ctio

n V

(x(t

))

−6 −4 −2 0 2 41

1.5

2

2.5

3

3.5

x

Pot

entia

l fun

ctio

n V

(x)

0 0.2 0.4 0.6 0.8 110

−30

10−20

10−10

100

Time t

Kin

etic

ene

rgy

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

Time t

x(t)

Fig. 5. Experiment of learning on the lineR: Effect of the initial momentumin escaping from the plateaus of the potential energy function landscape. Top-left panel: Course of the criterion functionV (x(t)). Top-right panel: Shapeof the potential functionV (x) and learning course represented by a dottedcurve (while the starting pointx0 is denoted by an open circle). Bottom-left panel: Kinetic energy. Bottom-right panel: Course ofx(t). (Solid line:Extended Hamiltonian learning algorithm. Dashed line: Gradient algorithm.)

C. About learning on the spaceSO(2)

To render the problem that arises about the numerical imple-mentation of the dynamical learning equations on manifolds, it isworth revisiting the explanatory example discussed in the concludingsection of paper [21] about the low-dimensional manifoldSO(2).By embedding the spaceSO(2) into the spaceR2×2, any element ofSO(2) may be regarded as a 2-by-2 real-valued matrixx = [ x11 x12

x21 x22 ]whose entries must satisfy the constrains:x2

11+x221 = 1, x2

22+x212 =

1, x11x12 + x21x22 = 0 andx11x22 − x12x21 = 1.

Page 10: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

10 EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

The extended Hamiltonian system learning equation for the vari-able x ∈ SO(2) in (50) is a special case of a general differentialequation x = F (x) on the manifoldSO(2), with F : x ∈SO(2) 7→ F (x) ∈ TxSO(2), namely, it may be broken downinto a set of four differential equations of the typexij(t) =Fij(x11(t), x12(t), x21(t), x22(t)). In the present case, it holds thatF (x) = xω with ω =

[

0 Ω−Ω 0

]

, whereΩ = Ω(x(t)) ∈ R. The Eulerstepping technique of numerical calculus to solve the abovesystem ofdifferential equations would readxk+1 = xk + ηF (xk), with η > 0denoting a learning stepsize. As explained in the conclusions of paper[21], such numerical stepping method does not take into account theconstraints on the entries of matrixx, namely, it generates a trajectoryk 7→ xk in the ambient spaceR2×2 rather than in the feasible spaceSO(2). Namely, starting from a pointxk ∈ SO(2), it would producea new pointxk+1 /∈ SO(2). The reason of such behavior is that theEuler numerical integration techniques insist on the flat spaceR

n

and do not cope with curved manifolds. An appropriate numericalstepping method is the one reported in equations (51) based on thenotion of manifold retraction.

It is instructive to investigate in detail on the effect of the Eulerstepping method in the solution of the differential equation x = xω.In such a context, the Euler stepping equation reads:

xk+1 = xk(e2 + ηωk), with ωk =

[

0 Ωk

−Ωk 0

]

. (73)

As the starting pointx0 satisfiesxT0 x0 = e2, it holds:

xT1 x1 = (e2 + ηω0)

TxT0 x0(e2 + ηω0) = e2 − η2ω2

0 . (74)

Note thatω2k = −Ω2

ke2, hencexT1 x1 = (1+ η2Ω2

0)e2. The first stepx1 looses the normality of the columns of an additive amountη2Ω2

0

and changes its determinant from1 to√

1 + η2Ω20. However, it keeps

the orthogonality of the columns of the matrixx (such phenomenonis peculiar of the caseSO(2) only and does not copy to the generalcaseSO(n) with n > 2). For the next step, it holds that:

xT2 x2 = (e2 + ηω1)

TxT1 x1(e2 + ηω1)

= (1 + η2Ω20)(e2 − η2ω2

1)

= (1 + η2Ω20)(1 + η2Ω2

1)e2. (75)

By reasoning by induction, it is readily verified that the matrix xk

keeps monotonically loosing the normality of its two columns ofan identical amount, while it retains the mutual orthogonality ofthe columns. On the other hand, the geometric integration paradigmapplied to the present case, with geodesic-based retraction, reads:

xk+1 = xk exp(ηωk), with ωk =

[

0 Ωk

−Ωk 0

]

. (76)

Note thatexp(ηωk) =[

cos(ηΩk) sin(ηΩk)− sin(ηΩk) cos(ηΩk)

]

. It is straightforward to

verify that the rule (76) implies thatxTk+1xk+1 = xT

k xk = e2.

D. Experiments of learning on the unit hypersphere

The first experiment of learning on the unit hypersphere concernsminor component analysis (MCA). Minor component analysis is astatistical data analysis technique that aims at determining the eigen-vector of a given symmetric positive-definite matrix correspondingto its smallest eigenvalue. Minor component analysis has severalapplications in machine learning and signal processing (see, e.g.,[53]). The potential energy function associated to minor componentanalysis is:

V (x) =1

2xTAx, (77)

with A ∈ S+(n) being a covariance matrix whose minor eigenpairis sought for. In the present experimentn = 10, η = 0.1 and

µ = 1.01√2λmax, with λmax being the maximal eigenvalue of the

matrixA and a Type-III implementation was selected. Any stationarypoint x⋆ ∈ Sn−1 of the extended Hamiltonian learning systemsatisfies∇x⋆V = 0, namely (en − x⋆x

T⋆ )Ax⋆ = 0. The latter

condition may be written equivalently asAx⋆ = 2V (x⋆)x⋆, whichconfirms that the extended Hamiltonian system (14) over the manifoldSn−1 with the potential energy function (77) evolves toward theeigenvectorx⋆ ∈ Sn−1 of the covariance matrixA corresponding tothe minimal eigenvalue2V (x⋆). The Figure 6 shows learning curvesof the extended Hamiltonian algorithm compared to the retractedRiemannian gradient algorithm, averaged over100 independent trials.In particular, the top panel shows the value of the criterionfunctionV (x(t)) as well as the value of the minimal criterion functionV (x⋆). The middle panel shows the kinetic energyKx(t)(x(t), x(t))while the bottom panel shows the value of the consistency indexC(t)

def=xT (t)v(t). The consistency index measures the numerical

precision of parallel translation and should take values asclose aspossible to zero. The results show that the extended Hamiltonianlearning algorithm is effective and converges much faster than thecorresponding retracted gradient-based learning algorithm endowedwith the same learning stepsize valueη.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Crit

erio

n fu

nctio

n

Time t

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.02

0.04

0.06

0.08

Kin

etic

ene

rgy

Time t

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2x 10

−17

Con

sist

ency

inde

x

Time t

Fig. 6. Experiment of learning on the unit hypersphere: Minor componentanalysis. Top panel: Criterion functionV (x(t)). Middle panel: Kineticenergy Kx(t)(x(t), x(t)). Bottom panel: Consistency indexC(t). (Solidline: Extended Hamiltonian learning algorithm. Dot-dashed line: RetractedRiemannian gradient algorithm. Dashed-line: Target learning criterion value(minimal potential energy function).) Curves averaged over 100 independenttrials on the same learning problem.

The second experiment of learning on the unit hypersphere con-cerns blind channel deconvolution. Blind deconvolution isa signalprocessing technique that aims at recovering a signal distorted bya transmission system. Deconvolution arises naturally when dealingwith finite multi-path interference on a signal, such as in marineseismic exploration [41] and in speech dereverberation [8]. Recent no-table applications are to bar code reconstruction [11], image integrityverification in forensics [50], high-resolution mapping oftranscriptionfactor binding sites in DNA chains [33] and to the thermodynamiccharacterization of thin films [13]. In particular, the ‘Bussgang’family of blind deconvolution algorithm was introduced in [2]. For adetailed explanation of the problem, the blind deconvolution theoryand of the experiments see, e.g., [14], [16]. The blind deconvolutionability of an algorithm is measured in terms of the Inter-SymbolInterference (ISI) figure that should take values as close aspossibleto zero. In the present experimentn = 14, η = 0.5 andµ = 1 as theresult of validation and a Type-III implementation was selected. The

Page 11: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

FIORI: EXTENDED HAMILTONIAN LEARNING ON RIEMANNIAN MANIFO LDS: NUMERICAL ASPECTS 11

Figure 7 shows a detail of the behavior of the extended Hamiltonianlearning algorithm on a noisy channel with signal-to-noiseratioranging in1, 5, 10, 20,∞ dB. The Figure 8 shows the ISI figure aswell as the runtime figures of the Extended Bussgang algorithm, theBussgang algorithm, the Bussgang algorithms with natural gradient,the Extended Bussgang algorithm with natural gradient and theExtended Hamiltonian learning algorithm. These figures confirm thatthe extended-Hamiltonian-learning algorithm applied to areal-worldblind channel deconvolution problem converges steadily and that itslearning performances are comparable to those of existing algorithmat a lower computational cost: In fact, the extended-Hamiltonian-learning algorithm applied to a blind channel deconvolution problemexhibits the lowest ISI index and is the fastest to run.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−30

−20

−10

0

10

Time t

ISI (

dB)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−14

−13.5

−13

−12.5

−12

−11.5

−11

Time t

Crit

erio

n fu

nctio

n (d

B)

SNR = ∞SNR = 20 dB

SNR = 10 dB

SNR = 5 dB

SNR = 1 dB

SNR = ∞

SNR = 20 dBSNR = 10 dB

SNR = 1 dB

SNR = 5 dB

Fig. 7. Experiment of learning on the unit hypersphere: Blind channeldeconvolution. Top panel: ISI figure during learning corresponding to signal-to-noise ratios (SNR) of1, 5, 10, 20,∞ dB. Bottom panel: Criterionfunction V (x(t)).

−30 −20 −10 0

1

2

3

4

5

ISI

Alg

orith

ms

0 5 10 15 20 25

1

2

3

4

5

Elapsed times

Fig. 8. Experiment of learning on the unit hypersphere: Blind channeldeconvolution. Left-hand panel: ISI figures. Right-hand panel: Runtime figures(in seconds). (1 – Extended Bussgang algorithm, 2 – Bussgangalgorithm, 3 –Bussgang algorithms with natural gradient, 4 – Extended Bussgang algorithmwith natural gradient, 5 – Extended Hamiltonian learning algorithm.)

E. Experiments of learning on the manifold of symmetric positive-definite matrices

Symmetric positive-definite matrices play an important role inmachine learning and applications (see, e.g., [18], [36]).Notableapplications are to medical imagery analysis [23], data clustering andtemplate matching [43] as well as automatic and intelligentcontrol[5]. The ensemble statistical features of data and the algorithms toestimate them are of prime importance in intelligent data processing.In particular, computing the mean value of a set of data is a widelyused technique to smooth out irregularities in the data and to filterout the noise and the measurement errors [22]. Learning the averagematrix out of a set ofN matricessk ∈ S+(n) may be achieved byseeking the matrixx ∈ S+(n) that minimizes the spread function

V (x) =1

N

N∑

k=1

d2(x, sk), (78)

where d : S+(n) × S+(n) → R denotes a distance function (seesubsection III-F). The function (78) measures the varianceof thesamplessk around the pointx and the empirical mean value of thedistribution is defined as the point that minimizes such variance. Inthe present experimentN = 50, n = 5, η = 0.1 and µ = 2and a Type-I implementation was selected. The Figure 9 showslearning curves of the extended Hamiltonian algorithm compared tothe retracted Riemannian gradient algorithm.

0 0.2 0.4 0.6 0.8 10.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

Pot

entia

l ene

rgy

func

tion

Time t0 0.2 0.4 0.6 0.8 1

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Kin

etic

ene

rgy

Time t

Fig. 9. Experiment of learning on the manifold of symmetric positive-definite matrices: Averaging. Left-hand panel: Criterion function V (x(t)).Right-hand panel: Kinetic energyKx(t)(x(t), x(t)). (Solid line: ExtendedHamiltonian learning algorithm. Dot-dashed line: Retracted Riemannian gra-dient algorithm.)

V. CONCLUSION

The present paper aims at discussing the numerical implementationof the extended Hamiltonian learning paradigm introduced in the con-tribution [21]. The main idea is to make use of two different notions tointegrate the differential equations of the system (14). Inparticular,the differential equation in the state-variablex ∈ M is integratednumerically by the help of retraction maps, while the differentialequation in the velocity-variablev ∈ TxM is integrated numericallyby the help of parallel-translation-based or vector-translation-basedstep-forward methods. The general-purpose discussion of section IIas well as the discussion of the cases-of-interest of section III showthat the implementation of the dynamical learning equations requires

Page 12: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

12 EEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

a large effort in terms of mathematical analysis and exhibits a richstructure in terms of implementation possibilities.

The contribution [21] and the present contribution aim togetherat illustrating the author’s current knowledge about the formulationof a learning theory based on the extended Hamiltonian stationary-action principle and to discuss its implementation on a computationplatform. Such contributions open new perspectives in the theory ofextended Hamiltonian learning that are briefly discussed below.

Keeping fixed the discretization parameterη, hence with the samenumerical precision, the dynamical learning algorithm mayconvergefaster than the gradient-steepest-descent algorithm. Thetheory alsoprovides a hint about how to choose the learning parameters on thebasis of a fine-convergence analysis. It is to be noted that the samelearning stepsizeη is used within the expression of the numericalintegration rule of the first dynamical equation via retraction andwithin the expression of the numerical integration rule of the seconddynamical equation via parallel/vector translation. In principle, twodiscretization stepsizes might be used instead. Moreover,it is knownthat the appropriate manipulation of the learning stepsizeparameterduring the training process can improve the learning ability of alearning machine [37]. A special emphasis is to be put on thosestepsize-adaptation methods that are parameter-independent, that is,that do not imply the need for the user to tune parameters whosevalues exert influence on the performance of the algorithm.

Considerable research into the training of neural networksbygradient-based learning rules has been undertaken in the past yearsand it has been known that the introduction of a momentum termintothe training equation can accelerate the training process considerably[42], [45]. However, it is still not clear how to choose the initial mo-mentum which corresponds, in the context of extended Hamiltonianlearning, to the initial speedv0.

Much of the discussion in the physics and engineering literatureconcerning damped systems focuses on systems subject to viscousdamping−µv even though viscous damping occurs rarely in realphysical systems. Other types of dissipative forces, such as theRayleigh drag, exist in real systems. Frictional or drag forces thatdescribe the motion of an object through a fluid or gas are rathercomplicated. One of the simplest empirical mathematical model istaken to be of the form−µ‖v‖ǫ−1v, where ǫ denotes a dampingexponent [47]. Such a model might replace the linear dampingtermin the system (14) to explore richer damping phenomena and needcareful theoretical investigation in order to understand how to embeda non-linear damping term in the equations of dynamics withoutviolating any underlying geometrical constraint.

The dynamical system (14) described by the state-pair(x, v) ∈TM evolves toward a stationarity state(x⋆, 0), where the pointx⋆ ∈ M coincides with a local minima of the potential energyfunction V over the manifoldM . Such observation may be ex-ploited in neural learning where the potential energy function isset to a learning goal: The optimal parameters of a machine arethose that minimize the goal function. Speculations were raised bythe possibility of choosing the potential energy functionV fromphysics and to understand what will be the behavior of the extendedhamiltonian learning system under any such potential field.As aninstance, the contribution [12] discusses a particle-interaction-typepotential function, termedinformation potential, in the context ofmachine learning. Also, the contribution [28] shows that electrostatic-type potential energy functions gives new perspectives to supportvector machines. moreover, recently, attention has been paid to thestudy of non-smooth Hamiltonian systems [31]: Such analysis wouldbe worth extending to the case of Hamiltonian systems whose state-variables evolve on manifolds.

To end with, it is worth mentioning an open question of seemingly

pure speculative interest yet. The differential system on the tangentbundleTM (1) describes the state of a system that evolve until itattains a local minimum of the potential energy functionV . If M = R

and a Euclidean metric is selected, then the system (1) collapses to afree damped oscillator such as, for instance, the mass-spring-dampersystem:

Mx+ κx+ µx = 0, (79)

whereκ > 0 denotes the spring’s constant of elasticity. What makesthe oscillator free is that the right-hand side of the above equationequals zero. By replacing the right-hand side by a function of time,for exampleF cos(ωt), with F, ω ∈ R

+, and the linear termκx witha non-linear term, one obtains a nonlinear oscillator [47].Nonlinearoscillator theory has several applications and speculation arose aboutthe possibility of extending the theory of non-linear oscillators tomanifolds by adding a time dependency to the system (1).

VI. A CKNOWLEDGMENTS

Part of the present work was written when I was a visiting scientistat the Bioinformatics Institute (BII-A∗STAR, Biopolis, Singapore). Iwish to gratefully thank Dr. Hwee-Kwan Lee for the opportunity tovisit BII and the Institute members for the warm hospitalityand forthe constructive comments about the present research topic. I wishto gratefully thank the anonymous reviewers and the associate editorfor the careful and stimulating comments to the present manuscript.

REFERENCES

[1] V.I. Arnol’d and A.B. Givental’, Symplectic geometry, in “DynamicalSystems IV: Symplectic Geometry & Its Applications” (V.I. Arnol’dand S.P. Novikov, Ed.s), Enciclopaedya of Mathematical Sciences, Vol.4, pp. 1 – 138, Springer-Verlag (Second Edition), 2001

[2] S. Bellini, Blind equalization, Alta Frequenza, Vol. 57, pp. 445 – 450,1988

[3] A. Bielecki, Estimation of the Euler method error on a Riemannianmanifold, Communications in Numerical Methods in Engineering, Vol.18. pp. 757 – 763, 2002

[4] S. Bonnabel and R. Sepulchre,Riemannian metric and geometric meanfor positive semidefinite matrices of fixed rank, SIAM Journal of MatrixAnalysis and Applications, Vol. 31, No. 3, pp. 1055 – 1070, 2009

[5] Y. Chen and J.E. McInroy,Estimation of symmetric positivedefinitematrices from imperfect measurements, IEEE Transactions on AutomaticControl, Vol. 47, No. 10, pp. 1721 – 1725, October 2002

[6] E. Celledoni and S. Fiori,Neural learning by geometric integration ofreduced ‘rigid-body’ equations, Journal of Computational and AppliedMathematics (JCAM), Vol. 172, No. 2, pp. 247 – 269, December 2004

[7] J. Cortes, A. Van Der Schaft and P.E. Crouch, “Characterization ofgradient control systems,” SIAM Journal on Control and Optimization,Vol. 44, No. 4, pp. 1192 – 1214, 2005

[8] M.J. Daly and J.P. Reilly,Blind deconvolution using bayesian methodswith application to the dereverberation of speech, Proceedings of theIEEE International Conference on Acoustics, Speech, and Signal Pro-cessing (ICASSP 2004), Vol. 2, pp. 1009 – 1012, May 2004

[9] N. del Buono and L. Lopez,Runge-Kutta type methods based ongeodesics for systems of ODEs on the Stiefel manifold, BIT NumericalMathematics, Vol. 41, No. 5, pp. 912 – 923, 2001

[10] A. Edelman, T.A. Arias and S.T. Smith,The geometry of algorithmswith orthogonality constraints, SIAM Journal on Matrix Analysis Ap-plications, Vol. 20, No. 2, pp. 303 – 353, 1998

[11] S. Esedoglu,Blind deconvolution of bar code signals, Inverse Problems,Vol. 20, pp. 121 – 135, 2004

[12] D. Erdogmus and J.C. Principe,Generalized information potentialcriterion for adaptive system training, IEEE Transactions on NeuralNetworks, Vol. 13, No. 5, pp. 1035 – 1044, September 2002

[13] E.F.S. Filho, D.B. Haddad and L.A.L. de Almeida,Thermal HysteresisCharacterization Through Blind Deconvolution, Proceedings of the17th

International Conference on Systems, Signals and Image Processing(IWSSIP 2010, Rio de Janeiro, Brazil, 17-19 June 2010), pp. 352 –355, 2010

[14] S. Fiori, A fast fixed-point neural blind deconvolution algorithm, IEEETransactions on Neural Networks, Vol. 15, No. 2, pp. 455 – 459, March2004

Page 13: Extended Hamiltonian Learning on Riemannian Manifolds ...web.dibet.univpm.it/fiori/publications/dynamical_part_2.pdf · initiated with the contribution S. Fiori, Extended Hamiltonian

FIORI: EXTENDED HAMILTONIAN LEARNING ON RIEMANNIAN MANIFO LDS: NUMERICAL ASPECTS 13

[15] S. Fiori, Formulation and integration of learning differential equationson the Stiefel manifold, IEEE Transactions on Neural Networks, Vol. 16,No. 6, pp. 1697 – 1701, November 2005

[16] S. Fiori, Geodesic-based and projection-based neural blind deconvolu-tion algorithms, Signal Processing, Vol. 88, No. 3, pp. 521 – 538, March2008

[17] S. Fiori, Lie-group-type neural system learning by manifold retractions,Neural Networks (Elsevier), Vol. 21, No. 10, pp. 1524 – 1529,December2008

[18] S. Fiori, Learning the Frechet mean over the manifold of symmet-ric positive-definite matrices(Invited keynote), Cognitive Computation(Springer), Vol. 1, No. 4, pp. 279 – 291, December 2009

[19] S. Fiori, Learning by natural gradient on noncompact matrix-typepseudo-Riemannian manifolds, IEEE Transactions on Neural Networks,Vol. 21, No. 5, pp. 841 – 852, May 2010

[20] S. Fiori, Averaging over the Lie group of optical systems transferencematrices, Frontiers of Electrical and Electronic Engineering in China(Springer), Special issue of the “Sino foreign-interchange Workshop onIntelligence Science and Intelligent Data Engineering” Part A, Vol. 6,No. 1, pp. 137 – 145, March 2011

[21] S. Fiori, Extended Hamiltonian learning on Riemannian manifolds:Theoretical aspects, IEEE Transactions on Neural Networks, Vol. 22,No. 5, pp. 687 – 700, May 2011

[22] S. Fiori and T. Tanaka,An algorithm to compute averages on matrix Liegroups, IEEE Transactions on Signal Processing, Vol. 57, No. 12, pp.4734 – 4743, December 2009

[23] P.T. Fletcher and S. Joshi,Riemannian geometry for the statisticalanalysis of diffusion tensor data, Signal Processing, Vol. 87, No. 2, pp.250 – 262, February 2007

[24] D. Gabay, Minimizing a differentiable function over a differentiablemanifold, Journal of Optimization Theory and Applications, Vol. 37,No. 2, pp. 177 – 219, 1982

[25] G. Golub and C. van Loan,Matrix Computations, The Johns HopkinsUniversity Press,3rd Edition, 1996

[26] E. Hairer, C. Lubich and G. Wanner,Geometric Numerical Integration:Structure-Preserving Algorithms for Ordinary Differential Equations,Springer Series in Computational Mathematics,2nd Edition, 2006

[27] W.F. Harris and J.R. Cardoso,The exponential-mean-log-transferenceas a possible representation of the optical character of an average eye,Ophthalmic and Physiological Optics, 2006, Vol.26, No. 4, pp. 380 –383, 2006

[28] S. Hochreiter, M.C. Mozer and K. Obermayer,Coulomb classifiers:Generalizing support vector machines via an analogy to electrostaticsystems, Advances in Neural Information Processing Systems (NIPS,MIT Press) 15, pp. 545 – 552, 2003

[29] G. Hori and T. Tanaka,Pivoting in Cayley tranform-based optimizationon orthogonal groups, Proceedings of the Second APSIPA AnnualSummit and Conference, pp. 181 – 184, Biopolis, Singapore, 14-17December 2010

[30] S. Huckemann, T. Hotz and A. Munk,Intrinsic MANOVA for Rieman-nian manifolds with an application to Kendall’s space of planar shapes,IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.32, No. 4, pp. 593 – 603, April 2010

[31] D.M. Kaufman and D.K. Pai,Geometric Numerical integration ofinequality constrained, nonsmooth Hamiltonian systems, 2010, arXivpreprint available athttp://arxiv.org/abs/1007.2233

[32] E. Klassen, A. Srivastava, W. Mio and S.H. Joshi,Analysis of planarshapes using geodesic paths on shape spaces, IEEE Transactions onPattern Analysis and Machine Intelligence, Vol. 26, No. 3, pp. 372 –383, March 2004

[33] D.S. Lun, A. Sherrid, B. Weiner, D.R. Sherman and J.E. Galagan,A blinddeconvolution approach to high-resolution mapping of transcriptionfactor binding sites from ChIP-seq data, Genome Biology, Vol. 10, No.12, pp. 142 – 144, December 2009

[34] G. Meyer, S. Bonnabel and R. Sepulchre,Regression on fixed-rankpositive semidefinite matrices: A Riemannian approach, Journal ofMachine Learning Research, Vol. 12, pp. 593 – 625, February 2011

[35] C.W. Misner, K.S. Thorne and J.A. Wheeler,Gravitation, W.H. FreemanPublisher, 1973

[36] M. Moakher and P.G. Batchelor,Symmetric Positive-Definite Matrices:From Geometry to Applications and Visualization. Chapter in ’Visual-ization and Processing of Tensor Fields’ (Joachim Weickertand HansHagen, Ed.s), Springer Series in Mathematics and Visualization, pp. 285– 298, 2006

[37] M. Moreira and E. Fiesler,Neural networks with adaptive learning rateand momentum terms, Technical report 95-04 of the Institut Dalle Molled’Intelligence Artificielle Perceptive, 1995

[38] H. Munthe-Kaas,High order Runge-Kutta methods on manifolds, Ap-plied Numerical Mathematics, Vol. 29, No. 1, pp. 115 – 127, January1999

[39] Y. Nishimori, S. Akaho and M.D. Plumbley,Riemannian optimizationmethod on the flag manifold for independent subspace analysis, Proc.of the 6th International Conference on Independent Component Anal-ysis and Blind Signal Separation (ICA’06), Lecture notes incomputerscience, Vol. 3889, pp. 295 – 302, Springer, Berlin, 2006

[40] L. Noakes,A global algorithm for geodesics, Journal of the AustralianMathematical Society (Series A), Vol. 64, pp. 37 – 50, 1998

[41] B. Nsiri, J.-M. Boucher and T. Chonavel,Multichannel blind deconvo-lution application to marine seismic, Proceedings of the OCEANS 2003(22-26 Septembre 2003, San Diego, CA, USA), Vol. 5, 2761 – 2766,September 2003

[42] V.V. Phansalkar and P.S. Sastry,Analysis of the back-propagationalgorithm with momentum, IEEE Transactions on Neural Networks, Vol5, No. 3, pp. 505 – 506, May 1994

[43] N. Prabhu, H.-C. Chang and M. Deguzman,Optimization on Liemanifolds and pattern recognition, Pattern Recognition, Vol. 38, No.12, pp. 2286 – 2300, December 2005

[44] C.-H. Qi, K.A. Gallivan and P.-A. Absil,Riemannian BFGS algorithmwith applications, Recent Advances in Optimization and its Applicationsin Engineering (M. Diehl et al., Ed.s), Part 3, pp. 183 – 192, Springer-Verlag Berlin Heidelberg, 2010

[45] G. Qiu, M.R. Varley and T.J. Terrell,Accelerated training of backprop-agation networks by using adaptive momentum step, Electronics Letters,Vol. 28, No. 4, pp. 377 – 378, February 1992

[46] V. Ramakrishna and F. Costa,On the exponentials of some structuredmatrices, Journal of Physics A: Mathematical and General, Vol. 37, pp.11613 – 11627, 2004

[47] M.A.F. Sanjuan,The effect of nonlinear damping on the universal escapeoscillator, International Journal of Bifurcation and Chaos, Vol. 9, No.4, pp. 735 – 744, 1999

[48] F. Silva-Leite and P. Crouch,Closed forms for the exponential mappingon matrix Lie groups, based on Putzer’s method, Journal of MathematicalPhysics, Vol. 40, No. 7, pp. 3561 – 3568, July 1999

[49] M. Spivak,A Comprehensive Introduction to Differential Geometry, 2nd

Edition, Berkeley, CA: Publish or Perish Press, 1979[50] A. Swaminathan, M. Wu and K.J.R. Liu,Digital image forensics via

intrinsic fingerprints, IEEE Transactions on Information Forensics andSecurity, Vol. 3, No. 1, pp. 101 – 117, March 2008

[51] N.T. Trendafilov and R.A. Lippert,The multimode Procrustes problem,Linear Algebra and Its Applications, Vol. 349, pp. 245 – 264,2002

[52] B. Vandereycken and S. Vandewalle,A Riemannian optimization ap-proach for computing low-rank solutions of Lyapunov equations, SIAMJournal on Matrix Analysis Applications, Vol. 31, No. 5, pp.2553 –2579, 2010

[53] L. Xu, E. Oja, and C.Y. Suen,Modified Hebbian learning for curve andsurface fitting, Neural Networks, Vol. 5, pp. 441 – 457, 1992

[54] J. Wallner and N. Dyn,Convergence and C1 analysis of subdivisionschemes on manifolds by proximity, Computer Aided Geometric Design,Vol. 22, pp. 593 – 622, 2005

Simone Fiori received the Italian Laurea (Dr. Eng.)cum laudein Electronic Engineering in July 1996from the University of Ancona (Italy) and the Ph.D.degree in Electrical Engineering (circuit theory) inMarch 2000 from the University of Bologna (Italy).He is currently serving as Adjunct Professor atthe Faculty of Engineering of the Universita Po-litecnica delle Marche (Ancona, Italy). His researchinterests include unsupervised learning theory forartificial neural networks, linear and non-linear adap-tive discrete-time filter theory, vision and image

processing by neural networks, continuous-time and discrete-time circuits forstochastic information processing, geometrical methods for machine learningand signal processing. He is author of more than 145 refereedjournal andconference papers. Dr. Fiori was the recipient of the 2001 “E.R. CaianielloAward” for the best Ph.D. dissertation in the artificial neural network field andthe 2010 “Rector Award” as a proficient researcher of the Faculty of Engi-neering of the Universita Politecnica delle Marche. He is currently serving asAssociate Editor for Neurocomputing (Elsevier), Computational Intelligenceand Neuroscience (Hindawi) and Cognitive Computation (Springer).