LNCS 8690 - Canonical Correlation Analysis on Riemannian...

Canonical Correlation Analysis

on Riemannian Manifolds and Its Applications

Hyunwoo J. Kim1, Nagesh Adluru1, Barbara B. Bendlin1,Sterling C. Johnson1, Baba C. Vemuri2, and Vikas Singh1

1 University of Wisconsin–Madison, USA2 University of Florida, USA

http://pages.cs.wisc.edu/~hwkim/projects/riem-cca

Abstract. Canonical correlation analysis (CCA) is a widely used statis-tical technique to capture correlations between two sets of multi-variaterandom variables and has found a multitude of applications in computervision, medical imaging and machine learning. The classical formulationassumes that the data live in a pair of vector spaces which makes its usein certain important scientific domains problematic. For instance, theset of symmetric positive definite matrices (SPD), rotations and proba-bility distributions, all belong to certain curved Riemannian manifoldswhere vector-space operations are in general not applicable. Analyzingthe space of such data via the classical versions of inference models israther sub-optimal. But perhaps more importantly, since the algorithmsdo not respect the underlying geometry of the data space, it is hard toprovide statistical guarantees (if any) on the results. Using the space ofSPD matrices as a concrete example, this paper gives a principled gen-eralization of the well known CCA to the Riemannian setting. Our CCAalgorithm operates on the product Riemannian manifold representingSPD matrix-valued fields to identify meaningful statistical relationshipson the product Riemannian manifold. As a proof of principle, we presentresults on an Alzheimer’s disease (AD) study where the analysis task in-volves identifying correlations across diffusion tensor images (DTI) andCauchy deformation tensor fields derived from T1-weighted magneticresonance (MR) images.

1 Introduction

Canonical correlation analysis (CCA) is a powerful statistical technique to ex-tract linear components that capture correlations between two multi-variate ran-dom variables [15]. CCA provides an answer to the following question: supposewe are given data of the form, (xi ∈ X ,yi ∈ Y)Ni=1 ⊂ X × Y where xi ∈ Rm

and yi ∈ Rn, find a model that explains both of these observations. More pre-cisely, CCA provides an answer to this question by identifying a pair of directionswhere the projections (namely, u and v) of the random variables, x and y yieldmaximum correlation ρu,v = COV(u, v)/σuσv. Here, COV(u, v) denotes the co-variance function and σ· gives the standard deviation. During the last decade,the CCA formulation has been broadly applied to various unsupervised learning

D. Fleet et al. (Eds.): ECCV 2014, Part II, LNCS 8690, pp. 251–267, 2014.c© Springer International Publishing Switzerland 2014

252 H.J. Kim et al.

problems in computer vision and machine learning including image retrieval [11],face/gait recognition [38], super-resolution [19] and action classification [24].

Beyond the applications described above, a number of works have recentlyinvestigated the use of CCA in analyzing neuroimaging data [3], which is a mainfocus of this paper. Here, for each participant in a clinical study, we acquiredifferent types of images such as Magnetic Resonance (MRI), Computed To-mography (CT) and functional MRI. It is expected that each imaging modalitycaptures a unique aspect of the underlying disease pathology. Therefore, givena group of N subjects and their corresponding brain images, we may want toidentify strong relationships (e.g., anatomical/functional correlations) across dif-ferent image types. When performed across different diseases, such an analysiswill reveal insights into what is similar and what is different across diseases evenwhen their symptomatic presentation may be similar. Alternatively, CCA mayserve a feature extraction role. That is, the brain regions found to be stronglycorrelated can be used directly in downstream statistical analysis. In a studyof a large number of subjects, rather than performing a hypothesis test on allbrain voxels independently for each imaging modality, restricting the number oftests only to the set of ‘relevant’ voxels (found via CCA) is known to improvestatistical power (since the False Discovery Rate correction will be less severe).

The classical version of CCA described above concurrently seeks two linearsubspaces (straight lines) in vector spaces Rm and Rn for the two multi-variaterandom variables x and y. The projection on to the straight line (linear sub-space) is obtained by an inner product. This formulation is broadly applicablebut encounters problems for manifold-valued data that are becoming increas-ingly important in present day research. For example, diffusion tensor magneticresonance images (DTI) allow one to infer the diffusion tensor characterizing theanisotropy of water diffusion at each voxel in an image volume. This tensorialfeature can be visualized as an ellipsoid and represented by a 3 × 3 symmetricpositive definite (SPD) matrix at each voxel in the acquired image volume. Nei-ther the individual SPD matrices nor the field of these SPD matrices lie in avector space but instead are elements of a negatively curved Riemannian mani-fold where standard vector space operations are not valid. Hence, classical CCAis not applicable in this setting. For T1-weighted Magnetic resonance images(MRIs), we are frequently interested in analyzing not just the 3D intensity im-age on its own, but rather a quantity that captures the deformation field betweeneach image and a population template. A registration between the image and thetemplate yields the deformation field required to align the image pairs and thedeterminant of the Jacobian J of this deformation at each voxel is a commonlyused feature that captures local volume changes [6,17]. Quantities such as the

Cauchy deformation tensor defined as√JTJ have been reported in literature

for use in morphometric analysis [18]. The input to the statistical analysis is a

3D image of voxels, where each voxel corresponds to a matrix√JTJ � 0 (the

Cauchy deformation tensor). Another example of manifold-valued fields is de-rived from high angular resolution diffusion images (HARDI) and can be usedto compute the ensemble average propagators (EAPs) at each voxel of the given

CCA on Riemannian Manifolds 253

HARDI data. The EAP is a probability density function that is related to thediffusion sensitized MR signal via the Fourier transform [5]. Since an EAP isa probability density function, by using a square root parameterization of thisdensity function, it is possible to identify it with a point on the unit HilbertSphere. Once again, to perform any statistical analysis of these data derived fea-tures, we cannot apply standard vector-space operations since the unit Hilbertsphere is a positively curved manifold. When analyzing real brain imaging data,it is entirely possible that no meaningful correlations exist in the data. The keydifficulty is that we do not know whether the experiment (i.e., inference) failedbecause there is in fact no statistically meaningful signal in the dataset or if thealgorithms being used are sub-optimal.

Related Work. There are two somewhat distinct bodies of work that are re-lated to and motivate this work. The first one relates to the extensive studyof the classical CCA and its non-linear variants. These include various inter-esting results based on kernelization [1,4,12], neural networks [25,16], and deeparchitectures [2]. Most, if not all of these strategies extend CCA to arbitrarynonlinear spaces. However, this flexibility brings with it the associated issues ofmodel selection (and thereby, regularization), controlling the complexity of theneural network structure, choosing an appropriate activation function and so on.It is an interesting question though not completely clear to us what type of a reg-ularizer should be used if one were to explicitly impose a Riemannian structureon the objectives described in the works above. As opposed to regularization,the second line of work incorporates the specific geometry of the data directlywithin the estimation problem. Various statistical constructs have been gener-alized to Riemannian manifolds: these include regression [39,31], classification[36], kernel methods [21], margin-based and boosting classifiers [26], interpola-tion, convolution, filtering [10] and dictionary learning [14,27]. Among the mostclosely related are ideas related to projective dimensionality reduction methods.For instance, the generalization of Principal Components analysis (PCA) viathe so-called Principal Geodesic Analysis (PGA) [9], Geodesic PCA [20], ExactPGA [33], Horizontal Dimension Reduction [32] with frame bundles, and an ex-tension of PGA to the product space of Riemannian manifolds, namely, tensorfields [36]. It is important to note that except the non-parametric method of [34],most of these strategies focus on one rather than two sets of random variables(as is the case in CCA). Even in this setting, the first results on successful gener-alization of parametric regression models to Riemannian manifolds is relativelyrecent: geodesic regression [8,29] and polynomial regression [13] (note that theadaptive CCA formulation in [37] seems related to our work but is not designedfor manifold-valued data).

This paper provides a parametric model between two different tensor fields ona Riemannian manifold, which is a significant step beyond these recent works.The CCA formulation we present requires the optimization of functions overeither a single product manifold or a pair of product manifolds (of differentdimensions) concurrently. The latter problem involving product manifolds ofdifferent dimensions will not be addressed in this paper. Note that in general,

254 H.J. Kim et al.

on manifolds the projection operation does not have a nice closed form solu-tion. So, we need to perform projections via an optimization scheme on the twomanifolds and find the best pair of geodesic subspaces. We provide a precisesolution to this problem. To our knowledge, this is the first extension of CCAto Riemannian manifolds. Our approach has two advantages relative to othernon-linear extensions of CCA. The first advantage is that no model selectionis required. Also our method incorporates the known geometry of data space.Our key contributions are: a) A principled generalization of CCA for Rie-mannian manifolds; b) First, a numerical optimization scheme for identifyingthe subspaces and later, single path algorithms with approximate projections(both these ideas may be applicable beyond the CCA formulation). c) Provid-ing experimental evidence how the Riemannian CCA formulation expands theoperating range of statistical analysis of neuroimaging data.

2 Canonical Correlation in Euclidean Space

First, we will briefly review the classical CCA in Euclidean space to motivatethe rest of our presentation. Recall that Pearson’s product-moment correlationcoefficient is a quantity to measure the relationship of two random variables,x ∈ R and y ∈ R. For one dimensional random variables,

ρx,y =COV(x, y)

σxσy=

E[(x− μx)(y − μy)]

σxσy=

∑Ni=1(xi − μx)(yi − μy)

√∑Ni=1(xi − μx)2

√∑Ni=1(yi − μy)2

(1)

For high dimensional data, x ∈ Rm and y ∈ R

n, we cannot however perform adirect calculation as above. So, we need to project each set of variables on to aspecial axis in each space X and Y. CCA generalizes the concept of correlationto random vectors (potentially of different dimensions). It is convenient to thinkof CCA as a measure of correlation between two multivariate data based on thebest projection which maximizes their mutual correlation.

Canonical Correlation for x ∈ Rm and y ∈ R

n is given by

maxwx,wy

corr(πwx (x), πwy (y)) = maxwx,wy

∑Ni=1 wT

x (xi − µx)wTy (yi − µy)

√∑Ni=1 (wT

x (xi − µx))2

√∑N

i=1

(wT

y (yi − µy))2 (2)

where πwx(x) := argmint∈R d(twx,x)2. We will call πwx(x) the projection co-

efficient for x (similarly for y). Define Swxas the subspace which is the span

of wx. The projection of x on to Swxis given by ΠSwx

(x). We can then verifythat the relationship between the projection and the projection coefficient is,

ΠSwx(x) := arg min

x′∈Swx

d(x,x′)2 =wT

xx

‖wx‖wx

‖wx‖ =wT

xx

‖wx‖2wx = πwx(x)wx (3)

In the Euclidean space, ΠSwx(x) has a closed form solution. In fact, it is

obtained by an inner product, wTxx. Hence, by replacing the projection coeffi-

cient πwx(x) with wTxx/‖wx‖2 and after a simple calculation, one obtains the


form in (2). Without loss of generality, assume that x,y are centered. Then theoptimization problem can be written as,

maxwx,wy

wTxX

TYwy subject to wTxX

TXwx = wTy Y

TYwy = 1 (4)

where x,wx ∈ Rm, y,wy ∈ R

n, X = [x1 . . .xN ]T and Y = [y1 . . .yN ]T . Theonly difference here is that we remove the denominator. Instead, we have twoequality constraints (note that correlation is scale-invariant).

3 Mathematical Preliminaries

We now briefly summarize certain basic concepts [7] which we will use later.

Riemannian Manifolds. A differentiable manifold [7] of dimension n is a setM and a family of injective mappings ϕi : Ui ⊂ Rn → M of open sets Ui of R

n

intoM such that: (1) ∪i ϕi(Ui) = M; (2) for any pair i, j with ϕi(Ui)∩ϕj(Uj) =W = φ, the sets ϕ−1

i (W ) and ϕ−1j (W ) are open sets in Rn and the mappings

ϕ−1j ◦ϕi are differentiable, where ◦ denotes function composition. In other words,

a differentiable manifold M is a topological space that is locally similar to anEuclidean space and has a globally defined differential structure. The tangentspace at a point p on the manifold, TpM, is a vector space that consists of thetangent vectors of all possible curves passing through p.

A Riemannian manifold is equipped with a smoothly varying inner product.The family of inner products on all tangent spaces is known as the Rieman-nian metric of the manifold. The geodesic distance between two points on M isthe length of the shortest geodesic curve connecting the two points, analogousto straight lines in Rn. The geodesic curve from xi to xj can be parameter-ized by a tangent vector in the tangent space at yi with an exponential mapExp(yi, ·) : TyiM → M. The inverse of the exponential map is the logarithmmap, Log(yi, ·) : M → TyiM. Separate from these notations, matrix exponential(and logarithm) are given as exp(·) (and log(·)).Intrinsic Mean. Let d(·, ·) define the geodesic distance between two points. Theintrinsic (or Karcher) mean of a set of points {xi} with non-negative weights {wi}is the minimizer of,

y = arg miny∈M

N∑

i=1

wid(y, yi)2, (5)

which may be an arithmetic, geometric or harmonic mean depending on d(·, ·).On manifolds, the Karcher mean with distance d(yi, yj) = ‖Logyi

yj‖ is,∑N

i=1 Logyyi = 0. This identity implies that y is a local minimum which hasa zero norm gradient [22], i.e., the sum of all tangent vectors corresponding togeodesic curves from mean y to all points yi is zero in the tangent space TyM. Onmanifolds, the existence and uniqueness of the Karcher mean is not guaranteed,unless we assume, for uniqueness, that the data is in a small neighborhood.

256 H.J. Kim et al.

µx

wxt1x�1

wxt2wx

ΠSwx(x2)

x1 x2

TµxMx

Mx

Swx

TμyMyµy wyu1 wyu2

wy

y1

y2MySwy

ΠSwy(y1)

ΠSwy(y2)

Fig. 1. CCA on Riemannian manifolds. CCA searches geodesic submanifolds (sub-spaces), Swx and Swy at the Karcher mean of data on each manifold. Correlationbetween projected points {ΠSwx

(xi)}Ni=1 and {ΠSwy(yi)}Ni=1 is equivalent to the cor-

relation between projection coefficients {ti}Ni=1 and {ui}Ni=1. Although x and y belongto the same manifold we show them in different plots for ease of explanation.

Geodesically Convex. A subset C of M is said to be a geodesically convex setif there is a minimizing geodesic curve in C between any two points in C. Thisassumption is commonly used [8] and essential to ensure that the Riemannianoperations such as the exponential and logarithm maps are well-defined.

4 A Model for CCA on Riemannian Manifolds

We now present a step by step derivation of our Riemannian CCA model. Clas-sical CCA finds the mean of each data modality. Then, it maximizes correlationbetween projected data on each subspace at the mean. Similarly, CCA on mani-folds must first compute the intrinsic mean (i.e., Karcher mean) of each data set.It must then identify a ‘generalized’ version of a subspace at each Karcher meanto maximize the correlation of projected data. The generalized form of a sub-space on Riemannian manifolds has been studied in the literature [33,26,20,9].The so-called geodesic submanifold [9,36,23] which has been used for geodesicregression serves our purpose well and is defined as S = Exp(µ, span({vi})∩U),where U ⊂ TµM, and vi ∈ TµM [9]. When S has only one tangent vector v,then the geodesic submanifold is simply a geodesic curve, see Figure 1.

We can now proceed to formulate the precise form of projection on to ageodesic submanifold. Recall that when given a point, its projection on a set isthe closest point in the set. So, the projection on to a geodesic submanifold (S)must be a function satisfying this behavior. This is given by,

ΠS(x) = arg minx′∈S

d(x,x′)2 (6)

In Euclidean space, the projection on a convex set (e.g., subspace) is unique.It is also unique on some manifolds under special conditions, e.g., quaternionsphere [30]. However, the uniqueness of the projection on geodesic submanifoldsin general conditions cannot be ensured. Like other methods, we assume thatgiven the specific manifold and the data, the projection is well-posed.


Finally, the correlation of points (after projection) can be measured by thedistance from the mean to the projected points. To be specific, the projectionon a geodesic submanifold corresponding to wx in classical CCA is given by

ΠSwx(x) := arg min

x′∈Swx

‖Log(x,x′)‖2x (7)

Swx:= Exp(µx, span{wx} ∩ U) where wx is a basis tangent vector and U ⊂

TµxMx is a small neighborhood of µx. The expression for projection coefficients

can now be given as

ti = πwx(xi) := arg mint′i∈(−ε,ε)

‖Log(Exp(µx, t′iwx),xi)‖2µx (8)

where xi,µx ∈ Mx, wx ∈ TµxMx, ti ∈ R. The term, ui = πwy

(y) is definedanalogously. ti is a real value to obtain the point ΠSwx

(x) = Exp(µx, tiwx).As mentioned above, x and y belong to the same manifold. Note that we aredealing with a single manifold, however, we use two different notations Mx, andMy to show that they are differently distributed for ease of discussion.

Notice that we have d(μx,ΠSwx(xi)) = ‖Log(μx,Exp(μx,wxti))‖µx

= ti‖wx‖µx.

By inspection, this shows that the projection coefficient is proportional to thelength of the geodesic curve from the base point µx to the projection of x,ΠSwx

(x). Correlation is scale invariant, as expected. Therefore, the correlationbetween projected points {ΠSwx

(xi)}Ni=1 and {ΠSwy(yi)}Ni=1 reduces to the cor-

relation between the quantities that serve as projection coefficients here, {ti}Ni=1

and {ui}Ni=1.Putting these pieces together, we obtain our generalized formulation for CCA,

ρx,y = corr(πwx (x), πwy (y)) = maxwx,wy ,t,u

∑Ni=1(ti − t)(ui − u)

√∑Ni=1(ti − t)2

√∑Ni=1(ui − u)2

(9)

where ti = πwx(xi), t := {ti}, ui = πwy

(yi), u := {ui}, t = 1N

∑Ni=1 ti and

u = 1N

∑Ni=1 ui. Expanding out components in (9) further, it takes the form,

ρx,y = maxwx,wy ,t,u

∑Ni=1(ti − t)(ui − u)

√∑Ni=1(ti − t)2

√∑Ni=1(ui − u)2

s.t. ti = arg minti∈(−ε,ε)

‖Log(Exp(μx, tiwx),xi)‖2, ∀i ∈ {1, . . . , N}

ui = arg minui∈(−ε,ε)

‖Log(Exp(μy, uiwy),yi)‖2,∀i ∈ {1, . . . , N}

(10)

Directly, we see that (10) is a multilevel optimization and solutions from nestedsub-optimization problems may be needed to solve the higher level problem. Itturns out that deriving the first order optimality conditions suggests a cleanerapproach.

Define f(t,u) :=∑N

i=1(ti−t)(ui−u)√∑Ni=1(ti−t)2

√∑Ni=1(ui−u)2

, g(ti,wx) := ‖Log(Exp(µx, tiwx),xi)‖2,

and g(ui,wy) := ‖Log(Exp(μy , uiwy),yi)‖2. Then, we may replace the equality

258 H.J. Kim et al.

constraints in (10) with optimality conditions rather than another optimizationproblem for each i. Using this idea, we have

ρ(wx,wy) = maxwx,wy ,t,u

f(t,u)

s.t. ∇tig(ti,wx) = 0,∇uig(ui,wy) = 0,∀i ∈ {1, . . . , N}(11)

5 Optimization Schemes

We present two different algorithms to solve the problem of computing CCA onRiemannian manifolds. The first algorithm is based on a numerical optimizationfor (11). We only summarize the main model here and provide all technical detailsin the extended version for space reasons. Subsequently, we present the secondapproach which is based on an approximation for a more efficient algorithm.

5.1 An Augmented Lagrangian Method

The augmented Lagrangian technique is a well known variation of the penaltymethod for constrained optimization problems. Given a constrained optimizationproblem max f(x) s.t. ci(x) = 0, ∀i, the augmented Lagrangian method solves asequence of the following models while increasing νk.

max f(x) +∑

i

λici(x)− νk∑

i

ci(x)2

(12)

The augmented Lagrangian formulation for our CCA formulation is given by

maxwx,wy ,t,u

LA(wx,wy , t,u,λk; νk) = max

wx,wy ,t,uf(t,u) +

N∑

i

λkti∇tig(ti,wx)+

N∑

i

λkui∇uig(ui,wy)− νk

2

(N∑

i=1

∇tig(ti,wx)2 +∇uig(ui,wy)

2

) (13)

The pseudocode for our algorithm is summarized in Algorithm 1.Remarks. Note that for Algorithm 1, we need the second derivative of g,

in particular, for d2

dwdtg,d2

dt2 g. The literature does not provide a great deal ofguidance on second derivatives of functions involving Log(·) and Exp(·) maps ongeneral Riemannian manifolds. However, depending on the manifold, it can beobtained analytically or numerically (see extended version of the paper).

Approximate strategies. It is clear that the core difficulty in deriving the algo-rithm above was the lack of a closed form solution to projections on to geodesicsubmanifolds. If however, an approximate form of the projection can lead tosignificant gains in computational efficiency with little sacrifice in accuracy, it isworthy of consideration. The simplest approximation is to use a Log-Euclideanmodel. But it is well known that the Log-Euclidean is reasonable for data that aretightly clustered on the manifold and not otherwise. Further, the Log-Euclideanmetric lacks the important property of affine invariance. We can obtain a more


Algorithm 1. Riemannian CCA based on the Augmented Lagarangian method

1: x1, . . . ,xN ∈ Mx, y1, . . . ,yN ∈ My

2: Given ν0 > 0, τ 0 > 0, starting points (w0x,w

0y, t

0,u0) and λ0

3: for k = 0, 1, 2 . . . do4: Start at (wk

x,wky , t

k,uk)5: Find an approximate minimizer (wk

x,wky , t

k,uk) of LA(·,λk; νk), and terminatewhen ‖∇LA(w

kx,w

ky , t

k,uk,λk; νk)‖ ≤ τk

6: if a convergence test for (11) is satisfied then7: Stop with approximate feasible solution8: end if9: λk+1

ti= λk

ti − νk∇tig(ti,wx),∀i10: λk+1

ui= λk

ui− νk∇uig(ui,wy),∀i

11: Choose new penalty parameter νk+1 ≥ νk

12: Set starting point for the next iteration13: Select tolerance τk+1

14: end for

Algorithm 2. CCA with approximate projection

1: Input X1, . . . , XN ∈ My , Y1, . . . , YN ∈ My

2: Compute intrinsic mean μx,μy of {Xi}, {Yi}3: Compute X�

i = Log(μx, Xi), Y�i = Log(μy, Yi)

4: Transform (using group action) {X�i}, {Y �

i } to the TIMx, TIMy

5: Perform CCA between TIMx, TIMy and get axes Wa ∈ TIMx, Wb ∈ TIMy

6: Transform (using group action) Wa,Wb to TµxMx, Tµy

My

accurate projection using the submanifold expression given in [36]. The form ofprojection is,

ΠS(x) ≈ Exp(μ,d∑

i=1

vi〈vi,Log(μ,x)〉µ ) (14)

where {vi} are orthonormal basis at TµM. The CCA algorithm with this ap-proximation for the projection is summarized as Algorithm 2.

Finally, we provide a brief remark on one remaining issue. This relates to thequestion why we use group action rather than other transformations such asparallel transport. Observe that Algorithm 2 sends the data from the tangentspace at the Karcher mean of the samples to the tangent space at Identity I.The purpose of the transformation is to put all samples at the Identity of theSPD manifold, to obtain a more accurate projection, which can be understoodusing (14). The projection and inner product depend on the anchor point μ.If μ is Identity, then there is no discrepancy between the Euclidean and theRiemannian inner products. Of course, one may use a parallel transport. How-ever, group action may be substantially more efficient than parallel transportsince the former does not require computing a geodesic curve (which is neededfor parallel transport). Interestingly, it turns out that on SPD manifolds with aGL-invariant metric, parallel transport from an arbitrary point p to Identity I is

260 H.J. Kim et al.

equivalent to the transform using a group action. So, one can parallel transporttangent vectors from p to I using the group action more efficiently. The proof ofTheorem 1 is available in the extended version.

Theorem 1. On SPD manifold, let Γp→I(w) denote the parallel transport ofw ∈ TpM along the geodesic from p ∈ M to I ∈ M. The parallel transport isequivalent to group action by p−1/2wp−T/2, where the inner product 〈u, v〉p =tr(p−1/2up−1vp−1/2).

5.2 Extensions to the Product Riemannian Manifold

In the types of imaging datasets of interest in this paper, we seek to perform ananalysis on an entire population of images (of multiple types). For such data,each image must be treated as a single entity, which necessitates extending theformulation above to a Riemannian product space.

Let us define a Riemannian metric on the product spaceM = M1×. . .×Mm.A natural choice is the following idea from [36].

〈X1,X2〉P =

m∑

j=1

〈Xj1 , X

j2〉P j (15)

where X1 =(X1

1 , . . . , Xm1

) ∈ M, and X2 =(X1

2 , . . . , Xm2

) ∈ M and P =(P 1, . . . , Pm

) ∈ M. Once we have the exponential and logarithm maps, CCAon a Riemannian product space can be directly performed by Algorithm 2. Theexponential map Exp(P ,V ) and logarithm map Log(P ,X) are given by

(Exp(P 1, V 1), . . . ,Exp(Pm, V m)) and (Log(P 1, X1), . . . ,Log(Pm, Xm)) (16)

respectively, where V = (V 1, . . . , V m) ∈ TPM. The length of tangent vector is

‖V ‖ =√‖V 1‖2P 1 + · · ·+ ‖Vm‖2Pm , where V i ∈ TP iMi. The geodesic distance

between two points d(X1,X2) on Riemannian product space is also measuredby the length of tangent vector from one point to the other. So we have

d(μx,X) =√

d(μ1x, X1)2 + · · ·+ d(μm

x , Xm)2 (17)

From our previous discussion of the relationship between projection coeffi-cients and distance from the mean to points (after projection) in Section 4, wehave ti = d(µx, ΠSWx

(Xi))/‖W x‖µxand tji = d(μj

x, ΠSW

jx

(Xji ))/‖W j

x‖μjx. By

substitution, the projection coefficients on Riemannian product space are givenby

ti = d(μx,ΠSWx(Xi))/‖W x‖µx

=

√√√√

m∑

j

(tji∥∥W j

x

∥∥μjx

)2

/m∑

j=1

∥∥W j

x

∥∥2

μjx

(18)

We can now mechanically substitute these “product space” versions of theterms in (18) to derive a CCA on Riemannian product space. The full model isprovided in the extended version.


6 Experiments

6.1 CCA on SPD Manifolds

Diffusion tensors are symmetric positive definite matrices at each voxel in DTI.Let SPD(n) be a manifold for symmetric positive definite matrices of size n×n.This forms a quotient space GL(n)/O(n), where GL(n) denotes the generallinear group and O(n) is the orthogonal group. The inner product of two tangentvectors u, v ∈ TpM is given by 〈u, v〉p = tr(p−1/2up−1vp−1/2). Here, TpM is atangent space at p (which is a vector space) is the space of symmetric matrices ofdimension (n+1)n/2. The geodesic distance is d(p, q)2 = tr(log2(p−1/2qp−1/2)).

Here, the exponential map and logarithm map are defined as,

Exp(p, v) = p1/2 exp(p−1/2vp−1/2)p1/2, Log(p, q) = p1/2 log(p−1/2qp−1/2)p1/2 (19)

and the first derivative of g in equation (11) on SPD(n) is given by

d

dtig(ti,wx) =

d

dti‖Log(Exp(μx, tiWx), Xi)‖2 =

d

dtitr[log2(X−1

i S(ti))]

= 2tr[log(X−1i S(ti))S(ti)

−1S(ti)], according to Prop. 2.1 in [28]

(20)

where S(ti) = Exp(μx, tiWx) = μ1/2x exptiA μ

1/2x , and S(ti) = μ

1/2x A exptiA μ

1/2x

and A = μ−1/2x Wxμ

−1/2x . The derivative of equality constraints, namely d2

dWdtg,d2

dt2 g are calculated by numerical derivatives. Embedding the tangent vectors inthe n(n + 1)/2 dimensional space with orthonormal basis in the tangent spaceenables one to compute numerical differentiation. Details are provided in theextended paper.

6.2 Synthetic Experiments

In this section we provide experimental results using a synthetic dataset to eval-uate the performance of Riemannian CCA. The samples are generated to bespread far apart on the manifold M(≡ SPD(3)) so that the curvature of themanifold plays a key role in the maximization of the correlation function. Inorder to sample data from different regions of the manifold, we generate dataaround two well separated means μx1 , μx2 ∈ X , μy1 , μy2 ∈ Y by perturbing thedata randomly (see the extended version) in the corresponding tangent spaces.Fig. 2 shows the CCA results obtained by Riemannian and Euclidean methods.We can clearly see the improvements from the manifold approach by inspectingthe correlation coefficients ρx,y on the respective titles.

6.3 CCA for Multi-modal Risk Analysis

Motivation:We collected multi-modal magnetic resonance imaging (MRI) datato investigate the effects of risk for Alzheimer’s disease (AD) on the white andgray matter in the brain. One of the central goals in analyzing this rich dataset

262 H.J. Kim et al.

Fig. 2. Synthetic experiments showing the benefits of Riemannian CCA. The top rowshows the projected data using the Euclidean CCA and the bottom using Rieman-nian CCA. PX and PY denote the projected axes. Each column represents a syntheticexperiment with a specific set of {μxj , εxj ;μyj , εyj}. The first column presents resultswith 100 samples while the three columns on the right show with 1000 samples. Theimprovements in the correlation coefficients ρx,y can be clearly seen from the corre-sponding titles.

is to find statistically significant AD risk ↔ brain relationships. We can adoptmany different ways of modeling these relationships but a potentially useful wayis to analyze multi modality imaging data simultaneously, using CCA.

Risk for AD is characterized by their familial history (FH) status as well asAPOE genotype risk factor. In the current experiments, we include a subset of343 subjects and first investigate the effects of age and gender in a multimodalfashion since these variables are also important factors in healthy aging.

Brain structure is characterized by diffusion weighted images (DWI) for whitematter and T1-weighted (T1W) image data for the gray matter. DWI data pro-vides us information about the microstructure of the white matter. We use diffu-sion tensor (D ∈ SPD(3)) model to represent the diffusivity in the microstructure.T1Wdata can be used to obtain volumetric properties of the gray-matter.The vol-umetric information is obtained from Jacobian matrices (J) of the diffeomorphicmapping to a population specific template. These Jacobian matrices can be usedto obtain the Cauchy deformation tensors which also belong to SPD(3).

Hippocampus and cingulum bundle (shown in Fig. 3) are two important re-gions in the brain. They are a priori believed to be significant in AD↔brainstructure relationships, primarily due to the role of hippocampus in memoryfunction and the projections of cingulum onto the hippocampus. However, de-tecting risk -brain relationships before the memory/cognitive function is impairedis difficult due to several factors (such as noise in the data, small sample andeffect sizes, type I error due to multiple comparisons.). One approach to improvethe statistical power in such a setting would be to perform tests on averageproperties in regions of interest (ROI) in the brain. This procedure reduces bothnoise and the number of comparisons/tests. However, taking averages will alsodampen the signal of interest which is already weak in such pre-clinical studies.


Fig. 3. Shown on the left are the bilateral cingulum bundles (green) inside a brain sur-face obtained from a population DTI template. Similarly on the right are the bilateralhippocampi. The gray and white matter ROIs are also shown on the right.

CCA can take the multi-modal information from the imaging data and projectthe voxels into a space where the signal of interest is likely to be stronger.

Experimental Design: The key multimodal linear relations we examine are

YDTI = β0 + β1Gender + β2XT1W + β3XT1W ·Gender + ε,

YDTI = β′0 + β′

1AgeGroup + β′2XT1W + β′

3XT1W ·AgeGroup + ε,

where the AgeGroup is defined as a categorical variable with 0 (middle aged) ifthe age of the subject ≤ 65 and 1 (old) otherwise. The sample under investiga-tion is between 43 and 75 years of age. The statistical tests ask if we can rejectthe Null hypotheses β3 = 0 and β′

3 = 0 using our data at α = 0.05. We reportthe results from the following four sets of analyses: (i) Classical ROI-averageanalysis: This is a standard type of setting where the brain measurements in anROI are averaged. Here YDTI = MD i.e., the average mean diffusivity in the cin-gulum bundle. XT1W = log |J | i.e., the average volumetric change (relative to thepopulation template) in the hippocampus. (ii) Euclidean CCA using scalar mea-sures (MD and log |J |) in the ROIs: Here, the voxel data is projected using theclassical CCA approach [35] i.e., YDTI = wT

MDMD and XT1W = wTlog |J| log |J |.

(iii) Euclidean CCA using D and J in the ROIs: This setting is an improvementto the setting above in that the projections are performed using the full tensordata [35]. Here YDTI = wT

DD and XT1W = wTJJ . (iv) Riemannian CCA using

D and J in the ROIs: Here YDTI = 〈wD,D〉μD and XT1W = 〈wJ ,J 〉μJ .The findings are shown in Fig. 4. We can see that the performance of CCA

using the full tensor information improves the statistical significance for bothEuclidean and Riemannian approaches. The weight vectors in the different set-tings for both Euclidean and Riemannian CCA are shown in Fig. 5 top row. Wewould like to note that there are several different approaches of using the datafrom CCA and we performed experiments with full gray matter and white mat-ter regions in the brain whose results are included in the extended version. Weshow the representative weight vectors (in Fig. 5 bottom row) obtained using thefull brain analyses. Interestingly, the weight vectors are spatially cohesive evenwithout enforcing any spatial constraints. What is even more remarkable is thatthe regions picked between the DTI and T1W modalities are complimentary ina biological sense. Specifically, when performing our CCA on the ROIs, althoughthe cingulum bundle extends into the superior mid-brain regions the weights are

264 H.J. Kim et al.

Fig. 4. Experimental evidence showing the improvements in statistical significanceof finding the multi-modal risk-brain interaction effects. Top row shows the gender,volume and diffusivity interactions. Second row shows the interaction effects of themiddle/old age groups.

Fig. 5. Weight vectors (in red-yellow color) obtained from our Riemannian CCA ap-proach. The weights are in arbitrary units. The top row is from applying RiemannianCCA on data from the cingulum and hippocampus ROIs (Fig. 3) while the bottomrow is obtained using data from the entire white and gray matter regions of the brain.On the left (three columns) block we show the results in orthogonal view for DTI andon the right for T1W. The corresponding underlays are the population averages of thefractional anisotropy and T1W contrast images respectively.

non-zero in its hippocampal projections. In the case of entire white and graymatter regions, the volumetric difference (from the population template) in theinferior part of the corpus callosum seem to be highly cross-correlated to thediffusivity in the corpus callosum. Our CCA finds these projections without anya priori constraints in the optimization suggesting that performing CCA on theintrinsic nature of the data can reveal biologically meaningful patterns. Due tospace constraints, we refer the interested reader to the extended version of thepaper for additional details.


7 Conclusion

The classical CCA assumes that data live in a pair of vector spaces. However,many modern scientific disciplines require the analysis of data which belong tocurved spaces where classical CCA is no longer applicable. Motivated by theproperties of imaging data from neuroimaging studies, we generalize CCA toRiemannian manifolds. We employ differential geometry tools to extend opera-tions in CCA to the manifold setting. Such a formulation results in a multi-leveloptimization problem. We derive solutions using the first order condition of pro-jection and an augmented Lagrangian method. In addition, we also develop anefficient single path algorithm with approximate projections. Finally, we proposea generalization to the product space of SPD(n), namely, tensor fields allowingus to treat a full brain image as a point on the product manifold. On the ex-perimental side, we presented neuroimaging findings using our proposed CCAon DTI and T1W imaging modalities on an Alzheimer’s disease (AD) datasetfocused on risk factors for this disease. Here, the proposed methods perform welland yield scientifically meaningful results. In closing, we note that our core opti-mization methods can be readily applied when maximizing correlation betweendata from two different types of Riemannian manifolds — this may open thedoors to various other types of analysis not explicitly investigated in this paper.

Acknowledgments. This work was supported in part by NIH grants AG040396(VS), AG037639 (BBB), AG021155 (SCJ), AG027161 (SCJ), NS066340 (BCV)and NSF CAREER award 1252725 (VS). Partial support was also providedby UW ADRC, UW ICTR, and Waisman Core grant P30 HD003352-45. Thecontents do not represent views of the Dept. of Veterans Affairs or the UnitedStates Government.

References

1. Akaho, S.: A kernel method for canonical correlation analysis. In: InternationalMeeting on Psychometric Society (2001)

2. Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis.In: ICML (2013)

3. Avants, B.B., Cook, P.A., Ungar, L., Gee, J.C., Grossman, M.: Dementia inducescorrelated reductions in white matter integrity and cortical thickness: a multivari-ate neuroimaging study with SCCA. Neuroimage 50(3), 1004–1016 (2010)

4. Bach, F.R., Jordan, M.I.: Kernel independent component analysis. The Journal ofMachine Learning Research 3, 1–48 (2003)

5. Callaghan, P.T.: Principles of nuclear magnetic resonance microscopy. Oxford Uni-versity Press (1991)

6. Chung, M., Worsley, K., Paus, T., Cherif, C., Collins, D., Giedd, J., Rapoport,J., Evans, A.: A unified statistical approach to deformation-based morphometry.NeuroImage 14(3), 595–606 (2001)

7. Do Carmo, M.P.: Riemannian geometry (1992)8. Fletcher, P.T.: Geodesic regression and the theory of least squares on riemannian

manifolds. International Journal of Computer Vision 105(2), 171–185 (2013)

266 H.J. Kim et al.

9. Fletcher, P.T., Lu, C., Pizer, S.M., Joshi, S.: Principal geodesic analysis for thestudy of nonlinear statistics of shape. Medical Imaging 23(8), 995–1005 (2004)

10. Goh, A., Lenglet, C., Thompson, P.M., Vidal, R.: A nonparametric Riemannianframework for processing high angular resolution diffusion images (HARDI), pp.2496–2503 (2009)

11. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: CCA: An overview with applicationto learning methods. Neural Computation 16(12), 2639–2664 (2004)

12. Hardoon, D.R., et al.: Unsupervised analysis of fMRI data using kernel canonicalcorrelation. NeuroImage 37(4), 1250–1259 (2007)

13. Hinkle, J., Fletcher, P.T., Joshi, S.: Intrinsic polynomials for regression on Rie-mannian manifolds. Journal of Mathematical Imaging and Vision, 1–21 (2014)

14. Ho, J., Xie, Y., Vemuri, B.: On a nonlinear generalization of sparse coding anddictionary learning. In: ICML, pp. 1480–1488 (2013)

15. Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377(1936)

16. Hsieh, W.W.: Nonlinear canonical correlation analysis by neural networks. NeuralNetworks 13(10), 1095–1105 (2000)

17. Hua, X., Gutman, B., et al.: Accurate measurement of brain changes in longitudinalMRI scans using tensor-based morphometry. Neuroimage 57(1), 5–14 (2011)

18. Hua, X., Leow, A.D., Parikshak, N., Lee, S., Chiang, M.C., Toga, A.W., Jack Jr.,C.R., Weiner, M.W., Thompson, P.M.: Tensor-based morphometry as a neuroimag-ing biomarker for Alzheimer’s disease: an MRI study of 676 AD, MCI, and normalsubjects. Neuroimage 43(3), 458–469 (2008)

19. Huang, H., He, H., Fan, X., Zhang, J.: Super-resolution of human face image usingcanonical correlation analysis. Pattern Recognition 43(7), 2532–2543 (2010)

20. Huckemann, S., Hotz, T., Munk, A.: Intrinsic shape analysis: Geodesic PCA forRiemannian manifolds modulo isometric Lie group actions. Statistica Sinica 20,1–100 (2010)

21. Jayasumana, S., Hartley, R., Salzmann, M., Li, H., Harandi, M.: Kernel methodson the Riemannian manifold of symmetric positive definite matrices. In: CVPR,pp. 73–80 (2013)

22. Karcher, H.: Riemannian center of mass and mollifier smoothing. Communicationson Pure and Applied Mathematics 30(5), 509–541 (1977)

23. Kim, H.J., Adluru, N., Collins, M.D., Chung, M.K., Bendlin, B.B., Johnson, S.C.,Davidson, R.J., Singh, V.: Multivariate general linear models (MGLM) on Rie-mannian manifolds with applications to statistical analysis of diffusion weightedimages. In: CVPR (2014)

24. Kim, T.K., Cipolla, R.: CCA of video volume tensors for action categorization anddetection. PAMI 31(8), 1415–1428 (2009)

25. Lai, P.L., Fyfe, C.: A neural implementation of canonical correlation analysis. Neu-ral Networks 12(10), 1391–1397 (1999)

26. Lebanon, G., et al.: Riemannian geometry and statistical machine learning. Ph.D.thesis, Carnegie Mellon University, Language Technologies Institute, School ofComputer Science (2005)

27. Li, P., Wang, Q., Zuo, W., Zhang, L.: Log-Euclidean kernels for sparse representa-tion and dictionary learning. In: ICCV, pp. 1601–1608 (2013)

28. Moakher, M.: A differential geometric approach to the geometric mean of sym-metric positive-definite matrices. SIAM Journal on Matrix Analysis and Applica-tions 26(3), 735–747 (2005)

29. Niethammer, M., Huang, Y., Vialard, F.X.: Geodesic regression for image time-series. In: MICCAI, pp. 655–662 (2011)


30. Said, S., Courty, N., Le Bihan, N., Sangwine, S.J., et al.: Exact principal geodesicanalysis for data on SO(3). In: Proceedings of the 15th European Signal ProcessingConference, pp. 1700–1705 (2007)

31. Shi, X., Styner, M., Lieberman, J., Ibrahim, J.G., Lin, W., Zhu, H.: Intrinsic regres-sion models for manifold-valued data. In: Yang, G.-Z., Hawkes, D., Rueckert, D.,Noble, A., Taylor, C. (eds.) MICCAI 2009, Part II. LNCS, vol. 5762, pp. 192–199.Springer, Heidelberg (2009)

32. Sommer, S.: Horizontal dimensionality reduction and iterated frame bundle de-velopment. In: Nielsen, F., Barbaresco, F. (eds.) GSI 2013. LNCS, vol. 8085, pp.76–83. Springer, Heidelberg (2013)

33. Sommer, S., Lauze, F., Nielsen, M.: Optimization over geodesics for exact principalgeodesic analysis. Advances in Computational Mathematics, 1–31

34. Steinke, F., Hein, M., Scholkopf, B.: Nonparametric regression between generalRiemannian manifolds. SIAM Journal on Imaging Sciences 3(3), 527–563 (2010)

35. Witten, D.M., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, withapplications to sparse principal components and canonical correlation analysis.Biostatistics 10(3), 515–534 (2009)

36. Xie, Y., Vemuri, B.C., Ho, J.: Statistical analysis of tensor fields. In: Jiang, T.,Navab, N., Pluim, J.P.W., Viergever, M.A. (eds.) MICCAI 2010, Part I. LNCS,vol. 6361, pp. 682–689. Springer, Heidelberg (2010)

37. Yger, F., Berar, M., Gasso, G., Rakotomamonjy, A.: Adaptive canonical correlationanalysis based on matrix manifolds. In: ICML (2012)

38. Yu, S., Tan, T., Huang, K., Jia, K., Wu, X.: A study on gait-based gender classi-fication. IEEE Transactions on Image Processing 18(8), 1905–1910 (2009)

39. Zhu, H., Chen, Y., Ibrahim, J.G., Li, Y., Hall, C., Lin, W.: Intrinsic regressionmodels for positive-definite matrices with applications to diffusion tensor imaging.Journal of the American Statistical Association 104(487) (2009)

LNCS 8690 - Canonical Correlation Analysis on Riemannian...

Documents

Transcript of LNCS 8690 - Canonical Correlation Analysis on Riemannian...