Doubly Decomposing Nonparametric Tensor Regression

Tensor regression problemOur Approach

Convergence AnalysisExperiments

Summary

Doubly DecomposingNonparametric Tensor Regression

M.Imaizumi (Univ. of Tokyo)K.Hayashi (AIST / JST ERATO)

2016/06/21 (ICML2016)

M.Imaizumi (Univ. of Tokyo) K.Hayashi (AIST / JST ERATO) Doubly Decomposing Nonparametric Tensor Regression



Summary

Outline

TopicRegression with Tensor inputNonparametric approach

MethodPropose nonparametric regression model with Bayes estimatorIt avoids “The Curse of Dimensionality” of tensor regression




Summary

1 Tensor regression problem

2 Our Approach

3 Convergence Analysis

4 Experiments

5 Summary




Summary

Tensor Regression Problem

Tensor dataX ∈ RI1×...×IK

K : mode of tensor XIk : dim of k-th mode

Tensor Regressionn observations Dn = {(Xi, Yi)}ni=1

Input (tensor) : Xi ∈ RI1×...×IK Output (real) : Yi ∈ RDn is generated with a function f : RI1×...×IK → R as

Yi = f(Xi) + εi

for i = 1, . . . , n.εi is a Gaussian noise




Summary

Application of Tensor Regression Problem

Predict health conditions from medical 3D imagesXi : medical 3D image of patient i, Yi : health condition of i

Predict spread of epidemics on networksXi : adjacency matrix of network i, Yi : # of infected nodes




Summary

The curse of dimensionality

Performance of the estimator of f gets worse with tensor input

Naive estimator f̃n

‖f̃n − f‖2 = O(n−2β/(2β+

∏k Ik)

),

where β is smoothness of f .∏k Ik = # of elements in X

We propose a method to obtain better convergence rate




Summary


2 Our Approach


4 Experiments

5 Summary




Summary

Double Decomposition

IdeaReduce the complexity of f by decomposing f intononparametric local functions

Control bias variance trade-off with the estimation

We decompose both the function f and the input tensor X




Summary

Double Decomposition 1

1.Tensor (CP) DecompositionConsider X ∈ RI1×...×IK

There exists a set of normalized vectors {x(k)r ∈ RIk}R∗,K

r,k=1,1

and scale term λr for all r = 1, . . . , R∗, then

X =

R∗∑r=1

λrx(1)r ⊗ x(2)r ⊗ . . .⊗ x(K)

r .

R∗ is a tensor rank




Summary

Double Decomposition 2

2.Functional Decomposition

For each r, consider a function f(x(1)r , x

(2)r , . . . , x

(K)r )

x(k)r are Ik-dimensional vectors

There exist M∗ ∈ Z+ ∪ {∞} and a set of local functions

{f (k)m }K,M∗

k,m=1 satisfying

f(x(1)r , x(2)r , . . . , x(K)r ) =

M∗∑m=1

K∏k=1

f (k)m (x(k)r ).

M∗ is a model complexity




Summary

Proposed Framework

Assumption

f is additive separable with respect to r = 1, . . . , R∗

Consider doubly decomposed form of f

f(X) =

M∗∑m=1

R∗∑r=1

λr

K∏k=1

f (k)m (x(k)r )

Additive-Multiplicative Nonparametric Regression(AMNR)

Composed by f(k)m for all m = 1, . . . ,M∗ and k = 1, . . . ,K

f(k)m takes a Ik-dimensional vector as inputM∗ (model complexity) and R∗ (tensor rank) are tuningparameters




Summary

Estimation Method

The Bayes method with the Gaussian process prior.

Prior

π(f) =∏m

∏k

GP(k)(f (k)m ),

Posterior

π(f |Dn) =exp(−

∑ni=1(Yi −G[f ](Xi))

2)∫exp(−

∑ni=1(Yi −G[f ′](Xi))2)π(df ′)

π(f),

where G[f ](Xi) :=M∑m=1

R∑r=1

K∏k=1

f (k)m

(x(k)r,i

).

ImplementationThe estimation bases on Gibbs sampling




Summary


2 Our Approach


4 Experiments

5 Summary




Summary

Analyze Convergence Theoretically

In the form of AMNR, M∗ controls the size of biasAs M∗ increases, the bias decreases

Focus on the distance between f̂n and f∗ with given M∗

We start with a case then finite M∗ is sufficient to represent fThen, we consider M∗ is larger (infinite)




Summary

Finite M ∗ Case

In the following, we assume that

i f belongs to Sobolev space with order βii Parameters of the prior estimation is appropriately selected

Theorem 1

Let M∗ <∞. Then, with some finite constant C > 0,

E‖f̂n − f∗‖2n ≤ Cn−2β/(2β+maxk Ik).

Remind that the naive estimator f̃n has a convergence raten−2β/(2β+

∏k Ik)




Summary

Infinite M ∗ Case

With infinite M∗, we estimate first M components with someassumption.

Theorem 2

Assume that with some constant γ ≥ 1,∥∥∥∑r λr∏k f

(k)m

∥∥∥2= o

(m−γ−1

), as m→∞. Suppose we

construct the estimator with a proximal complexity M such that

M � (n−2β/(2β+maxk Ik))1/(1+γ).

Then, with some finite constant C > 0,

E‖f̂n − f∗‖2n ≤ C(n−2β/(2β+maxk Ik))γ/(1+γ).




Summary

Convergence Rate

Compare Nomparametric method for Tensor Regression

For example case, we set K = 3, Ik = 100, β = γ = 2

Method Conv. Rate Example

Naive n−2β/(2β+∏

k Ik) n−1/2501

AMNR (Finite M∗) n−2β/(2β+maxk Ik) n−1/26

AMNR (Infinite M∗) (n−2β/(2β+maxk Ik))γ/(1+γ) n−1/39

AMNR achieves better convergence rate, by reducing the sizeof the model space by the double decomposition




Summary


2 Our Approach


4 Experiments

5 Summary




Summary

Experiment Outline

We introduce 3 experiments

1 Prediction performance2 Convergence analysis3 Real data analysis

Methods

AMNR (our method)TGP (Tensor Gaussian Process ; Zhou et al. (2014))

Close to the naive estimator

TLR (Tensor Linear Regression ; Zhao et al. (2013))

Not nonparametric method




Summary

Prediction Performance

Generate synthetic data with low rank tensor as

f(X) =∑2

r=1 λr∏Kk=1(1 + exp(–γTx

(k)r ))−1

Full view

100 200 300 400 500

n

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

MSE

Enlarged view

100 200 300 400 500

n

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

MSE




Summary

Convergence analysis

Generate synthetic data with smoothness-controlled processf(X) =

∑Rr=1

∏Kk=1

∑l µlφl(γ

Tx)

Symmetric tensor

4.0 4.5 5.0 5.5 6.0

log n

log M

SE

TGP(10*10*10)

AMNR(10*10*10)

Asymmetric tensor

4.0 4.5 5.0 5.5 6.0

log n

log M

SE

TGP(10*3*3)

AMNR(10*3*3)




Summary

Real Data Analysis

Epidemic Spreading DataXi : Adjacency matrix of network iYi : the number of total infected nodes of network i

0 50 100 150 200

n

0

10

20

30

40

50

60

70

MSE

Testing MSE




Summary


2 Our Approach


4 Experiments

5 Summary




Summary

Conclusion

Proposed nonparametric regression model with tensor input

AMNR controls the complexity of the regression function, bydecomposing both the function and the input tensor

It controls the bias variance trade-off and avoids the curse ofdimensionality of tensor input

Its limitation is a computational complexity


Doubly Decomposing Nonparametric Tensor Regression

Science

Transcript of Doubly Decomposing Nonparametric Tensor Regression