Estimating structured vector autoregressive models

23

Click here to load reader

Transcript of Estimating structured vector autoregressive models

Page 1: Estimating structured vector autoregressive models

ICML2016

Estimating Structured Vector Autoregressive Models

NEC

ICML2016 2016/7/21

Page 2: Estimating structured vector autoregressive models

• Vector Autoregressive

• i.i.d.

• →( ) Lasso Group-Lasso i.i.d.

• →( ) R ( )

• R

Estimating Structured Vector Autoregressive Models

Igor Melnyk [email protected] Banerjee [email protected]

Department of Computer Science and Engineering, University of Minnesota, Twin Cities

AbstractWhile considerable advances have been madein estimating high-dimensional structured mod-els from independent data using Lasso-type mod-els, limited progress has been made for settingswhen the samples are dependent. We consider es-timating structured VAR (vector auto-regressivemodel), where the structure can be captured byany suitable norm, e.g., Lasso, group Lasso, or-der weighted Lasso, etc. In VAR setting withcorrelated noise, although there is strong de-pendence over time and covariates, we establishbounds on the non-asymptotic estimation error ofstructured VAR parameters. The estimation erroris of the same order as that of the correspondingLasso-type estimator with independent samples,and the analysis holds for any norm. Our anal-ysis relies on results in generic chaining, sub-exponential martingales, and spectral represen-tation of VAR models. Experimental results onsynthetic and real data with a variety of structuresare presented, validating theoretical results.

1. IntroductionThe past decade has seen considerable progress on ap-proaches to structured parameter estimation, especially inthe linear regression setting, where one considers regular-ized estimation problems of the form:

ˆ� = argmin

�2Rq

1

M

ky � Z�k2

2

+ �

M

R(�) , (1)

where {(y

i

, z

i

), i = 1, . . . , M}, y

i

2 R, z

i

2 Rq , suchthat y = [y

T

1

, . . . , y

T

M

]

T and Z = [z

T

1

, . . . , z

T

M

]

T , isthe training set of M independently and identically dis-tributed (i.i.d.) samples, �

M

> 0 is a regularizationparameter, and R(·) denotes a suitable norm (Tibshirani,

Proceedings of the 33 rdInternational Conference on Machine

Learning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

1996; Zou & Hastie, 2005; Yuan & Lin, 2006). Specificchoices of R(·) lead to certain types of structured parame-ters to be estimated. For example, the decomposable normR(�) = k�k

1

yields Lasso, estimating sparse parameters,R(�) = k�k

G

gives Group Lasso, estimating group sparseparameters, and R(�) = k�k

owl

, the ordered weightedL

1

norm (OWL) (Bogdan et al., 2013), gives sorted L

1

-penalized estimator, clustering correlated regression pa-rameters (Figueiredo & Nowak, 2014). Non-decomposablenorms, such as K-support norm (Argyriou et al., 2012) oroverlapping group sparsity norm (Jacob et al., 2009) can beused to uncover more complicated model structures. The-oretical analysis of such models, including sample com-plexity and non-asymptotic bounds on the estimation er-ror rely on the design matrix Z, usually assumed (sub)-Gaussian with independent rows, and the specific normR(·) under consideration (Raskutti et al., 2010; Rudelson &Zhou, 2013). Recent work has generalized such estimatorsto work with any norm (Negahban et al., 2012; Banerjeeet al., 2014) with i.i.d. rows in Z.

The focus of the current paper is on structured estimationin vector auto-regressive (VAR) models (Lutkepohl, 2007),arguably the most widely used family of multivariate timeseries models. VAR models have been applied widely,ranging from describing the behavior of economic and fi-nancial time series (Tsay, 2005) to modeling the dynamicalsystems (Ljung, 1998) and estimating brain function con-nectivity (Valdes-Sosa et al., 2005), among others. A VARmodel of order d is defined as

x

t

= A

1

x

t�1

+ A

2

x

t�2

+ · · · + A

d

x

t�d

+ ✏

t

, (2)

where x

t

2 Rp denotes a multivariate time series, A

k

2Rp⇥p

, k = 1, . . . , d are the parameters of the model, andd � 1 is the order of the model. In this work, we as-sume that the noise ✏

t

2 Rp follows a Gaussian distribu-tion, ✏

t

⇠ N (0, ⌃), with E(✏

t

T

t

) = ⌃ and E(✏

t

T

t+⌧

) = 0,for ⌧ 6= 0. The VAR process is assumed to be stable andstationary (Lutkepohl, 2007), while the noise covariancematrix ⌃ is assumed to be positive definite with boundedlargest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1.

In the current context, the parameters {A

k

} are assumed

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

2

Page 3: Estimating structured vector autoregressive models

-

Lasso[Basu15]

VAR

i.i.d. [Banerjee14]

VAR

3

Page 4: Estimating structured vector autoregressive models

• i.i.d.

• •

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

w,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

VAR

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

4

Page 5: Estimating structured vector autoregressive models

• : p vector VAR(d)

Estimating Structured Vector Autoregressive Models

Igor Melnyk [email protected] Banerjee [email protected]

Department of Computer Science and Engineering, University of Minnesota, Twin Cities

AbstractWhile considerable advances have been madein estimating high-dimensional structured mod-els from independent data using Lasso-type mod-els, limited progress has been made for settingswhen the samples are dependent. We consider es-timating structured VAR (vector auto-regressivemodel), where the structure can be captured byany suitable norm, e.g., Lasso, group Lasso, or-der weighted Lasso, etc. In VAR setting withcorrelated noise, although there is strong de-pendence over time and covariates, we establishbounds on the non-asymptotic estimation error ofstructured VAR parameters. The estimation erroris of the same order as that of the correspondingLasso-type estimator with independent samples,and the analysis holds for any norm. Our anal-ysis relies on results in generic chaining, sub-exponential martingales, and spectral represen-tation of VAR models. Experimental results onsynthetic and real data with a variety of structuresare presented, validating theoretical results.

1. IntroductionThe past decade has seen considerable progress on ap-proaches to structured parameter estimation, especially inthe linear regression setting, where one considers regular-ized estimation problems of the form:

ˆ� = argmin

�2Rq

1

M

ky � Z�k2

2

+ �

M

R(�) , (1)

where {(y

i

, z

i

), i = 1, . . . , M}, y

i

2 R, z

i

2 Rq , suchthat y = [y

T

1

, . . . , y

T

M

]

T and Z = [z

T

1

, . . . , z

T

M

]

T , isthe training set of M independently and identically dis-tributed (i.i.d.) samples, �

M

> 0 is a regularizationparameter, and R(·) denotes a suitable norm (Tibshirani,

Proceedings of the 33 rdInternational Conference on Machine

Learning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

1996; Zou & Hastie, 2005; Yuan & Lin, 2006). Specificchoices of R(·) lead to certain types of structured parame-ters to be estimated. For example, the decomposable normR(�) = k�k

1

yields Lasso, estimating sparse parameters,R(�) = k�k

G

gives Group Lasso, estimating group sparseparameters, and R(�) = k�k

owl

, the ordered weightedL

1

norm (OWL) (Bogdan et al., 2013), gives sorted L

1

-penalized estimator, clustering correlated regression pa-rameters (Figueiredo & Nowak, 2014). Non-decomposablenorms, such as K-support norm (Argyriou et al., 2012) oroverlapping group sparsity norm (Jacob et al., 2009) can beused to uncover more complicated model structures. The-oretical analysis of such models, including sample com-plexity and non-asymptotic bounds on the estimation er-ror rely on the design matrix Z, usually assumed (sub)-Gaussian with independent rows, and the specific normR(·) under consideration (Raskutti et al., 2010; Rudelson &Zhou, 2013). Recent work has generalized such estimatorsto work with any norm (Negahban et al., 2012; Banerjeeet al., 2014) with i.i.d. rows in Z.

The focus of the current paper is on structured estimationin vector auto-regressive (VAR) models (Lutkepohl, 2007),arguably the most widely used family of multivariate timeseries models. VAR models have been applied widely,ranging from describing the behavior of economic and fi-nancial time series (Tsay, 2005) to modeling the dynamicalsystems (Ljung, 1998) and estimating brain function con-nectivity (Valdes-Sosa et al., 2005), among others. A VARmodel of order d is defined as

x

t

= A

1

x

t�1

+ A

2

x

t�2

+ · · · + A

d

x

t�d

+ ✏

t

, (2)

where x

t

2 Rp denotes a multivariate time series, A

k

2Rp⇥p

, k = 1, . . . , d are the parameters of the model, andd � 1 is the order of the model. In this work, we as-sume that the noise ✏

t

2 Rp follows a Gaussian distribu-tion, ✏

t

⇠ N (0, ⌃), with E(✏

t

T

t

) = ⌃ and E(✏

t

T

t+⌧

) = 0,for ⌧ 6= 0. The VAR process is assumed to be stable andstationary (Lutkepohl, 2007), while the noise covariancematrix ⌃ is assumed to be positive definite with boundedlargest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1.

In the current context, the parameters {A

k

} are assumed

Estimating Structured Vector Autoregressive Models

Igor Melnyk [email protected] Banerjee [email protected]

Department of Computer Science and Engineering, University of Minnesota, Twin Cities

AbstractWhile considerable advances have been madein estimating high-dimensional structured mod-els from independent data using Lasso-type mod-els, limited progress has been made for settingswhen the samples are dependent. We consider es-timating structured VAR (vector auto-regressivemodel), where the structure can be captured byany suitable norm, e.g., Lasso, group Lasso, or-der weighted Lasso, etc. In VAR setting withcorrelated noise, although there is strong de-pendence over time and covariates, we establishbounds on the non-asymptotic estimation error ofstructured VAR parameters. The estimation erroris of the same order as that of the correspondingLasso-type estimator with independent samples,and the analysis holds for any norm. Our anal-ysis relies on results in generic chaining, sub-exponential martingales, and spectral represen-tation of VAR models. Experimental results onsynthetic and real data with a variety of structuresare presented, validating theoretical results.

1. IntroductionThe past decade has seen considerable progress on ap-proaches to structured parameter estimation, especially inthe linear regression setting, where one considers regular-ized estimation problems of the form:

ˆ� = argmin

�2Rq

1

M

ky � Z�k2

2

+ �

M

R(�) , (1)

where {(y

i

, z

i

), i = 1, . . . , M}, y

i

2 R, z

i

2 Rq , suchthat y = [y

T

1

, . . . , y

T

M

]

T and Z = [z

T

1

, . . . , z

T

M

]

T , isthe training set of M independently and identically dis-tributed (i.i.d.) samples, �

M

> 0 is a regularizationparameter, and R(·) denotes a suitable norm (Tibshirani,

Proceedings of the 33 rdInternational Conference on Machine

Learning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

1996; Zou & Hastie, 2005; Yuan & Lin, 2006). Specificchoices of R(·) lead to certain types of structured parame-ters to be estimated. For example, the decomposable normR(�) = k�k

1

yields Lasso, estimating sparse parameters,R(�) = k�k

G

gives Group Lasso, estimating group sparseparameters, and R(�) = k�k

owl

, the ordered weightedL

1

norm (OWL) (Bogdan et al., 2013), gives sorted L

1

-penalized estimator, clustering correlated regression pa-rameters (Figueiredo & Nowak, 2014). Non-decomposablenorms, such as K-support norm (Argyriou et al., 2012) oroverlapping group sparsity norm (Jacob et al., 2009) can beused to uncover more complicated model structures. The-oretical analysis of such models, including sample com-plexity and non-asymptotic bounds on the estimation er-ror rely on the design matrix Z, usually assumed (sub)-Gaussian with independent rows, and the specific normR(·) under consideration (Raskutti et al., 2010; Rudelson &Zhou, 2013). Recent work has generalized such estimatorsto work with any norm (Negahban et al., 2012; Banerjeeet al., 2014) with i.i.d. rows in Z.

The focus of the current paper is on structured estimationin vector auto-regressive (VAR) models (Lutkepohl, 2007),arguably the most widely used family of multivariate timeseries models. VAR models have been applied widely,ranging from describing the behavior of economic and fi-nancial time series (Tsay, 2005) to modeling the dynamicalsystems (Ljung, 1998) and estimating brain function con-nectivity (Valdes-Sosa et al., 2005), among others. A VARmodel of order d is defined as

x

t

= A

1

x

t�1

+ A

2

x

t�2

+ · · · + A

d

x

t�d

+ ✏

t

, (2)

where x

t

2 Rp denotes a multivariate time series, A

k

2Rp⇥p

, k = 1, . . . , d are the parameters of the model, andd � 1 is the order of the model. In this work, we as-sume that the noise ✏

t

2 Rp follows a Gaussian distribu-tion, ✏

t

⇠ N (0, ⌃), with E(✏

t

T

t

) = ⌃ and E(✏

t

T

t+⌧

) = 0,for ⌧ 6= 0. The VAR process is assumed to be stable andstationary (Lutkepohl, 2007), while the noise covariancematrix ⌃ is assumed to be positive definite with boundedlargest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1.

In the current context, the parameters {A

k

} are assumed

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

i.i.d.

5

Page 6: Estimating structured vector autoregressive models

� 2 ⌦E

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Thm4.3

Thm4.4 Lem.4.1

Lem.4.2

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

6

Page 7: Estimating structured vector autoregressive models

� 2 ⌦E

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Thm4.3

Thm4.4 Lem.4.1

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Lem.4.2

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

7

Page 8: Estimating structured vector autoregressive models

Restricted Eigenvalue Condition

(well-conditioned)

• cone Lasso-type

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

( )⌦E

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

⌦E 8

Page 9: Estimating structured vector autoregressive models

Restricted Error Set

• •

• r=2

�⇤

�⇤ + ⌦E

(L1)(L1)

conecone

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

(r>1)⌦E

� 2⌦E

9

Page 10: Estimating structured vector autoregressive models

� 2 ⌦E

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Thm4.3

Thm4.4 Lem.4.1

Lem.4.2

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

10

Page 11: Estimating structured vector autoregressive models

• 4.2

k�k2

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

,

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

Estimating Structured VAR

k = 1, . . . , d. Since L

1

is decomposable, it can be shownthat (cone(⌦

E

j

)) 4

ps. Next, since ⌦

R

= {u 2Rdp|R(u) 1}, then using Lemma 3 in (Banerjee et al.,2014) and Gaussian width results in (Chandrasekaran et al.,2012), we can establish that w(⌦

R

) O(

p

log(dp)).Therefore, based on Theorem 4.3 and the discussion at theend of Section 3.2, the bound on the regularization parame-ter takes the form �

N

� O⇣

p

log(dp)/N

. Hence, the es-

timation error is bounded by k�k2

O⇣

p

s log(dp)/N

as long as N > O(log(dp)).

3.3.2. GROUP LASSO

To establish results for Group norm, we assume that foreach i = 1, . . . , p, the vector �

i

2 Rdp can be parti-tioned into a set of K disjoint groups, G = {G

1

, . . . , G

K

},with the size of the largest group m = max

k

|Gk

|. Group

Lasso norm is defined as k�kGL =

P

K

k=1

k�

G

k

k2

. Weassume that the parameter �⇤ is s

G

-group-sparse, whichmeans that the largest number of non-zero groups in any�

i

, i = 1, . . . , p is s

G

. Since Group norm is decompos-able, as was established in (Negahban et al., 2012), it canbe shown that (cone(⌦

E

j

)) 4

ps

G

. Similarly as inthe Lasso case, using Lemma 3 in (Banerjee et al., 2014),we get w(⌦

RGL) O(

p

m + log(K)). The bound onthe �

N

takes the form �

N

� O⇣

p

(m + log(K))/N

.Combining these derivations, we obtain the bound k�k

2

O⇣

p

s

G

(m + log(K))/N

for N > O(m + log(K)).

3.3.3. SPARSE GROUP LASSO

Similarly as in Section 3.3.2, we assume that we haveK disjoint groups of size at most m. The Sparse GroupLasso norm enforces sparsity not only across but alsowithin the groups and is defined as k�kSGL = ↵k�k

1

+

(1 � ↵)

P

K

k=1

k�

G

k

k2

, where ↵ 2 [0, 1] is a parame-ter which regulates a convex combination of Lasso andGroup Lasso penalties. Note that since k�k

2

k�k1

,it follows that k�kGL k�kSGL. As a result, for� 2 ⌦

RSGL ) � 2 ⌦

RGL , so that ⌦RSGL ✓ ⌦

RGL

and thus w(⌦

RSGL) w(⌦

RGL) O(

p

m + log(K)),according to Section 3.3.2. Assuming �⇤ is s-sparseand s

G

-group-sparse and noting that the norm is de-composable, we get (cone(⌦

E

j

)) 4(↵

ps + (1 �

↵)

ps

G

)). Consequently, the error bound is k�k2

O⇣

p

(↵s + (1 � ↵)s

G

)(m + log(K))/N

.

3.3.4. OWL NORM

Ordered weighted L

1

norm is a recently introduced regu-larizer and is defined as k�kowl =

P

dp

i=1

c

i

|�|(i)

, wherec

1

� . . . � c

dp

� 0 is a predefined non-increasing se-

quence of weights and |�|(1)

� . . . � |�|(dp)

is the se-quence of absolute values of �, ranked in decreasing order.In (Chen & Banerjee, 2015) it was shown that w(⌦

R

) O(

p

log(dp)/c̄), where c̄ is the average of c

1

, . . . , c

dp

and the norm compatibility constant is (cone(⌦E

j

)) 2c

2

1

ps/c̄. Therefore, based on Theorem 4.3, we get �

N

�O⇣

p

log(dp)/(c̄N)

and the estimation error is bounded

by k�k2

O⇣

2c

1

p

s log(dp)/(c̄N)

.

We note that the bound obtained for Lasso and Group Lassois similar to the bound obtained in (Song & Bickel, 2011;Basu & Michailidis, 2015; Kock & Callot, 2015). More-over, this result is also similar to the works, which dealtwith independent observations, e.g., (Bickel et al., 2009;Negahban et al., 2012), with the difference being the con-stants, reflecting correlation between the samples, as wediscussed in Section 3.2. The explicit bound for SparseGroup Lasso and OWL is a novel aspect of our work forthe non-asymptotic recovery guarantees for the VAR esti-mation problem with norm regularization, being just a sim-ple consequence from our more general framework.

3.4. Proof Sketch

In this Section we outline the steps of the proof for Theo-rem 3.3 and 3.4, all the details can be found in Sections 2.2and 2.3 of the supplement.

3.4.1. BOUND ON REGULARIZATION PARAMETER

Recall that our objective is to establish for some ↵ > 0

a probabilistic statement that �

N

� ↵ � R

⇤[

1

N

Z

T ✏] =

supR(U)1

1

N

Z

T ✏, U↵

, where U = [u

T

1

, . . . , u

T

p

]

T 2 Rdp

2

for u

j

2 Rdp and ✏ = vec(E) for E in (3). We denoteE

:,j

2 RN as a column of noise matrix E and note thatsince Z = I

p⇥p

⌦ X , then using the row-wise separabilityassumption in (5) we can split the overall probability state-ment into p parts, which are easier to work with. Thus, ourobjective would be to establish

P

supR(u

j

)r

j

1

N

X

T

E

:,j

, u

j

↵ ↵

j

� ⇡

j

, (18)

for j = 1, . . . , p, whereP

p

j=1

j

= ↵ andP

p

j=1

r

j

= 1.

The overall strategy is to first show that the ran-dom variable 1

N

X

T

E

:,j

, u

j

has sub-exponential tails.Based on the generic chaining argument, we thenuse Theorem 1.2.7 in (Talagrand, 2006) and bound

E"

sup

R(u

j

)r

j

1

N

X

T

E

:,j

, u

j

#

. Finally, using Theo-

rem 1.2.9 in (Talagrand, 2006) we establish the high proba-bility bound on concentration of sup

R(u

j

)r

j

1

N

X

T

E

:,j

, u

j

around its mean, i.e., derive the bound in (18).

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

s:11

Page 12: Estimating structured vector autoregressive models

• Lasso

k�k2

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

k = 1, . . . , d. Since L

1

is decomposable, it can be shownthat (cone(⌦

E

j

)) 4

ps. Next, since ⌦

R

= {u 2Rdp|R(u) 1}, then using Lemma 3 in (Banerjee et al.,2014) and Gaussian width results in (Chandrasekaran et al.,2012), we can establish that w(⌦

R

) O(

p

log(dp)).Therefore, based on Theorem 4.3 and the discussion at theend of Section 3.2, the bound on the regularization parame-ter takes the form �

N

� O⇣

p

log(dp)/N

. Hence, the es-

timation error is bounded by k�k2

O⇣

p

s log(dp)/N

as long as N > O(log(dp)).

3.3.2. GROUP LASSO

To establish results for Group norm, we assume that foreach i = 1, . . . , p, the vector �

i

2 Rdp can be parti-tioned into a set of K disjoint groups, G = {G

1

, . . . , G

K

},with the size of the largest group m = max

k

|Gk

|. Group

Lasso norm is defined as k�kGL =

P

K

k=1

k�

G

k

k2

. Weassume that the parameter �⇤ is s

G

-group-sparse, whichmeans that the largest number of non-zero groups in any�

i

, i = 1, . . . , p is s

G

. Since Group norm is decompos-able, as was established in (Negahban et al., 2012), it canbe shown that (cone(⌦

E

j

)) 4

ps

G

. Similarly as inthe Lasso case, using Lemma 3 in (Banerjee et al., 2014),we get w(⌦

RGL) O(

p

m + log(K)). The bound onthe �

N

takes the form �

N

� O⇣

p

(m + log(K))/N

.Combining these derivations, we obtain the bound k�k

2

O⇣

p

s

G

(m + log(K))/N

for N > O(m + log(K)).

3.3.3. SPARSE GROUP LASSO

Similarly as in Section 3.3.2, we assume that we haveK disjoint groups of size at most m. The Sparse GroupLasso norm enforces sparsity not only across but alsowithin the groups and is defined as k�kSGL = ↵k�k

1

+

(1 � ↵)

P

K

k=1

k�

G

k

k2

, where ↵ 2 [0, 1] is a parame-ter which regulates a convex combination of Lasso andGroup Lasso penalties. Note that since k�k

2

k�k1

,it follows that k�kGL k�kSGL. As a result, for� 2 ⌦

RSGL ) � 2 ⌦

RGL , so that ⌦RSGL ✓ ⌦

RGL

and thus w(⌦

RSGL) w(⌦

RGL) O(

p

m + log(K)),according to Section 3.3.2. Assuming �⇤ is s-sparseand s

G

-group-sparse and noting that the norm is de-composable, we get (cone(⌦

E

j

)) 4(↵

ps + (1 �

↵)

ps

G

)). Consequently, the error bound is k�k2

O⇣

p

(↵s + (1 � ↵)s

G

)(m + log(K))/N

.

3.3.4. OWL NORM

Ordered weighted L

1

norm is a recently introduced regu-larizer and is defined as k�kowl =

P

dp

i=1

c

i

|�|(i)

, wherec

1

� . . . � c

dp

� 0 is a predefined non-increasing se-

quence of weights and |�|(1)

� . . . � |�|(dp)

is the se-quence of absolute values of �, ranked in decreasing order.In (Chen & Banerjee, 2015) it was shown that w(⌦

R

) O(

p

log(dp)/c̄), where c̄ is the average of c

1

, . . . , c

dp

and the norm compatibility constant is (cone(⌦E

j

)) 2c

2

1

ps/c̄. Therefore, based on Theorem 4.3, we get �

N

�O⇣

p

log(dp)/(c̄N)

and the estimation error is bounded

by k�k2

O⇣

2c

1

p

s log(dp)/(c̄N)

.

We note that the bound obtained for Lasso and Group Lassois similar to the bound obtained in (Song & Bickel, 2011;Basu & Michailidis, 2015; Kock & Callot, 2015). More-over, this result is also similar to the works, which dealtwith independent observations, e.g., (Bickel et al., 2009;Negahban et al., 2012), with the difference being the con-stants, reflecting correlation between the samples, as wediscussed in Section 3.2. The explicit bound for SparseGroup Lasso and OWL is a novel aspect of our work forthe non-asymptotic recovery guarantees for the VAR esti-mation problem with norm regularization, being just a sim-ple consequence from our more general framework.

3.4. Proof Sketch

In this Section we outline the steps of the proof for Theo-rem 3.3 and 3.4, all the details can be found in Sections 2.2and 2.3 of the supplement.

3.4.1. BOUND ON REGULARIZATION PARAMETER

Recall that our objective is to establish for some ↵ > 0

a probabilistic statement that �

N

� ↵ � R

⇤[

1

N

Z

T ✏] =

supR(U)1

1

N

Z

T ✏, U↵

, where U = [u

T

1

, . . . , u

T

p

]

T 2 Rdp

2

for u

j

2 Rdp and ✏ = vec(E) for E in (3). We denoteE

:,j

2 RN as a column of noise matrix E and note thatsince Z = I

p⇥p

⌦ X , then using the row-wise separabilityassumption in (5) we can split the overall probability state-ment into p parts, which are easier to work with. Thus, ourobjective would be to establish

P

supR(u

j

)r

j

1

N

X

T

E

:,j

, u

j

↵ ↵

j

� ⇡

j

, (18)

for j = 1, . . . , p, whereP

p

j=1

j

= ↵ andP

p

j=1

r

j

= 1.

The overall strategy is to first show that the ran-dom variable 1

N

X

T

E

:,j

, u

j

has sub-exponential tails.Based on the generic chaining argument, we thenuse Theorem 1.2.7 in (Talagrand, 2006) and bound

E"

sup

R(u

j

)r

j

1

N

X

T

E

:,j

, u

j

#

. Finally, using Theo-

rem 1.2.9 in (Talagrand, 2006) we establish the high proba-bility bound on concentration of sup

R(u

j

)r

j

1

N

X

T

E

:,j

, u

j

around its mean, i.e., derive the bound in (18).

Estimating Structured VAR

k = 1, . . . , d. Since L

1

is decomposable, it can be shownthat (cone(⌦

E

j

)) 4

ps. Next, since ⌦

R

= {u 2Rdp|R(u) 1}, then using Lemma 3 in (Banerjee et al.,2014) and Gaussian width results in (Chandrasekaran et al.,2012), we can establish that w(⌦

R

) O(

p

log(dp)).Therefore, based on Theorem 4.3 and the discussion at theend of Section 3.2, the bound on the regularization parame-ter takes the form �

N

� O⇣

p

log(dp)/N

. Hence, the es-

timation error is bounded by k�k2

O⇣

p

s log(dp)/N

as long as N > O(log(dp)).

3.3.2. GROUP LASSO

To establish results for Group norm, we assume that foreach i = 1, . . . , p, the vector �

i

2 Rdp can be parti-tioned into a set of K disjoint groups, G = {G

1

, . . . , G

K

},with the size of the largest group m = max

k

|Gk

|. Group

Lasso norm is defined as k�kGL =

P

K

k=1

k�

G

k

k2

. Weassume that the parameter �⇤ is s

G

-group-sparse, whichmeans that the largest number of non-zero groups in any�

i

, i = 1, . . . , p is s

G

. Since Group norm is decompos-able, as was established in (Negahban et al., 2012), it canbe shown that (cone(⌦

E

j

)) 4

ps

G

. Similarly as inthe Lasso case, using Lemma 3 in (Banerjee et al., 2014),we get w(⌦

RGL) O(

p

m + log(K)). The bound onthe �

N

takes the form �

N

� O⇣

p

(m + log(K))/N

.Combining these derivations, we obtain the bound k�k

2

O⇣

p

s

G

(m + log(K))/N

for N > O(m + log(K)).

3.3.3. SPARSE GROUP LASSO

Similarly as in Section 3.3.2, we assume that we haveK disjoint groups of size at most m. The Sparse GroupLasso norm enforces sparsity not only across but alsowithin the groups and is defined as k�kSGL = ↵k�k

1

+

(1 � ↵)

P

K

k=1

k�

G

k

k2

, where ↵ 2 [0, 1] is a parame-ter which regulates a convex combination of Lasso andGroup Lasso penalties. Note that since k�k

2

k�k1

,it follows that k�kGL k�kSGL. As a result, for� 2 ⌦

RSGL ) � 2 ⌦

RGL , so that ⌦RSGL ✓ ⌦

RGL

and thus w(⌦

RSGL) w(⌦

RGL) O(

p

m + log(K)),according to Section 3.3.2. Assuming �⇤ is s-sparseand s

G

-group-sparse and noting that the norm is de-composable, we get (cone(⌦

E

j

)) 4(↵

ps + (1 �

↵)

ps

G

)). Consequently, the error bound is k�k2

O⇣

p

(↵s + (1 � ↵)s

G

)(m + log(K))/N

.

3.3.4. OWL NORM

Ordered weighted L

1

norm is a recently introduced regu-larizer and is defined as k�kowl =

P

dp

i=1

c

i

|�|(i)

, wherec

1

� . . . � c

dp

� 0 is a predefined non-increasing se-

quence of weights and |�|(1)

� . . . � |�|(dp)

is the se-quence of absolute values of �, ranked in decreasing order.In (Chen & Banerjee, 2015) it was shown that w(⌦

R

) O(

p

log(dp)/c̄), where c̄ is the average of c

1

, . . . , c

dp

and the norm compatibility constant is (cone(⌦E

j

)) 2c

2

1

ps/c̄. Therefore, based on Theorem 4.3, we get �

N

�O⇣

p

log(dp)/(c̄N)

and the estimation error is bounded

by k�k2

O⇣

2c

1

p

s log(dp)/(c̄N)

.

We note that the bound obtained for Lasso and Group Lassois similar to the bound obtained in (Song & Bickel, 2011;Basu & Michailidis, 2015; Kock & Callot, 2015). More-over, this result is also similar to the works, which dealtwith independent observations, e.g., (Bickel et al., 2009;Negahban et al., 2012), with the difference being the con-stants, reflecting correlation between the samples, as wediscussed in Section 3.2. The explicit bound for SparseGroup Lasso and OWL is a novel aspect of our work forthe non-asymptotic recovery guarantees for the VAR esti-mation problem with norm regularization, being just a sim-ple consequence from our more general framework.

3.4. Proof Sketch

In this Section we outline the steps of the proof for Theo-rem 3.3 and 3.4, all the details can be found in Sections 2.2and 2.3 of the supplement.

3.4.1. BOUND ON REGULARIZATION PARAMETER

Recall that our objective is to establish for some ↵ > 0

a probabilistic statement that �

N

� ↵ � R

⇤[

1

N

Z

T ✏] =

supR(U)1

1

N

Z

T ✏, U↵

, where U = [u

T

1

, . . . , u

T

p

]

T 2 Rdp

2

for u

j

2 Rdp and ✏ = vec(E) for E in (3). We denoteE

:,j

2 RN as a column of noise matrix E and note thatsince Z = I

p⇥p

⌦ X , then using the row-wise separabilityassumption in (5) we can split the overall probability state-ment into p parts, which are easier to work with. Thus, ourobjective would be to establish

P

supR(u

j

)r

j

1

N

X

T

E

:,j

, u

j

↵ ↵

j

� ⇡

j

, (18)

for j = 1, . . . , p, whereP

p

j=1

j

= ↵ andP

p

j=1

r

j

= 1.

The overall strategy is to first show that the ran-dom variable 1

N

X

T

E

:,j

, u

j

has sub-exponential tails.Based on the generic chaining argument, we thenuse Theorem 1.2.7 in (Talagrand, 2006) and bound

E"

sup

R(u

j

)r

j

1

N

X

T

E

:,j

, u

j

#

. Finally, using Theo-

rem 1.2.9 in (Talagrand, 2006) we establish the high proba-bility bound on concentration of sup

R(u

j

)r

j

1

N

X

T

E

:,j

, u

j

around its mean, i.e., derive the bound in (18).

Estimating Structured VAR

k = 1, . . . , d. Since L

1

is decomposable, it can be shownthat (cone(⌦

E

j

)) 4

ps. Next, since ⌦

R

= {u 2Rdp|R(u) 1}, then using Lemma 3 in (Banerjee et al.,2014) and Gaussian width results in (Chandrasekaran et al.,2012), we can establish that w(⌦

R

) O(

p

log(dp)).Therefore, based on Theorem 4.3 and the discussion at theend of Section 3.2, the bound on the regularization parame-ter takes the form �

N

� O⇣

p

log(dp)/N

. Hence, the es-

timation error is bounded by k�k2

O⇣

p

s log(dp)/N

as long as N > O(log(dp)).

3.3.2. GROUP LASSO

To establish results for Group norm, we assume that foreach i = 1, . . . , p, the vector �

i

2 Rdp can be parti-tioned into a set of K disjoint groups, G = {G

1

, . . . , G

K

},with the size of the largest group m = max

k

|Gk

|. Group

Lasso norm is defined as k�kGL =

P

K

k=1

k�

G

k

k2

. Weassume that the parameter �⇤ is s

G

-group-sparse, whichmeans that the largest number of non-zero groups in any�

i

, i = 1, . . . , p is s

G

. Since Group norm is decompos-able, as was established in (Negahban et al., 2012), it canbe shown that (cone(⌦

E

j

)) 4

ps

G

. Similarly as inthe Lasso case, using Lemma 3 in (Banerjee et al., 2014),we get w(⌦

RGL) O(

p

m + log(K)). The bound onthe �

N

takes the form �

N

� O⇣

p

(m + log(K))/N

.Combining these derivations, we obtain the bound k�k

2

O⇣

p

s

G

(m + log(K))/N

for N > O(m + log(K)).

3.3.3. SPARSE GROUP LASSO

Similarly as in Section 3.3.2, we assume that we haveK disjoint groups of size at most m. The Sparse GroupLasso norm enforces sparsity not only across but alsowithin the groups and is defined as k�kSGL = ↵k�k

1

+

(1 � ↵)

P

K

k=1

k�

G

k

k2

, where ↵ 2 [0, 1] is a parame-ter which regulates a convex combination of Lasso andGroup Lasso penalties. Note that since k�k

2

k�k1

,it follows that k�kGL k�kSGL. As a result, for� 2 ⌦

RSGL ) � 2 ⌦

RGL , so that ⌦RSGL ✓ ⌦

RGL

and thus w(⌦

RSGL) w(⌦

RGL) O(

p

m + log(K)),according to Section 3.3.2. Assuming �⇤ is s-sparseand s

G

-group-sparse and noting that the norm is de-composable, we get (cone(⌦

E

j

)) 4(↵

ps + (1 �

↵)

ps

G

)). Consequently, the error bound is k�k2

O⇣

p

(↵s + (1 � ↵)s

G

)(m + log(K))/N

.

3.3.4. OWL NORM

Ordered weighted L

1

norm is a recently introduced regu-larizer and is defined as k�kowl =

P

dp

i=1

c

i

|�|(i)

, wherec

1

� . . . � c

dp

� 0 is a predefined non-increasing se-

quence of weights and |�|(1)

� . . . � |�|(dp)

is the se-quence of absolute values of �, ranked in decreasing order.In (Chen & Banerjee, 2015) it was shown that w(⌦

R

) O(

p

log(dp)/c̄), where c̄ is the average of c

1

, . . . , c

dp

and the norm compatibility constant is (cone(⌦E

j

)) 2c

2

1

ps/c̄. Therefore, based on Theorem 4.3, we get �

N

�O⇣

p

log(dp)/(c̄N)

and the estimation error is bounded

by k�k2

O⇣

2c

1

p

s log(dp)/(c̄N)

.

We note that the bound obtained for Lasso and Group Lassois similar to the bound obtained in (Song & Bickel, 2011;Basu & Michailidis, 2015; Kock & Callot, 2015). More-over, this result is also similar to the works, which dealtwith independent observations, e.g., (Bickel et al., 2009;Negahban et al., 2012), with the difference being the con-stants, reflecting correlation between the samples, as wediscussed in Section 3.2. The explicit bound for SparseGroup Lasso and OWL is a novel aspect of our work forthe non-asymptotic recovery guarantees for the VAR esti-mation problem with norm regularization, being just a sim-ple consequence from our more general framework.

3.4. Proof Sketch

In this Section we outline the steps of the proof for Theo-rem 3.3 and 3.4, all the details can be found in Sections 2.2and 2.3 of the supplement.

3.4.1. BOUND ON REGULARIZATION PARAMETER

Recall that our objective is to establish for some ↵ > 0

a probabilistic statement that �

N

� ↵ � R

⇤[

1

N

Z

T ✏] =

supR(U)1

1

N

Z

T ✏, U↵

, where U = [u

T

1

, . . . , u

T

p

]

T 2 Rdp

2

for u

j

2 Rdp and ✏ = vec(E) for E in (3). We denoteE

:,j

2 RN as a column of noise matrix E and note thatsince Z = I

p⇥p

⌦ X , then using the row-wise separabilityassumption in (5) we can split the overall probability state-ment into p parts, which are easier to work with. Thus, ourobjective would be to establish

P

supR(u

j

)r

j

1

N

X

T

E

:,j

, u

j

↵ ↵

j

� ⇡

j

, (18)

for j = 1, . . . , p, whereP

p

j=1

j

= ↵ andP

p

j=1

r

j

= 1.

The overall strategy is to first show that the ran-dom variable 1

N

X

T

E

:,j

, u

j

has sub-exponential tails.Based on the generic chaining argument, we thenuse Theorem 1.2.7 in (Talagrand, 2006) and bound

E"

sup

R(u

j

)r

j

1

N

X

T

E

:,j

, u

j

#

. Finally, using Theo-

rem 1.2.9 in (Talagrand, 2006) we establish the high proba-bility bound on concentration of sup

R(u

j

)r

j

1

N

X

T

E

:,j

, u

j

around its mean, i.e., derive the bound in (18).

Estimating Structured VAR

k = 1, . . . , d. Since L

1

is decomposable, it can be shownthat (cone(⌦

E

j

)) 4

ps. Next, since ⌦

R

= {u 2Rdp|R(u) 1}, then using Lemma 3 in (Banerjee et al.,2014) and Gaussian width results in (Chandrasekaran et al.,2012), we can establish that w(⌦

R

) O(

p

log(dp)).Therefore, based on Theorem 4.3 and the discussion at theend of Section 3.2, the bound on the regularization parame-ter takes the form �

N

� O⇣

p

log(dp)/N

. Hence, the es-

timation error is bounded by k�k2

O⇣

p

s log(dp)/N

as long as N > O(log(dp)).

3.3.2. GROUP LASSO

To establish results for Group norm, we assume that foreach i = 1, . . . , p, the vector �

i

2 Rdp can be parti-tioned into a set of K disjoint groups, G = {G

1

, . . . , G

K

},with the size of the largest group m = max

k

|Gk

|. Group

Lasso norm is defined as k�kGL =

P

K

k=1

k�

G

k

k2

. Weassume that the parameter �⇤ is s

G

-group-sparse, whichmeans that the largest number of non-zero groups in any�

i

, i = 1, . . . , p is s

G

. Since Group norm is decompos-able, as was established in (Negahban et al., 2012), it canbe shown that (cone(⌦

E

j

)) 4

ps

G

. Similarly as inthe Lasso case, using Lemma 3 in (Banerjee et al., 2014),we get w(⌦

RGL) O(

p

m + log(K)). The bound onthe �

N

takes the form �

N

� O⇣

p

(m + log(K))/N

.Combining these derivations, we obtain the bound k�k

2

O⇣

p

s

G

(m + log(K))/N

for N > O(m + log(K)).

3.3.3. SPARSE GROUP LASSO

Similarly as in Section 3.3.2, we assume that we haveK disjoint groups of size at most m. The Sparse GroupLasso norm enforces sparsity not only across but alsowithin the groups and is defined as k�kSGL = ↵k�k

1

+

(1 � ↵)

P

K

k=1

k�

G

k

k2

, where ↵ 2 [0, 1] is a parame-ter which regulates a convex combination of Lasso andGroup Lasso penalties. Note that since k�k

2

k�k1

,it follows that k�kGL k�kSGL. As a result, for� 2 ⌦

RSGL ) � 2 ⌦

RGL , so that ⌦RSGL ✓ ⌦

RGL

and thus w(⌦

RSGL) w(⌦

RGL) O(

p

m + log(K)),according to Section 3.3.2. Assuming �⇤ is s-sparseand s

G

-group-sparse and noting that the norm is de-composable, we get (cone(⌦

E

j

)) 4(↵

ps + (1 �

↵)

ps

G

)). Consequently, the error bound is k�k2

O⇣

p

(↵s + (1 � ↵)s

G

)(m + log(K))/N

.

3.3.4. OWL NORM

Ordered weighted L

1

norm is a recently introduced regu-larizer and is defined as k�kowl =

P

dp

i=1

c

i

|�|(i)

, wherec

1

� . . . � c

dp

� 0 is a predefined non-increasing se-

quence of weights and |�|(1)

� . . . � |�|(dp)

is the se-quence of absolute values of �, ranked in decreasing order.In (Chen & Banerjee, 2015) it was shown that w(⌦

R

) O(

p

log(dp)/c̄), where c̄ is the average of c

1

, . . . , c

dp

and the norm compatibility constant is (cone(⌦E

j

)) 2c

2

1

ps/c̄. Therefore, based on Theorem 4.3, we get �

N

�O⇣

p

log(dp)/(c̄N)

and the estimation error is bounded

by k�k2

O⇣

2c

1

p

s log(dp)/(c̄N)

.

We note that the bound obtained for Lasso and Group Lassois similar to the bound obtained in (Song & Bickel, 2011;Basu & Michailidis, 2015; Kock & Callot, 2015). More-over, this result is also similar to the works, which dealtwith independent observations, e.g., (Bickel et al., 2009;Negahban et al., 2012), with the difference being the con-stants, reflecting correlation between the samples, as wediscussed in Section 3.2. The explicit bound for SparseGroup Lasso and OWL is a novel aspect of our work forthe non-asymptotic recovery guarantees for the VAR esti-mation problem with norm regularization, being just a sim-ple consequence from our more general framework.

3.4. Proof Sketch

In this Section we outline the steps of the proof for Theo-rem 3.3 and 3.4, all the details can be found in Sections 2.2and 2.3 of the supplement.

3.4.1. BOUND ON REGULARIZATION PARAMETER

Recall that our objective is to establish for some ↵ > 0

a probabilistic statement that �

N

� ↵ � R

⇤[

1

N

Z

T ✏] =

supR(U)1

1

N

Z

T ✏, U↵

, where U = [u

T

1

, . . . , u

T

p

]

T 2 Rdp

2

for u

j

2 Rdp and ✏ = vec(E) for E in (3). We denoteE

:,j

2 RN as a column of noise matrix E and note thatsince Z = I

p⇥p

⌦ X , then using the row-wise separabilityassumption in (5) we can split the overall probability state-ment into p parts, which are easier to work with. Thus, ourobjective would be to establish

P

supR(u

j

)r

j

1

N

X

T

E

:,j

, u

j

↵ ↵

j

� ⇡

j

, (18)

for j = 1, . . . , p, whereP

p

j=1

j

= ↵ andP

p

j=1

r

j

= 1.

The overall strategy is to first show that the ran-dom variable 1

N

X

T

E

:,j

, u

j

has sub-exponential tails.Based on the generic chaining argument, we thenuse Theorem 1.2.7 in (Talagrand, 2006) and bound

E"

sup

R(u

j

)r

j

1

N

X

T

E

:,j

, u

j

#

. Finally, using Theo-

rem 1.2.9 in (Talagrand, 2006) we establish the high proba-bility bound on concentration of sup

R(u

j

)r

j

1

N

X

T

E

:,j

, u

j

around its mean, i.e., derive the bound in (18).

12

Page 13: Estimating structured vector autoregressive models

� 2 ⌦E

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Thm4.3

Thm4.4 Lem.4.1

Lem.4.2

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

13

Page 14: Estimating structured vector autoregressive models

( ) : R -

• R -

• • →

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

A2A1 A3 A4

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

A2A1 A3 A4

Estimating Structured VAR

characterizing sample complexity and error bounds.

2.1. Regularized Estimator

To estimate the parameters of the VAR model, we trans-form the model in (2) into the form suitable for regularizedestimator (1). Let (x

0

, x

1

, . . . , x

T

) denote the T + 1 sam-ples generated by the stable VAR model in (2), then stack-ing them together we obtain2

6

6

6

4

x

T

d

x

T

d+1

...x

T

T

3

7

7

7

5

=

2

6

6

6

4

x

T

d�1

x

T

d�2

. . . x

T

0

x

T

d

x

T

d�1

. . . x

T

1

......

. . ....

x

T

T�1

x

T

T�2

. . . x

T

T�d

3

7

7

7

5

2

6

6

6

4

A

T

1

A

T

2

...A

T

d

3

7

7

7

5

+

2

6

6

6

4

T

d

T

d+1

...✏

T

T

3

7

7

7

5

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2RN⇥p for N = T �d+1. Vectorizing (column-wise) eachmatrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2

, � 2 Rdp

2

,✏ 2 RNp, and ⌦ is the Kronecker product. The covari-ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

.Consequently, the regularized estimator takes the form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along therows of matrices A

k

. Specifically, if we denote � =

[�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

fork = 1, . . . , d, then our assumption is equivalent to

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assumethe norm R(·) to be the same for each row i. Since theanalysis decouples across rows, it is straightforward to ex-tend our analysis to the case when a different norm is usedfor each row of A

k

, e.g., L

1

for row one, L

2

for row two,K-support norm (Argyriou et al., 2012) for row three, etc.Observe that within a row, the norm need not be decompos-able across columns.

The main difference between the estimation problem in (1)and the formulation in (4) is the strong dependence betweenthe samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assump-tion on the data {(y

i

, z

i

), i = 1, . . . , Np}. In particular,this leads to the correlations between the rows and columns

of matrix X (and consequently of Z). To deal with suchdependencies, following (Basu & Michailidis, 2015), weutilize the spectral representation of the autocovariance ofVAR models to control the dependencies in matrix X .

2.2. Stability of VAR Model

Since VAR models are (linear) dynamical systems, for theanalysis we need to establish conditions under which theVAR model (2) is stable, i.e., the time-series process doesnot diverge over time. For understanding stability, it is con-venient to rewrite VAR model of order d in (2) as an equiv-alent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1

x

t�2

...x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable ifall the eigenvalues of A satisfy det(�I

dp⇥dp

� A) = 0

for � 2 C, |�| < 1. Equivalently, if expressed in termsof original parameters A

k

, stability is satisfied if det(I �P

d

k=1

A

k

1

k

) = 0 (see Section 2.1.1 of (Lutkepohl, 2007)and Section 1.3 of the supplement for more details).

2.3. Properties of Design Matrix X

In what follows, we analyze the covariance structure of ma-trix X in (3) using spectral properties of VAR model. Theresults will then be used in establishing the high probabilitybounds for the estimation guarantees in problem (4).

Define any row of X as X

i,:

2 Rdp, 1 i N . Sincewe assumed that ✏

t

⇠ N (0, ⌃), it follows that each rowis distributed as X

i,:

⇠ N (0, CX), where the covariancematrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out thatsince CX is a block-Toeplitz matrix, its eigenvalues can bebounded as (see (Gutierrez-Gutierrez & Crespo, 2011))

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and fori =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2 [0, 2⇡], isthe spectral density, i.e., a Fourier transform of the autoco-variance matrix �(h). The advantage of utilizing spectral

Estimating Structured Vector Autoregressive Models

Igor Melnyk [email protected] Banerjee [email protected]

Department of Computer Science and Engineering, University of Minnesota, Twin Cities

AbstractWhile considerable advances have been madein estimating high-dimensional structured mod-els from independent data using Lasso-type mod-els, limited progress has been made for settingswhen the samples are dependent. We consider es-timating structured VAR (vector auto-regressivemodel), where the structure can be captured byany suitable norm, e.g., Lasso, group Lasso, or-der weighted Lasso, etc. In VAR setting withcorrelated noise, although there is strong de-pendence over time and covariates, we establishbounds on the non-asymptotic estimation error ofstructured VAR parameters. The estimation erroris of the same order as that of the correspondingLasso-type estimator with independent samples,and the analysis holds for any norm. Our anal-ysis relies on results in generic chaining, sub-exponential martingales, and spectral represen-tation of VAR models. Experimental results onsynthetic and real data with a variety of structuresare presented, validating theoretical results.

1. IntroductionThe past decade has seen considerable progress on ap-proaches to structured parameter estimation, especially inthe linear regression setting, where one considers regular-ized estimation problems of the form:

ˆ� = argmin

�2Rq

1

M

ky � Z�k2

2

+ �

M

R(�) , (1)

where {(y

i

, z

i

), i = 1, . . . , M}, y

i

2 R, z

i

2 Rq , suchthat y = [y

T

1

, . . . , y

T

M

]

T and Z = [z

T

1

, . . . , z

T

M

]

T , isthe training set of M independently and identically dis-tributed (i.i.d.) samples, �

M

> 0 is a regularizationparameter, and R(·) denotes a suitable norm (Tibshirani,

Proceedings of the 33 rdInternational Conference on Machine

Learning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

1996; Zou & Hastie, 2005; Yuan & Lin, 2006). Specificchoices of R(·) lead to certain types of structured parame-ters to be estimated. For example, the decomposable normR(�) = k�k

1

yields Lasso, estimating sparse parameters,R(�) = k�k

G

gives Group Lasso, estimating group sparseparameters, and R(�) = k�k

owl

, the ordered weightedL

1

norm (OWL) (Bogdan et al., 2013), gives sorted L

1

-penalized estimator, clustering correlated regression pa-rameters (Figueiredo & Nowak, 2014). Non-decomposablenorms, such as K-support norm (Argyriou et al., 2012) oroverlapping group sparsity norm (Jacob et al., 2009) can beused to uncover more complicated model structures. The-oretical analysis of such models, including sample com-plexity and non-asymptotic bounds on the estimation er-ror rely on the design matrix Z, usually assumed (sub)-Gaussian with independent rows, and the specific normR(·) under consideration (Raskutti et al., 2010; Rudelson &Zhou, 2013). Recent work has generalized such estimatorsto work with any norm (Negahban et al., 2012; Banerjeeet al., 2014) with i.i.d. rows in Z.

The focus of the current paper is on structured estimationin vector auto-regressive (VAR) models (Lutkepohl, 2007),arguably the most widely used family of multivariate timeseries models. VAR models have been applied widely,ranging from describing the behavior of economic and fi-nancial time series (Tsay, 2005) to modeling the dynamicalsystems (Ljung, 1998) and estimating brain function con-nectivity (Valdes-Sosa et al., 2005), among others. A VARmodel of order d is defined as

x

t

= A

1

x

t�1

+ A

2

x

t�2

+ · · · + A

d

x

t�d

+ ✏

t

, (2)

where x

t

2 Rp denotes a multivariate time series, A

k

2Rp⇥p

, k = 1, . . . , d are the parameters of the model, andd � 1 is the order of the model. In this work, we as-sume that the noise ✏

t

2 Rp follows a Gaussian distribu-tion, ✏

t

⇠ N (0, ⌃), with E(✏

t

T

t

) = ⌃ and E(✏

t

T

t+⌧

) = 0,for ⌧ 6= 0. The VAR process is assumed to be stable andstationary (Lutkepohl, 2007), while the noise covariancematrix ⌃ is assumed to be positive definite with boundedlargest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1.

In the current context, the parameters {A

k

} are assumed

Aj

The second condition in [4] establishes the upper bound on the estimation error.

Lemma 4.2 Assume that the restricted eigenvalue (RE) condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦E

) and some constant > 0, where cone(⌦E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦E

)), (17)

where (cone(⌦E

)) is a norm compatibility constant, defined as (cone(⌦E

)) = sup

U2cone(⌦E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error satisfies the upperbound in (17). However, the results are defined in terms of the quantities, involving Z and ✏, which arerandom. Therefore, in the following we establish high probability bounds on the regularization parameterin (14) and RE condition in (16).

4.1 High Probability Bounds

In this Section we present the main results of our work, followed by the discussion on their properties andillustrating some special cases based on popular Lasso and Group Lasso regularization norms. In Section4.4 we will present the main ideas of our proof technique, with all the details delegated to the Appendices Cand D.To establish lower bound on the regularization parameter �

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵,for some ↵ > 0, which will establish the required relationship �

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 4.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define w(⌦

R

) = E[ sup

u2⌦R

hg, ui] to be a Gaussian width

of set ⌦

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) +

log(p)) we can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will show that inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some

⌫ > 0 and then setp

N = ⌫.

Theorem 4.4 Let ⇥ = cone(⌦E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r > 1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

,

for�

j

is of size dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set ⌦

E

j

is a part of the decomposition

in ⌦

E

= ⌦

E

1

⇥ · · · ⇥ ⌦E

p

due to the assumption on the row-wise separability of norm R(·) in (5). Also

714

Page 15: Estimating structured vector autoregressive models

• 3.4

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

: Gaussian width

The second condition in [4] establishes the upper bound on the estimation error.

Lemma 4.2 Assume that the restricted eigenvalue (RE) condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦E

) and some constant > 0, where cone(⌦E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦E

)), (17)

where (cone(⌦E

)) is a norm compatibility constant, defined as (cone(⌦E

)) = sup

U2cone(⌦E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error satisfies the upperbound in (17). However, the results are defined in terms of the quantities, involving Z and ✏, which arerandom. Therefore, in the following we establish high probability bounds on the regularization parameterin (14) and RE condition in (16).

4.1 High Probability Bounds

In this Section we present the main results of our work, followed by the discussion on their properties andillustrating some special cases based on popular Lasso and Group Lasso regularization norms. In Section4.4 we will present the main ideas of our proof technique, with all the details delegated to the Appendices Cand D.To establish lower bound on the regularization parameter �

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵,for some ↵ > 0, which will establish the required relationship �

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 4.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define w(⌦

R

) = E[ sup

u2⌦R

hg, ui] to be a Gaussian width

of set ⌦

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) +

log(p)) we can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will show that inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some

⌫ > 0 and then setp

N = ⌫.

Theorem 4.4 Let ⇥ = cone(⌦E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r > 1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

,

for�

j

is of size dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set ⌦

E

j

is a part of the decomposition

in ⌦

E

= ⌦

E

1

⇥ · · · ⇥ ⌦E

p

due to the assumption on the row-wise separability of norm R(·) in (5). Also

7

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

15

Page 16: Estimating structured vector autoregressive models

• p

• •

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

0 <

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

16

Page 17: Estimating structured vector autoregressive models

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

VAR

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

3.3 Properties of Data Matrix X

In what follows, we analyze the covariance structure of matrix X in (3) using spectral properties of VARmodel (see Appendix B for additional details). The results will then be used in establishing the high proba-bility bounds for the estimation guarantees in problem (4).Define any row of X as X

i,:

2 Rdp, 1 i N . Since we assumed that ✏

t

⇠ N (0, ⌃), it follows that eachrow is distributed as X

i,:

⇠ N (0, CX), where the covariance matrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out that since CX is a block-Toeplitz matrix, its eigenvalues canbe bounded as (see [13])

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and for i =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2[0, 2⇡], is the spectral density, i.e., a Fourier transform of the autocovariance matrix �(h). The advantage ofutilizing spectral density is that it has a closed form expression (see Section 9.4 of [24])

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤

max

(A) = max

!2[0,2⇡]⇤

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

, (10)

see Appendix B.1 for additional details.In establishing high probability bounds we will also need information about a vector q = Xa 2 RN forany a 2 Rdp, kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa), it follows that q ⇠ N (0, Q

a

) with acovariance matrix Q

a

2 RN⇥N . It can be shown (see Appendix B.3 for more details) that Q

a

can be writtenas

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2 RNdp which is obtained from matrix X by stackingall the rows in a single vector, i.e, U = vec(XT

). In order to bound eigenvalues of CU (and consequently of

5

3.3 Properties of Data Matrix X

In what follows, we analyze the covariance structure of matrix X in (3) using spectral properties of VARmodel (see Appendix B for additional details). The results will then be used in establishing the high proba-bility bounds for the estimation guarantees in problem (4).Define any row of X as X

i,:

2 Rdp, 1 i N . Since we assumed that ✏

t

⇠ N (0, ⌃), it follows that eachrow is distributed as X

i,:

⇠ N (0, CX), where the covariance matrix CX 2 Rdp⇥dp is the same for all i

CX =

2

6

6

4

�(0) �(1) . . . �(d � 1)

�(1)

T

�(0) . . . �(d � 2)

......

. . ....

�(d � 1)

T

�(d � 2)

T

. . . �(0)

3

7

7

5

, (7)

where �(h) = E(x

t

x

T

t+h

) 2 Rp⇥p. It turns out that since CX is a block-Toeplitz matrix, its eigenvalues canbe bounded as (see [13])

inf

1jp

!2[0,2⇡]

j

[⇢(!)] ⇤

k

[CX]

1kdp

sup

1jp

!2[0,2⇡]

j

[⇢(!)], (8)

where ⇤

k

[·] denotes the k-th eigenvalue of a matrix and for i =

p�1, ⇢(!) =

P1h=�1 �(h)e

�hi!

, ! 2[0, 2⇡], is the spectral density, i.e., a Fourier transform of the autocovariance matrix �(h). The advantage ofutilizing spectral density is that it has a closed form expression (see Section 9.4 of [24])

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤

max

(A) = max

!2[0,2⇡]⇤

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

, (10)

see Appendix B.1 for additional details.In establishing high probability bounds we will also need information about a vector q = Xa 2 RN forany a 2 Rdp, kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa), it follows that q ⇠ N (0, Q

a

) with acovariance matrix Q

a

2 RN⇥N . It can be shown (see Appendix B.3 for more details) that Q

a

can be writtenas

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2 RNdp which is obtained from matrix X by stackingall the rows in a single vector, i.e, U = vec(XT

). In order to bound eigenvalues of CU (and consequently of

5

Q

a

), observe that U can be viewed as a vector obtained by stacking N outputs from VAR model in (6). Sim-ilarly as in (8), if we denote the spectral density of the VAR process in (6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi!,! 2 [0, 2⇡], where �X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and A are as defined in expression (6). Thus, an upperbound on CU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where we defined ⇤

min

(A) = min

!2[0,2⇡]⇤

min

(A(!))

for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not exist closed-form expressions for ⇤

max

(A) and⇤

min

(A). However, for some special cases there are results establishing the bounds on these quantities(e.g., see Proposition 2.2 in [5]).

4 Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution of optimization problem (4) and �⇤, the true valueof the parameter. The focus of our work is to determine conditions under which the optimization problemin (4) has guarantees on the accuracy of the obtained solution, i.e., the error term is bounded: ||�||

2

for some known �. To establish such conditions, we utilize the framework of [4]. Specifically, estimationerror analysis is based on the following known results adapted to our settings. The first one characterizesthe restricted error set ⌦

E

, where the error � belongs.

Lemma 4.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the

error vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

6

Estimating Structured VAR

density is that it has a closed form expression (see Section9.4 of (Priestley, 1981))

⇢(!)=

I�d

X

k=1

A

k

e

�ki!

!�1

2

4

I�d

X

k=1

A

k

e

�ki!

!�1

3

5

,

where ⇤ denotes a Hermitian of a matrix. Therefore, from(8) we can establish the following lower bound

min

[CX] � ⇤

min

(⌃)/⇤

max

(A) = L, (9)

where we defined ⇤max

(A) = max

!2[0,2⇡]

max

(A(!)) for

A(!)=

I �d

X

k=1

A

T

k

e

ki!

!

I �d

X

k=1

A

k

e

�ki!

!

. (10)

In establishing high probability bounds we will also needinformation about a vector q = Xa 2 RN for any a 2 Rdp,kak

2

= 1. Since each element X

T

i,:

a ⇠ N (0, a

T

CXa),it follows that q ⇠ N (0, Q

a

) with a covariance matrixQ

a

2 RN⇥N . It can be shown (see Section 2.1.2 of thesupplement) that Q

a

can be written as

Q

a

= (I ⌦ a

T

)CU (I ⌦ a), (11)

where CU = E(UUT

) for U =

X

T

1,:

. . . X

T

N,:

T 2RNdp which is obtained from matrix X by stacking all therows in a single vector, i.e, U = vec(XT

). In order tobound eigenvalues of CU (and consequently of Q

a

), ob-serve that U can be viewed as a vector obtained by stackingN outputs from VAR model in (6). Similarly as in (8),if we denote the spectral density of the VAR process in(6) as ⇢X(!) =

P1h=�1 �X(h)e

�hi! , ! 2 [0, 2⇡], where�X(h) = E[X

j,:

X

T

j+h,:

] 2 Rdp⇥dp, then we can write

inf

1ldp

!2[0,2⇡]

l

[⇢X(!)] ⇤

k

[CU ]

1kNdp

sup

1ldp

!2[0,2⇡]

l

[⇢X(!)].

The closed form expression of spectral density is

⇢X(!) =

I � Ae

�i!

��1

⌃E

h

I � Ae

�i!

��1

i⇤,

where ⌃E is the covariance matrix of a noise vector and Aare as defined in expression (6). Thus, an upper bound onCU can be obtained as ⇤

max

[CU ] ⇤

max

(⌃)

min

(A)

, where wedefined ⇤

min

(A) = min

!2[0,2⇡]

min

(A(!)) for

A(!) =

I � AT

e

i!

� �

I � Ae

�i!

. (12)

Referring back to covariance matrix Q

a

in (11), we get

max

[Q

a

] ⇤

max

(⌃)/⇤

min

(A) = M. (13)

We note that for a general VAR model, there might not existclosed-form expressions for ⇤

max

(A) and ⇤min

(A). How-ever, for some special cases there are results establishingthe bounds on these quantities (e.g., see Proposition 2.2 in(Basu & Michailidis, 2015)).

3. Regularized Estimation Guarantees

Denote by � =

ˆ� � �⇤ the error between the solution ofoptimization problem (4) and �⇤, the true value of the pa-rameter. The focus of our work is to determine conditionsunder which the optimization problem in (4) has guaran-tees on the accuracy of the obtained solution, i.e., the errorterm is bounded: ||�||

2

� for some known �. To estab-lish such conditions, we utilize the framework of (Banerjeeet al., 2014). Specifically, estimation error analysis is basedon the following known results adapted to our settings. Thefirst one characterizes the restricted error set⌦

E

, where theerror � belongs.

Lemma 3.1 Assume that

N

� rR

1

N

Z

T ✏

, (14)

for some constant r > 1, where R

⇤ ⇥ 1

N

Z

T ✏⇤

is a

dual form of the vector norm R(·), which is defined as

R

⇤[

1

N

Z

T ✏] = sup

R(U)1

1

N

Z

T ✏, U↵

, for U 2 Rdp

2

, where

U = [u

T

1

, u

T

2

, . . . , u

T

p

]

T

and u

i

2 Rdp

. Then the error

vector k�k2

belongs to the set

E

=

� 2 Rdp

2

R(�⇤+�) R(�⇤

)+

1

r

R(�)

. (15)

The second condition in (Banerjee et al., 2014) establishesthe upper bound on the estimation error.

Lemma 3.2 Assume that the restricted eigenvalue (RE)

condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦

E

) and some constant > 0, where

cone(⌦

E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦

E

)), (17)

where (cone(⌦

E

)) is a norm compatibility constant, de-

fined as (cone(⌦

E

)) = sup

U2cone(⌦

E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14)and (16) hold, then the error satisfies the upper bound in(17). However, the results are defined in terms of the quan-tities, involving Z and ✏, which are random. Therefore, inthe following we establish high probability bounds on theregularization parameter in (14) and RE condition in (16).

which can also be compactly written as

Y = XB + E, (3)

where Y 2 RN⇥p, X 2 RN⇥dp, B 2 Rdp⇥p, and E 2 RN⇥p for N = T �d+1. Vectorizing (column-wise)each matrix in (3), we get

vec(Y ) = (I

p⇥p

⌦ X)vec(B) + vec(E)

y = Z� + ✏,

where y 2 RNp, Z = (I

p⇥p

⌦ X) 2 RNp⇥dp

2 , � 2 Rdp

2 , ✏ 2 RNp, and ⌦ is the Kronecker product. Thecovariance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ I

N⇥N

. Consequently, the regularized estimator takesthe form

ˆ� = argmin�2Rdp

2

1

N

||y � Z�||22

+ �

N

R(�), (4)

where R(�) can be any vector norm, separable along the rows of matrices A

k

. Specifically, if we denote� = [�

T

1

. . . �

T

p

]

T and A

k

(i, :) as the row of matrix A

k

for k = 1, . . . , d, then our assumption is equivalentto

R(�)=

p

X

i=1

R

i

=

p

X

i=1

R

h

A

1

(i, :)

T

. . .A

d

(i, :)

T

i

T

. (5)

To reduce clutter and without loss of generality, we assume the norm R(·) to be the same for each row i.Since the analysis decouples across rows, it is straightforward to extend our analysis to the case when adifferent norm is used for each row of A

k

, e.g., L

1

for row one, L

2

for row two, K-support norm [2] for rowthree, etc. Observe that within a row, the norm need not be decomposable across columns.The main difference between the estimation problem in (1) and the formulation in (4) is the strong de-pendence between the samples (x

0

, x

1

, . . . , x

T

), violating the i.i.d. assumption on the data {(y

i

, z

i

), i =

1, . . . , Np}. In particular, this leads to the correlations between the rows and columns of matrix X (andconsequently of Z). To deal with such dependencies, following [5], we utilize the spectral representation ofthe autocovariance of VAR models to control the dependencies in matrix X .

3.2 Stability of VAR Model

Since VAR models are (linear) dynamical systems, for the analysis we need to establish conditions underwhich the VAR model (2) is stable, i.e., the time-series process does not diverge over time. For understandingstability, it is convenient to rewrite VAR model of order d in (2) as an equivalent VAR model of order 1

2

6

6

6

4

x

t

x

t�1

...x

t�(d�1)

3

7

7

7

5

=

2

6

6

6

6

6

4

A

1

A

2

. . . A

d�1

A

d

I 0 . . . 0 0

0 I . . . 0 0

......

. . ....

...0 0 . . . I 0

3

7

7

7

7

7

5

| {z }

A

2

6

6

6

4

x

t�1x

t�2...

x

t�d

3

7

7

7

5

+

2

6

6

6

4

t

0

...0

3

7

7

7

5

(6)

where A 2 Rdp⇥dp. Therefore, VAR process is stable if all the eigenvalues of A satisfy det(�I

dp⇥dp

�A) =

0 for � 2 C, |�| < 1. Equivalently, if expressed in terms of original parameters A

k

, stability is satisfied ifdet(I � P

d

k=1

A

k

1

k

) = 0 (see Appendix A for more details).

4

L =

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

L =

17

Page 18: Estimating structured vector autoregressive models

� 2 ⌦E

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Thm4.3

Thm4.4 Lem.4.1

Lem.4.2

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

18

Page 19: Estimating structured vector autoregressive models

• 4.3

The second condition in [4] establishes the upper bound on the estimation error.

Lemma 4.2 Assume that the restricted eigenvalue (RE) condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦E

) and some constant > 0, where cone(⌦E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦E

)), (17)

where (cone(⌦E

)) is a norm compatibility constant, defined as (cone(⌦E

)) = sup

U2cone(⌦E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error satisfies the upperbound in (17). However, the results are defined in terms of the quantities, involving Z and ✏, which arerandom. Therefore, in the following we establish high probability bounds on the regularization parameterin (14) and RE condition in (16).

4.1 High Probability Bounds

In this Section we present the main results of our work, followed by the discussion on their properties andillustrating some special cases based on popular Lasso and Group Lasso regularization norms. In Section4.4 we will present the main ideas of our proof technique, with all the details delegated to the Appendices Cand D.To establish lower bound on the regularization parameter �

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵,for some ↵ > 0, which will establish the required relationship �

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 4.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define w(⌦

R

) = E[ sup

u2⌦R

hg, ui] to be a Gaussian width

of set ⌦

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) +

log(p)) we can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will show that inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some

⌫ > 0 and then setp

N = ⌫.

Theorem 4.4 Let ⇥ = cone(⌦E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r > 1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

,

for�

j

is of size dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set ⌦

E

j

is a part of the decomposition

in ⌦

E

= ⌦

E

1

⇥ · · · ⇥ ⌦E

p

due to the assumption on the row-wise separability of norm R(·) in (5). Also

7

The second condition in [4] establishes the upper bound on the estimation error.

Lemma 4.2 Assume that the restricted eigenvalue (RE) condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦E

) and some constant > 0, where cone(⌦E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦E

)), (17)

where (cone(⌦E

)) is a norm compatibility constant, defined as (cone(⌦E

)) = sup

U2cone(⌦E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error satisfies the upperbound in (17). However, the results are defined in terms of the quantities, involving Z and ✏, which arerandom. Therefore, in the following we establish high probability bounds on the regularization parameterin (14) and RE condition in (16).

4.1 High Probability Bounds

In this Section we present the main results of our work, followed by the discussion on their properties andillustrating some special cases based on popular Lasso and Group Lasso regularization norms. In Section4.4 we will present the main ideas of our proof technique, with all the details delegated to the Appendices Cand D.To establish lower bound on the regularization parameter �

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵,for some ↵ > 0, which will establish the required relationship �

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 4.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define w(⌦

R

) = E[ sup

u2⌦R

hg, ui] to be a Gaussian width

of set ⌦

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) +

log(p)) we can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will show that inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some

⌫ > 0 and then setp

N = ⌫.

Theorem 4.4 Let ⇥ = cone(⌦E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r > 1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

,

for�

j

is of size dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set ⌦

E

j

is a part of the decomposition

in ⌦

E

= ⌦

E

1

⇥ · · · ⇥ ⌦E

p

due to the assumption on the row-wise separability of norm R(·) in (5). Also

7

The second condition in [4] establishes the upper bound on the estimation error.

Lemma 4.2 Assume that the restricted eigenvalue (RE) condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦E

) and some constant > 0, where cone(⌦E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦E

)), (17)

where (cone(⌦E

)) is a norm compatibility constant, defined as (cone(⌦E

)) = sup

U2cone(⌦E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error satisfies the upperbound in (17). However, the results are defined in terms of the quantities, involving Z and ✏, which arerandom. Therefore, in the following we establish high probability bounds on the regularization parameterin (14) and RE condition in (16).

4.1 High Probability Bounds

In this Section we present the main results of our work, followed by the discussion on their properties andillustrating some special cases based on popular Lasso and Group Lasso regularization norms. In Section4.4 we will present the main ideas of our proof technique, with all the details delegated to the Appendices Cand D.To establish lower bound on the regularization parameter �

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵,for some ↵ > 0, which will establish the required relationship �

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 4.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define w(⌦

R

) = E[ sup

u2⌦R

hg, ui] to be a Gaussian width

of set ⌦

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) +

log(p)) we can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will show that inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some

⌫ > 0 and then setp

N = ⌫.

Theorem 4.4 Let ⇥ = cone(⌦E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r > 1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

,

for�

j

is of size dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set ⌦

E

j

is a part of the decomposition

in ⌦

E

= ⌦

E

1

⇥ · · · ⇥ ⌦E

p

due to the assumption on the row-wise separability of norm R(·) in (5). Also

7

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

Estimating Structured VAR

3.1. High Probability Bounds

In this Section we present the main results of our work,followed by the discussion on their properties and illustrat-ing some special cases based on popular Lasso and GroupLasso regularization norms. In Section 3.4 we will presentthe main ideas of our proof technique, with all the detailsdelegated to the supplement.

To establish lower bound on the regularization parameter�

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵, forsome ↵ > 0, which will establish the required relationship�

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 3.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define

w(⌦

R

) = E[ sup

u2⌦

R

hg, ui] to be a Gaussian width of set

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with

probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) + log(p)) we

can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will showthat inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some ⌫ > 0 and

then setp

N = ⌫.

Theorem 3.4 Let ⇥ = cone(⌦

E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as ⌦

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r >

1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

, for�

j

is of size

dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set

E

j

is a part of the decomposition in ⌦

E

= ⌦

E

1

⇥ · · · ⇥⌦

E

p

due to the assumption on the row-wise separability of

norm R(·) in (5). Also define w(⇥) = E[sup

u2⇥

hg, ui] to

be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2Rdp

. Then with probability at least 1 � c

1

exp(�c

2

2

+

log(p)), for any ⌘ > 0, inf

�2cone(⌦

E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫,

where ⌫ =

pNL � 2

pM � cw(⇥) � ⌘ and c, c

1

, c

2

are

positive constants, and L, M are defined in (9) and (13).

3.2. Discussion

From Theorem 3.4, we can choose ⌘ =

1

2

pNL and setp

N =

pNL� 2

pM� cw(⇥) � ⌘ and since

pN > 0

must be satisfied, we can establish a lower bound on thenumber of samples N :

pN >

2

pM+cw(⇥)p

L/2

= O(w(⇥)).Examining this bound and using (9) and (13), we can con-clude that the number of samples needed to satisfy therestricted eigenvalue condition is smaller if ⇤

min

(A) and

min

(⌃) are larger and⇤max

(A) and⇤max

(⌃) are smaller.In turn, this means that matrices A and A in (10) and (12)must be well conditioned and the VAR process is stable,with eigenvalues well inside the unit circle (see Section2.2). Alternatively, we can also understand the bound onN as showing that large values of M and small values ofL indicate stronger dependency in the data, thus requiringmore samples for the RE conditions to hold with high prob-ability.

Analyzing Theorems 3.3 and 3.4 we can interpret the es-tablished results as follows. As the size and dimension-ality N , p and d of the problem increase, we emphasizethe scale of the results and use the order notations to de-note the constants. Select a number of samples at leastN � O(w

2

(⇥)) and let the regularization parameter sat-isfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability

then the restricted eigenvalue condition ||Z�||2

||�||2

� pN

for � 2 cone(⌦E

) holds, so that = O(1) is a pos-itive constant. Moreover, the norm of the estimation er-ror in optimization problem (4) is bounded by k�k

2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)). Note that the normcompatibility constant (cone(⌦

E

j

)) is assumed to be thesame for all j = 1, . . . , p, which follows from our assump-tion in (5).

Consider now Theorem 3.3 and the bound on the regu-larization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. Asthe dimensionality of the problem p and d grows andthe number of samples N increases, the first term w(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seenby computing N for which the two terms become equalw(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) <

w(⌦

R

). Therefore, we can rewrite our results as fol-lows: once the restricted eigenvalue condition holds and�

N

� O⇣

w(⌦

R

)pN

, the error norm is upper-bounded by

k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

3.3. Special Cases

While the presented results are valid for any norm R(·),separable along the rows of A

k

, it is instructive to special-ize our analysis to a few popular regularization choices,such as L

1

and Group Lasso, Sparse Group Lasso andOWL norms.

3.3.1. LASSO

To establish results for L

1

norm, we assume that the pa-rameter �⇤ is s-sparse, which in our case is meant to rep-resent the largest number of non-zero elements in any �

i

,i = 1, . . . , p, i.e., the combined i-th rows of each A

k

,

rL

The second condition in [4] establishes the upper bound on the estimation error.

Lemma 4.2 Assume that the restricted eigenvalue (RE) condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦E

) and some constant > 0, where cone(⌦E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦E

)), (17)

where (cone(⌦E

)) is a norm compatibility constant, defined as (cone(⌦E

)) = sup

U2cone(⌦E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error satisfies the upperbound in (17). However, the results are defined in terms of the quantities, involving Z and ✏, which arerandom. Therefore, in the following we establish high probability bounds on the regularization parameterin (14) and RE condition in (16).

4.1 High Probability Bounds

In this Section we present the main results of our work, followed by the discussion on their properties andillustrating some special cases based on popular Lasso and Group Lasso regularization norms. In Section4.4 we will present the main ideas of our proof technique, with all the details delegated to the Appendices Cand D.To establish lower bound on the regularization parameter �

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵,for some ↵ > 0, which will establish the required relationship �

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 4.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define w(⌦

R

) = E[ sup

u2⌦R

hg, ui] to be a Gaussian width

of set ⌦

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) +

log(p)) we can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will show that inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some

⌫ > 0 and then setp

N = ⌫.

Theorem 4.4 Let ⇥ = cone(⌦E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r > 1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

,

for�

j

is of size dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set ⌦

E

j

is a part of the decomposition

in ⌦

E

= ⌦

E

1

⇥ · · · ⇥ ⌦E

p

due to the assumption on the row-wise separability of norm R(·) in (5). Also

7

The second condition in [4] establishes the upper bound on the estimation error.

Lemma 4.2 Assume that the restricted eigenvalue (RE) condition holds

||Z�||2

||�||2

�p

N, (16)

for � 2 cone(⌦E

) and some constant > 0, where cone(⌦E

) is a cone of the error set, then

||�||2

1 + r

r

N

(cone(⌦E

)), (17)

where (cone(⌦E

)) is a norm compatibility constant, defined as (cone(⌦E

)) = sup

U2cone(⌦E

)

R(U)

||U ||2

.

Note that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error satisfies the upperbound in (17). However, the results are defined in terms of the quantities, involving Z and ✏, which arerandom. Therefore, in the following we establish high probability bounds on the regularization parameterin (14) and RE condition in (16).

4.1 High Probability Bounds

In this Section we present the main results of our work, followed by the discussion on their properties andillustrating some special cases based on popular Lasso and Group Lasso regularization norms. In Section4.4 we will present the main ideas of our proof technique, with all the details delegated to the Appendices Cand D.To establish lower bound on the regularization parameter �

N

, we derive an upper bound on R

⇤[

1

N

Z

T ✏] ↵,for some ↵ > 0, which will establish the required relationship �

N

� ↵ � R

⇤[

1

N

Z

T ✏].

Theorem 4.3 Let ⌦

R

= {u 2 Rdp|R(u) 1}, and define w(⌦

R

) = E[ sup

u2⌦R

hg, ui] to be a Gaussian width

of set ⌦

R

for g ⇠ N (0, I). For any ✏

1

> 0 and ✏

2

> 0 with probability at least 1 � c exp(� min(✏

2

2

, ✏

1

) +

log(p)) we can establish that

R

1

N

Z

T ✏

c

2

(1+✏

2

)

w(⌦

R

)pN

+ c

1

(1+✏

1

)

w

2

(⌦

R

)

N

2

where c, c

1

and c

2

are positive constants.

To establish restricted eigenvalue condition, we will show that inf

�2cone(⌦E

)

||(Ip⇥p

⌦X)�||2

||�||2

� ⌫, for some

⌫ > 0 and then setp

N = ⌫.

Theorem 4.4 Let ⇥ = cone(⌦E

j

) \ S

dp�1

, where S

dp�1

is a unit sphere. The error set ⌦

E

j

is defined as

E

j

=

n

j

2 Rdp

R(�

⇤j

+�

j

) R(�

⇤j

) +

1

r

R(�

j

)

o

, for r > 1, j = 1, . . . , p, and� = [�

T

1

, . . . ,�

T

p

]

T

,

for�

j

is of size dp ⇥ 1, and �⇤= [�

⇤T1

. . . �

⇤Tp

]

T

, for �

⇤j

2 Rdp

. The set ⌦

E

j

is a part of the decomposition

in ⌦

E

= ⌦

E

1

⇥ · · · ⇥ ⌦E

p

due to the assumption on the row-wise separability of norm R(·) in (5). Also

7

:

Gaussian width

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

19

Page 20: Estimating structured vector autoregressive models

• N p (d=1 )

• N

0 1000 2000 3000 4000 50000

5

10

15

20

0 1000 2000 3000 4000 50000

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 7000.4

0.5

0.6

0.7

0.8

0.9

1

k�k 2

N

N

p

k�k 2

�/�

max

�/�

max

N/[s log(pd)]

0 100 200 300 400 500 6000

5

10

15

20

k�k 2

N

N

k�k 2

�/�

max

�/�

max

0 1000 2000 3000 4000 50000

5

10

15

0 500 1000 1500 2000 25000

5

10

15

0 1000 2000 3000 4000 5000

0.4

0.5

0.6

0.7

0.8

0.9

1

N/[s(m + log K)]

K

10 20 30 40 50 600.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(a)

(b)

(c) (d)

(e)

(f)

(g)

(h)

Figure 1: Results for estimating parameters of a stable first order sparse VAR (top row) and group sparseVAR (bottom row). Problem dimensions: p 2 [10, 600], N 2 [10, 5000], �

N

max

2 [0, 1], K 2 [2, 60] andd = 1. Figures (a) and (e) show dependency of errors on sample size for different p; in Figure (b) the N

is scaled by (s log p) and plotted against k�k2

to show that errors scale as (s log p)/N ; in (f) the graphis similar to (b) but for group sparse VAR; in (c) and (g) we show dependency of �

N

on p (or number ofgroups K in (g)) for fixed sample size N ; finally, Figures (d) and (h) display the dependency of �

N

on N

for fixed p.

0 1000 2000 3000 4000 5000

0

2

4

6

8

10

12

14

16

18

20

0 500 1000 1500 2000

0

2

4

6

8

10

12

14

16

18

20

0 1000 2000 3000 4000 50000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400 4500.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

k�k 2

k�k 2

k�k 2

k�k 2

N N

N

N

c̄N

s log(p)

p

p

�/�

max

�/�

max

!"# $"# %"# &"#

'"# ("# )"# *"#

N

(↵s + (1 � ↵)sG)(m + log(K))

0 1000 2000 3000 4000 50000

1

2

3

4

5

6

7

8

9

10

0 100 200 300 400 500 6000

1

2

3

4

5

6

7

8

9

10

0 1000 2000 3000 4000 50000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 5000.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

�/�

max

�/�

max

Figure 2: Results for estimating parameters of a stable first order Sparse Group Lasso VAR (top row)and OWL-regularized VAR (bottom row). Problem dimensions for Sparse Group Lasso : p 2 [10, 410],N 2 [10, 5000], �

N

max

2 [0, 1], K 2 [2, 60] and d = 1. Problem dimensions for OWL: p 2 [10, 410],N 2 [10, 5000], �

N

max

2 [0, 1], s 2 [4, 260] and d = 1. All results are shown after averaging across 50 runs.

12

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

20

Page 21: Estimating structured vector autoregressive models

• VAR(d=1) 1step

• • or

Estimating Structured VAR

0 1000 2000 3000 4000 50000

5

10

15

20

0 1000 2000 3000 4000 50000

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 7000.4

0.5

0.6

0.7

0.8

0.9

1

k�k 2

N

N

p

k�k 2

�/�

max

�/�

max

N/[s log(pd)]

0 100 200 300 400 500 6000

5

10

15

20

k�k 2

N

N

k�k 2

�/�

max

�/�

max

0 1000 2000 3000 4000 50000

5

10

15

0 500 1000 1500 2000 25000

5

10

15

0 1000 2000 3000 4000 5000

0.4

0.5

0.6

0.7

0.8

0.9

1

N/[s(m + log K)]

K

10 20 30 40 50 600.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(a)

(b)

(c) (d)

(e)

(f)

(g)

(h)

Figure 1. Results for estimating parameters of a stable first order sparse VAR (top row) and group sparse VAR (bottom row). Problemdimensions: p 2 [10, 600], N 2 [10, 5000], �

N

�max

2 [0, 1], K 2 [2, 60] and d = 1. Figures (a) and (e) show dependency of errors onsample size for different p; in Figure (b) the N is scaled by (s log p) and plotted against k�k2 to show that errors scale as (s log p)/N ;in (f) the graph is similar to (b) but for group sparse VAR; in (c) and (g) we show dependency of �N on p (or number of groups K in(g)) for fixed sample size N ; finally, Figures (d) and (h) display the dependency of �N on N for fixed p.

Lasso OWL Group Lasso Sparse Group Lasso Ridge32.3(6.5) 32.2(6.6) 32.7(6.5) 32.2(6.4) 33.5(6.1)32.7(7.9) 44.5(15.6) 75.3(8.4) 38.4(9.6) 99.9(0.2)

Table 1. Mean squared error (row 2) of the five methods used in fitting VAR model, evaluated on aviation dataset (MSE is computedusing one-step-ahead prediction errors). Row 3 shows the average number of non-zeros (as a percentage of total number of elements) inthe VAR matrix. The last row shows a typical sparsity pattern in A1 for each method (darker dots - stronger dependencies, lighter dots -weaker dependencies). The values in parenthesis denote one standard deviation after averaging the results over 300 flights.

the results after averaging across 300 flights.

From the table we can see that the considered problem ex-hibits a sparse structure since all the methods detected sim-ilar patterns in matrix A

1

. In particular, the analysis ofsuch patterns revealed a meaningful relationship among theflight parameters (darker dots), e.g., normal accelerationhad high dependency on vertical speed and angle-of-attack,the altitude had mainly dependency with fuel quantity, ver-tical speed with aircraft nose pitch angle, etc. The resultsalso showed that the sparse regularization helps in recov-ering more accurate and parsimonious models as is evi-dent by comparing performance of Ridge regression withother methods. Moreover, while all the four Lasso-basedapproaches performed similar to each other, their sparsitylevels were different, with Lasso producing the sparsest so-lutions. As was also expected, Group Lasso had largernumber of non-zeros since it did not enforce sparsity withinthe groups, as compared to the sparse version of this norm.

5. ConclusionsIn this work we present a set of results for characterizingnon-asymptotic estimation error in estimating structuredvector autoregressive models. The analysis holds for any

norms, separable along the rows of parameter matrices.Our analysis is general as it is expressed in terms of Gaus-sian widths, a geometric measure of size of suitable sets,and includes as special cases many of the existing resultsfocused on structured sparsity in VAR models.

AcknowledgementsThe research was supported by NSF grants IIS-1447566,IIS-1447574, IIS-1422557, CCF-1451986, CNS- 1314560,IIS-0953274, IIS-1029711, NASA grant NNX12AQ39A,and gifts from Adobe, IBM, and Yahoo.

A1

nMSE(%)(%)

R(·)

SUBMITTED TO 2

tage of its atomic formulation.• Instantiation of the CG algorithm to handle the Ivanov

formulation of OWL regularization; more specifically:– we show how the atomic formulation of the OWL

norm allows solving efficiently the linear program-ming problem in each iteration of the CG algorithm;

– based on results from [25], we show convergence ofthe resulting algorithm and provide explicit valuesfor the constants.

• A new derivation of the proximity operator of the OWLnorm, arguably simpler than those in [10] and [44],highlighting its connection to isotonic regression and thepool adjacent violators (PAV) algorithm [3], [8].

• An efficient method to project onto an OWL norm ball,based on a root-finding scheme.

• Tackling the Ivanov formulation under OWL regulariza-tion using projected gradient algorithms, based on theproposed OWL projection.

The paper is organized as follows. Section II, after reviewingthe OWL norm and some of its basic properties, presentsthe atomic formulation, and derives its dual norm. The twokey computational tools for using OWL regularization, theproximity operator and the projection on a ball, are addressedin Section III. Section IV instantiates the CG algorithmand accelerated projected gradient algorithms to tackle theconstrained optimization formulation of OWL regularizationfor linear regression. Finally, Section V reports experimentalresults illustrating the performance and comparison of theproposed approaches, and Section VI concludes the paper.

NotationLower-case bold letters, e.g., x, y, denote (column) vectors,

their transposes are xT , yT , and the i-th and j-th componentsare written as xi and yj . Matrices are written in uppercase bold, e.g., A, B. The vector with the absolute valuesof the components of x is written as |x|. For a vectorx, x

[i] is its i-th largest component (i.e., for x 2 Rn,x[1]

� x[2]

� · · · � x[n], with ties broken by some arbitrary

rule); consequently, |x|[i] is the i-th largest component of

x in magnitude. The vector obtained by sorting (in non-increasing order) the components of x is denoted as x#, thus|x|# denotes the vector obtained by sorting the components ofx in non-increasing order of magnitude. We denote as P(x)a permutation matrix (thus P(x)�1

= P(x)T ) that sorts thecomponents of x in non-increasing order, i.e., x# = P(x)x;naturally, |x|# = P(|x|)|x|. Finally, 1 is a vector with allentries equal to 1, whereas 0 = 01 is a vector with all entriesequal to zero, and � is the entry-wise (Hadamard) product.

II. THE OWL, ITS ATOMIC FORMULATION, AND ITS DUAL

A. The OWL NormThe ordered weighted `

1

(OWL) norm [10], [44], denotedas ⌦

w

: Rn ! R+

, is defined as

w

(x) =nX

i=1

|x|[i] wi = wT |x|#, (1)

Fig. 1. OWL balls in R2 with different weights: (a) w1 > w2 > 0; (b)w1 = w2 > 0; (c) w1 > w2 = 0.

where w 2 Km+

is a vector of non-increasing weights, i.e.,belonging to the so-called monotone non-negative cone [13],

Km+

= {x 2 Rn: x

1

� x2

� · · ·xn � 0} ⇢ Rn+

. (2)

OWL balls in R2 and R3, for different choices of the weightvector w, are illustrated in Figs. 1 and 2.

Fig. 2. OWL balls in R3 with different weights: (a) w1 > w2 > w3 > 0;(b) w1 > w2 = w3 > 0; (c) w1 = w2 > w3 > 0; (d) w1 = w2 > w3 = 0;(e) w1 > w2 = w3 = 0; (f) w1 = w2 = w3 > 0.

The fact that, if w 2 Km+

\{0}, then ⌦

w

is indeed a norm(thus convex and homogenous of degree 1), was shown in [10],[44]. It is clear that ⌦

w

is lower bounded by the (appropriatelyscaled) `1 norm:

w

(x) � w1

|x|[1]

= w1

kxk1, (3)

with the inequality becoming an equality if w1

= 1, and w2

=

· · · = wn = 0. Moreover, ⌦w

satisfies the following pair ofinequalities, with respect to the `

1

norm:

w̄ kxk1

w

(x) w1

kxk1

, (4)

where w̄ = kwk1

/n is the average of the weights; the firstinequality is a corollary of Chebyshev’s sum inequality1, andthe second inequality is trivial from the definition of ⌦

w

. Ofcourse, if w = w

1

1, both inequalities in (4) become equalities.Finally, as shown in [46], the following specific choice for

1According to Chebyshev’s sum inequality [24], if x,y 2 Km+, then

1

n

xTy �

1

n

nX

i=1

xi

! 1

n

nX

i=1

yi

!=

1

n

2kxk1kyk1.

SUBMITTED TO 2

tage of its atomic formulation.• Instantiation of the CG algorithm to handle the Ivanov

formulation of OWL regularization; more specifically:– we show how the atomic formulation of the OWL

norm allows solving efficiently the linear program-ming problem in each iteration of the CG algorithm;

– based on results from [25], we show convergence ofthe resulting algorithm and provide explicit valuesfor the constants.

• A new derivation of the proximity operator of the OWLnorm, arguably simpler than those in [10] and [44],highlighting its connection to isotonic regression and thepool adjacent violators (PAV) algorithm [3], [8].

• An efficient method to project onto an OWL norm ball,based on a root-finding scheme.

• Tackling the Ivanov formulation under OWL regulariza-tion using projected gradient algorithms, based on theproposed OWL projection.

The paper is organized as follows. Section II, after reviewingthe OWL norm and some of its basic properties, presentsthe atomic formulation, and derives its dual norm. The twokey computational tools for using OWL regularization, theproximity operator and the projection on a ball, are addressedin Section III. Section IV instantiates the CG algorithmand accelerated projected gradient algorithms to tackle theconstrained optimization formulation of OWL regularizationfor linear regression. Finally, Section V reports experimentalresults illustrating the performance and comparison of theproposed approaches, and Section VI concludes the paper.

NotationLower-case bold letters, e.g., x, y, denote (column) vectors,

their transposes are xT , yT , and the i-th and j-th componentsare written as xi and yj . Matrices are written in uppercase bold, e.g., A, B. The vector with the absolute valuesof the components of x is written as |x|. For a vectorx, x

[i] is its i-th largest component (i.e., for x 2 Rn,x[1]

� x[2]

� · · · � x[n], with ties broken by some arbitrary

rule); consequently, |x|[i] is the i-th largest component of

x in magnitude. The vector obtained by sorting (in non-increasing order) the components of x is denoted as x#, thus|x|# denotes the vector obtained by sorting the components ofx in non-increasing order of magnitude. We denote as P(x)a permutation matrix (thus P(x)�1

= P(x)T ) that sorts thecomponents of x in non-increasing order, i.e., x# = P(x)x;naturally, |x|# = P(|x|)|x|. Finally, 1 is a vector with allentries equal to 1, whereas 0 = 01 is a vector with all entriesequal to zero, and � is the entry-wise (Hadamard) product.

II. THE OWL, ITS ATOMIC FORMULATION, AND ITS DUAL

A. The OWL NormThe ordered weighted `

1

(OWL) norm [10], [44], denotedas ⌦

w

: Rn ! R+

, is defined as

w

(x) =nX

i=1

|x|[i] wi = wT |x|#, (1)

Fig. 1. OWL balls in R2 with different weights: (a) w1 > w2 > 0; (b)w1 = w2 > 0; (c) w1 > w2 = 0.

where w 2 Km+

is a vector of non-increasing weights, i.e.,belonging to the so-called monotone non-negative cone [13],

Km+

= {x 2 Rn: x

1

� x2

� · · ·xn � 0} ⇢ Rn+

. (2)

OWL balls in R2 and R3, for different choices of the weightvector w, are illustrated in Figs. 1 and 2.

Fig. 2. OWL balls in R3 with different weights: (a) w1 > w2 > w3 > 0;(b) w1 > w2 = w3 > 0; (c) w1 = w2 > w3 > 0; (d) w1 = w2 > w3 = 0;(e) w1 > w2 = w3 = 0; (f) w1 = w2 = w3 > 0.

The fact that, if w 2 Km+

\{0}, then ⌦

w

is indeed a norm(thus convex and homogenous of degree 1), was shown in [10],[44]. It is clear that ⌦

w

is lower bounded by the (appropriatelyscaled) `1 norm:

w

(x) � w1

|x|[1]

= w1

kxk1, (3)

with the inequality becoming an equality if w1

= 1, and w2

=

· · · = wn = 0. Moreover, ⌦w

satisfies the following pair ofinequalities, with respect to the `

1

norm:

w̄ kxk1

w

(x) w1

kxk1

, (4)

where w̄ = kwk1

/n is the average of the weights; the firstinequality is a corollary of Chebyshev’s sum inequality1, andthe second inequality is trivial from the definition of ⌦

w

. Ofcourse, if w = w

1

1, both inequalities in (4) become equalities.Finally, as shown in [46], the following specific choice for

1According to Chebyshev’s sum inequality [24], if x,y 2 Km+, then

1

n

xTy �

1

n

nX

i=1

xi

! 1

n

nX

i=1

yi

!=

1

n

2kxk1kyk1.

[Zheng15]

21

Page 22: Estimating structured vector autoregressive models

• -

VAR

• i.i.d.

• :

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

define w(⇥) = E[sup

u2⇥hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N (0, I) and u 2 Rdp

. Then with

probability at least 1 � c

1

exp(�c

2

2

+ log(p)), for any ⌘ > 0

inf

�2cone(⌦E

)

||(Ip⇥p

⌦ X)�||2

||�||2

� ⌫,

where ⌫ =

pNL� 2

pM� cw(⇥) � ⌘ and c, c

1

, c

2

are positive constants, and L and M are defined in (9)and (13).

4.2 Discussion

From Theorem 4.4, we can choose ⌘ =

1

2

pNL and set

pN =

pNL � 2

pM � cw(⇥) � ⌘ and sincep

N > 0 must be satisfied, we can establish a lower bound on the number of samples N

pN >

2

pM + cw(⇥)p

L/2

= O(w(⇥)). (18)

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfythe restricted eigenvalue condition is smaller if ⇤

min

(A) and ⇤min

(⌃) are larger and ⇤max

(A) and ⇤max

(⌃)

are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and theVAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we canalso understand (18) as showing that large values of M and small values of L indicate stronger dependencyin the data, thus requiring more samples for the RE conditions to hold with high probability.Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimen-sionality N , p and d of the problem increase, we emphasize the scale of the results and use the order notationsto denote the constants. Select a number of samples at least N � O(w

2

(⇥)) and let the regularization pa-rameter satisfy �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

. With high probability then the restricted eigenvalue condition||Z�||

2

||�||2

� pN for� 2 cone(⌦

E

) holds, so that = O(1) is a positive constant. Moreover, the norm of the

estimation error in optimization problem (4) is bounded by k�k2

O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

(cone(⌦E

j

)).Note that the norm compatibility constant (cone(⌦

E

j

)) is assumed to be the same for all j = 1, . . . , p,which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter �

N

� O⇣

w(⌦

R

)pN

+

w

2

(⌦

R

)

N

2

.As the dimensionality of the problem p and d grows and the number of samples N increases, the first termw(⌦

R

)pN

will dominate the second one w

2

(⌦

R

)

N

2

. This can be seen by computing N for which the two terms

become equal w(⌦

R

)pN

=

w

2

(⌦

R

)

N

2

, which happens at N = w

2

3

(⌦

R

) < w(⌦

R

). Therefore, we can rewrite our

results as follows: once the restricted eigenvalue condition holds and �

N

� O⇣

w(⌦

R

)pN

, the error norm is

upper-bounded by k�k2

O⇣

w(⌦

R

)pN

(cone(⌦E

j

)).

4.3 Special Cases

While the presented results are valid for any norm R(·), separable along the rows of A

k

, it is instructive tospecialize our analysis to a few popular regularization choices, such as L

1

and Group Lasso, Sparse GroupLasso and OWL norms.

8

22

Page 23: Estimating structured vector autoregressive models

• •

• •

• -

• [Negahban09]

• ARX

23