Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation...

22
Lecture 15: Matrix deviation inequality May 29 - June 3, 2020 Matrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 1 / 22

Transcript of Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation...

Page 1: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

Lecture 15 Matrix deviation inequality

May 29 - June 3 2020

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 1 22

A is an mtimes n random matrix whose rows are independent meanzero isotropic and sub-gaussian random vectors in Rn

(If you find it helpful to think in terms of concrete examples letthe entries of A be independent N (0 1) random variables)

For a fixed vector x isin Rn we have

E983042Ax98304222 = Em983131

j=1

(Ajx)2 =

m983131

j=1

E(Ajx)2

=

m983131

j=1

xTE(ATjAj)x = m983042x98304222

Further if we assume that concentration about the mean holdshere (and in fact it does) we should expect that

983042Ax9830422 asympradicm983042x9830422

with high probability

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 2 22

Similarly to Johnson-Lindenstrauss Lemma our next goal is tomake the last formula hold simultaneously over all vectors x insome fixed set T sub Rn Precisely we may ask ndash how large is theaverage uniform deviation

E supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055

This quantity should clearly depend on some notion of the size ofT the larger T the larger should the uniform deviation be Sohow can we quantify the size of T for this problem In the nextsection we will do precisely this introduce a convenient geometricmeasure of the sizes of sets in Rn which is called Gaussian width

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 3 22

1 Gaussian width

Definition Let T sub Rn be a bounded set and g be a standardnormal random vector in Rn ie g sim N (0 In) Then thequantities

ω(T ) = E supxisinT

〈gx〉 and γ(T ) = E supxisinT

|〈gx〉|

are called the Gaussian width of T and the Gaussian complexity ofT respectively

Gaussian width and Gaussian complexity are closely relatedIndeed (Exercise)

2ω(T ) = ω(T minus T ) = E supxyisinT

〈gxminus y〉

= E supxyisinT

|〈gxminus y〉|

= γ(T minus T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 4 22

Gaussian width has a natural geometric interpretation Suppose gis a unit vector in Rn Then a momentrsquos thought reveals thatsupxyisinT 〈gxminus y〉 is simply the width of T in the direction of gie the distance between the two hyperplanes with normal g thattouch T on both sides as shown in the figure Then 2ω(T ) can beobtained by averaging the width of T over all directions g in Rn

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 5 22

This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and

983042g9830422 asympradicn with high probability

(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor

radicn Ultimately the geometric

interpretation of the Gaussian width becomes the following ω(T )is approximately

radicn2 larger than the usual geometric width of

T averaged over all directions

A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn

p = x isin Rn 983042x983042p le 1 We have

γ(Bn2 ) sim

radicn γ(Bn

1 ) sim983155

log n

For any finite set T sub Bn2 we have γ(T ) ≲

983155log |T | The same

holds for Gaussian width ω(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22

2 Matrix deviation inequality

Theorem 1 (Matrix deviation inequality)

Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then

E supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 le CK2γ(T )

whereK = max

i983042Ai983042ψ2

is the maximal sub-gaussian norm of the rows of A

21 Tail bound

It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22

Let u ge 0 Then the event

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 983249 CK2[γ(T ) + u middot rad(T )]

holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as

rad(T ) = supxisinT

983042x9830422

Since rad(T ) ≲ γ(T ) we improve the bound as

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 ≲ K2uγ(T )

for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation

983061E sup

xisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055p9830621p

983249 CpK2γ(T )

where Cp le Cradicp for p ge 1

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22

22 Deviation of squares

It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself

We can easily deduce the deviation of squares by using the identity

a2 minus b2 = (aminus b)2 + 2b(bminus a)

for a = 983042Ax9830422 and b =radicm983042x9830422

We have

E supxisinT

983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22

23 Deriving Johnson-Lindenstrauss Lemma

X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have

γ(T ) ≲983155

log |T | le983155

log |X |2 ≲983155

log |X |

Matrix deviation inequality and m ge Cεminus2 logN then yield

supxyisinX

983055983055983055983055983042A(xminus y)9830422983042xminus y9830422

minusradicm

983055983055983055983055 ≲983155

logN le εradicm

with high probability say 099 Multiplying both sides by983042xminus y9830422

radicm we can write the last bound as follows With

probability at least 099 we have

(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422

for all xy isin X

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22

3 Covariance estimation

We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn

We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition

Theorem 2 (Covariance estimation for sub-gaussian distributions)

Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn

983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422

Then for every N ge 1 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 2: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

A is an mtimes n random matrix whose rows are independent meanzero isotropic and sub-gaussian random vectors in Rn

(If you find it helpful to think in terms of concrete examples letthe entries of A be independent N (0 1) random variables)

For a fixed vector x isin Rn we have

E983042Ax98304222 = Em983131

j=1

(Ajx)2 =

m983131

j=1

E(Ajx)2

=

m983131

j=1

xTE(ATjAj)x = m983042x98304222

Further if we assume that concentration about the mean holdshere (and in fact it does) we should expect that

983042Ax9830422 asympradicm983042x9830422

with high probability

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 2 22

Similarly to Johnson-Lindenstrauss Lemma our next goal is tomake the last formula hold simultaneously over all vectors x insome fixed set T sub Rn Precisely we may ask ndash how large is theaverage uniform deviation

E supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055

This quantity should clearly depend on some notion of the size ofT the larger T the larger should the uniform deviation be Sohow can we quantify the size of T for this problem In the nextsection we will do precisely this introduce a convenient geometricmeasure of the sizes of sets in Rn which is called Gaussian width

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 3 22

1 Gaussian width

Definition Let T sub Rn be a bounded set and g be a standardnormal random vector in Rn ie g sim N (0 In) Then thequantities

ω(T ) = E supxisinT

〈gx〉 and γ(T ) = E supxisinT

|〈gx〉|

are called the Gaussian width of T and the Gaussian complexity ofT respectively

Gaussian width and Gaussian complexity are closely relatedIndeed (Exercise)

2ω(T ) = ω(T minus T ) = E supxyisinT

〈gxminus y〉

= E supxyisinT

|〈gxminus y〉|

= γ(T minus T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 4 22

Gaussian width has a natural geometric interpretation Suppose gis a unit vector in Rn Then a momentrsquos thought reveals thatsupxyisinT 〈gxminus y〉 is simply the width of T in the direction of gie the distance between the two hyperplanes with normal g thattouch T on both sides as shown in the figure Then 2ω(T ) can beobtained by averaging the width of T over all directions g in Rn

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 5 22

This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and

983042g9830422 asympradicn with high probability

(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor

radicn Ultimately the geometric

interpretation of the Gaussian width becomes the following ω(T )is approximately

radicn2 larger than the usual geometric width of

T averaged over all directions

A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn

p = x isin Rn 983042x983042p le 1 We have

γ(Bn2 ) sim

radicn γ(Bn

1 ) sim983155

log n

For any finite set T sub Bn2 we have γ(T ) ≲

983155log |T | The same

holds for Gaussian width ω(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22

2 Matrix deviation inequality

Theorem 1 (Matrix deviation inequality)

Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then

E supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 le CK2γ(T )

whereK = max

i983042Ai983042ψ2

is the maximal sub-gaussian norm of the rows of A

21 Tail bound

It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22

Let u ge 0 Then the event

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 983249 CK2[γ(T ) + u middot rad(T )]

holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as

rad(T ) = supxisinT

983042x9830422

Since rad(T ) ≲ γ(T ) we improve the bound as

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 ≲ K2uγ(T )

for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation

983061E sup

xisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055p9830621p

983249 CpK2γ(T )

where Cp le Cradicp for p ge 1

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22

22 Deviation of squares

It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself

We can easily deduce the deviation of squares by using the identity

a2 minus b2 = (aminus b)2 + 2b(bminus a)

for a = 983042Ax9830422 and b =radicm983042x9830422

We have

E supxisinT

983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22

23 Deriving Johnson-Lindenstrauss Lemma

X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have

γ(T ) ≲983155

log |T | le983155

log |X |2 ≲983155

log |X |

Matrix deviation inequality and m ge Cεminus2 logN then yield

supxyisinX

983055983055983055983055983042A(xminus y)9830422983042xminus y9830422

minusradicm

983055983055983055983055 ≲983155

logN le εradicm

with high probability say 099 Multiplying both sides by983042xminus y9830422

radicm we can write the last bound as follows With

probability at least 099 we have

(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422

for all xy isin X

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22

3 Covariance estimation

We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn

We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition

Theorem 2 (Covariance estimation for sub-gaussian distributions)

Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn

983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422

Then for every N ge 1 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 3: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

Similarly to Johnson-Lindenstrauss Lemma our next goal is tomake the last formula hold simultaneously over all vectors x insome fixed set T sub Rn Precisely we may ask ndash how large is theaverage uniform deviation

E supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055

This quantity should clearly depend on some notion of the size ofT the larger T the larger should the uniform deviation be Sohow can we quantify the size of T for this problem In the nextsection we will do precisely this introduce a convenient geometricmeasure of the sizes of sets in Rn which is called Gaussian width

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 3 22

1 Gaussian width

Definition Let T sub Rn be a bounded set and g be a standardnormal random vector in Rn ie g sim N (0 In) Then thequantities

ω(T ) = E supxisinT

〈gx〉 and γ(T ) = E supxisinT

|〈gx〉|

are called the Gaussian width of T and the Gaussian complexity ofT respectively

Gaussian width and Gaussian complexity are closely relatedIndeed (Exercise)

2ω(T ) = ω(T minus T ) = E supxyisinT

〈gxminus y〉

= E supxyisinT

|〈gxminus y〉|

= γ(T minus T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 4 22

Gaussian width has a natural geometric interpretation Suppose gis a unit vector in Rn Then a momentrsquos thought reveals thatsupxyisinT 〈gxminus y〉 is simply the width of T in the direction of gie the distance between the two hyperplanes with normal g thattouch T on both sides as shown in the figure Then 2ω(T ) can beobtained by averaging the width of T over all directions g in Rn

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 5 22

This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and

983042g9830422 asympradicn with high probability

(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor

radicn Ultimately the geometric

interpretation of the Gaussian width becomes the following ω(T )is approximately

radicn2 larger than the usual geometric width of

T averaged over all directions

A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn

p = x isin Rn 983042x983042p le 1 We have

γ(Bn2 ) sim

radicn γ(Bn

1 ) sim983155

log n

For any finite set T sub Bn2 we have γ(T ) ≲

983155log |T | The same

holds for Gaussian width ω(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22

2 Matrix deviation inequality

Theorem 1 (Matrix deviation inequality)

Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then

E supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 le CK2γ(T )

whereK = max

i983042Ai983042ψ2

is the maximal sub-gaussian norm of the rows of A

21 Tail bound

It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22

Let u ge 0 Then the event

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 983249 CK2[γ(T ) + u middot rad(T )]

holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as

rad(T ) = supxisinT

983042x9830422

Since rad(T ) ≲ γ(T ) we improve the bound as

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 ≲ K2uγ(T )

for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation

983061E sup

xisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055p9830621p

983249 CpK2γ(T )

where Cp le Cradicp for p ge 1

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22

22 Deviation of squares

It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself

We can easily deduce the deviation of squares by using the identity

a2 minus b2 = (aminus b)2 + 2b(bminus a)

for a = 983042Ax9830422 and b =radicm983042x9830422

We have

E supxisinT

983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22

23 Deriving Johnson-Lindenstrauss Lemma

X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have

γ(T ) ≲983155

log |T | le983155

log |X |2 ≲983155

log |X |

Matrix deviation inequality and m ge Cεminus2 logN then yield

supxyisinX

983055983055983055983055983042A(xminus y)9830422983042xminus y9830422

minusradicm

983055983055983055983055 ≲983155

logN le εradicm

with high probability say 099 Multiplying both sides by983042xminus y9830422

radicm we can write the last bound as follows With

probability at least 099 we have

(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422

for all xy isin X

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22

3 Covariance estimation

We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn

We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition

Theorem 2 (Covariance estimation for sub-gaussian distributions)

Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn

983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422

Then for every N ge 1 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 4: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

1 Gaussian width

Definition Let T sub Rn be a bounded set and g be a standardnormal random vector in Rn ie g sim N (0 In) Then thequantities

ω(T ) = E supxisinT

〈gx〉 and γ(T ) = E supxisinT

|〈gx〉|

are called the Gaussian width of T and the Gaussian complexity ofT respectively

Gaussian width and Gaussian complexity are closely relatedIndeed (Exercise)

2ω(T ) = ω(T minus T ) = E supxyisinT

〈gxminus y〉

= E supxyisinT

|〈gxminus y〉|

= γ(T minus T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 4 22

Gaussian width has a natural geometric interpretation Suppose gis a unit vector in Rn Then a momentrsquos thought reveals thatsupxyisinT 〈gxminus y〉 is simply the width of T in the direction of gie the distance between the two hyperplanes with normal g thattouch T on both sides as shown in the figure Then 2ω(T ) can beobtained by averaging the width of T over all directions g in Rn

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 5 22

This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and

983042g9830422 asympradicn with high probability

(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor

radicn Ultimately the geometric

interpretation of the Gaussian width becomes the following ω(T )is approximately

radicn2 larger than the usual geometric width of

T averaged over all directions

A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn

p = x isin Rn 983042x983042p le 1 We have

γ(Bn2 ) sim

radicn γ(Bn

1 ) sim983155

log n

For any finite set T sub Bn2 we have γ(T ) ≲

983155log |T | The same

holds for Gaussian width ω(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22

2 Matrix deviation inequality

Theorem 1 (Matrix deviation inequality)

Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then

E supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 le CK2γ(T )

whereK = max

i983042Ai983042ψ2

is the maximal sub-gaussian norm of the rows of A

21 Tail bound

It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22

Let u ge 0 Then the event

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 983249 CK2[γ(T ) + u middot rad(T )]

holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as

rad(T ) = supxisinT

983042x9830422

Since rad(T ) ≲ γ(T ) we improve the bound as

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 ≲ K2uγ(T )

for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation

983061E sup

xisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055p9830621p

983249 CpK2γ(T )

where Cp le Cradicp for p ge 1

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22

22 Deviation of squares

It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself

We can easily deduce the deviation of squares by using the identity

a2 minus b2 = (aminus b)2 + 2b(bminus a)

for a = 983042Ax9830422 and b =radicm983042x9830422

We have

E supxisinT

983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22

23 Deriving Johnson-Lindenstrauss Lemma

X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have

γ(T ) ≲983155

log |T | le983155

log |X |2 ≲983155

log |X |

Matrix deviation inequality and m ge Cεminus2 logN then yield

supxyisinX

983055983055983055983055983042A(xminus y)9830422983042xminus y9830422

minusradicm

983055983055983055983055 ≲983155

logN le εradicm

with high probability say 099 Multiplying both sides by983042xminus y9830422

radicm we can write the last bound as follows With

probability at least 099 we have

(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422

for all xy isin X

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22

3 Covariance estimation

We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn

We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition

Theorem 2 (Covariance estimation for sub-gaussian distributions)

Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn

983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422

Then for every N ge 1 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 5: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

Gaussian width has a natural geometric interpretation Suppose gis a unit vector in Rn Then a momentrsquos thought reveals thatsupxyisinT 〈gxminus y〉 is simply the width of T in the direction of gie the distance between the two hyperplanes with normal g thattouch T on both sides as shown in the figure Then 2ω(T ) can beobtained by averaging the width of T over all directions g in Rn

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 5 22

This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and

983042g9830422 asympradicn with high probability

(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor

radicn Ultimately the geometric

interpretation of the Gaussian width becomes the following ω(T )is approximately

radicn2 larger than the usual geometric width of

T averaged over all directions

A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn

p = x isin Rn 983042x983042p le 1 We have

γ(Bn2 ) sim

radicn γ(Bn

1 ) sim983155

log n

For any finite set T sub Bn2 we have γ(T ) ≲

983155log |T | The same

holds for Gaussian width ω(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22

2 Matrix deviation inequality

Theorem 1 (Matrix deviation inequality)

Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then

E supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 le CK2γ(T )

whereK = max

i983042Ai983042ψ2

is the maximal sub-gaussian norm of the rows of A

21 Tail bound

It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22

Let u ge 0 Then the event

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 983249 CK2[γ(T ) + u middot rad(T )]

holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as

rad(T ) = supxisinT

983042x9830422

Since rad(T ) ≲ γ(T ) we improve the bound as

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 ≲ K2uγ(T )

for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation

983061E sup

xisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055p9830621p

983249 CpK2γ(T )

where Cp le Cradicp for p ge 1

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22

22 Deviation of squares

It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself

We can easily deduce the deviation of squares by using the identity

a2 minus b2 = (aminus b)2 + 2b(bminus a)

for a = 983042Ax9830422 and b =radicm983042x9830422

We have

E supxisinT

983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22

23 Deriving Johnson-Lindenstrauss Lemma

X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have

γ(T ) ≲983155

log |T | le983155

log |X |2 ≲983155

log |X |

Matrix deviation inequality and m ge Cεminus2 logN then yield

supxyisinX

983055983055983055983055983042A(xminus y)9830422983042xminus y9830422

minusradicm

983055983055983055983055 ≲983155

logN le εradicm

with high probability say 099 Multiplying both sides by983042xminus y9830422

radicm we can write the last bound as follows With

probability at least 099 we have

(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422

for all xy isin X

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22

3 Covariance estimation

We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn

We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition

Theorem 2 (Covariance estimation for sub-gaussian distributions)

Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn

983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422

Then for every N ge 1 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 6: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and

983042g9830422 asympradicn with high probability

(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor

radicn Ultimately the geometric

interpretation of the Gaussian width becomes the following ω(T )is approximately

radicn2 larger than the usual geometric width of

T averaged over all directions

A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn

p = x isin Rn 983042x983042p le 1 We have

γ(Bn2 ) sim

radicn γ(Bn

1 ) sim983155

log n

For any finite set T sub Bn2 we have γ(T ) ≲

983155log |T | The same

holds for Gaussian width ω(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22

2 Matrix deviation inequality

Theorem 1 (Matrix deviation inequality)

Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then

E supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 le CK2γ(T )

whereK = max

i983042Ai983042ψ2

is the maximal sub-gaussian norm of the rows of A

21 Tail bound

It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22

Let u ge 0 Then the event

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 983249 CK2[γ(T ) + u middot rad(T )]

holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as

rad(T ) = supxisinT

983042x9830422

Since rad(T ) ≲ γ(T ) we improve the bound as

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 ≲ K2uγ(T )

for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation

983061E sup

xisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055p9830621p

983249 CpK2γ(T )

where Cp le Cradicp for p ge 1

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22

22 Deviation of squares

It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself

We can easily deduce the deviation of squares by using the identity

a2 minus b2 = (aminus b)2 + 2b(bminus a)

for a = 983042Ax9830422 and b =radicm983042x9830422

We have

E supxisinT

983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22

23 Deriving Johnson-Lindenstrauss Lemma

X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have

γ(T ) ≲983155

log |T | le983155

log |X |2 ≲983155

log |X |

Matrix deviation inequality and m ge Cεminus2 logN then yield

supxyisinX

983055983055983055983055983042A(xminus y)9830422983042xminus y9830422

minusradicm

983055983055983055983055 ≲983155

logN le εradicm

with high probability say 099 Multiplying both sides by983042xminus y9830422

radicm we can write the last bound as follows With

probability at least 099 we have

(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422

for all xy isin X

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22

3 Covariance estimation

We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn

We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition

Theorem 2 (Covariance estimation for sub-gaussian distributions)

Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn

983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422

Then for every N ge 1 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 7: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

2 Matrix deviation inequality

Theorem 1 (Matrix deviation inequality)

Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then

E supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 le CK2γ(T )

whereK = max

i983042Ai983042ψ2

is the maximal sub-gaussian norm of the rows of A

21 Tail bound

It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22

Let u ge 0 Then the event

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 983249 CK2[γ(T ) + u middot rad(T )]

holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as

rad(T ) = supxisinT

983042x9830422

Since rad(T ) ≲ γ(T ) we improve the bound as

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 ≲ K2uγ(T )

for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation

983061E sup

xisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055p9830621p

983249 CpK2γ(T )

where Cp le Cradicp for p ge 1

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22

22 Deviation of squares

It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself

We can easily deduce the deviation of squares by using the identity

a2 minus b2 = (aminus b)2 + 2b(bminus a)

for a = 983042Ax9830422 and b =radicm983042x9830422

We have

E supxisinT

983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22

23 Deriving Johnson-Lindenstrauss Lemma

X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have

γ(T ) ≲983155

log |T | le983155

log |X |2 ≲983155

log |X |

Matrix deviation inequality and m ge Cεminus2 logN then yield

supxyisinX

983055983055983055983055983042A(xminus y)9830422983042xminus y9830422

minusradicm

983055983055983055983055 ≲983155

logN le εradicm

with high probability say 099 Multiplying both sides by983042xminus y9830422

radicm we can write the last bound as follows With

probability at least 099 we have

(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422

for all xy isin X

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22

3 Covariance estimation

We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn

We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition

Theorem 2 (Covariance estimation for sub-gaussian distributions)

Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn

983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422

Then for every N ge 1 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 8: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

Let u ge 0 Then the event

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 983249 CK2[γ(T ) + u middot rad(T )]

holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as

rad(T ) = supxisinT

983042x9830422

Since rad(T ) ≲ γ(T ) we improve the bound as

supxisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055 ≲ K2uγ(T )

for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation

983061E sup

xisinT

983055983055983042Ax9830422 minusradicm983042x9830422

983055983055p9830621p

983249 CpK2γ(T )

where Cp le Cradicp for p ge 1

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22

22 Deviation of squares

It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself

We can easily deduce the deviation of squares by using the identity

a2 minus b2 = (aminus b)2 + 2b(bminus a)

for a = 983042Ax9830422 and b =radicm983042x9830422

We have

E supxisinT

983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22

23 Deriving Johnson-Lindenstrauss Lemma

X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have

γ(T ) ≲983155

log |T | le983155

log |X |2 ≲983155

log |X |

Matrix deviation inequality and m ge Cεminus2 logN then yield

supxyisinX

983055983055983055983055983042A(xminus y)9830422983042xminus y9830422

minusradicm

983055983055983055983055 ≲983155

logN le εradicm

with high probability say 099 Multiplying both sides by983042xminus y9830422

radicm we can write the last bound as follows With

probability at least 099 we have

(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422

for all xy isin X

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22

3 Covariance estimation

We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn

We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition

Theorem 2 (Covariance estimation for sub-gaussian distributions)

Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn

983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422

Then for every N ge 1 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 9: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

22 Deviation of squares

It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself

We can easily deduce the deviation of squares by using the identity

a2 minus b2 = (aminus b)2 + 2b(bminus a)

for a = 983042Ax9830422 and b =radicm983042x9830422

We have

E supxisinT

983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22

23 Deriving Johnson-Lindenstrauss Lemma

X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have

γ(T ) ≲983155

log |T | le983155

log |X |2 ≲983155

log |X |

Matrix deviation inequality and m ge Cεminus2 logN then yield

supxyisinX

983055983055983055983055983042A(xminus y)9830422983042xminus y9830422

minusradicm

983055983055983055983055 ≲983155

logN le εradicm

with high probability say 099 Multiplying both sides by983042xminus y9830422

radicm we can write the last bound as follows With

probability at least 099 we have

(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422

for all xy isin X

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22

3 Covariance estimation

We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn

We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition

Theorem 2 (Covariance estimation for sub-gaussian distributions)

Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn

983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422

Then for every N ge 1 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 10: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

23 Deriving Johnson-Lindenstrauss Lemma

X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have

γ(T ) ≲983155

log |T | le983155

log |X |2 ≲983155

log |X |

Matrix deviation inequality and m ge Cεminus2 logN then yield

supxyisinX

983055983055983055983055983042A(xminus y)9830422983042xminus y9830422

minusradicm

983055983055983055983055 ≲983155

logN le εradicm

with high probability say 099 Multiplying both sides by983042xminus y9830422

radicm we can write the last bound as follows With

probability at least 099 we have

(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422

for all xy isin X

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22

3 Covariance estimation

We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn

We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition

Theorem 2 (Covariance estimation for sub-gaussian distributions)

Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn

983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422

Then for every N ge 1 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 11: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

3 Covariance estimation

We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn

We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition

Theorem 2 (Covariance estimation for sub-gaussian distributions)

Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn

983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422

Then for every N ge 1 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 12: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that

X = Σ12Z Xi = Σ12Zi i = 1 N

The sub-gaussian assumption implies that

983042Z983042ψ2 ≲ 1

Then

983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1

〈Σ12RNΣ12xx〉

where

RN =1

N

N983131

i=1

ZiZTi minus In

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 13: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as

983042ΣN minus Σ983042 = maxxisinT

〈RNxx〉 = maxxisinT

9830559830559830559830551

N

983131N

i=1〈Zix〉2 minus 983042x98304222

983055983055983055983055

=1

NmaxxisinT

983055983055983042Ax98304222 minusN983042x98304222983055983055

Now apply the matrix deviation inequality for squares to conclude that

E 983042ΣN minus Σ983042 ≲ 1

N

983059γ(T )2 +

radicN rad(T )γ(T )

983060

The radius and Gaussian width of the ellipsoid T are easy to compute

rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12

By using tr(Σ) le n983042Σ983042 we have

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

n

N+

n

N

983062

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 14: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

31 Low-dimensional distributions

We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields

E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157

r

N+

r

N

983062

where

r = r(Σ12) =trΣ

983042Σ983042

is the stable rank of Σ12 This means that covariance estimationis possible with

N sim r

samples

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 15: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

4 Underdetermined linear equations

Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables

Ax = y

When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that

x isin K

where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori

Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 16: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

41 An optimization approach

We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K

To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration

The set on the left (whose boundary is shown) is star-shaped theset on the right is not

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 17: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

The Minkowski functional of K is defined as

983042x983042K = inft gt 0 xt isin K x isin Rn

If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)

Now we propose the following way to solve the recovery problemsolve the optimization program

min 983042xprime983042K subject to y = Axprime

Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)

Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 18: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations

Theorem 3 (Recovery by optimization)

Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies

E983042983141xminus x9830422 ≲ω(K)radic

m

where ω(K) is the Gaussian width of K

Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 19: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives

supuvisinK

983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422

983055983055 ≲ γ(T ) = 2ω(K)

Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get

Eradicm983042983141xminus x9830422 ≲ ω(K)

Dividing both sides byradicm we complete the proof

This theorem says that a signal x isin K can be efficiently recovered from

m sim ω(K)2

random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 20: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

5 Sparse recovery

Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector

y = Ax

where A is an mtimes n random matrix

The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (

983123ni=1 |xi|p)1p and for a reason

You can quickly check that

983042x9830420 = limprarr0

983042x983042p

Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 21: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

Our first attempt to recover x is to try the following optimizationproblem

min 983042xprime9830420 subject to y = Axprime

This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently

To overcome this difficulty we use

min 983042xprime9830421 subject to y = Axprime

This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently

We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22

Page 22: Lecture 15: Matrix deviation inequalitymath.xmu.edu.cn/group/nona/damc/Lecture15.pdfMatrix deviation inequality DAMC Lecture 15 May 29 - June 3, 2020 16 / 22 The Minkowski functional

Theorem 4 (Sparse recovery by optimization)

Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies

E983042983141xminus x9830422 ≲983155

(s log n)m983042x9830422

Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote

the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite

983042x9830421 leradics983042x9830422 as the inclusion

x isinradics983042x9830422 middot Bn

1 = K

By the Gaussian width of Bn1 we have

ω(K) =radics983042x9830422 middot ω(Bn

1 ) leradics983042x9830422 middot γ(Bn

1 ) leradics983042x9830422 middot

983155log n

Substitute this in Theorem 3 and complete the proof

Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22