Monte Carlo Linear Algebra: A Review and Recent Resultsdimitrib/Monte_Carlo_Linear_Algebra.pdf ·...

Motivating Framework: Low-Dimensional Approximation Sampling Issues Solution Methods and Singularity Issues

Monte Carlo Linear Algebra:A Review and Recent Results

Dimitri P. Bertsekas

Department of Electrical Engineering and Computer ScienceMassachusetts Institute of Technology

Caradache, France, 2012


Monte Carlo Linear Algebra

An emerging field combining Monte Carlo simulation and algorithmic linearalgebra

Plays a central role in approximate DP (policy iteration, projected equationand aggregation methods)

Advantage of Monte Carlo

Can be used to approximate sums of huge number of terms such ashigh-dimensional inner products

A very broad scope of applications

Linear systems of equations

Least squares/regression problems

Eigenvalue problems

Linear and quadratic programming problems

Linear variational inequalities

Other quasi-linear structures


Monte Carlo Estimation Approach for Linear Systems

We focus on solution of Cx = d

Use simulation to compute Ck → C and dk → d

Estimate the solution by matrix inversion C−1k dk ≈ C−1d (assuming C is

invertible)

Alternatively, solve Ck x = dk iteratively

Why simulation?

C may be of small dimension, but may be defined in terms of matrix-vectorproducts of huge dimension

What are the main issues?

Efficient simulation design that matches the structure of C and d

Efficient and reliable algorithm design

What to do when C is singular or nearly singular


References

Collaborators: Huizhen (Janey) Yu, Mengdi Wang

D. P. Bertsekas and H. Yu, “Projected Equation Methods for ApproximateSolution of Large Linear Systems," Journal of Computational andApplied Mathematics, Vol. 227, 2009, pp. 27-50.H. Yu and D. P. Bertsekas, “Error Bounds for Approximations fromProjected Linear Equations," Mathematics of Operations Research, Vol.35, 2010, pp. 306-329.D. P. Bertsekas, “Temporal Difference Methods for General ProjectedEquations," IEEE Trans. on Aut. Control, Vol. 56, pp. 2128 - 2139, 2011.M. Wang and D. P. Bertsekas, “Stabilization of Stochastic IterativeMethods for Singular and Nearly Singular Linear Systems", Lab. forInformation and Decision Systems Report LIDS-P-2878, MIT, December2011 (revised March 2012).M. Wang and D. P. Bertsekas, “Convergence of IterativeSimulation-Based Methods for Singular Linear Systems", Lab. forInformation and Decision Systems Report LIDS-P-2879, MIT, December2011 (revised April 2012).D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II:Approximate Dyn. Programming, Athena Scientific, Belmont, MA, 2012.


Outline

1 Motivating Framework: Low-Dimensional ApproximationProjected EquationsAggregationLarge-Scale Regression

2 Sampling IssuesSimulation for Projected EquationsMultistep MethodsConstrained Projected Equations

3 Solution Methods and Singularity IssuesInvertible CaseSingular and Nearly Singular CaseDeterministic and Stochastic Iterative MethodsNullspace ConsistencyStabilization Schemes


Low-Dimensional Approximation

Start from a high-dimensional equation y = Ay + b

Approximate its solution within a subspace S = {Φx | x ∈ <s}Columns of Φ are basis functions

Equation approximation approach

Approximate solution y∗ with the solution Φx∗ of an equation defined on S

Important example: Projection/Galerkin approximation

Φx = Π(AΦx + b)

S: Subspace spanned by basis functions

0

Galerkin approximation of equation y∗ = Ay∗ + b Φx∗ = Π(AΦx∗ + g)

LP CONVEX NLP

Simplex

Gradient/Newton

Duality

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondifferentiability and piecewise linearity are common features.

Primal Problem Description

Dual Problem Description

Vertical Distances

Crossing Point Differentials

Values f(x) Crossing points f∗(y)

−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

A union of points An intersection of halfspaces

minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

Abstract Min-Common/Max-Crossing Theorems

1

y∗ = Ay∗ + b Φx∗ = Π(AΦx∗ + b) AΦx∗ + b

LP CONVEX NLP

Simplex

Gradient/Newton

Duality












Vertical Distances



−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y


minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�


1


LP CONVEX NLP

Simplex

Gradient/Newton

Duality












Vertical Distances



−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y


minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�


1


Matrix Form of Projected Equation

Let Π be projection with respect to a weighted Euclidean norm ‖y‖Ξ =√

y ′Ξy

The Galerkin solution is obtained from the orthogonality condition

Φx∗ − (AΦx∗ + b) ⊥ (Columns of Φ)

orCx = d

whereC = Φ′Ξ(I − A)Φ, d = Φ′Ξb

Motivation for simulation

If y is high-dimensional, C and d involve high-dimensional matrix-vectoroperations


Another Example: Aggregation

Let D and Φ be matrices whose rows are probability distributions.

Aggregation equation

By forming convex combinations of variables (i.e., y ≈ Φx) and equations(using D), we obtain an aggregate form of the fixed point problem y = Ay + b:

x = D(AΦx + b)

or Cx = d withC = DAΦ, d = Db

Connection with projection/Galerkin approximation

The aggregation equation yields

Φx = ΦD(AΦx + b)

ΦD is an oblique projection in some of the most interesting types ofaggregation [if DΦ = I so that (ΦD)2 = ΦD] - also a seminorm projection.


Another Example: Large-Scale Regression

Weighted least squares problem

Considerminy∈<n

‖Wy − h‖2Ξ,

where W and h are given, ‖ · ‖Ξ is a weighted Euclidean norm, and y ishigh-dimensional.

We approximate y within the subspace S = {Φx | x ∈ <s} , to obtain

minx∈<s

‖W Φx − h‖2Ξ.

Equivalent linear system Cx = d

C = Φ′W ′ΞW Φ, d = Φ′W ′Ξh


Key Idea for Simulation

Critical Problem

Compute sums∑n

i=1 ai for very large n (or n =∞)

Convert Sum to an Expected Value

Introduce a sampling distribution ξ and write

n∑i=1

ai =n∑

i=1

ξi

(ai

ξi

)= Eξ{a}

where the random variable a has distribution

P{

a =ai

ξi

}= ξi , i = 1, . . . , n

We “invent" ξ to convert a “deterministic" problem to a “stochastic"problem that can be solved by simulation.Complexity advantage: Running time is independent of the number n ofterms in the sum, only the distribution of a.Importance sampling idea: Use a sampling distribution that matches theproblem for efficiency (e.g., make the variance of a small) .


Row and Column Sampling for System Cx = d

Row Sampling According to ! (May Use Markov Chain Q)Column Sampling According to Markov Chain P ! |A|x! = T (x!) Φr! Πx! T (Φr!) Subspace S Projection on SΦrt Φrt+1 = ΠT (Φrt) T (Φrt)over " Conditional Mean Squared Estimation Error Normal zero-mean

noise N with variance #2

fZ(z) z

Error Probability of ErrorNoisy Channel Receiver Signal Region of error when"1 is transmitted

when +1 is transmitted

s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N

a b A fX|{X"A}(x) fX(x) x y µx + #x µy + #y (µx, µy)

Set E of pairs Efficient Frontier MAP Selection Threshold k!

$ % & g(x) = H0 or g(x) = H1 fx(d) g

Space of Possible Observations x Calculation Process PosteriorpΘ|X( · | X = x) P(Θ = 1 | X = k) P(Θ = 2 | X = k)

pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...

Hypothesis selection Θ = 2 Θ = 1

fS(s;H0) pX(·; ") pX(·; "m) pX(x; "1) pX(x; "m)

Decision Rule Confidence intervals etc

v log ! + 1/2

MAP Estimate ML Estimate

pX(· ;H1) pX(· ;H0) pX(·; "m) pX(x; "1) pX(x; "m)

H0 or H1 Number of heads k Θ, X Observation X = x Prior pΘ orfΘ

Conditional pX|Θ or fX|Θ Point Estimates Error Analysis etc

"

1

i0 i1 j0 j1 ik ik+1 jk jk+1

Row Sampling According to ! (May Use Markov Chain Q)Column Sampling According to Markov Chain P ! |A|x! = T (x!) !r! "x! T (!r!) Subspace S Projection on S!rt !rt+1 = "T (!rt) T (!rt)over " Conditional Mean Squared Estimation Error Normal zero-mean


fZ(z) z



s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N


Set E of pairs E#cient Frontier MAP Selection Threshold k!

$ % & g(x) = H0 or g(x) = H1 fx(d) g

Space of Possible Observations x Calculation Process Posteriorp!|X( · | X = x) P($ = 1 | X = k) P($ = 2 | X = k)

pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...

Hypothesis selection $ = 2 $ = 1



v log ! + 1/2



H0 or H1 Number of heads k $, X Observation X = x Prior p! orf!

1




fZ(z) z



s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N



$ % & g(x) = H0 or g(x) = H1 fx(d) g


pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...




v log ! + 1/2




1




fZ(z) z



s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N



$ % & g(x) = H0 or g(x) = H1 fx(d) g


pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...




v log ! + 1/2




1




fZ(z) z



s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N



$ % & g(x) = H0 or g(x) = H1 fx(d) g


pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...




v log ! + 1/2




1


Row Sampling According to ξ (May Use Markov Chain Q)Column Sampling According to Markov Chain P ! |A|x! = T (x!) !r! "x! T (!r!) Subspace S Projection on S!rt !rt+1 = "T (!rt) T (!rt)over θ Conditional Mean Squared Estimation Error Normal zero-mean

noise N with variance σ2

fZ(z) z



s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N

a b A fX|{X"A}(x) fX(x) x y µx + σx µy + σy (µx, µy)


γ α β g(x) = H0 or g(x) = H1 fx(d) g


pX(·; θ1) pX(· ; θ) pX(·; θm) pX(x; θ1) pX(x; θm)...


fS(s;H0) pX(·; θ) pX(·; θm) pX(x; θ1) pX(x; θm)


v log ξ + 1/2


pX(· ;H1) pX(· ;H0) pX(·; θm) pX(x; θ1) pX(x; θm)


1




fZ(z) z



s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N



$ % & g(x) = H0 or g(x) = H1 fx(d) g


pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...




v log ! + 1/2




1


Row Sampling According to ! (May Use Markov Chain Q)Column Sampling According to Markov Chain P ∼ |A|x∗ = T (x∗) Φr∗ Πx∗ T (Φr∗) Subspace S Projection on SΦrt Φrt+1 = ΠT (Φrt) T (Φrt)over " Conditional Mean Squared Estimation Error Normal zero-mean


fZ(z) z

Error Probability of ErrorNoisy Channel Receiver Signal Region of error when−1 is transmitted


s = −1 or − 1 − 1 if s + N < 0 + 1 if s + N ≥ 0 s + N

a b A fX|{X∈A}(x) fX(x) x y µx + #x µy + #y (µx, µy)

Set E of pairs Efficient Frontier MAP Selection Threshold k∗

$ % & g(x) = H0 or g(x) = H1 fx(d) g

Space of Possible Observations x Calculation Process Posteriorp!|X( · | X = x) P(Θ = 1 | X = k) P(Θ = 2 | X = k)

pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...




v log ! + 1/2



H0 or H1 Number of heads k Θ, X Observation X = x Prior p! orf!

1




fZ(z) z



s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N



$ % & g(x) = H0 or g(x) = H1 fx(d) g


pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...




v log ! + 1/2




1

i0 i1 j0 j1 ik ik+1 jk jk+1 . . .Row Sampling According to ! (May Use Markov Chain Q)Column Sampling According to Markov Chain P ! |A|x! = T (x!) !r! "x! T (!r!) Subspace S Projection on S!rt !rt+1 = "T (!rt) T (!rt)over " Conditional Mean Squared Estimation Error Normal zero-mean


fZ(z) z



s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N



$ % & g(x) = H0 or g(x) = H1 fx(d) g


pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...




v log ! + 1/2




1

i0 i1 j0 j1 ik ik+1 jk jk+1 . . .Row Sampling According to ! (May Use Markov Chain Q)Column Sampling According to Markov Chain P ! |A|x∗ = T (x∗) Φr∗ Πx∗ T (Φr∗) Subspace S Projection on SΦrt Φrt+1 = ΠT (Φrt) T (Φrt)over " Conditional Mean Squared Estimation Error Normal zero-mean


fZ(z) z



s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N

a b A fX|{X∈A}(x) fX(x) x y µx + #x µy + #y (µx, µy)

Set E of pairs Efficient Frontier MAP Selection Threshold k∗

$ % & g(x) = H0 or g(x) = H1 fx(d) g

Space of Possible Observations x Calculation Process Posteriorp!|X( · | X = x) P(Θ = 1 | X = k) P(Θ = 2 | X = k)

pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...




v log ! + 1/2



H0 or H1 Number of heads k Θ, X Observation X = x Prior p! orf!

1



fZ(z) z



s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N



$ % & g(x) = H0 or g(x) = H1 fx(d) g


pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...




v log ! + 1/2




Conditional pX|! or fX|! Point Estimates Error Analysis etc

"

1

i0 i1 j0 j1 ik ik+1 jk jk+1 . . .Row Sampling According to ! (May Use Markov Chain Q)Column Sampling According to P Markov Chain P ! |A|x! = T (x!) !r! "x! T (!r!) Subspace S Projection on S!rt !rt+1 = "T (!rt) T (!rt)over " Conditional Mean Squared Estimation Error Normal zero-mean


fZ(z) z



s = "1 or " 1 " 1 if s + N < 0 + 1 if s + N # 0 s + N



$ % & g(x) = H0 or g(x) = H1 fx(d) g


pX(·; "1) pX(· ; ") pX(·; "m) pX(x; "1) pX(x; "m)...




v log ! + 1/2




1

Row sampling: Generate sequence {i0, i1, . . .} according to ξ (thediagonal of Ξ), i.e., relative frequency of each row i is ξi

Column sampling: Generate sequence{

(i0, j0), (i1, j1), . . .}

according tosome transition probability matrix P with

pij > 0 if aij 6= 0,

i.e., for each i , the relative frequency of (i, j) is pij

Row sampling may be done using a Markov chain with transition matrixQ (unrelated to P)Row sampling may also be done without a Markov chain - just samplerows according to some known distribution ξ (e.g., a uniform)


Simulation Formulas for Matrix Form of Projected Equation

Approximation of C and d by simulation:

C = Φ′Ξ(I − A)Φ ∼ Ck =1

k + 1

k∑t=0

φ(it )(φ(it )−

ait jt

pit jtφ(jt )

)′,

d = Φ′Ξb ∼ dk =1

k + 1

k∑t=0

φ(it )bit

We have by law of large numbers Ck → C, dk → d .

Equation approximation: Solve the equation Ck x = dk in place ofCx = d .

Algorithms

Matrix inversion approach: x∗ ≈ C−1k dk (if Ck is invertible for large k )

Iterative approach: xk+1 = xk − γGk (Ck xk − dk )


Multistep Methods - TD(λ)-Type

Instead of solving (approximately) the equation y = T (y) = Ay + b, considerthe multistep equivalent

y = T (λ)(y)

where for λ ∈ [0, 1)

T (λ) = (1− λ)∞∑`=0

λ`T `+1

Special multistep sampling methodsBias-variance tradeoff

u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2

Simulation error Jµ Slope Jµ = Φrµ

Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)

c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��

Cost = 0 Cost = −1

νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1 − νj(u)ν 1 − νi(u)

ν 1 − νk(u)ν

1 − µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + αn�

j=1

pijJ∗(j)

Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0

J1 = T 2µ0J0

1

u = 1 Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p 1 0νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2

Simulation error Bias ΠJµ Slope Jµ = Φrµ


c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1 − νj(u)ν 1 − νi(u)

ν 1 − νk(u)ν

1 − µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

i∗ i∗ − 1

g(i) + αn�

j=1

pijJ∗(j)

Do not Replace Set SR i 1 n Value Iterations J1 = TJ0 = Tµ0J0

J1 = T 2µ0J0

1

Special States Aggregate States Features n t pnn(u) pin(u) pni(u)pjn(u) pnj(u)

State i Feature Extraction Mapping Feature Vector φ(i) Linear CostApproximator φ(i)�r

Solution of Jµ = ΠTµ(Jµ) λ = 0 λ = 1 0 < λ < 1

Route to Queue 2hλ(n) λ∗ λµ λ hµ,λ(n) = (λµ − λ)Nµ(n)n − 1 −(n − 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p

1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2

Simulation error Solution of Jµ = WTµ(Jµ) Bias ΠJµ Slope Jµ =Φrµ


c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1 − νj(u)ν 1 − νi(u)

ν 1 − νk(u)ν

1 − µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

1

Special States Aggregate States Features n t pnn(u) pin(u) pni(u)pjn(u) pnj(u)

State i Feature Extraction Mapping Feature Vector φ(i) Linear CostApproximator φ(i)�r






c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1 − νj(u)ν 1 − νi(u)

ν 1 − νk(u)ν

1 − µi

µµi

µ

Cost = 2�α J0

R + g(1) + αn�

j=1

p1jJ∗(j)

1

TJ J0

Direct Method: Projection of cost vector Jµ ΠJµ n t pnn(u) pin(u)pni(u) pjn(u) pnj(u)

Indirect Method: Solving a projected form of Bellman’s equationProjection on S. Solution of projected equation Φr = ΠTµ(Φr)

Tµ(Φr) Φr = ΠTµ(Φr)

ΠJµ n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

J TJ ΠTJ J T J ΠT J

Value Iterate T (Φrk) = g + αPΦrk Projection on S Φrk Φrk+1






c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��


νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

J(1) = g(1, u2) + αp11(u2)J(1) + αp12(u2)J(2)

J∗ =�J∗(1), J∗(2)

�

1


Φx = ΠT (λ)(Φx)

Subspace spanned by basis functionsLP CONVEX NLP

Simplex

Gradient/Newton

Duality












Vertical Distances



−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y


minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1


Φx = ΠT (λ)(Φx)


Simplex

Gradient/Newton

Duality












Vertical Distances



−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y


minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1


Φx = ΠT (λ)(Φx) y∗ Πy∗

Subspace spanned by basis functionsSolution of multistep projected equationLP CONVEX NLP

Simplex

Gradient/Newton

Duality












Vertical Distances



−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y


1




Simplex

Gradient/Newton

Duality












Vertical Distances



−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y


1




Simplex

Gradient/Newton

Duality












Vertical Distances



−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y


1


Constrained Projected Equations

ConsiderΦx = ΠT (Φx) = Π(AΦx + b)

where Π is the projection operation onto a closed convex subset S of thesubspace S (w/ respect to weighted norm ‖ · ‖Ξ; Ξ: positive definite).

x − x∗ x∗ x∗ − T (x∗) T (x∗)

Subspace S Set S

1


Φx = ΠT (λ)(Φx)


Simplex

Gradient/Newton

Duality












Vertical Distances



−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y


minx

�f1(x) + f2(x)

�= maxy

�f∗1 (y) + f∗

2 (−y)�

1




Simplex

Gradient/Newton

Duality












Vertical Distances



−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y


1


ΠT (Φx∗) = Φx∗



Simplex

Gradient/Newton

Duality












Vertical Distances



−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

1


ΠT (Φx∗) = Φx∗ T (Φx∗)



Simplex

Gradient/Newton

Duality












Vertical Distances



−f∗1 (y) f∗

1 (y) + f∗2 (−y) f∗

2 (−y)

Slope y∗ Slope y

1

From the properties of projection,(Φx∗ − T (Φx∗)

)′Ξ(y − Φx∗) ≥ 0, ∀ y ∈ S

This is a linear variational inequality: Find x∗ such that

f (Φx∗)′(y − Φx∗) ≥ 0, ∀ y ∈ S,

where f (y) = Ξ(y − T (y)

)= Ξ

(y − (Ay + b)

).


Equivalence Conclusion

Two equivalent problems

The projected equationΦx = ΠT (Φx)

where Π is projection with respect to ‖ · ‖Ξ on convex set S ⊂ S

The special-form VI

f (Φx∗)′Φ(x − x∗) ≥ 0, ∀ x ∈ X ,

wheref (y) = Ξ

(y − T (y)

), X = {x | Φx ∈ S}

Special linear cases: T (y) = Ay + b

S = <n: VI <==> f (Φx∗) = Ξ(Φx∗ − T (Φx∗)

)= 0 (linear equation)

S = subspace: VI <==> f (Φx∗) ⊥ S (e.g., projected linear equation)

f (y) the gradient of a quadratic, S: polyhedral (e.g., approx. LP and QP)

Linear VI case (e.g., cooperative and zero-sum games withapproximation)


Deterministic Solution Methods - Invertible Case of Cx = d

Matrix Inversion Method

x∗ = C−1d

Generic Linear Iterative Method

xk+1 = xk − γG(Cxk − d)

where:

G is a scaling matrix, γ > 0 is a stepsize

Eigenvalues of I − γGC within the unit circle (for convergence)

Special cases:

Projection/Richardson’s method: C positive semidefinite, G positivedefinite symmetric

Proximal method (quadratic regularization)

Splitting/Gauss-Seidel method


Simulation-Based Solution Methods - Invertible Case

Given sequences Ck → C and dk → d

Matrix Inversion Method

xk = C−1k dk

Iterative Method

xk+1 = xk − γGk (Ck xk − dk )

where:

Gk is a scaling matrix with Gk → G

γ > 0 is a stepsize

xk → x∗ if and only if the deterministic version is convergent


Solution Methods - Singular Case (Assuming a Solution Exists)

Given sequences Ck → C and dk → d . Matrix inversion method does notapply

Iterative Method


Need not converge to a solution, even if the deterministic version does

Questions:

Under what conditions is the stochastic method convergent?

How to modify the method to restore convergence?


Simulation-Based Solution Methods - Nearly Singular Case

The theoretical view

If C is nearly singular, we are in the nonsingular case

The practical view

If C is nearly singular, we are essentially in the singular case (unless thesimulation is extremely accurate)

The eigenvalues of the iteration


get in and out of the unit circle for a long time (until the “size" of the simulationnoise becomes comparable to the “stability margin" of the iteration)

Think of roundoff error affecting the solution of ill-conditioned systems(simulation noise is far worse)


Deterministic Iterative Method - Convergence Analysis

Assume that C is invertible or singular (but Cx = d has a solution)

Generic Linear Iterative Method

xk+1 = xk − γG(Cxk − d)

Standard Convergence Result

Let C be singular and denote by N(C) the nullspace of C. Then:{xk} is convergent (for all x0 and sufficiently small γ) to a solution of Cx = d ifand only if:

(a) Each eigenvalue of GC either has a positive real part or is equal to 0.

(b) The dimension of N(GC) is equal to the algebraic multiplicity of theeigenvalue 0 of GC.

(c) N(C) = N(GC).


Proof Based on Nullspace Decomposition for Singular Systems

For any solution x∗, rewrite the iteration as

xk+1 − x∗ = (I − γGC)(xk − x∗)

Linarly transform the iteration

Introduce a similarity transformation involving N(C) and N(C)⊥

Let U and V be orthonormal bases of N(C) and N(C)⊥:

[U V ]′(I − γGC)[U V ] = I − γ[

U ′GCU U ′GCVV ′GCU V ′GCV

]= I − γ

[0 U ′GCV0 V ′GCV

]≡[

I −γN0 I − γH

],

where H has eigenvalues with positive real parts. Hence for some γ > 0,

ρ(I − γH) < 1,

so I − γH is a contraction ... �


Nullspace Decomposition of Deterministic Iteration

S 0

)⊥ x∗⊥ x∗ + V zk

Sequence {xk}

Solution Set Sequence

k x∗ + Uyk

k x∗ + Uy∗

N(C) N

) N(C)⊥

x∗ + N(C) N

Nullspace Component Sequence Othogonal ComponentNullspace Component Sequence Othogonal Component

Nullspace Component Sequence Othogonal ComponentNullspace Component Sequence Orthogonal Component

Nullspace Component Sequence Orthogonal Component Full Iterate

Figure: Iteration decomposition into components on N(C) and N(C)⊥.

xk = x∗ + Uyk + Vzk

Nullspace component: yk+1 = yk − γNzk

Orthogonal component: zk+1 = zk − γHzk CONTRACTIVE


Stochastic Iterative Method May Diverge

The stochastic iteration


approaches the deterministic iteration

xk+1 = xk − γG(Cxk − d), where ρ(I − γGC) ≤ 1.

However, sinceρ(I − γGk Ck )→ 1

ρ(I − γGk Ck ) may cross above 1 too frequently, and we can have divergence.

Difficulty is that the orthogonal component is now coupled to the nullspacecomponent with simulation noise


Divergence of the Stochastic/Singular Iteration

S 0

)⊥ x∗

⊥ x∗ + V zk

Sequence {xk}

Solution Set Sequence

k x∗ + Uyk

N(C) N

) N(C)⊥

x∗ + N(C) N

Nullspace Component Sequence Othogonal ComponentNullspace Component Sequence Othogonal Component

Nullspace Component Sequence Othogonal ComponentNullspace Component Sequence Orthogonal Component

Nullspace Component Sequence Orthogonal Component Full Iterate

Figure: NOISE LEAKAGE FROM N(C) to N(C)⊥


Nullspace component: yk+1 = yk − γNzk + Noise(yk , zk )

Orthogonal component: zk+1 = zk − γHzk + Noise(yk , zk )


Divergence Example for a Singular Problem

2× 2 Example

Let the noise be {ek}: MC averages with mean 0 so ek → 0, and let

xk+1 =

[1 + ek 0

ek 1/2

]xk

Nullspace component yk = xk (1) diverges:

k∏t=1

(1 + et ) = O(e√

k )→∞

Orthogonal component zk = xk (2) diverges:

xk+1(2) = 1/2xk (2) + ek

k∏t=1

(1 + et ),

where

ek

k∏t=1

(1 + et ) = O

(e√

k

√k

)→∞.


What Happens in Nearly Singular Problems?

“Divergence" until Noise << “Stability Margin" of the iteration

Compare with roundoff error problems in inversion of nearly singularmatrices

A Simple Example

Consider the inversion of a scalar c > 0, with simulation error η. Theabsolute and relative errors are

E =1

c + η− 1

c, Er =

E1/c

.

By a Taylor expansion around η = 0:

E ≈∂(1/(c + η)

)∂η

∣∣∣η=0

η = − η

c2 , Er ≈ −η

c.

For the estimate 1c+η

to be reliable, it is required that

|η| << |c|.Number of i.i.d. samples needed: k � 1/c2.


Nullspace Consistent Iterations

Nullspace Consistency and Convergence of Residual

If N(Gk Ck ) ≡ N(C), we say that the iteration is nullspace-consistent.

Nullspace consistent iteration generates convergent residuals(Cxk − d → 0), iff the deterministic iteration converges.

Proof Outline:


Nullspace component: yk+1 = yk − γNzk + Noise(yk , zk )

Orthogonal component: zk+1 = zk − γHzk + Noise(zk ) DECOUPLED

LEAKAGE FROM N(C) IS ANIHILATED by V so

Cxk − d = CVzk → 0

�


Interesting Special Cases

Proximal/Quadratic Regularization Method

xk+1 = xk − (C′k Ck + βI)−1C′k (Ck xk − dk )

Can diverge even in the nullspace consistent case.

In the nullspace consistent case, under favorable conditions xk → somesolution x∗.

In these cases the nullspace component yk stays constant.

Approximate DP (projected equation and aggregation)

The estimates often take the form

Ck = Φ′Mk Φ, dk = Φ′hk ,

where Mk → M for some positive definite M.

If Φ has dependent columns, the matrix C = Φ′MΦ is singular.

The iteration using such Ck and dk is nullspace consistent.

In typical methods (e.g., LSPE) xk → some solution x∗.


Stabilization of Divergent Iterations

A Stabilization Scheme

Shifting the eigenvalues of I − γGk Ck by −δk :

xk+1 = (1− δk )xk − γGk (Ck xk − dk ) .

Convergence of Stabilized Iteration

Assume that the eigenvalues are shifted slower than the convergence rate ofthe simulation:

(Ck − C, dk − d ,Gk −G)/δk → 0,∞∑

k=0

δk =∞

Then the stabilized iteration generates xk → some x∗ iff the deterministiciteration without δk does.

Stabilization is interesting even in the nonsingular case

It provides a form of “regularization"


Stabilization of the Earlier Divergent Example

100

101

102

103

104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Modulus of Iteration

k

||(1−δk)I−G

kA

k||

(i) δk = k−1/3

(ii) δk = 0


Thank You!

Monte Carlo Linear Algebra: A Review and Recent Resultsdimitrib/Monte_Carlo_Linear_Algebra.pdf ·...

Documents

Transcript of Monte Carlo Linear Algebra: A Review and Recent Resultsdimitrib/Monte_Carlo_Linear_Algebra.pdf ·...