- Second Exam Presentation Low Rank Matrix Approximation ... · Given an m n matrix A, we are often...

IntroductionClassical Results

Approximation and Probabilistic ResultsRandomized Algorithms - Strategies and Benefits

Research ActivityOpen Problems and Future Research Directions

Second Exam PresentationLow Rank Matrix Approximation

John Svadlenka

City University of New YorkGraduate Center

Date Pending




Outline

1 Introduction

2 Classical Results

3 Approximation and Probabilistic Results

4 Randomized Algorithms - Strategies and Benefits

5 Research Activity

6 Open Problems and Future Research Directions




Problem DefinitionOverview of Conventional AlgorithmsRelated ProblemsMotivation for New Approaches

Given an m × n matrix A, we are often interested in approximatingA as the product of an m × k matrix B and a k × n matrix C .

A ≈ B · C

Why?

Provided it is true that k � min(m, n):

Arithmetic cost of matrix vector product is 2(m + n)k

Storage space of matrix A is (m + n)k

(m + n)k � m × n

We denote the product B · C as a rank k approximation of A





More formally, we seek a rank k matrix approximation of matrix Afor some ε > 0 such that:

‖A− Ak‖ ≤ (1 + ε)‖A− Ak‖

Ak is the theoretical best rank k approximation of A

Matrix norms are Frobenius ‖· ‖F or Spectral ‖· ‖2

||A||2F :=∑m,n

i ,j=1 |aij |2 ||A||2 := sup||v||2=1 ||Av||2

Ak can be computed from the SVD with cost O((m + n)mn). Sowe seek less costly approaches. Why?





Suppose m = n and compare mn(m + n) = 2n3 with n2 log n:

n n3 n2 log n

10 1,000 332100 1.00e+06 66,400

1,000 1.00 e+09 1.00 e+0610,000 1.00 e+12 1.33 e+09

Consider the above statistics in light of some recent trends:

Conventional LRA does not scale for Big Data purposes

Approximation algorithms are increasingly preferred

Applications utilizing numerical linear algebra are expandingbeyond traditional scientific and engineering disciplines





Conventional LRA algorithms generate decompositions, mostimportant of these are SVD, Rank-Revealing QR (RRQR), andRRLU:

Singular Value Decomposition (SVD) [Eckhart-Young]

Let A be an m × n matrix with r = rank(A) whose elements maybe complex. Then there exists two unitary matrices U and V , andan m × n diagonal matrix Σ with nonnegative elements σi , whereσ1 ≥ σ2 ≥ · · · ≥ σr > 0 and σj = 0 for j > r , such that:

A = UΣV ∗

U and V are m ×m and n × n, respectively.





QR Decomposition

Let A be an m × n matrix with m ≥ n whose elements may becomplex. Then there exists an m × n matrix Q and an n × nmatrix R such that

A = QR

where the columns of Q are orthonormal and R is upper triangular.

Cost O(mnmin (m, n)) is lower than that for SVD.

There are several efficient strategies to orthogonalize A

Column i of A is the linear combination of columns of Q withthe coefficients given by column i of R





The LRA problem is also significant for these related subjects:

Principal Component Analysis

Clustering Algorithms

Tensor Decomposition

Rank Structured Matrices

But a series of recent trends have provided impetus for newapproaches to LRA...





Consider these examples of Emerging Applications and Big Data:

New disciplines: Machine Learning, Data Science, ImageProcessing

Modern Massive Data Sets from Physical Systems Modelling,Sensor Measurements, Internet

New Fields: Recommender Systems, Complex Systems Science

Classical LRA algorithms and their implementations, thoughwell-developed over many years, are characterized by:

Limited parallelization opportunities

Relatively high computational complexity

Memory bottlenecks with out-of-core data sets




Eckhart-Young Theorem For SVDLow Rank Format and Matrix DecompositionsQR DecompositionSkeleton (CUR) DecompositionInterpolative DecompositionDecomposition Summary

[Eckhart-Young Theorem]

Let A ∈ Cm×n and let Ak be the truncated SVD of rank k whereUk , Vk , and Σk are m× k, n× k and k × k, respectively. We have:

Ak = UkΣkV∗k

Then the approximation errors are defined as below. Furthermore,these are the smallest errors of any rank k approximation of A.

‖A− Ak‖2 = σk+1

‖A− Ak‖F =

√√√√√min(m,n)∑j=k+1

σ2j





Given a rank-k SVD representation of a matrix we may generateits low rank format:

Ak = Uk · (ΣkV∗k )

Other decompositions consist of matrix factors being orthogonal orhaving a row and/or column subset of the original matrix:

RRQR

UTV

CUR

Interpolative Decomposition (ID) (one-sided and two-sided)

We may generate a low rank format similarly with:W = CUR = [CU]R = C [UR]W = UTV = (UT )V = U(TV )





Existence of a QR factorization for any matrix can be proven inmany ways. For example, it follows from Gram-Schmidtorthogonalization:

Theorem

Suppose (a1, a2, . . . , an) is a linearly independent list of vectors ofa fixed dimension. Then there is an orthonormal list of vectors(q1, q2, . . . , qn) such that span(a1, a2, . . . , an) =span(q1, q2, . . . , qn).





Shortcomings of Gram-Schmidt QR algorithm wrt LRA:

Problem: The algorithm may fail if rank(A) < nSolution: Introduce a column pivoting strategyImpact: A = QRP where P is a permutation matrix

Problem: Rounding error impacts orthogonalizationSolution: Normalize qi before computing qi+1

Solution: Compute q′i s up to some epsilon tolerance





Skeleton (CUR) Decomposition Theorem

Let A be an m × n matrix of rank k of real elements withrank(A) = k . Then there exists a nonsingular k × k submatrix Aof A.

Moreover, let I be and J be the index sets of the rows and columnsof A, respectively, in A. Then A = CUR where U = A−1 andC = A(1..m, J) and R = A(I , 1..n).

A set of k columns and rows captures A′s column, row spaces

Skeleton is in contrast to SVD’s left and right singular vectors

Can use QRP or LUP algorithms to find the submatrix A





Interpolative Decomposition Lemma

Suppose A is an m × n matrix of rank k whose elements may becomplex. Then there exists an m × k matrix B consisting of asubset of columns of A and a k × n matrix P such that:

A = B · PThe Ik matrix appears in some column subset of P

|pij | ≤ 1 for all i and j

ID more appropriate for data analysis purposes

Also appropriate if properties of A required in decomposition





What type of decomposition is better? It depends...

NLA Theoretician’s point of view: orthogonal matrices are better.

Input error propagation minimizedOrthogonal bases reduce amount of arithmeticThey preserve vector and matrix properties inmultiplicationBut are not easy to understand for data analysis

Data Analyst’s perspective: submatrices are better.

Preserve structural properties of original matrixEasier to understand in application termsBut may not be well-conditioned




Rationale for Approximation AlgorithmsDimension ReductionColumn and Row SamplingCUR and Maximum Volume

The case for approximation approaches to LRA? A large set ofresults concerning:

Random Matrices and subspace projections

Existential Results for rank k approximations

Column and/or Row Sampling

Matrix skeletons (CUR) and volume maximization

New algorithmic approaches:

Process some matrix much smaller than the original

Provide arbitrary accuracy up to machine precision

Employ adaptive and non-adaptive strategies

Separate randomized and deterministic processing





Johnson-Lindenstrauss Lemma [1984]

Let X1,X2, . . .Xn ∈ Rd . Then for ε ∈ (0, 1) there exists Φ ∈ Rk×d

for k = O( 1ε2 log n) such that:

(1− ε)‖Xi − Xj‖2 ≤ ‖ΦXi − ΦXj‖2 ≤ (1 + ε)‖Xi − Xj‖2

Distances among vectors in Euclidean space approximatelypreserved in lower dimensional space independent of d

Matrix vector multiplication is O(d log n) for each Xi

Dasgupta and Gupta (2003) proved that standard Gaussianmatrices with i.i.d. N(0, 1) can be used for Φ

Achlioptas (2003) showed that random {+1,-1} entries suffice.





Next major result: matrix vector multiplication in O(d log d + |P|).

Fast Johnson-Lindenstrauss Transform [Ailon Chazelle 2006]

Let Φ = PHD P ∈ Rk×d H,D ∈ Rd×d d = 2l

Pij ∼ N(0, q−1) with probability q

Pij = 0 with probability 1− q q = min(Θ( log2 nd ), 1)

H2 =(

d− 12 d− 1

2

d− 12 −d− 1

2

)and H2q :=

(Hq Hq

Hq −Hq

)q = 2h, h = 1, . . . l

D is a diagonal matrix with dii drawn uniformly from {1,−1}.Then we have that with probability 2

3 that:

(1− ε)k‖Xi‖2 ≤ ‖ΦXi‖2 ≤ (1 + ε)k‖Xi‖2





Relative-Error Bound (Frobenius norm) [Sarlos 2006]

Let A ∈ Rm×n. If Φ is an r × n J-L transform with i.i.d. zero meanentries {−1,+1} for r = Θ(kε + k log k) and if ε ∈ (0, 1), thenwith probability ≥ .5, we have that:

‖A− ProjAΦT ,k(A)‖F ≤ (1 + ε)‖A− Ak‖F

where ProjAΦT ,k(A) is the best rank k approximation of the

projection of A in the column space of AΦT .

Papadimitriou et al. (2000) first applied random projections forLatent Semantic Indexing (LSI) and derived an additive errorbound result.





A relative-error bound in the spectral norm uses a power iterationto offset any slow singular value decay.

Relative-Error Bound (Spectral norm) [Halko et al. 2011]

Let A ∈ Rm×n. If B is an n × 2k Gaussian matrix andY = (AA∗)qAB such that q is a small non-negative integer and 2kis the target rank approximation where 2 ≤ k ≤ 0.5min{m, n}then:

E‖A− ProjY ,2k(A)‖2 ≤[

1 + 4

√2 min (m, n)

k − 1

] 12q+1

‖A− Ak‖2

A power iteration increases A′s largest singular valuesimproving accuracy

A refined proof [Woodruff 2014] gave a rank-k approximation





From the Relative-Error Bound results of Sarlos and Halko et al.:

With l > k random linear combinations of A′s columns ⇒We can obtain a rank k approximation of A

How and why?

Multiplying A by random vector x gives y ∈ colspace(A)

With high probability y ′s are linearly independent

We get a new approximate basis A for A with dimension l

Project A on to A

Get a rank k matrix approximation of this projection





Consider the existence result of Ruston (1962) for a collection of kcolumns, C , in A ∈ Rm×n:

‖A− CC †A‖2 ≤√

1 + k(n − k)‖A− Ak‖2

The CX approximation is A ≈ CX where X := C †A

Sampling with Euclidean norms of matrix columns [FriezeKannan Vempala 2004] to get additive error bounds

Sampling according to the top right singular vectors [BoutsidisMahoney Drineas 2010] for relative error bounds





Another approach to LRA extends column sampling to also includerow sampling:

Extensions to both CX probability distribution approaches

Approximation error proportional to square of the CX error

General Approach:

Sample c columns of A to get C as in CX

Sample r rows from A using a probability distributionconstructed from C

Re-scale the selected rows and columns

Additional processing steps to get an LRA





More recent directions include CUR with volume sampling:

Pseudo-Skeleton Approximation [Goreinov et al 1997]

Suppose A ∈ Rm×n. Then there exists a set of k columns androws, C and R, in A as given by their index sets c and r ,respectively, and a matrix U ∈ Rk×k such that:

‖A− CUR‖2 ≤ O(√k(√m +

√n))‖A− Ak‖2

Maximal Volume for LRA [Goreinov and Tyrtyshnikov 2001]

Suppose A is a CUR approximation of the form given above andU = A(r , c)−1. If A(r , c) has maximal determinant modulus of allk × k submatrices of A, then

‖A− A‖C ≤ (k + 1)‖A− Ak‖2





CUR approximation of A depends on finding a sufficiently largevolume submatrix:

Submatrix is the intersection of C and R in the CUR

Volume quantifies the orthogonality of matrix columns

It is NP-hard to find a submatrix of maximal volume

Greedy algorithms find approximate maximal volume




Dimension ReductionVariations on Dimension Reduction AlgorithmTradeoffs with Randomized MapsCUR Decomposition and Cross Approximation

This random projection algorithm follows from J-L Lemma andRelative-Error Bound Results:

Input: A ∈ Rm×n

Input: rank k , oversampling parameter pOutput: B ∈ Rm×(k+p), C ∈ R(k+p)×n

l ← k + p

Construct random Gaussian matrix G ∈ Rn×l

Y ← A · GGet an orthogonal basis matrix Q for YB ← QC ← Q∗ · AOutput B, C

Algorithm 1: Dimension Reduction [Halko et al 2011]





To get a rank l SVD approximation for A using algorithmoutput:

1 Run an SVD algorithm on the matrix C = UΣV ∗

2 U ← B · U

Comments on the algorithm

Algorithm itself uses conventional steps on smaller matrices

Matrix Matrix mult. (block operation) preferable for A

Costliest step is Y ← A · G requiring O(mnl) ops

QR factorization may avoid overhead of column pivoting

Oversampling parameter typically higher with other randommatrices





Other possibilities:

Introduce parallelism for matrix matrix multiplication

SRFT/SHRT random multipliers reduce multiplication cost toO(mn log l)

Superfast abridged (sparse) versions of SRFT/SHRT allowfurther cost reduction though no probability guarantee.





Subsampled Random Hadamard Transform (SRHT) is√

nl DHR

D ∈ Cn×n is diagonal matrix of random {-1, +1} entriesH is the Hadamard matrixR ∈ In×l has random columns from the identity matrix

Gaussian random matrices

Have to generate n × l entries, also expensive multiplication

Probability of failure is 3e−p

Fast SRFT/SRHT

Recursive Divide and Conquer ⇒ smaller complexity cost

Only n + l random entries needed

Probability of failure rises: O( 1k ) for rank-k approximation

Non-sequential memory access ⇒ memory bottlenecks





In general, desirable properties of random multipliers include:

Orthogonal

Sparse (but not too sparse)

Structured

Questions to consider with regard to SRFT/SRHT:

Are there alternatives that do not have the memory issues?

Concerns of FFT with limited parallelization

Alternatives - tradeoff arithmetic complexity for bettermemory performance and parallelization?

Can we have the best of both worlds?

Results on different multipliers to be shown from my own research. . .





CUR Cross-Approximation

W1

W1

W1

W1

W2 W3

The first three recursive steps of a Cross Approximation algorithm output of three striped matrices W1, W2, and W3

Adapted from Low Rank Approximation: New Insights, Accurate Superfast Algorithms, Pre-processing and

Extensions, Victor Y. Pan, Qi Luan, John Svadlenka, Liang Zhao 2017





To complete the CUR approximation:

1 Form the matrix U by getting the inverse of A(I , J)

2 Set C = A(:, J) and R = A(I , :)

How to approximate the maximum volume:

Use RRLU or RRQR algorithms

Example: LU Factorization to generate an upper triangularmatrix [CT Pan 2000]

For triangular matrix T ∈ Rn×n : det(T ) =n∏

i=1tii

Goal is to maximize absolute values on T ′s diagonal

Involves column interchanges and searching for maximumabsolute-valued elements





Some comments on the CUR Cross Approximation:

As with Dimension Reduction, runs an algorithm on smallermatrix than A

Each pass through the algorithm’s loop requires onlyO((m + n)k2) ops

Implications of not using all matrix entries in the algorithm?

How to parallelize this algorithm? Perhaps Divide andConquer approach with small blocks.




Dimension ReductionCUR Approximation

Formulate random multipliers with the strategy:1 Utilize structured, sparse primitive matrices of random

(Gaussian, Bernoulli) variables to form families of randommultipliers B

2 B ∈ Rn×l , B =t∑

i=1Bi and t is a small constant

3 Bi are chosen and applied from the following classes:Abridged and Permuted Hadamard APH (with optional scalingS)Orthogonal Permutation matrix PInverse bidiagonal matrix IBD : (I + SZ )−1

S is a diagonal matrix, and Z =

0 . . . . . . 0

1. . .

. . . 0

0. . .

. . . 0... . . . 1 0





Numerical Experiments: Relative errors with various multipliers

SVD-generated Matrices Laplacian Matrices

Multiplier Sum Mean Std Mean Std

Gaussian 1.07E-08 3.82E-09 2.05E-13 1.62E-13

ASPH, 2 IBD 1.23E-08 5.84E-09 1.69E-13 1.34E-13

ASPH, 3 IBD 1.33E-08 1.00E-08 1.98E-13 1.30E-13

3 IBD 1.18E-08 6.23E-09 1.78E-13 1.42E-13

APH, 3 IBD 1.28E-08 1.40E-08 2.33E-13 3.44E-13

APH, 2 IBD 1.43E-08 1.87E-08 1.78E-13 1.61E-13

ASPH, 1 P 1.22E-08 1.26E-08 2.21E-13 2.83E-13

ASPH, 2 P 1.51E-08 1.18E-08 3.57E-13 9.27E-13

ASPH, 3 P 1.19E-08 6.93E-09 2.24E-13 1.76E-13

APH, 3 P 1.26E-08 1.16E-08 2.15E-13 1.70E-13

APH, 2 P 1.31E-08 1.18E-08 1.25E-14 5.16E-14





Investigate novel approaches that decrease computation:

Sum of IBD’s without APH, ASPH

IBD is a rank structured matrix: low rank off-diagonal blocks

Matrix Matrix Multiplication with IBD is O((n + l)m) ops

Good spatial and temporal locality (unlike SRFT/SRHT)

Generalize to other rank structured matrices?





Our numerical experiments are promising, but new directions to beinvestigated and from computational perspective:

Incorporate approximate leverage scores

Avoid random memory access (max element searching,column and row interchanges)

Look for matrix matrix multiplication possibilities instead

Extensions to tensors?





CUR Cross Approximation Benchmark Results

Inputs rank mean stdbaart 6 1.94e-07 3.57e-09shaw 12 3.02e-07 6.84e-09

gravity 25 3.35e-07 1.97e-07wing 4 1.92e-06 8.78e-09

foxgood 10 7.25e-06 1.09e-06inverse Laplace 25 2.40e-07 6.88e-08

Table: CUR approximation of benchmark 1000 × 1000 input matrices (atthe numerical rank of the input matrices) of discretized IntegralEquations from the San Jose University Singular Matrix Database




Open Problems:

Do there exist random multipliers for Dimension Reductionsuch that Matrix Matrix multiplication can be done fasterthan O(mn log n)?

Does their exist a CUR approximation algorithm with arelative error (1 + ε) bound in the spectral norm?

Future Research Directions:

Theoretical, algorithmic, and computational research in Low RankApproximation, its applications, and related problem areas




Acknowledgements

I would like to thank my mentor, Professor Victor Pan, for histhoughtful guidance, insight, and support throughout my doctoraleducation. I am also grateful to Professors Feng Gu and xxxxx fortheir participation and interest as committee members for mySecond Exam. Thank you.




References I

N. Halko, P. G. Martinsson, J. A. Tropp, Finding Structurewith Randomness: Probabilistic Algorithms for ApproximateMatrix Decompositions, SIAM Review, 53, 2, 217–288, 2011.

M. W. Mahoney, Randomized Algorithms for Matrices andData, Foundations and Trends in Machine Learning, NOWPublishers, 3, 2, 2011. Preprint: arXiv:1104.5557 (2011)(Abridged version in: Advances in Machine Learning and DataMining for Astronomy, edited by M. J. Way et al., 647–672,2012.)

Woodruff, David P., Sketching as a tool for numerical linearalgebra, Foundations and Trends R© in Theoretical ComputerScience, 10, 1–2, 1–157, 2014.




References II

T. Sarlos, Improved Approximation Algorithms for LargeMatrices via Random Projections, Proceedings of IEEESymposium on Foundations of Computer Science (FOCS),143–152, 2006.

Golub, Gene H., and Christian Reinsch, Singular valuedecomposition and least squares solutions, Numerischemathematik, 14, 5, 403–420, 1970.

Axler, Sheldon Jay, Linear Algebra Done Right, Springer, NewYork, NY, 1997 (second edition).




References III

S. A. Goreinov, E. E. Tyrtyshnikov and N. L. Zamarashkin, Atheory of pseudo-skeleton approximation, Linear Algebra AndIts Applications, 261, 1–21,1997.

E. Liberty, F. Woolfe, P. G. Martinsson, V. Rokhlin and M.Tygert, Randomized algorithms for the low rank approximationof matrices, PNAS, 104, 51, 20167-20172, 2007.

F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert, A fastrandomized algorithm for the approximation of matrices,Technical Report YALEU/DCS/TR-1380, Yale UniversityDepartment of Computer Science, New Haven, CT, 2007.




References IV

W. B. Johnson and J. Lindenstrauss, Extension of Lipschitzmapping into Hilbert spaces, Proc. of modern analysis andprobability, Contemporary Mathematics, 26, 189-206, 1984.

N. Ailon and B. Chazelle, Approximate nearest neighbors andthe fast Johnson-Lindenstrauss transform, STOC 2006: Proc.38th Ann. ACM Theory of Computing, 557-563, 2006.

Drineas, Petros, Michael W. Mahoney, and S. Muthukrishnan,Relative-error CUR matrix decompositions, SIAM Journal onMatrix Analysis and Applications, 30, 2, 844-881, 2008.




References V

C. H. Papadimitriou, P. Raghavan, H. Tamaki and S. Vempala,Latent Semantic Indexing: A probabilistic analysis, Journal ofComputer and System Sciences, 61, 2, 217-235, 2000.

S. A. Goreinov, N. L. Zamarashkin and E. E. Tyrtyshnikov,Pseudo-skeleton approximations by matrices of maximalvolume, Mathematical Notes, 62, 4, 515-519, 1997.

S. A. Goreinov and E. E. Tyrtyshnikov, The maximal-volumeconcept in approximation by low-rank matrices, ContemporaryMathematics, 208, 47-51, 2001.




References VI

D. Achlioptas, Database-friendly random projections, Proc.ACM Symp. on the Principles of Database Systems, 274-281,2001.

C.-T. Pan, On the existence and computation ofrank-revealing LU factorizations, Linear Algebra and itsApplications, 316, 199–222, 2000.

V. Y. Pan, Structured Matrices and Polynomials: UnifiedSuperfast Algorithms, Birkhauser/Springer, Boston/New York,2001.

Pan, Victor, John Svadlenka, and Liang Zhao, FastDerandomized Low-rank Approximation and Extensions,CoRR, abs/1607.05801, 2016.




References VII

Rudelson, Mark, and Roman Vershynin, Non-asymptotictheory of random matrices: extreme singular values, CoRR,abs/1003.2990v2, 2010.

Dasgupta, Sanjoy, and Anupam Gupta, An elementary proof ofa theorem of Johnson and Lindenstrauss, Random Structuresand Algorithms, 22, 1, 60-65, 2003.

A. Frieze, R. Kannan and S. Vempala, Fast Monte-Carloalgorithms for finding low-rank approximations, Journal of theACM 51, 6, 1025-1041, 2004.




References VIII

Akin, Berkin, Franz Franchetti, and James C. Hoe, FFTs withnear-optimal memory access through block data layouts,Acoustics, Speech and Signal Processing (ICASSP), 2014IEEE International Conference on. IEEE, 2014.

Barba, Lorena A., and Rio Yokota, How will the fast multipolemethod fare in the exascale era, SIAM News, 46, 6, 1-3,2013.

M. Gu and S. C. Eisenstat, Efficient algorithms for computinga strong rank-revealing QR factorization, SIAM Journal ofScientific Computing, 17, 4, 848-869, 1996.




References IX

Lindtjorn, Olav, et al, Beyond traditional microprocessors forgeoscience high-performance computing applications, IeeeMicro, 31, 2, 41-49, 2011.

Ruston, A., Auerbach’s theorem and tensor products ofBanach spaces, Mathematical Proceedings of the CambridgePhilosophical Society, 58, 3,doi:10.1017/S0305004100036744, 476-480, 1962.

Cheng, Hongwei, et al., On the compression of low rankmatrices, SIAM Journal on Scientific Computing, 26, 4,1389-1404, 2005.




APPENDIX:Traditional Applications

Applications of matrix computations have typically included:

Physical Sciences and Engineering

Data Collection and Analysis

Computer Graphics

Biological and Life Sciences

The Theoretical Computer Science (TCS) perspective isincreasingly important:

Cross-fertilization of research in both fields

Demands of new applications of interest to TCS

Shortcomings of conventional LRA algorithms




APPENDIX: Two-sided Interpolative Decomposition

Two-sided Interpolative Decomposition Theorem [Cheng et al2005]

Let A be an m × n matrix and k ≤ min(m, n). Then there exists:

A = PL

(IkS

)AS

(Ik |T

)P∗R + X

such that PL and PR are permutation matrices. S ∈ C(m−k)×k andT ∈ Ck×(n−k) and X satisfy:

‖S‖F ≤√k(m − k)

‖T‖F ≤√

k(n − k)

‖X‖2 ≤ σk+1(A)√

1 + k(min(m, n)− k)




APPENDIX: Deterministic Algorithms

Theorem

Gram-Schmidt and QR Factorization: Suppose (a1, a2, . . . , an) is alinearly independent list of vectors in an inner product space V .Then there is an orthonormal list of vectors (q1, q2, . . . , qn) suchthat span(a1, a2, . . . , an) = span(q1, q2, . . . , qn).

Proof.

Let proj(r , s) := <r ,s><r ,r> r denote the projection of r on to s

w1 := a1

w2 := a2 − proj(a2,w1)...wn = an − proj(an,w1)− proj(an,w2)− · · · − proj(an,wn−1)q1 = w1/‖w1‖, q2 = w2/‖w2‖, . . . , qn = wn/‖wn‖





Re-arranging equations for w1,w2, . . . ,wn to be equations witha1, a2, . . . , an on the left-hand side and replacing wi with qi givesA = Q · R where

A = [a1, a2, . . . , an]

Q = [q1, q2, . . . , qn]

R =

< q1, a1 > < q1, a2 > < q1, a3 > . . . < q1, an >

0 < q2, a2 > . . . < q2, an−1 > < q2, an >0 0 < q3, a3 > . . . < q3, an >...

......

......

0 0 0 0 < qn, an >





As a QR Gram Schmidt alternative, consider an orthogonal matrixproduct Q1Q2 . . .Qn that transforms A to upper triangular form R

(Qn . . .Q2Q1)A = R

Multiplying both sides by (Qn . . .Q2Q1)−1, we have that:

(Qn . . .Q2Q1)−1(Qn . . .Q2Q1)A = (Qn . . .Q2Q1)−1R

A = Q1Q2 . . .QnR

A product of orthogonal matrices is also orthogonal so allowing forcolumn-pivoting we have that:

AΠ = Q1Q2 . . .QnR

A Householder reflection matrix is used for each Qi , i = 1, 2, . . . , nto transform A to R column-wise...





A Householder matrix vector multiplication Hx = (I − 2vvT )xreflects a vector x across the hyperplane normal to v .

Unit vector v is constructed for each Qi Householder matrix sothat entries of column i below the diagonal of A vanish.

x = (aii , aii+1, . . . , ain) for column i

v depends upon x and the standard basis vector eiThe matrix product Qi · A is applied

The above items are repeated for each column of A

Impact to QR algorithm:

Householder matrices improve numerical stability

But each matrix Qi is applied separately to A

Therefore, parallelism options are limited





SVD decomposition of A = UΣV occurs in two distinct steps:

1st Step: Use two sequences of Householder translations toreduce A to upper bidiagonal form:

B = Qn . . .Q2Q1AP1P2 . . .Pn−2

Therefore, we have that: A = Q1Q2 . . .QnBPn−2 . . .P2P1

2nd Step: Use two sequences of Givens rotations (orthogonaltransformations) to reduce B to diagonal form Σ

Σ = Gn−1 . . .G2G1BF1F2 . . .Fn−1

Likewise, we have that: B = G1G2 . . .Gn−1ΣFn−1 . . .F2F1

Set U := Q1Q2 . . .QnG1G2 . . .Gn−1

Set V := (F1F2 . . .Fn−1)∗(P1P2 . . .Pn−2)∗





SVD cost is O(mn max(m,n))

QR cost is O(kmn) for a rank-k approximation

Random memory access (eg, column pivoting) contributes tomemory bottlenecks

This is especially the case for out-of-core data sets

The standard QR algorithm forms Q from a product ofHouseholder reflector matrices which permits better numericalstability.

- Second Exam Presentation Low Rank Matrix Approximation ... · Given an m n matrix A, we are often...

Documents

Transcript of - Second Exam Presentation Low Rank Matrix Approximation ... · Given an m n matrix A, we are often...