Matrix Computations (Ch. 1-2) - University at Buffalobest.eng.buffalo.edu › Research › Lecture...

Matrix Computations(Chapter 1-2)

Mehrdad Sheikholeslami

Department of Electrical Engineering

University at Buffalo

The Best Group

Winter 2016

Chapter 1

1.1 Basic Algorithms and Notation

1.2 Exploiting Structure: If a matrix has structure, then it is usually possible to exploit it.

1.3 Block Matrices and Algorithms: A block matrix is a matrix with matrix entries.

1.4 Vectorization and Re-Use Issues

1.1 Basic Algorithms and Notation Matrix Notation:

Let ℝ denote the set of real numbers.

𝐴 ∈ ℝ𝑚×𝑛 ⇔ 𝐴 = 𝑎𝑖𝑗 =

𝑎11 ⋯ 𝑎1𝑛⋮ ⋱ ⋮

𝑎𝑚1 ⋯ 𝑎𝑚𝑛

, 𝑎𝑖𝑗 ∈ ℝ

Matrix Operations:

• 𝑇𝑟𝑎𝑛𝑠𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 ℝ𝑚×𝑛 ⟶ℝ𝑛×𝑚 : 𝐶 = 𝐴𝑇 ⟹ 𝑐𝑖𝑗 = 𝑎𝑖𝑗• 𝐴𝑑𝑑𝑖𝑡𝑖𝑜𝑛 ℝ𝑚×𝑛 + ℝ𝑚×𝑛 ⟶ℝ𝑚×𝑛 : 𝐶 = 𝐴 + 𝐵 ⟹ 𝑐𝑖𝑗 = 𝑎𝑖𝑗 + 𝑏𝑖𝑗• 𝑆𝑐𝑎𝑙𝑎𝑟 − 𝑚𝑎𝑡𝑟𝑖𝑥 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 ℝ × ℝ𝑚×𝑛 ⟶ℝ𝑚×𝑛 : 𝐶 = 𝑎𝐴 ⟹ 𝑐𝑖𝑗 = 𝑎𝑎𝑖𝑗• 𝑀𝑎𝑡𝑟𝑖𝑥 − 𝑚𝑎𝑡𝑟𝑖𝑥 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 ℝ𝑚×𝑝 × ℝ𝑝×𝑛 ⟶ℝ𝑚×𝑛 : 𝐶 = 𝐴𝐵 ⟹ 𝑐𝑖𝑗 = σ𝑘=1

𝑟 𝑎𝑖𝑘𝑏𝑘𝑗 Vector Notation:

𝑥 ∈ ℝ𝑛 ⇔ 𝑥 =

𝑥1⋮𝑥𝑛

, 𝑥𝑖 ∈ ℝ , 𝐶𝑜𝑙𝑢𝑚𝑛 𝑣𝑒𝑐𝑡𝑜𝑟

𝑥 ∈ ℝ1×𝑛 ⇔ 𝑥 = 𝑥1 ⋯ 𝑥𝑛 , 𝑥𝑖 ∈ ℝ , 𝑅𝑜𝑤 𝑣𝑒𝑐𝑡𝑜𝑟If 𝑥 is a column vector then y = 𝑥𝑇is a row vector.

Vector Operations:

• 𝑉𝑒𝑐𝑡𝑜𝑟 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 ℝ × ℝ𝑛 ⟶ℝ𝑛 : 𝑧 = 𝑎𝑥 ⟹ 𝑧𝑖 = 𝑎𝑥𝑖• 𝑉𝑒𝑐𝑡𝑜𝑟 𝑎𝑑𝑑𝑖𝑡𝑖𝑜𝑛 ℝ𝑛 + ℝ𝑛 ⟶ℝ𝑛 : 𝑧 = 𝑥 + 𝑦 ⟹ 𝑧𝑖 = 𝑥𝑖 + 𝑦𝑖• The dot Product (inner product) ℝ1×𝑛 × ℝ𝑛 ⟶ℝ : 𝑐 = 𝑥𝑇𝑦 ⟹ 𝑐 = σ𝑖=1

𝑛 𝑎𝑖𝑏𝑖c=0

for i = 1:n

c = c + x(i)y(i)

end

It involves n multiplications and n additions. It is an “O(n)” operation, meaning that the amount of work is

linear in the dimensions.

• 𝑉𝑒𝑐𝑡𝑜𝑟 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑦 𝐻𝑎𝑑𝑎𝑚𝑎𝑟𝑑 𝑃𝑟𝑜𝑑𝑢𝑐𝑡 ℝ𝑛 ∗ ℝ𝑛 ⟶ℝ𝑛 : 𝑧 = 𝑥.∗ 𝑦 ⟹ 𝑧𝑖 = 𝑥𝑖𝑦𝑖• An important update form (Saxpy): 𝑦 = 𝑎𝑥 + 𝑦 ⟹ 𝑦𝑖 = 𝑎𝑥𝑖 + 𝑦𝑖 (“=“ is being used to denote

assignment)

for i = 1:n

y(i) = ax(i)+y(i)

end

Generalized saxpy or gaxpy:

Suppose 𝐴 ∈ ℝ𝑚×𝑛 and we wish to compute the update: 𝑦 = 𝐴𝑥 + 𝑦.A standard way is to update the components

one at a time: 𝑦𝑖 = σ𝑗=1𝑛 𝑎𝑖𝑗 𝑥𝑗 + 𝑦𝑖 , 𝑖 = 1:𝑚.

• gaxpy (Row Version)

for i = 1:m

for j = 1:n

y(i) = A(i,j)x(j)+y(i)

end

end

• gaxpy (Column Version)

for j = 1:n

for i = 1:m

y(i) = A(i,j)x(j)+y(i)

end

end

• The inner loop in either gaxpy algorithm carries out a saxpy algorithm.

Partitioned matrices: From row point of view, a matrix is a stack of row vectors.

𝐴 ∈ ℝ𝑚×𝑛 ⇔ 𝐴 =𝑟1𝑇

⋮𝑟𝑛𝑇

, 𝑟𝑘 ∈ ℝ𝑛

• Then the row version of gaxpy will become:

for i = 1:m

y(i) = 𝑟𝑖𝑇x(i) + y(i)

end

• Alternatively a matrix is a collection of column vectors:

𝐴 ∈ ℝ𝑚×𝑛 ⇔ 𝐴 = 𝑐1 … 𝑐𝑛 , 𝑐𝑘 ∈ ℝ𝑚

Then the column version of gaxpy will become:

for j = 1:n

y(j) = x(j)c(j)+ y(j)

end

The colon notation: To specify a column or row of a matrix

𝑖𝑓 𝐴 ∈ ℝ𝑚×𝑛 → 𝐴 𝑘, : = 𝑎𝑘1 ⋯ 𝑎𝑘𝑛 𝑎𝑛𝑑 𝐴 : , 𝑘 =

𝑎1𝑘⋮

𝑎𝑛𝑘• Then the row version of gaxpy will become:

for i = 1:m

y(i) = 𝐴(𝑖, : )x(i) + y(i)

end

• And the column version of gaxpy will become:

for j = 1:n

y(j) = x(j) 𝐴(: , 𝑗) + y(j)

end

• With the colon notation we are able to suppress iteration details.

Outer Product

𝐴 = 𝐴 + 𝑥𝑦𝑇 , 𝐴 ∈ ℝ𝑚×𝑛, x ∈ ℝ𝑚, 𝑦 ∈ ℝ𝑛

for i = 1:m

for j = 1:n

A(i,j) = A(i,j)+x(i) y(j)

end

end

To eliminate the j-loop by colon:

for i = 1:m

A(i,:) = A(i,:)+x(i) yT

end

Or if we make the i-loop the inner loop:

for j = 1:n

A(:,j) = A(:,j)+y(j)x

end

Matrix-Matrix Multiplication

Methods:

• Dot product

• Saxpy Version Although equivalent mathematically, can have different levels of performance.

• Outer Product

Scalar Level Specification

The “ijk” variant (Identify rows of C (and A) with “i”, columns of C (and B) with “j” and

summation index with “k”):

𝐶 = 𝐴𝐵 + 𝐶 , 𝐴 ∈ ℝ𝑚×𝑝, 𝐵 ∈ ℝ𝑝×𝑛, 𝐶 ∈ ℝ𝑚×𝑛

for i = 1:m

for j = 1:n

for k = 1:p

C(i,j) = A(i,k)B(i,k)+C(i,j)

end

end

end

So we have 3!=6 variations, each of six possibilities features and inner loop of operation (dot product or saxpy) and

has its own pattern of data flow, for example:

The “ijk” variant dot product access to a row of A and a column of B.

The “jki” variant saxpy access to a column of C and a column of A.

Loop Order Inner Loop Middle Loop Inner Loop Data Access

ijk Dot vector×matrix A by row, B by column

jik Dot matrix×vector A by row, B by column

ikj Saxpy row gaxpy B by row, C by row

jki Saxpy column gaxpy A by column, C by column

kij Saxpy row outer product B by row, C by row

kji Saxpy column outer product A by column, C by column

Matrix Multiplication: Dot product version

If 𝐴 ∈ ℝ𝑚×𝑝, 𝐵 ∈ ℝ𝑝×𝑛, 𝐶 ∈ ℝ𝑚×𝑛 are given, this algorithm replaces C by AB+C:

for i = 1:m

for j = 1:n

C(i,j) = A(i,:)B(:,j)+C(i,j)

end

end

Inner two loops of ijk variant define a row-oriented gaxpy operation.

Matrix Multiplication: A saxpy formulation

Suppose A and C are column partitioned, If 𝐴 ∈ ℝ𝑚×𝑝, 𝐵 ∈ ℝ𝑝×𝑛, 𝐶 ∈ ℝ𝑚×𝑛 are given, this algorithm overwrites

C by AB+C:

for j = 1:n

for k = 1:p

C(:,j) = A(:,k)B(k,j)+C(:,j)

end

end

Note that the k-loop oversees a gaxpy operation:

for j = 1:n

C(:,j) = AB(:,j)+C(:,j)

end

Matrix Multiplication: Outer product version

Consider the “kij” variant:

for k = 1:p

for j = 1:n

for i = 1:m


end

end

end

The inner two loops oversee the outer product update: 𝐶 = 𝑎𝑘𝑏𝑘𝑇 + 𝐶. Where A is column partitioned and b is

row partitioned. If 𝐴 ∈ ℝ𝑚×𝑝, 𝐵 ∈ ℝ𝑝×𝑛, 𝐶 ∈ ℝ𝑚×𝑛 are given, this algorithm overwrites C by AB+C:

for k = 1:p

C = A(:,k)B(k,:)+C

end

The notion of “level”:

• The dot product and saxpy are examples of level-1 operations. The amount of data and amount of

arithmetic is linear in dimension in these operations. An m-by-n outer product or gaxpy operation involves

a quadratic amount of data (O(mn)) and a quadratic amount of data (O(mn)). They are examples of level-2

operations. The matrix update C=AB+C is a level-3 operation; It involves a quadratic amount of data and

a cubic amount of work.

• Theorem: If 𝐴 ∈ ℝ𝑚×𝑝, 𝐵 ∈ ℝ𝑝×𝑛, then (𝐴𝐵)𝑇= 𝐵𝑇𝐴𝑇

Proof is in the text book.

• Complex Matrices:

The scaling, addition and multiplication of complex matrices corresponds to the real case. However, transposition

becomes conjugate transposition : 𝐶 = 𝐴𝐻 → 𝑐𝑖𝑗 = 𝑎𝑗𝑖The dot product of complex n-vector x and y is prescribed as: 𝑠 = 𝑥𝐻𝑦 = σ𝑖=1

𝑛 ҧ𝑥𝑖𝑦𝑖Finally, if 𝐴 = 𝐵 + 𝑖𝐶, then we designate the real and imaginary part of A by 𝑅𝑒(𝐴) = 𝐵 and 𝐼𝑚(𝐴) =𝐶 respectively.

1.2 Exploiting structureThe efficiency of a given matrix algorithm depends on many things. In this section we treat the amount of required

arithmetic and storage.

Band Matrices and x-0 Notation

𝐴 ∈ ℝ𝑚×𝑛 has lower bandwidth 𝑝 if 𝑎𝑖𝑗 = 0, whenever 𝑖 > 𝑗 + 𝑝 and upper bandwidth 𝑞 if 𝑗 > 𝑖 + 𝑞 implies 𝑎𝑖𝑗 =

0. Below is an example of 8-by-5 matrix that has lower bandwidth 1 and upper bandwidth 2:𝑥 𝑥 𝑥𝑥 𝑥 𝑥000000

𝑥00000

𝑥𝑥0000

0𝑥𝑥𝑥𝑥000

00𝑥𝑥𝑥𝑥00

Diagonal matrix manipulation

Matrices with upper and lower bandwidth zero are diagonal. If 𝐷 ∈ ℝ𝑚×𝑛 is diagonal then:

𝐷 = 𝑑𝑖𝑎𝑔(𝑑1, … , 𝑑𝑞), 𝑞𝑚𝑖𝑛 = min m, n ⇔ 𝑑𝑖 = 𝑑𝑖𝑖If D is diagonal and A is a matrix, then DA is a row scaling of A and AD is a column scaling of A.

Type of Matrix Lower Bandwidth Upper Bandwidth

diagonal 0 0

upper triangular 0 n-1

lower triangular m-1 0

tridiagonal 1 1

upper bidiagonal 0 1

lower bidiagonal 1 0

upper Hessenberg 1 n-1

lower Hessenberg m-1 1

Band terminology for m-by-n matrices

Triangular Matrix Multiplication

If A and B are both n-by-n and upper triangular, the 3-by-3 case of C=AB will look like this:

𝐶 =

𝑎11𝑏11 𝑎11𝑏12 + 𝑎12𝑏22 𝑎11𝑏13 + 𝑎12𝑏23 + 𝑎13𝑏330 𝑎22𝑏22 𝑎22𝑏23 + 𝑎23𝑏330 0 𝑎33𝑏33

It suggests that product is triangular and its upper triangular entries are the result of abbreviated inner products.

If 𝐴, 𝐵 ∈ ℝ𝑛×𝑛 are upper triangular, then for C=AB we have:

C=0

for i = 1:n

for j = i:n

for k = i:j

c(i,j) = A(i,k)B(k,j)+c(i,j)

end

end

end

To quantify the savings in the last slide algorithm we need some tools for measuring the amount of work:

Flops: A flop is a floating point operation.

• A dot product or saxpy operation of length n 2n flops (n multiplication and n adds)

• The gaxpy y=Ax+y, 𝐴 ∈ ℝ𝑚×𝑛 and outer product update of A=A+xyT 2mn flops

• The matrix multiply update C=AB+C, 𝐴 ∈ ℝ𝑚×𝑝, 𝐵 ∈ ℝ𝑝×𝑛, 𝐶 ∈ ℝ𝑚×𝑛 2mnp flops

Now to calculate the flops in triangular matrix multiplication: 𝑐𝑖𝑗 , (𝑖 ≤ 𝑗)requires 2(j-i+1) flops.

Total number of flops: σ𝑖=1𝑛 σ𝑗=1

𝑛 2(𝑗 − 𝑖 + 1) ≈𝑛3

3

We find that this multiplication requires one-sixth the number of flops as full matrix multiplication.

• Flop counting is a necessarily crude approach to the measuring of program efficiency and we must not infer

too much from it because it only captures one of several dimensions of efficiency issue.

Band Storage

Suppose 𝐴 ∈ ℝ𝑛×𝑛 has lower bandwidth 𝑝 and upper bandwidth 𝑞 and 𝑝 and 𝑞 are much smaller than n. Such a

matrix can be stored in a 𝑝 + 𝑞 + 1 − 𝑏𝑦 − 𝑛 array A.band with the convention that: 𝑎𝑖𝑗 = 𝐴. 𝑏𝑎𝑛𝑑 𝑖 − 𝑞 + 1, 𝑗

𝐴 =

𝑎11 𝑎12 𝑎13𝑎21 𝑎22 𝑎230000

𝑎32000

𝑎33𝑎4300

0 0 0𝑎24 0 0𝑎34𝑎44𝑎540

𝑎35𝑎45𝑎55𝑎65

0𝑎46𝑎56𝑎66

then 𝐴. 𝑏𝑎𝑛𝑑 =

0 0 𝑎130 𝑎12 𝑎23𝑎11𝑎21

𝑎22𝑎32

𝑎33𝑎43

𝑎24 𝑎35 𝑎46𝑎34 𝑎45 𝑎56𝑎44𝑎54

𝑎55𝑎65

𝑎660

• Band gaxpy: With this data structure, our column oriented gaxpy algorithm transforms to the following:

Suppose 𝐴 ∈ ℝ𝑛×𝑛 has lower bandwidth 𝑝 and upper bandwidth 𝑞 and is stored in A.band format. If 𝑥, 𝑦 ∈ ℝ𝑛,

then the algorithm overwrites y with Ax+y:

for j = 1:n

ytop= max(1,j-q)

ybot= min(n,j+p)

atop= max(1,q+2-j)

abot= atop+ybot- ytop

y(ytop:ybot)=x(j)A.band(atop: abot , j)+ y(ytop: ybot)

end

• This algorithm involves just 2𝑛(𝑝 + 𝑞 + 1) flops with assumption that 𝑝 and 𝑞 are much smaller than 𝑛.

Symmetry

We say that 𝐴 ∈ ℝ𝑛×𝑛 is symmetric if 𝐴𝑇 = 𝐴.

Thus 𝐴 =1 2 32 4 53 5 6

is symmetric. Storage requirement can be halved if we store the lower triangle of elements,

e.g 𝐴. 𝑣𝑒𝑐 = 1 2 3 4 5 6 . In general, with data structure we have:

𝑎𝑖𝑗 = 𝐴. 𝑣𝑒𝑐( 𝑗 − 1 𝑛 − 𝑗 𝑗 − 1 /(2 + 𝑖)), 𝑖 ≥ 𝑗

• Symmetric storage gaxpy: Suppose 𝐴 ∈ ℝ𝑛×𝑛 is symmetric and is stored in A.vec format. If 𝑥, 𝑦 ∈ ℝ𝑛, then

the algorithm overwrites y with Ax+y:

for j = 1:n

for i = 1:j-1

y(i) = A.vec((i-1)n-i(-1)/2+j).x(j)+y(i)

end

for i = j:n

y(i) = A.vec((j-1)n-j(j-1)/2+i).x(j)+y(i)

• This algorithm requires the same 2𝑛2 flops that an ordinary gaxpy requires. Halving the storage requirement

is purchased with some awkward subscripting.

• Store by diagonal

Symmetric matrices can also be stored by diagonal. In this structure we represent A with the vector.

If 𝐴 =1 2 32 4 53 5 6

, 𝑡ℎ𝑒𝑛 𝐴. 𝑑𝑖𝑎𝑔 = 1 4 6 2 5 3

In general if 𝑖 ≥ 𝑗 : 𝑎𝑖+𝑘,𝑖= 𝐴. 𝑑𝑖𝑎𝑔(𝑖 + 𝑛𝑘 − 𝑘(𝑘 − 1)/2) 𝑘 ≥ 0

• Store by diagonal gaxpy: Suppose 𝐴 ∈ ℝ𝑛×𝑛 is symmetric and is stored in A.diag format. If 𝑥, 𝑦 ∈ ℝ𝑛, then the

algorithm overwrites y with Ax+y:

for i = 1:n

y(i) = A.diag(i)x(i) + y(i)

end

for k = 1:n-1

t= nk-k(k-1)/2

for i = 1:n-k

y(i) = A.diag(i+t)x(i+k) + y(i)

end

for i = 1:n-k

y(i+k)=A.diag(i+t)x(i)+y(i+k)

end

end

A Note on Overwriting and Workspaces:

Overwriting input data in another way to control the amount of memory that a matrix computation requires.

Consider n-by-n matrix multiplication problem C=AB with the proviso that the “input matrix” B is to be

overwritten by the “output matrix” C. we can not simply overwrite B, because:

C(1:n, 1:n) = 0

for j = 1:n

for k = 1:n

C(:,j) = C(:,j) + A(:,k)B(k,j)

B(:,j) is needed throughout the entire k-loop. A linear workspace is needed to hold the jth column of the product

until it is “safe” to overwrite B(:,j):

for j = 1:n

w(1:n) = 0

for k = 1:n

w(:) = w(:) + A(:,k)B(k,j)

end

B(:,j)=w(:)

end

This linear workspace is usually not important in a matrix computation that has a 2-dimensional array of the same

order.

1.3 Block Matrices and AlgorithmsBlock algorithms are increasingly more important in high performance computing. By a block algorithm we

essentially mean an algorithm that is rich in matrix-matrix multiplication. Algorithms of this type turn out to me

more efficient in many computing environments than those that are organized at a lower linear algebraic level.

Block Matrix Notation

Column and row partitioning are special cases of matrix blocking. In general we can partition both the rows and

columns of a m-by-n matrix A to obtain:

𝐴 =

𝐴11 ⋯ 𝐴1𝑟⋮ ⋱ ⋮

𝐴𝑞1 ⋯ 𝐴𝑞𝑟

Where 𝑚1+…+𝑚𝑞 = 𝑚 𝑎𝑛𝑑 𝑛1+…+𝑛𝑟 = 𝑛. And 𝐴𝛼𝛽 designates the (𝛼𝛽) block or submatrix This block has

dimension 𝑚𝑎 − 𝑏𝑦 − 𝑛𝛽.

𝑚1

𝑚𝑞

𝑛1 𝑛𝑟

Block Matrix Manipulation

Block matrices combine just like matrices with scalar entries as long as certain dimension requirements are met.

𝐵 =

𝐵11 ⋯ 𝐵1𝑟⋮ ⋱ ⋮𝐵𝑞1 ⋯ 𝐵𝑞𝑟

B is partitioned conformably with the matrix A in the last slide. The sum C = A+B can also be regarded as a q-by-r

block matrix:

𝐶 =

𝐶11 ⋯ 𝐶1𝑟⋮ ⋱ ⋮𝐶𝑞1 ⋯ 𝐶𝑞𝑟

=

𝐴11 + 𝐵11 ⋯ 𝐴1𝑟 + 𝐵1𝑟⋮ ⋱ ⋮

𝐴𝑞1 + 𝐵𝑞1 ⋯ 𝐴𝑞𝑟 + 𝐵𝑞𝑟

But multiplication is a little trickier, Lemma: If 𝐴 ∈ ℝ𝑚×𝑝, 𝐵 ∈ ℝ𝑝×𝑛

𝐴 =

𝐴1⋮𝐴𝑞

and 𝐵 = 𝐵1 ⋯ 𝐵𝑟 then:

𝐴𝐵 = 𝐶 =

𝐶11 ⋯ 𝐶1𝑟⋮ ⋱ ⋮𝐶𝑞1 ⋯ 𝐶𝑞𝑟

Where 𝐶𝛼𝛽 = 𝐴𝛼𝐵𝛽 for 𝛼 = 1: 𝑞 𝑎𝑛𝑑 𝛽 = 1: 𝑟

𝑚1

𝑚𝑞

𝑛1 𝑛𝑟

𝑚1

𝑚𝑞

𝑛1 𝑛𝑟

𝑚1

𝑚𝑞

𝑛1 𝑛𝑟

Another lemma: If 𝐴 ∈ ℝ𝑚×𝑝, 𝐵 ∈ ℝ𝑝×𝑛

𝐴 =

𝐴1⋮𝐴𝑞

and 𝐵 = 𝐵1 ⋯ 𝐵𝑟 then:

𝐴𝐵 = 𝐶 =𝛾=1

𝑠

𝐴𝛾𝐵𝛾

Theorem 1.3: Based on the two lemmas, for general block matrix multiplication we have:

If 𝐴 =

𝐴11 ⋯ 𝐴1𝑠⋮ ⋱ ⋮

𝐴𝑞1 ⋯ 𝐴𝑞𝑠

, 𝐵 =𝐵11 ⋯ 𝐵1𝑟⋮ ⋱ ⋮𝐵𝑠1 ⋯ 𝐵𝑠𝑟

and 𝐶 = 𝐴𝐵 =

𝐶11 ⋯ 𝐶1𝑟⋮ ⋱ ⋮𝐶𝑞1 ⋯ 𝐶𝑞𝑟

then:

𝐶𝛼𝛽 =𝛾=1

𝑠

𝐴𝛼𝛾𝐵𝛾𝛽 𝛼 = 1: 𝑞 𝑎𝑛𝑑 𝛽 = 1: 𝑟

𝑝1

𝑝𝑠

𝑝1 𝑝𝑠

𝑚1

𝑚𝑞𝑝1 𝑝𝑠

𝑝1

𝑝𝑠𝑛1 𝑛𝑟

𝑚1

𝑚𝑞

𝑛1 𝑛𝑟

Submatrix Designation

Suppose 𝐴 ∈ ℝ𝑚×𝑛 and that 𝑖 = 𝑖1, … , 𝑖𝑟 𝑎𝑛𝑑 𝑗 = 𝑗1, … , 𝑗𝑐 are integer vectors with property that: 𝑖1, … , 𝑖𝑟 ∈1,2, … ,𝑚 and 𝑗1, … , 𝑗𝑐 ∈ 1,2, … , 𝑛

We let 𝐴(𝑖, 𝑗) denote the r-by-c submatrix

𝐴(𝑖, 𝑗) =𝐴(𝑖1, 𝑗1) ⋯ 𝐴(𝑖1, 𝑗𝑐)

⋮ ⋱ ⋮𝐴(𝑖𝑟 , 𝑗1) ⋯ 𝐴(𝑖𝑟 , 𝑗𝑐)

Then 𝐴(𝑖1: 𝑖2, 𝑗1: 𝑗2) is the submatrix obtained by extracting rows 𝑖1 through 𝑖2 and columns 𝑗1 through 𝑗2.

𝐴(3: 5,1: 2) =

𝑎31 𝑎32𝑎41 𝑎42𝑎51 𝑎52

Block Matrix MultiplicationMultiplication of block matrices can be arranged in several possible ways just as ordinary, scalar-level matrix

multiplication can. Different blockings for A, B and C can set the stage for block version of the dot product,

saxpy and outer product algorithm. We assume that these three matrices are all n-by-n that 𝑛 = 𝑁𝑙. If A =(𝐴𝛼𝛽), B = (𝐵𝛼𝛽) and C = (𝐶𝛼𝛽) are N-by-n block matrices by 𝑙-by-𝑙 blocks, then from theorem 1.3:

𝐶𝛼𝛽 =𝛾=1

𝑁

𝐴𝛼𝛾𝐵𝛾𝛽 + 𝐶𝛼𝛽 𝛼 = 1:N, 𝛽 = 1:𝑁

If we organize a matrix multiplication procedure around this summation we obtain:

for 𝛼 = 1:𝑁𝑖 = 𝛼 − 1 𝑙 + 1: 𝛼𝑙for 𝛽 = 1:𝑁

j = 𝛽 − 1 𝑙 + 1: 𝛽𝑙for 𝛾 = 1:𝑁

k = 𝛾 − 1 𝑙 + 1: 𝛾𝑙𝐶 𝑖, 𝑗 = 𝐴 𝑖, 𝑘 𝐵 𝑘, 𝑗 + 𝐶 𝑖, 𝑗

end

end

end

• Block saxpy and block outer product is now obtainable.

Complex Matrix Multiplication

Consider the complex matrix multiplication update:

𝐶1 + 𝑖𝐶2 = 𝐴1 + 𝑖𝐴2 𝐵1 + 𝑖𝐵2 + 𝐶1 + 𝑖𝐶2It can be expressed as follows:𝐶1𝐶2

=𝐴1 −𝐴2𝐴2 𝐴1

𝐵1𝐵2

+𝐶1𝐶2

A Divide and Conquer Matrix Multiplication𝐶11 𝐶12𝐶21 𝐶22

=𝐴11 𝐴12𝐴21 𝐴22

𝐵11 𝐵12𝐵21 𝐵22

In ordinary algorithm 𝐶 𝑖, 𝑗 = 𝐴 𝑖, 1 𝐵 1, 𝑗 + 𝐴 𝑖, 2 𝐵 2, 𝑗 . There are 8 multiplies and 4 adds. Strassen (1969)

shown how to complete C with just 7 multiplies and 18 adds. Strassen method involves 7/8ths the arithmetic of the

fully conventional algorithm (Page 31)

1.4 Vectorization and Re-Use IssuesVector pipeline computers are able to perform vector operations very fast because of special hardware that is able

to exploit the fact that a vector operation is a very regular sequence of scalar operations. The performance from

such a computer depends on upon the length of the vector operands, vector stride, vector loads and stores and the

level of data re-use.

Pipelining Arithmetic Operation

3 Cycle Adder

Pipelined Addition

Vector Operations

A vector pipeline computer comes with a collection of vector instructions that take place in vector registers.

Vectors travel between the registers and memory. An important attribute of a vector processor is the length of its

vector registers (𝑣𝐿). A length-n vector must operation must be broken down into subvector operation of length 𝑣𝐿or less.

The Vector Length Issue

Suppose the pipeline for the vector operation “op” takes 𝜏𝑜𝑝 cycles to set up. Assume that one component of the

result is obtained per cycle when the pipeline if filled. The required time for an n-dimensional “op” is:

𝑇𝑜𝑝 𝑛 = 𝜏𝑜𝑝 + 𝑛 𝜇 , 𝑛 ≤ 𝑣𝐿Where 𝜇 is the cycle time and 𝑣𝐿 is the length of the vector hardware. If the vectors are longer that vector hardware

length they nust be broken down, thus:

𝑛 = 𝑛1𝑣𝐿 + 𝑛0, 0 ≤ 𝑛0 ≤ 𝑣𝐿Then :

𝑇𝑜𝑝 𝑛 = ቐ𝑛1 𝜏𝑜𝑝 + 𝑣𝐿 𝜇 𝑛0 = 0

𝑛1 𝜏𝑜𝑝 + 𝑣𝐿 + 𝜏𝑜𝑝 + 𝑛0 𝜇 𝑛0 ≠ 0

We simplify the above to: 𝑇𝑜𝑝 𝑛 = (𝑛 + 𝜏𝑜𝑝𝑐𝑒𝑖𝑙𝑛

𝑣𝐿)𝜇

Where 𝑐𝑒𝑖𝑙 𝑎 is the smallest integer such that 𝑎 ≤ 𝑐𝑒𝑖𝑙(𝑎)

If 𝜌 flops per component are involved, then the effective rate of computation for general 𝑛 is given by:

𝑅𝑜𝑝 𝑛 =𝜌𝑛

𝑇𝑜𝑝 𝑛

If 𝜇 is in seconds then 𝑅𝑜𝑝 is in flops per second.

Now let us see the performance of the matrix multiply update 𝐶 = 𝐴𝐵 + 𝐶.

for i = 1:m

for j = 1:n

for k = 1:p


end

end

end

This is the “ijk” variant and its innermost loop oversees a length-p dot product, thus:

𝑇𝑖𝑗𝑘 = 𝑚𝑛𝑝 +𝑚𝑛. 𝑐𝑒𝑖𝑙𝑝

𝑣𝐿𝜏𝑑𝑜𝑡

A similar analysis for each of the other variants leads to the following table:

Assume that 𝜏𝑑𝑜𝑡 and 𝜏𝑠𝑎𝑥 are equal. If 𝑚, 𝑛, and 𝑝 are all smaller than 𝑣𝐿, the most efficient variants will have the

longest inner loops. Otherwise the distinction between six options is small.

The Stride Issue

The Vector Touch Issue: The time required to read or write a vector to memory is comparable to the time

required to engage the vector in a dot product or saxpy.

Variant Cycles

ijk 𝑚𝑛𝑝 +𝑚𝑛. 𝑐𝑒𝑖𝑙 Τ𝑝 𝑣𝐿 𝜏𝑑𝑜𝑡

jik 𝑚𝑛𝑝 +𝑚𝑛. 𝑐𝑒𝑖𝑙 Τ𝑝 𝑣𝐿 𝜏𝑑𝑜𝑡

ikj 𝑚𝑛𝑝 +𝑚𝑝. 𝑐𝑒𝑖𝑙 Τ𝑛 𝑣𝐿 𝜏𝑠𝑎𝑥

jki 𝑚𝑛𝑝 + 𝑛𝑝. 𝑐𝑒𝑖𝑙 Τ𝑚 𝑣𝐿 𝜏𝑠𝑎𝑥

kij 𝑚𝑛𝑝 +𝑚𝑝. 𝑐𝑒𝑖𝑙 Τ𝑛 𝑣𝐿 𝜏𝑠𝑎𝑥

kji 𝑚𝑛𝑝 + 𝑛𝑝. 𝑐𝑒𝑖𝑙 Τ𝑚 𝑣𝐿 𝜏𝑠𝑎𝑥

Chapter 2

2.1 Basic Ideas from linear Algebra

2.2 Vector Norms

2.3 Matrix Norms

2.4 Finite Precision Matrix Computation

2.5 Orthogonality and the SVD

2.6 Projection and the CS Decomposition

2.7 The Sensitivity of Square Linear Systems

2.1 Basic Ideas from linear Algebra• Independence: A set of vectors 𝑎1, … , 𝑎𝑛 in ℝ𝑚 is linearly independent if σ𝑗=1

𝑛 𝛼𝑗𝑎𝑗 = 0 implies 𝛼 1: 𝑛 = 0.

Otherwise, a nontrivial combination of the 𝑎𝑖 is zero and 𝑎1, … , 𝑎𝑛 is said to be linearly dependent.

• Subspace: A subspace of ℝ𝑚 is a subset that is also a vector space. Given collection of vectors 𝑎1, … , 𝑎𝑛 ∈ ℝ𝑚,

the set of all linear combinations of these vectors is a subspace referred to as the span of 𝑎1, … , 𝑎𝑛

𝑠𝑝𝑎𝑛 𝑎1, … , 𝑎𝑛 = 𝑗=1

𝑛

𝛽𝑗𝑎𝑗: 𝛽𝑗 ∈ ℝ

• Basis: The subset 𝑎𝑖1, … , 𝑎𝑖𝑘 is a maximal linearly independent subset of 𝑎1, … , 𝑎𝑛 if it is linearly

independent and is not properly contained in any linearly independent subset of 𝑎1, … , 𝑎𝑛 . If 𝑎𝑖1, … , 𝑎𝑖𝑘 is

maximal, the span 𝑎1, … , 𝑎𝑛 = 𝑠𝑝𝑎𝑛 𝑎𝑖1, … , 𝑎𝑖𝑘 and 𝑎𝑖1, … , 𝑎𝑖𝑘 is a basis for 𝑠𝑝𝑎𝑛 𝑎1, … , 𝑎𝑛 .

• Dimension: If 𝑆 ⊆ ℝ𝑚 is a subspace, then it is possible to find independent space vectors 𝑎1, … , 𝑎𝑘 ∈ 𝑆 such

that 𝑆 = 𝑠𝑝𝑎𝑛 𝑎1, … , 𝑎𝑛 . All bases for a subspace S have the same number of elements. This number is the

dimension and is denoted by dim(S).

• Range: There are two important subspaces associated with an m-by-n matrix A. The range of A is defined by:

𝑟𝑎𝑛 𝐴 = 𝑦 ∈ ℝ𝑚: 𝑦 = 𝐴𝑥 𝑓𝑜𝑟 𝑠𝑜𝑚𝑒 𝑥 ∈ ℝ𝑛

• Null Space: null 𝐴 = 𝑥 ∈ ℝ𝑛: 𝐴𝑥 = 0If 𝐴 = 𝑎1, … , 𝑎𝑛 is a column partitioning then: 𝑟𝑎𝑛 𝐴 = 𝑠𝑝𝑎𝑛 𝑎1, … , 𝑎𝑛• Rank: 𝑟𝑎𝑛𝑘 𝐴 = dim(𝑟𝑎𝑛 𝐴 )It can be shown that 𝑟𝑎𝑛𝑘(𝐴) = 𝑟𝑎𝑛𝑘(𝐴𝑇). We say that 𝐴 ∈ ℝ𝑚×𝑛 is rank deficient if rank(A)< min{m, n}. If 𝐴 ∈

ℝ𝑚×𝑛 then: dim 𝑛𝑢𝑙𝑙 𝐴 + 𝑟𝑎𝑛𝑘 𝐴 = 𝑛

Matrix Inverse

The n-by-n identity (Unit) matrix In is defined by column partitioning: 𝐼𝑛 = 𝑒1, … , 𝑒𝑘Where 𝑒𝑘 is the kth canonical vector: 𝑒𝑘 = (0,… , 0,1,0, … , 0)𝑇

If 𝐴 and 𝑋 are in ℝ𝑛×𝑛 and satisfy 𝐴𝑋 = 𝐼, then 𝑋 is the inverse of 𝐴 and is denoted by 𝐴−1. If 𝐴−1 exists, then 𝐴 is

said to be nonsingular, otherwise it is singular.

Matrix Inverse Properties:

• The inverse of a product is the reverse product of the inverses: (𝐴𝐵)−1= 𝐵−1𝐴−1

• The transpose of inverse is the inverse of the transpose: (𝐴−1)𝑇 = (𝐴𝑇)−1 = 𝐴−𝑇

• Sherman-Morrison-Woodbury formula: (𝐴 + 𝑈𝑉𝑇)−1= 𝐴−1 − 𝐴−1𝑈(𝐼 + 𝑉𝑇𝐴−1𝑈)−1𝑉𝑇𝐴−1

Where 𝐴 ∈ ℝ𝑛×𝑛 and U and V are n-by-k. A rank 𝑘 correction to a matrix results in a rank 𝑘 correction of the

inverse. Here we assumed that both 𝐴 and 𝐼 + 𝑉𝑇𝐴−1𝑈 are nonsingular.

The Determinant

If 𝐴 = (𝑎) ∈ ℝ1×1.The determinant of 𝐴 ∈ ℝ𝑛×𝑛 is defined in term of order n-1 determinants:

det 𝐴 =𝑗=2

𝑛

(−1)𝑗+1𝑎1𝑗det(𝐴1𝑗)

Here 𝐴1𝑗 is an (n-1)-by-(n-1) matrix obtained by deleting the first row and jth column of A.

Determinant Properties:

• det 𝐴𝐵 = det 𝐴 det(𝐵) 𝐴, 𝐵 ∈ ℝ𝑛×𝑛

• det 𝐴𝑇 = det 𝐴 𝐴 ∈ ℝ𝑛×𝑛

• det 𝑐𝐴 = 𝑐𝑛 det 𝐴 𝐶 ∈ ℝ, 𝐴 ∈ ℝ𝑛×𝑛

• det 𝐴 ≠ 0 ⟺ 𝐴 𝑖𝑠 𝑛𝑜𝑛𝑠𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝐴 ∈ ℝ𝑛×𝑛

Differentiation

Suppose α is a scalar and that A(α) is an m-by-n matrix with entries 𝑎𝑖𝑗 𝛼 . If 𝑎𝑖𝑗 𝛼 is a differentiable function of α

for all 𝑖 and 𝑗, then by ሶ𝐴 𝛼 we mean the matrix:

ሶ𝐴 𝛼 =𝑑𝐴(𝛼)

𝑑𝛼=

𝑑𝑎𝑖𝑗 𝛼

𝑑𝛼= ሶ𝑎𝑖𝑗(𝛼)

The differentiation of a parameterized matrix turns out to be a handy way to examine the sensitivity of a various

matrix problem.

2.2 Vector NormsNorms serve the same purpose on vector spaces that absolute value does on the real line: They furnish a measure of

distance. More precisely, ℝ𝑛 together with a norm on ℝ𝑛 defines a metric space.

A vector norm on ℝ𝑛 is a function of 𝑓:ℝ𝑛 → ℝ that satisfies the following properties:

𝑓 𝑥 ≥ 0 𝑥 ∈ ℝ𝑛

𝑓 𝑥 + 𝑦 ≤ 𝑓 𝑥 + 𝑓 𝑦 𝑥, 𝑦 ∈ ℝ𝑛

𝑓 𝑎𝑥 = 𝑎 𝑓 𝑥 𝑎 ∈ ℝ, 𝑥 ∈ ℝ𝑛

This functions is denoted by double bar notation: 𝑓 𝑥 = 𝑥 . Subscripts on the double bar are used to distinguish

between various norms. A useful class of vector norms are the p-norms: 𝑥 𝑝 = ( 𝑥1𝑝 +⋯+ 𝑥𝑛

𝑝)1

𝑝

Of these, the 1,2 and ∞ norms are the most important:

𝑥 1 = 𝑥1 +⋯+ 𝑥𝑛

𝑥 2 = ( 𝑥12 +⋯+ 𝑥𝑛

2)12= (𝑥𝑇𝑥)

12

𝑥 ∞ = max1≤𝑖≤𝑛

𝑥𝑖

A unit vector with respect to the norm is a vector that satisfies 𝑥 = 1.

Some Vector Properties:

• A classic result concerning p-norms is the Holder inequality:

𝑥𝑇𝑦 ≤ 𝑥 𝑝 𝑦 𝑞

1

𝑝+1

𝑞= 1

• All norms on ℝ𝑛 are equivalent, meaning that if . 𝛼 and . 𝛽 are norms on ℝ𝑛, then there exist positive

constants c1 and c2 such that: 𝑐1 𝑥 𝛼 ≤ 𝑥 𝛽 ≤ 𝑐2 𝑥 𝛼

For example for all 𝑥 ∈ ℝ𝑛: 𝑥 2 ≤ 𝑥 1 ≤ 𝑛 𝑥 2

Absolute and Relative Error

Suppose ො𝑥 ∈ ℝ𝑛 is an approximation to 𝑥 ∈ ℝ𝑛. For a given vector norm 𝐸𝑎𝑏𝑠 = ො𝑥 − 𝑥 is the absolute error and

𝐸𝑟𝑒𝑙 =ො𝑥−𝑥

𝑥

Convergence:

We say that a sequence 𝑥 𝑘 of n-vectors converges to x if:

lim𝑘→∞

𝑥(𝑘) − 𝑥 = 0

2.3 Matrix NormsThe definition of a matrix is equivalent to the definition of vector norm. In particular, 𝑓:ℝ𝑚×𝑛 → ℝ that satisfies

the following properties:

𝑓 𝐴 ≥ 0 𝐴 ∈ ℝ𝑚×𝑛

𝑓 𝐴 + 𝐵 ≤ 𝑓 𝐴 + 𝑓 𝐵 𝑥, 𝑦 ∈ ℝ𝑚×𝑛

𝑓 𝑎𝐴 = 𝑎 𝑓 𝐴 𝑎 ∈ ℝ, 𝐴 ∈ ℝ𝑚×𝑛

And double bar notation is with subscript to designate matrix norms: 𝑓 𝐴 = 𝐴 .

Frobenius norm: 𝐴 𝐹 = σ𝑖=1𝑚 σ𝑗=1

𝑛 𝑎𝑖𝑗2

p-norms: 𝐴 𝑝 = sup𝐴𝑥 𝑝

𝑥 𝑝

Note that the 2-norm on ℝ3×2 is different function from 2-norm on ℝ5×6.

Some Matrix Norm Properties

The Frobenius and p-norms (especially p= 1, 2 and ∞) satisfy certain inequalities that are frequently used in the

analysis of matrix computations:

𝐴 2 ≤ 𝐴 𝐹 ≤ 𝑛 𝐴 2

max𝑖,𝑗

𝑎𝑖𝑗 ≤ 𝐴 2 ≤ 𝑚𝑛max𝑖,𝑗

𝑎𝑖𝑗

𝐴 1 = max1≤𝑗≤𝑛

𝑖=1

𝑚

𝑎𝑖𝑗

𝐴 ∞ = max1≤𝑗≤𝑚

𝑖=1

𝑚

𝑎𝑖𝑗

1

𝑛𝐴 ∞ ≤ 𝐴 2 ≤ 𝑚 𝐴 ∞

1

𝑚𝐴 1 ≤ 𝐴 2 ≤ 𝑛 𝐴 1

If 𝐴 ∈ ℝ𝑚×𝑛, 1 ≤ 𝑖1 ≤ 𝑖2 ≤ m and 1 ≤ 𝑗1 ≤ 𝑗2 ≤ n, then:

𝐴(𝑖1: 𝑖2, 𝑗1: 𝑗2) 𝑝 ≤ 𝐴 𝑝

2.4 Finite Precision Matrix Computations The Floating Point Numbers

When calculations are performed on a computer, each arithmetic operation is generally affected by roundoff error.

This error arises because the machine hardware can only represent a subset of the real numbers. We denote this

subset by F and refer to its elements as floating point numbers.

The floating point number system on a particular computer is characterized by four integers: The 𝑏𝑎𝑠𝑒 𝛽, the

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑡 and the 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 𝑟𝑎𝑛𝑔𝑒 [𝐿, 𝑈]. In particular, F consists of all numbers 𝑓 of the form:

𝑓 = ±. 𝑑1𝑑2…𝑑𝑡 × 𝛽𝑒 0 ≤ 𝑑𝑖 ≤ 𝛽, 𝑑1 ≠ 0, 𝐿 ≤ 𝑒 ≤ 𝑈together with zero. Notice that for a nonzero 𝑓 ∈ F we have 𝑚 ≤ 𝑓 ≤ 𝑀 where:

𝑚 = 𝛽𝐿−1 and 𝑀 = 𝛽𝑈(1 − 𝛽−𝑡)As an example, if 𝛽=2, t=3, L=0 and U=2, then the non-negative elements of F are represented by hash marks on

the axis displayed below.

Typical Value for (𝛽,t,L,U) might be (2,56,-64,64)

A Model of Floating Point Arithmetic

To make general pronouncements about the effect of rounding errors on a given algorithm, it is necessary to have a

model of computer arithmetic on F. To this end define the set G by

𝐺 = 𝑥 ∈ ℝ ∶ 𝑚 ≤ 𝑥 ≤ 𝑀 ⋃ 0And the other operator: 𝑓𝑙: 𝐺 → 𝐹 by

𝑓𝑙 𝑥 =𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑐 ∈ 𝐹 𝑡𝑜 𝑥 𝑤𝑖𝑡ℎ 𝑡𝑖𝑒𝑠 ℎ𝑎𝑛𝑑𝑙𝑒𝑑 𝑏𝑦

𝑟𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑎𝑤𝑎𝑦 𝑓𝑟𝑜𝑚 𝑧𝑒𝑟𝑜

The 𝑓𝑙 operator can be shown to satisfy: 𝑓𝑙 𝑥 = 𝑥 1 + 𝜖 𝜖 ≤ 𝑢

Where 𝑢 is the unit roundoff defined by: 𝑢 =1

2𝛽1−𝑡

Let a and b be any two floating point numbers and let “op” denote any of the four arithmetic operations +,−,×,÷.If 𝑎 𝑜𝑝 𝑏 ∈ 𝐺, then in our model of floating point arithmetic we assume that the computed version of (𝑎 𝑜𝑝 𝑏 ) is

given by 𝑓𝑙 𝑎 𝑜𝑝 𝑏 . It follow that 𝑓𝑙 𝑎 𝑜𝑝 𝑏 = (𝑎 𝑜𝑝 𝑏)(1 + 𝜖), with 𝜖 ≤ 𝑢, thus:𝑓𝑙 𝑎 𝑜𝑝 𝑏 − 𝑎 𝑜𝑝 𝑏

𝑎 𝑜𝑝 𝑏≤ 𝑢 , 𝑎 𝑜𝑝 𝑏 ≠ 0

Cancelation

Another important aspect of finite precision arithmetic is the phenomenon of catastrophic cancellation. This term

refers to the extreme loss of correct significant digits when small numbers are additively computed from larger

numbers. A well known example is the computation of 𝑒−𝛼 via Taylor series with 𝛼 > 0. The roundoff error

associated with this method is approximately u times the largest sum.

The absolute Value Notation

Suppose 𝐴 ∈ ℝ𝑚×𝑛. We wish to quantify the errors associated with its floating point representation.

[𝑓𝑙 𝐴 ]𝑖𝑗= 𝑓𝑙 𝑎𝑖𝑗 = 𝑎𝑖𝑗 1 + 𝜖 𝑖𝑗 , |𝜖𝑖𝑗| ≤ 𝑢

Roundoff in Dot Product

The standard dor product algorithm:

s=0

for k = 1:n

s=s+xkyk

end

Here, x and y are n-by-1 floating point vectors.

Problem: The distinction between computed and exact quantities. We shall use 𝑓𝑙(.) operator to signify computed

quantities.

Thus 𝑓𝑙 𝑥𝑇𝑦 denotes the computed output of last algorithm. Now to bound |𝑓𝑙 𝑥𝑇𝑦 -𝑥𝑇𝑦|. If

𝑠𝑝 = 𝑓𝑙(𝑘=1

𝑝

𝑥𝑘𝑦𝑘)

Then 𝑠1 = 𝑥1𝑦1 1 + 𝛿1 , with |𝛿1| ≤ 𝑢 and for 𝑝 = 2: 𝑛

𝑠𝑝 = 𝑓𝑙 𝑠𝑝−1 + 𝑓𝑙 𝑥𝑝𝑦𝑝 = 𝑠𝑝−1 + 𝑥𝑝𝑦𝑝 1 + 𝛿𝑝 1 + 𝜖 𝑝 |𝛿𝑝 , |𝜖 𝑝 ≤ 𝑢

And with a little algebra:

𝑓𝑙 𝑥𝑇𝑦 = 𝑠𝑛 =𝑘=1

𝑛

𝑥𝑘𝑦𝑘(1 + 𝛾𝑘)

Where

1 + 𝛾𝑘 = (1 + 𝛿𝑘)ෑ𝑗=𝑘

𝑛

(1 + 𝜖𝑗)

With the convention that 𝜖 1 = 0, thus:

|𝑓𝑙 𝑥𝑇𝑦 -𝑥𝑇𝑦| ≤ σ𝑘=1𝑛 |𝑥𝑘𝑦𝑘||𝛾𝑘|

Now to bound the quantities |𝛾𝑘| in terms of u: Lemma: If 1 + 𝛼 = ς𝑘=1𝑛 (1 +𝛼 𝑘) 𝑤ℎ𝑒𝑟𝑒 𝛼𝑘 ≤ 𝑢 𝑎𝑛𝑑 𝑛𝑢 ≤

.01, 𝑡ℎ𝑒𝑛 |𝛼| ≤ 1.01𝑛𝑢.

Applying above lemma to our problem: |𝑓𝑙 𝑥𝑇𝑦 -𝑥𝑇𝑦 ≤ 1.01𝑛𝑢 𝑥 𝑇 𝑦|

Dot Product Accumulation

A computed dot product has good relative error, 𝑓𝑙 𝑥𝑇y = 𝑥𝑇y 1 + δ ,𝑤ℎ𝑒𝑟𝑒 𝛿 ≈ 𝑢. Thus the ability to

accumulate dot product is very appealing.

Forward and Backward Error Analyses

Every roundoff bound given above is the consequence of a forward error analysis. An alternative style of

characterizing the roundoff errors in an algorithm is backward error analysis. Here, the rounding errors are related

to the data of the problem rather than to its solution.

2.5 Orthogonality and the SVD Orthogonality

A set of vectors 𝑥1, … , 𝑥𝑝 in ℝ𝑚 is orthogonal if 𝑥𝑖𝑇𝑥𝑗 = 0 whenever 𝑖 ≠ 𝑗, and orthonormal if 𝑥𝑖

𝑇𝑥𝑗 = 𝛿𝑗.

Orthogonal vectors are maximally independent for they point in totally different directions. A collection of

subspaces 𝑆1, … , 𝑆𝑝 in ℝ𝑚 is mutually orthogonal if 𝑥𝑇𝑦 = 0 whenever 𝑥 ∈ 𝑆𝑖 and 𝑦 ∈ 𝑆𝑗 for 𝑖 ≠ 𝑗. The

orthogonal complement of a subspace 𝑆 ⊆ ℝ𝑚 is defined by

𝑆⊥ = {𝑦 ∈ ℝ𝑚: 𝑦𝑇𝑥 = 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥 ∈ 𝑆}A matrix 𝑄 ∈ ℝ𝑚×𝑚 is said to be orthogonal if 𝑄𝑄𝑇 = 𝐼. If 𝑄 = 𝑞1, … , 𝑞𝑚 is orthogonal, then the 𝑞𝑖 form an

orthogonal basis for ℝ𝑚.

Theorem: If 𝑉1 ∈ ℝ𝑛×𝑟 has orthogonal columns, then there exists 𝑉2 ∈ ℝ𝑛×(𝑛−𝑟) such that:

𝑉 = 𝑉1𝑉2is orthogonal. Note that ran(𝑉1)

⊥= 𝑟𝑎𝑛(𝑉2)

Norms and Orthogonal Transformation

The 2-norm is invariant under orthogonal transformation, for if 𝑄𝑄𝑇 = 𝐼, then 𝑄𝑥 22 =

𝑥𝑇𝑄𝑇𝑄𝑥 = 𝑥𝑥𝑇 = 𝑥 22. The matrix 2-norm and the Frobenius norm are also invariant with

respect to orthogonal transformation. It is easy to show that for all orthogonal Q and Z of

appropriate dimensions we have:

𝑄𝐴𝑍 𝐹 = 𝐴 𝐹

and

𝑄𝐴𝑍 2 = 𝐴 2

Singular Value Decomposition(SVD)

• Theorem: If A is a real m-by-n matrix, then there exist orthogonal matrices:

𝑈 = [𝑢1, … , 𝑢𝑚] ∈ ℝ𝑚×𝑚 and 𝑉 = [𝑣1, … , 𝑣𝑛] ∈ ℝ𝑛×𝑛

Such that

𝑈𝑇𝐴𝑉 = 𝑑𝑖𝑎𝑔 𝜎1, … , 𝜎𝑝 ∈ ℝ𝑚×𝑛 𝑝 = min 𝑚, 𝑛

Where 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑝 ≥ 0.

The 𝜎𝑖 are the singular values of A and the vectors 𝑢𝑖 and 𝑣𝑖 are the ith left singular vector and the ith right

singular vector respectively.

The SVD reveals a great deal about the structure of a matrix. If the SVD of A is given , we define 𝑟 by:

𝜎1 ≥ ⋯ ≥ 𝜎𝑟 > 𝜎𝑟+1 = ⋯ = 𝜎𝑝 = 0.

Then r𝑎𝑛𝑘(𝐴) = 𝑟𝑛𝑢𝑙𝑙 𝐴 = 𝑠𝑝𝑎𝑛 𝑣𝑟+1, … , 𝑣𝑛𝑟𝑎𝑛(𝐴) = 𝑠𝑝𝑎𝑛 𝑢1, … , 𝑢𝑟

And we have the SVD expansion

𝐴 =𝑖=1

𝑟

𝜎𝑖𝑢𝑖𝑣𝑖𝑇

The Thin SVD

If 𝐴 = 𝑈Σ𝑉𝑇 ∈ ℝ𝑚×𝑛 is the SVD of A and 𝑚 ≥ 𝑛, then

𝐴 = 𝑈1Σ1𝑉𝑇

Where

𝑈1 = 𝑈 : , 1: 𝑛 = [𝑢1, … , 𝑢𝑛] ∈ ℝ𝑚×𝑛

And

Σ1 = 𝑈 1: 𝑛, 1: 𝑛 = 𝑑𝑖𝑎𝑔(Σ1, … , Σ𝑛) ∈ ℝ𝑛×𝑛

This trimmed down version of SVD is referred to as thin SVD.

Rank Deficiency and the SVD

One of the most valuable aspects of the SVD is that it enables us to deal sensibly with the concept of matrix rank.

Rounding errors and fuzzy data make rank determination a nontrivial exercise.

Theorem: Let the SVD of A ∈ ℝ𝑚×𝑛 be given. If 𝑘 < 𝑟 = 𝑟𝑎𝑛𝑘 𝐴 and

𝐴𝑘 =𝑖=1

𝑘

𝜎𝑖𝑢𝑖𝑣𝑖𝑇

Then

min𝑟𝑎𝑛𝑘 𝐵 =𝑘

𝐴 − 𝐵 2 = 𝐴 − 𝐴𝑘 2 = 𝜎𝑘+1

The previous theorem says that the smallest singular value if A is the 2-norm distance of A to the set of all rank-

deficient matrices. It also follows that the set of full rank matrices is ℝ𝑚×𝑛 is both open and dense.

Unitary Matrices

Over the complex field the unitary matrices correspond to the orthogonal matrices. In particular, 𝑄 ∈ ℂ𝑛×𝑛 is

unitary if 𝑄𝐻𝑄 = 𝑄𝑄𝐻 = 𝐼𝑛. Unitary matrices preserve 2-norm. The SVD of a complex matrix involves unitary

matrices. If A ∈ ℂ𝑚×𝑛, then there exist unitary matrices U ∈ ℂ𝑚×𝑚 and V ∈ ℂ𝑚×𝑛 such that:

𝑈𝐻𝐴𝑉 = 𝑑𝑖𝑎𝑔 𝜎1, … , 𝜎𝑝 ∈ ℝ𝑚×𝑛 𝑝 = min 𝑚, 𝑛

Where 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑝 ≥ 0.

2.6 Projections and the CS DecompositionIf the object of a computation is to compute a matrix or a vector, then norms are useful for assessing the accuracy

of the answer or for measuring progress during an iteration. If the object of a computation is to compute a

subspace, then to make similar comments we need to be able to quantify the distance between two subspaces.

Orthogonal projections are critical in this regard.

Orthogonal Projection:

Let 𝑆 ⊆ ℝ𝑛 be a subspace. 𝑃 ∈ ℝ𝑛×𝑛is the orthogonal projection onto S if 𝑟𝑎𝑛 𝑃 = 𝑆, 𝑃2 = 𝑃 𝑎𝑛𝑑 𝑃𝑇 = 𝑃.

From this definition it is easy to show that if 𝑥 ∈ ℝ𝑛 then 𝑃𝑥 ∈ 𝑆 𝑎𝑛𝑑 𝐼 − 𝑃 𝑥𝑃 ∈ 𝑆⊥

SVD-Related Projection

There are several important orthogonal projections associated with the singular value decomposition. Suppose

𝐴 = 𝑈Σ𝑉𝑇 ∈ ℝ𝑚×𝑛 is the SVD of A and that 𝑟 = 𝑟𝑎𝑛𝑘(𝐴). If we have U and V partitioning

𝑈 = 𝑈𝑟𝑟

ഥ𝑈 𝑟𝑚−𝑟

𝑉 = 𝑉𝑟𝑟

ത𝑉 𝑟𝑛−𝑟

Then

𝑉𝑟𝑉𝑟𝑇 = 𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑛 𝑡𝑜 𝑛𝑢𝑙𝑙(𝐴)⊥= 𝑟𝑎𝑛 𝐴𝑇

ത𝑉𝑟 ത𝑉𝑟𝑇 = 𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑛 𝑡𝑜 𝑛𝑢𝑙𝑙(𝐴)

𝑈𝑟𝑈𝑟𝑇 = 𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑛 𝑡𝑜 𝑛𝑢𝑙𝑙(𝐴)

ഥ𝑈𝑟 ഥ𝑈𝑟𝑇 = 𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑛 𝑡𝑜 𝑟𝑎𝑛(𝐴)⊥= 𝑛𝑢𝑙𝑙 𝐴𝑇

Distance between Subspaces

Suppose 𝑆1 and 𝑆2 are subspaces of ℝ𝑛 and that dim 𝑆1 = dim 𝑆2 . We define the distance between two spaces

by:

𝑑𝑖𝑠𝑡 𝑆1, 𝑆2 = 𝑃1 − 𝑃2 2

Where 𝑃𝑖 is the orthogonal projection onto 𝑆𝑖 . The distance between a pair of subspaces can be characterized in

terms of blocks of a certain orthogonal matrix.

Theorem: Suppose

𝑊 = 𝑊1𝑘

𝑊 2𝑛−𝑘

𝑍 = 𝑍1𝑘

𝑍 2𝑛−𝑘

are n-by-n orthogonal matrices. If 𝑆1 = 𝑟𝑎𝑛 𝑊1 and 𝑆2 = 𝑟𝑎𝑛 𝑧1 , then

𝑑𝑖𝑠𝑡 𝑆1, 𝑆2 = 𝑊1𝑇 − 𝑍2 2

Note that if 𝑆1 and 𝑆2 are subspaces in ℝ𝑛 with the same dimensions, then

0 ≤ 𝑑𝑖𝑠𝑡 𝑆1, 𝑆2 ≤ 1

The CS Decomposition

The blocks of an orthogonal matrix partitioned into 2-by-2 form have highly related SVDs. This is the gist of CS

decomposition.

Theorem (The CS Decomposition (Thin Version)): Consider the matrix: 𝑄 =𝑄1𝑄2

𝑄1 ∈ ℝ𝑚1×𝑛, 𝑄2 ∈ ℝ𝑚2×𝑛

Where 𝑚1 ≥ 𝑛 and 𝑚2 ≥ 𝑛. If the columns of Q are orthonormal then there exist orthogonal matrices 𝑈1 ∈ℝ𝑚1×𝑚1 , 𝑈2 ∈ ℝ𝑚2×𝑚2 and 𝑉1 ∈ ℝ𝑛×𝑛 such that

𝑈1 00 𝑈2

𝑇 𝑄1𝑄2

𝑉1 =𝐶𝑆

Where

𝐶 = 𝑑𝑖𝑎𝑔 cos 𝜃1 , … , cos 𝜃𝑛 ,𝑆 = 𝑑𝑖𝑎𝑔 sin 𝜃1 , … , sin 𝜃𝑛 ,

And

0 ≤ 𝜃1 ≤ 𝜃2 ≤ ⋯ ≤ 𝜃𝑛 ≤𝜋

2

Using the same sort of techniques:

Theorem (CS Decomposition (General Version))

If 𝑄 =𝑄11 𝑄12𝑄21 𝑄22

is a 2-by-2 partitioning of an n-by-n orthogonal matrix, then there exist orthogonal

𝑈 =𝑈1 00 𝑈2

𝑎𝑛𝑑 𝑉 =𝑉1 00 𝑉2

Such that

𝑈𝑇𝑄𝑉 =

𝐼 0 00 𝐶 00000

00𝑆0

000𝐼

0 0 00 𝑆 00𝐼00

00−𝐶0

𝐼000

Where 𝐶 = 𝑑𝑖𝑎𝑔 𝑐1, … , 𝑐𝑝 and 𝐶 = 𝑑𝑖𝑎𝑔 𝑠1, … , 𝑠𝑝 are square diagonal matrices with 0 ≤ 𝑐𝑖 , 𝑠𝑖 ≤ 1

2.7 The Sensitivity of Square SystemWe now use some of the developed tools in the previous section to analyze the linear problem 𝐴𝑥 = 𝑏. Where 𝐴 ∈ℝ𝑛×𝑛 is nonsingular and 𝑏 ∈ ℝ𝑛. We want to examine how disturbance in A and b affect the solution x.

SVD Analysis

If

𝐴 =𝑖=1

𝑛

𝜎𝑖𝑢𝑖𝑣𝑖𝑇 = 𝑈Σ𝑉𝑇

is the SVD of A, then

𝑥 = 𝐴−1𝑏 = (𝑈Σ𝑉𝑇)−1𝑏 =

𝑖=1

𝑛𝑢𝑖𝑇𝑏

𝜎𝑖𝑣𝑖

This expansion shows that small changes in A or b can induce relatively large changes in x if 𝜎𝑛is small.

𝜎𝑛 is the distance from A to the set of singular matrices. As the matrix of coefficients approaches this set, it is

intuitively clear that the solution should be increasingly sensitive to changes.

Condition

A precise measure of linear system sensitivity can be obtained by considering the parameterized system:

𝐴 + 𝜖𝐹 𝑥 𝜖 = 𝑏 + 𝜖𝑓 𝑥 0 = 𝑥Where 𝐹 ∈ ℝ𝑚×𝑛 and 𝑓 ∈ ℝ𝑛. If A in nonsingular then it is clear that 𝑥 𝜖 is differentiable in a neighborhood of

zero. Moreover ሶ𝑥 0 = 𝐴−1 𝑓 − 𝐹𝑥 and thus, the Taylor series expansion for 𝑥 𝜖 has the form:

𝑥 𝜖 = 𝑥 + 𝜖 ሶ𝑥 0 + 𝑂 𝜖2

Using any vector norm and consistent matrix norm we obtain:

𝑥 𝜖 − 𝑥

𝑥≤ 𝜖 𝐴−1

𝑓

𝑥+ 𝐹 + 𝑂 𝜖2

For square matrices A define the condition number 𝑘(𝐴):𝑘 𝐴 = 𝐴 𝐴−1

With the convention that 𝑘 𝐴 = ∞ for singular A. Using the inequality 𝑏 ≤ 𝐴 𝑥 :𝑥 𝜖 − 𝑥

𝑥≤ 𝑘 𝐴 𝜌𝐴 + 𝜌𝐵 + 𝑂 𝜖2

Where

𝜌𝐴 = 𝜖𝐹

𝐴and 𝜌𝐴 = |𝜖|

𝑓

𝑏represent the relative errors in A and b respectively. Thus error in 𝑥 can be 𝑘(𝐴) times the relative error in A and b.

In this sense condition number 𝑘(𝐴) quantifies the sensitivity of 𝐴𝑥 = 𝑏.

Note that 𝑘 . depends on the underlying norm and subscripts are used accordingly.

If 𝑘 𝐴 is large, then A is said to be an ill-conditioned matrix. This is a norm dependent property. However, any

two condition numbers 𝑘𝛼 . and 𝑘𝛽 . on ℝ𝑛×𝑛 are equivalent in that constants 𝑐1 and 𝑐2 can be found for

which

𝑐1𝑘𝛼 𝐴 ≤ 𝑘𝛽 𝐴 ≤ 𝑐2𝑘𝛼 𝐴 𝐴 ∈ ℝ𝑛×𝑛

And matrices with small condition numbers are said to be well-conditioned.

Determinants and Nearness to Singularity

Now we want to find out how well determinant size measures ill-conditioning. We find out there is little

correlation between det(𝐴) and the condition of 𝐴𝑥 = 𝑏. For example for the matrix 𝐵𝑛:

𝐵𝑛 =

1 −1 …0 1 …⋮0

⋮0

⋱…

−1−1⋮1

Has determinant of 1, but 𝑘∞ 𝐵𝑛 = 𝑛2𝑛−1

A Rigorous Norm Bound

The previous method is a little unsatisfying because it is contingent on 𝜖 being “small enough’ and because it shed

no light on the size of 𝑂 𝜖2 term. Now we establish a perturbation theorem that is completely rigorous:

Lemma: Suppose:

𝐴𝑥 = 𝑏 𝐴 ∈ ℝ𝑛×𝑛, 0 ≠ 𝑏 ∈ ℝ𝑛

𝐴 + ∆𝐴 𝑦 = 𝑏 + ∆𝑏 ∆𝐴 ∈ ℝ𝑛×𝑛, ∆𝑏 ∈ ℝ𝑛

With ∆𝐴 ≤ 𝜖 𝐴 and ∆𝑏 ≤ 𝜖 𝑏 . If 𝜖𝑘 𝐴 = 𝑟 < 1, then 𝐴 + ∆𝐴 is nonsingular and𝑦

𝑥≤1 + 𝑟

1 − 𝑟Theorem: If conditions of the lemma hold, then:

𝑦 − 𝑥

𝑥≤

2𝜖

1 − 𝑟𝑘 𝐴

A more refined perturbation Theory

Theorem: Suppose

𝐴𝑥 = 𝑏 𝐴 ∈ ℝ𝑛×𝑛, 0 ≠ 𝑏 ∈ ℝ𝑛

𝐴 + ∆𝐴 𝑦 = 𝑏 + ∆𝑏 ∆𝐴 ∈ ℝ𝑛×𝑛, ∆𝑏 ∈ ℝ𝑛

And that ∆𝐴 ≤ 𝜖 𝐴 and |∆𝑏| ≤ 𝜖|𝑏|. If 𝛿𝑘∞ 𝐴 = 𝑟 < 1, then 𝐴 + ∆𝐴 is nonsingular and𝑦 − 𝑥 ∞

𝑥 ∞≤

2𝛿

1 − 𝑟𝐴−1 𝐴 ∞

We refer the value of 𝐴−1 𝐴 ∞ as the Skeel condition number.

Matrix Computations (Ch. 1-2) - University at Buffalobest.eng.buffalo.edu › Research › Lecture...

Documents

Transcript of Matrix Computations (Ch. 1-2) - University at Buffalobest.eng.buffalo.edu › Research › Lecture...