Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf ·...

12
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning — Spring 2013 Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems. 1. Mixture model problem This problem shows that projecting onto proper subspace helps greatly in learning mixture models compared to projecting onto random subspace. (a) Create a mixture model with the following specifications. X R d is a random variable whose distribution is a mixture of 4 gaussian random variables with means at e 1 ,e 2 ,e 3 and e 4 (e i are standard basis) of magnitude 1 and vari- ance 0.1. All 4 random variables are choosen uniformly at random. Draw b4 * dlog(d)c samples of X for d =10,100 and 500. (b) Cluster the samples based on nearest neighbour. One way to do it is to classify any two points as from the same cluster if the distance between them is less than a threshold r (Look for function ’pdist’ in matlab). Count the fraction of pairs in error. The error will depend on the threshold chosen r. One good value for threshold is the diameter of a cluster, which is 2 * σ 2 * d. (c) Now project all the points into a low dimensional subspace spanning the means of the four gaussian random variables. Take the space spanned by {e1,...,e10} for this case. Compute the same error for different values of d. (d) This time project all points into a randomly generated subspace of dimesnion 10 (One way is to use the range space of a random matrix). Compute error in this case for different values of d. (e) (15 points) Give a plot comparing errors in the above three scenarios. Solution: The plot is shown in Figure 1. The errors have been normalized by n(n - 1)/2 (maximum number of possible pairs), where n is the number of samples in each case. 2. Planted Model Recall the planted model from class. Let n = 200, and k = 5. Form the matrix P as in class, that has p on the five equal blocks on the diagonal, and q =1 - p everywhere else. Let A be the random matrix, as in class, where each entry A ij is a Bernoulli random variable with probability P ij . Note that you will have to construct A as a symmetric matrix, so generate elements above the diagonal, and then just replicate them below. (a) (5 points) For p =0.8, 0.7, 0.6 and 0.55, generate the eigenvalues of P and of (the random matrix) A and plot them. How many eigenvalues of P are non-zero? 1

Transcript of Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf ·...

Page 1: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

The University of Texas at AustinDepartment of Electrical and Computer Engineering

EE381V: Large Scale Learning — Spring 2013

Assignment Two

Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013.

Computational Problems.

1. Mixture model problem

This problem shows that projecting onto proper subspace helps greatly in learning mixturemodels compared to projecting onto random subspace.

(a) Create a mixture model with the following specifications.

X ∈ Rd is a random variable whose distribution is a mixture of 4 gaussian randomvariables with means at e1, e2, e3 and e4 (ei are standard basis) of magnitude 1 and vari-ance 0.1. All 4 random variables are choosen uniformly at random. Draw b4 ∗ dlog(d)csamples of X for d =10,100 and 500.

(b) Cluster the samples based on nearest neighbour. One way to do it is to classify any twopoints as from the same cluster if the distance between them is less than a threshold r(Look for function ’pdist’ in matlab). Count the fraction of pairs in error. The errorwill depend on the threshold chosen r. One good value for threshold is the diameter ofa cluster, which is

√2 ∗ σ2 ∗ d.

(c) Now project all the points into a low dimensional subspace spanning the means of thefour gaussian random variables. Take the space spanned by {e1,...,e10} for this case.Compute the same error for different values of d.

(d) This time project all points into a randomly generated subspace of dimesnion 10 (Oneway is to use the range space of a random matrix). Compute error in this case fordifferent values of d.

(e) (15 points) Give a plot comparing errors in the above three scenarios.

Solution: The plot is shown in Figure 1. The errors have been normalized by n(n − 1)/2(maximum number of possible pairs), where n is the number of samples in each case.

2. Planted Model

Recall the planted model from class. Let n = 200, and k = 5. Form the matrix P as in class,that has p on the five equal blocks on the diagonal, and q = 1 − p everywhere else. Let Abe the random matrix, as in class, where each entry Aij is a Bernoulli random variable withprobability Pij . Note that you will have to construct A as a symmetric matrix, so generateelements above the diagonal, and then just replicate them below.

(a) (5 points) For p = 0.8, 0.7, 0.6 and 0.55, generate the eigenvalues of P and of (therandom matrix) A and plot them. How many eigenvalues of P are non-zero?

1

Page 2: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

Figure 1: Plot showing the normalized pair-wise error in the 3 cases for different values of d

(b) (5 points) Now run the spectral clustering algorithm on A (rather than the Laplacian).The points that you get will be 5 dimensional. Pick a random projection onto two dimen-sions, and project all of the resulting five dimensional points onto these two randomlychosen dimensions. Plot the results for the different values of p.

(c) (5 points) Cluster according to k-Means.

Solution: The various plots are shown in Figure 2-9. In all the cases the top 5 eigenvaluesof P are non-zero.

Linear Algebra.The following problems focus on important concepts and ideas from linear algebra that are

critical in this class. Please read up about singular value decomposition, and other topics asnecessary. Problems marked with a ∗ are optional.

1. Range and Nullspace of Matrices: Recall from class the definition of the null space and therange of a linear transformation, T : V →W :

null(T ) = {v ∈ V : Tv = 0}range(T ) = {Tv ∈W : v ∈ V }

• (5 points) Suppose A is a 10-by-10 matrix of rank 5, and B is also a 10-by-10 matrix ofrank 5. What is the smallest and largest the rank the matrix C = AB could be?

• (5 points) Now suppose A is a 10-by-15 matrix of rank 7, and B is a 15-by-11 matrix ofrank 8. What is the largest that the rank of matrix C = AB can be?

2

Page 3: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

Figure 2: Plot showing the eigenvalues of the P and A matrices for p = .8.

Figure 3: Plot showing the eigenvalues of the P and A matrices for p = .7.

3

Page 4: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

Figure 4: Plot showing the eigenvalues of the P and A matrices for p = .6.

Figure 5: Plot showing the eigenvalues of the P and A matrices for p = .55.

4

Page 5: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

(a) Projected points (b) K-Mean clusters

Figure 6: Projected points in random 2 direction, and clustered points by K-means for p = .8

(a) Projected points (b) K-Mean clusters

Figure 7: Projected points in random 2 direction, and clustered points by K-means for p = .7

5

Page 6: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

(a) Projected points (b) K-Mean clusters

Figure 8: Projected points in random 2 direction, and clustered points by K-means for p = .6

(a) Projected points (b) K-Mean clusters

Figure 9: Projected points in random 2 direction, and clustered points by K-means for p = .55

6

Page 7: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

Solution: We know for any two matrices A (m× n) and B (n× k),

rank(A) + rank(B)− n ≤ rank(AB) ≤ min(rank(A), rank(B))

Part I: Using the above bounds,

rank(C) ≥ 5 + 5− 10 = 0

rank(C) ≤ min(5, 5) = 5

The bounds are tight. For example take,

A = B =

[I5 00 0

]Then C = AB = A, therefore rank(C) = 5. Now if we take,

A =

[I5 00 0

], B =

[0 00 I5

]then C = AB = 0 the all zero matrix, hence rank(C) = 0.

Part II: rank(C) ≤ min(7, 8) = 7. The bound is also tight. Let

A =

[I7 07×8

03×7 03×8

], B =

[I8 08×3

07×8 07×3

]Then,

C = AB =

[I7 07×4

03×7 03×4

]Therefore rank(C) = 7.

2. Riesz Representation Theorem: (5 points) Consider the standard basis for Rn: e1 = (1, 0, . . . , 0), e2 =(0, 1, 0, . . . , 0), etc. Recall that the inner-product of two vectors w1 = (α1, . . . , αn),w2 =(β1, . . . , βn) ∈ Rn, is given by:

〈w1,w2〉 =

n∑i=1

αiβi.

Let f : Rn → R be a linear map. Show that there exists a vector x ∈ Rn, such that

f(w) = 〈x,w〉,

for any w ∈ Rn.

Remark: It turns out that this result is true in much more generality. For example, considerthe vector space of square-integrable functions (something we will see much more later inthe course). Let F denote a linear map from square integrable functions to R. Then, as a

7

Page 8: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

consequence similar to the finite dimensional exercise here, there exists a square integrablefunction, g, such that:

F (f) =

∫fg.

Solution: Let x =∑n

i=1 f(ei)ei and w =∑n

i=1wiei. Then using linearity of f() we have,

f(w) = f(n∑

i=1

wiei) =n∑

i=1

wif(ei) = 〈x,w〉

3. (10 points) Recall from class that the spectral theorem for symmetric n × n real matrices,says, among other things, that if A is a symmetric (real) n × n matrix, then it has an basisof orthonormal eigenvectors, {v1, . . . , vn}. Use A and {vi} to construct a matrix T , such thatthe matrix T>AT is diagonal.

Solution: Let T = [v1 v2 . . . vn], i.e. the matrix with orthonormal eigenvectors of A ascolumns. Let λi be the eigenvalue corresponding to the eigenvector vi. Then,

T TAT =

vT1...vTn

[Av1 Av2 . . . Avn] =

vT1...vTn

[λ1v1 λ2v2 . . . λnvn]

=

λ1

. . .

λn

Since ||vi||2 = 1 and vTi vj = 0 for all i 6= j.

4. (10 points) Let A be a rank n matrix (of any dimensions, not necessarily square). Let itssingular value decomposition be given by

A = UΣV ∗,

where Σ is the matrix of singular values given in descending order: σ1 ≥ σ2 ≥ · · · ≥ σn. LetΣk denote the matrix with σk+1, . . . , σn set to zero, and let A = UΣkV

∗. Show that A solvesthe optimization problem:

minA : rk(A)≤k

‖A− A‖F .

Hint: you can use the fact that the optimal solution should satisfy (A − A) ⊥ A, whereorthogonality is defined with respect to the natural matrix inner product compatible with theFrobenius norm:

〈M,N〉 =∑i,j

MijNij = Trace(M∗N).

If you choose to use this hint, please do show that A and A satisfy the orthogonality, asclaimed.

Solution: There can be several proofs. The following has been adapted from [1]. Letx∗ be the optimal value of the optimization problem. Now, A = UΣkV

∗ is feasible and||A− UΣkV

∗||2F =∑n

i=k+1 σ2i . Hence,

8

Page 9: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

x∗ ≤

√√√√ n∑i=k+1

σ2i (1)

Now we find a lower bound for x∗. First we prove the following lemmas.

Lemma 1: For any two matrices A,B ∈ Cm×n with rank(B) ≤ k,

σ1(A−B) ≥ σk+1(A).

Proof: For any unit vector v we have σ1(A−B)2 = ||A−B||22 ≥ v∗(A−B)∗(A−B)v. Nowrank(B) ≤ k implies dim(Null(B)) > n−k. Let U = span{v1, . . . , vk+1} the span of top k+1right singular vectors of A. From dimensionality argument it follows Null(B)

⋂U 6= φ. So

let v =∑k+1

i=1 aivi ∈ Null(B)⋂U . Then, v∗(A−B)∗(A−B)v = v∗A∗Av =

∑k+1i=1 σi(A)2a2

i ≥σk+1(A)2. Hence σ1(A−B) ≥ σk+1(A).

Lemma 2: Let A = B + C, then σi+j−1(A) ≤ σi(B) + σj(C).

Proof: For i = j = 1 we have σ1(A) = u∗1Av1 = u∗1(B + C)v1 = u∗1Bv1 + u∗1Cv1 ≤ σ1(B) +σ1(C). Now for any matrix X, let Xk be the matrix with last n−k singular values of X set to0. Hence σ1(A−Ak) = σk+1(A). Now using Lemma 1 and the fact that rank(Bi−1 +Cj−1) ≤i+ j − 2 we get,

σi(B) + σj(C) = σ1(B −Bi−1) + σ1(C − Cj−1) ≥ σ1(B + C − (Bi−1 + Cj−1)) ≥ σi+j−1(A).

Lemma 3: For any two matrices A,B ∈ Cm×n with rank(B) ≤ k, σi(A−B) ≥ σi+k(A).

Proof: As rank(B) ≤ k, σk+1(B) = 0. Therefore by applying Lemma 2 with A − B and Bwe get,

σi(A−B) = σi(A−B) + σk+1(B) ≥ σi+k+1−1(A−B +B) = σi+k(A)

Now we show the lower bound. For any matrix A and A with rank(A) ≤ k using Lemma 3we have,

||A− A||2F =

n+k∑i=1

σi(A−B)2 ≥n−k∑i=1

σi(A−B)2 ≥n−k∑i=1

σi+k(A)2 =

n∑i=k+1

σi(A)2 (2)

Therefor x∗ ≥√∑n

i=k+1 σi(A)2. Combining with the upper bound in equation 1 we have

x∗ =√∑n

i=k+1 σi(A)2. Hence A = UΣkV∗ is an optimal solution.

5. ∗ Prove the above hint, namely, that the optimal solution must satisfy the orthogonalitycondition (in the previous problem, you are assuming that this is true).

Solution: Let A be m × n matrix. We have ||A − A||2F =∑m

i=1 |Ai − Ai|22, the sum of

the squared distances between the rows of A and A. Rows of A spans at most a dimensionk subspace Vk. Let projVk

(A) be the matrix of rows being projection of the correspondingrows of A onto Vk. We argue that the optimal solution will always have the rows of Aprojected onto the optimal subspace Vk. This is because from Pythagoras’ theorem for anyB ∈ Vk, ||A − B||2F ≥ ||A − projVk

(B)||2F and projVk(B) is also a rank ≤ k matrix, hence it

9

Page 10: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

is optimum. Now let A be the optimal solution and Vk the optimal subspace. Then rowsof A − A are perpendicular to subspace Vk hence to the rows of A. Therefore 〈A − A, A〉 =∑m

i=1〈Ai − Ai, Ai〉 = 0. Hence proved.

6. (10 points) Let A ∈ Rn×n be symmetric. Recall that by the spectral theorem, A will havereal eigenvalues. Therefore we can order the eigenvalues of A: λ1(A) ≥ λ2(A) ≥ · · · ≥ λn(A).Show that:

λ1(A) = maxy 6=0

〈y,Ay〉〈y, y〉

,

...

λk(A) = maxV :dimV =k

min06=y∈V

〈y,Ay〉〈y, y〉

.

Solution: Let {v1, . . . ,vn} be the set of orthonormal eigenvectors of A, λi(A) is the eigen-value corresponding to vi. Let V be any subspace of dimension k, and let U = span{vk,vk+1, . . . ,vn}.Then V

⋂U 6= φ else dim(V ) + dim(U) = k + (n− k + 1) = n+ 1 > n a contradiction. Let

w =∑n

i=k wivi ∈ V⋂U . Now,

〈w,Aw〉〈w,w〉

=

∑ni=k w

2i λi∑n

i=k w2i

≤ λk therefore

min06=y∈V

〈y,Ay〉〈y, y〉

≤ λk

maxV :dimV =k

min06=y∈V

〈y,Ay〉〈y, y〉

≤ λk (3)

Now take V = span{v1, . . . ,vk}. Then for any y =∑k

i=1 yivi ∈ V ,

〈y,Ay〉〈y, y〉

=

∑ki=1 y

2i λi∑k

i=1 y2i

≥ λk

The equality is achieved when y = vk. Hence the inequality in equaltion 3 is achieved withequality. Therefore,

maxV :dimV =k

min06=y∈V

〈y,Ay〉〈y, y〉

= λk

7. ∗ The spectral radius of a matrix A is defined as:

ρ(A) = max{|λ| : λ an e-value of A.}.

Note that the spectral radius is invariant under similarity transformations, and thus we canspeak of the spectral radius of a linear operator.

(a) Show that ρ(A) ≤ σ1(A).

10

Page 11: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

(b) Show thatρ(A) = inf

{S: detS 6=0}σ1(S−1AS).

Solution: Let v be the normalized eigenvector corresponding to the maximum magnitudeeigenvalue λ.

(a) σ1(A) = maxu1,u2:||u1||=||u2||=1 |u∗1Au2| ≥ |v∗Av| = |v∗(λv)| = |λ| = ρ(A)

(b) Note that matrix eigenvalue remain the same under similarity transforms. Hence forany non-degenerate matrix S, eigenvalues of A and S−1AS are the same. Thereforeρ(A) = ρ(S−1AS). Now,

ρ(S−1AS) ≤ σ1(S−1AS)

ρ(A) = inf{S: detS 6=0}

ρ(S−1AS) ≤ inf{S: detS 6=0}

σ1(S−1AS) (4)

Now let A = U∗∆U be the Schur decomposition of A, where ∆ is an upper triangularmatrix and U an unitary matrix. The diagonal entries of ∆ are the eigenvalues of A.Now let Dt = diag(t, t2, . . . , tn). Then,

At = Dt∆D−1t =

λ1 t−1∆12 t−2∆13 . . . t−n+1∆1n

0 λ2 t−1∆23 . . . t−n+1∆2n

. . . . . . .0 0 0 . . . t−1∆n−1,n

0 0 0 0 λn

Now for any ε > 0, by making t large enough we can ensure that the sum of all off-diagonal components of At is less than ε. Hence choose t such that, ||At||1 ≤ ρ(A) + εand ||At||∞ ≤ ρ(A) + ε. Define St = U∗D−1

t . Then,

inf{S: detS 6=0}

σ1(S−1AS) ≤ σ1(S−1t ASt) = σ1(D−1

t ∆Dt) = ||At||2 ≤√||At||1||At||∞ ≤ ρ(A)+ε

Combining with equation 4 and taking ε→ 0 we get the rquired result.

8. (10 points) A linear operator N is called nilpotent if for some integer k, Nk = 0. Show thatif N is nilpotent, then (I + N) has a square root. (Hint: consider the Taylor expansion of√

1 + x, and use that as a starting point).

Solution: Let the Taylor series expansion of√

1 + x about x = 0 is given by,

√1 + x = a0 +

∞∑i=1

aixi

Where a0 = 1. Since, 1 + x = (√

1 + x)2 =(1 +

∑∞i=1 aix

i)2

. We have,

2a1 = 1i∑

j=0

ajai−j = 0 ∀i ≥ 2

11

Page 12: Assignment Two - University of Texas at Austinusers.ece.utexas.edu/.../problemset2_soln.pdf · Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational Problems.

Define AN = a0I +∑∞

i=1 aiNi = a0I +

∑k−1i=1 aiN

i (since N i = 0 for all i ≥ k). The sumconverges since there are only finite number of terms, hence AN is well defined. IdenticallyAN ∗AN =

(a0I +

∑∞i=1 aiN

i)2

= I +N . Hence I +N has a squre root given by AN .

9. (5 points) Find an example of a matrix that is diagonalizable, but not unitarily so. That is,produce an example of a n × n matrix A (n up to you) for which there is some invertiblematrix T that satisfies T−1AT = D for some diagonal matrix, but the columns of T cannotbe taken to be orthonormal. Hint: related to one of the problems above.

Solution: We have T−1AT = D. Take any non-orthogonal but invertible matrix T and adiagonal matrix D. Then the A matrix which is diagonalizable by T is given by A = TDT−1.For example,

T =

[0 11 1

], D =

[1 00 2

]gives

A =

[2 01 1

]

10. (10 points) Suppose A ∈ Cm×n has full column rank n. Show that:

min∆∈Cm×n

{||∆||2 | rank(A+ ∆) < n} = σn(A),

where σn(A) denotes the smallest singular value of A.

Solution: Let the SVD of A be given by: A = UΣV ∗. Note that we can equivalently writethis using a sum of rank-one matrices that are each outer products of the left and rightsingular vectors:

A =n∑

i=1

σiuiv∗i .

Then letting ∆ = −σnunv∗n, we conclude that the value of the minimization above is no more

than σn(A) (since we have exhibited a perturbation that causes A to drop rank, and hasspectral norm σn(A). We have left to show that it is not possible to do any better.

Since A has rank n, by rank-nullity, its null-space contains only the zero vector. Suppose ∆is any perturbation that causes A to drop rank. Let v be any unit vector in the null spaceof (A+ ∆) (note that this vector is not in the nullspace of A). Then we have (A+ ∆)v = 0,and hence ‖Av‖2 = ‖∆v‖2, whence

‖∆‖ ≥ ‖∆v‖2 = ‖Av‖2 ≥ min‖z‖2=1

‖Az‖2 = σn(A).

References

[1] M. Raghupati, ”Low Rank Approximation in the Frobenius Norm and Perturbation of SingularValues”.

12