Lecture 14: Principal Components Analysisinabd171/wiki.files/lecture14... · 2016-12-20 ·...

Lecture 14: Principal ComponentsAnalysis

Introduction to Learningand Analysis of Big Data

Kontorovich and Sabato (BGU) Lecture 14 1 / 16

Unsupervised Learning

Until now: supervisedI receive labeled data (x , y)I try to learn mapping x 7→ y

Current topic: unsupervised

Receive unlabeled data only

What’s a reasonable “learning” goal?I discover some structure in the dataI does it clump nicely in space?

(clustering)I is it well-represented by a small set of points?

(sample compression/condensing)I does it have a sparse representation in some basis?

(compressed sensing)I does it lie on a low-dimensional manifold?

(dimensionality reduction)

•Kontorovich and Sabato (BGU) Lecture 14 2 / 16

Discovering structure in Data: Motivation

Computational: “simpler” data is faster to search/store/process

Statistical: “simpler” data allows for better generalization

Dimensionality reduction has a denoising effect

Example: human facesI 16x16 images and only grayscale: a 256-dimensional spaceI Curse of dimensionality: sample size exponential in dI Is learning faces hopeless?I Not if they have structureI Low-dimensional manifold [next slide]

Common theme: learning of a high-dimensional signal is possiblewhen its intrinsic structure is low-dimensional

•


Faces manifold

[credit: N. Vasconcelos and A. Lippman]http://www.svcl.ucsd.edu/projects/manifolds

•


http://www.svcl.ucsd.edu/projects/manifolds

Principal Components Analysis (PCA)

Dimensionality reduction technique for data in Rd

Problem statement [draw on board]:I given m points x1, . . . , xm in Rd

I and target dimension k < dI find “best” k-dimensional subspace approximating the data

Formally: find matrices U ∈ Rd×k and V ∈ Rk×d

that minimize

f (U,V ) =m∑i=1

‖xi − UVxi‖22

V : Rd → Rk is “compressor”, U : Rk → Rd is “decompressor”

Is f : Rd×k × Rk×d → R convex?

No! f (U, ·) and f (·,V ) both convex, but not f (·, ·)Claim: optimal solution achieved at U = V> and U>U = I(columns of U are orthonormal)


Principal Components Analysis: auxiliary claimOptimization problem: minimizeU∈Rd×k ,V∈Rk×d

∑mi=1 ‖xi − UVxi‖22

Claim: optimal solution achieved at U = V> and U>U = IProof:

I For any U,V , linear map x 7→ UVx has range R of dimension k

I Let w1, . . . ,wk be orthonormal basis for R; arrange into columns of W .

I Hence, for each xi there is zi ∈ Rk such that UVxi = Wzi .

I Note: W>W = I .

I Which z minimizes f (xi , z) := ‖xi −Wz‖22?

I For all x ∈ Rd , z ∈ Rk ,f (x , z) = ‖x‖22 + z>W>Wz − 2z>W>x = ‖x‖22 + ‖z‖22 − 2z>W>x .

I Minimize w.r.t. z : Oz f = 2z − 2W>x = 0 =⇒ z = W>x .

I Thereforem∑i=1

‖xi − UVxi‖22 =m∑i=1

‖xi −Wzi‖22 ≥m∑i=1

‖xi −WW>xi‖22.

I U,V are optimal, so∑m

i=1 ‖xi − UVxi‖22 =∑m

i=1 ‖xi −WW>xi‖22.I So instead of U,V can take W ,W>.

WW>x is the orthogonal projection of x onto R. [board]


PCA: reformulated

Optimization problem:

minimizeU∈Rd×k :U>U=I

m∑i=1

‖xi − UU>xi‖22 (∗)

For every x ∈ Rd and U ∈ Rd×k with U>U = I ,

‖x − UU>x‖2 = ‖x‖2 − 2x>UU>x + x>UU>UU>x

= ‖x‖2 − x>UU>x

= ‖x‖2 − trace(x>UU>x)

= ‖x‖2 − trace(U>xx>U).

Recall trace operator:I Defined for any square matrix A: sum of A’s diagonal entriesI For any A ∈ Rp×q,B ∈ Rq×p, we have trace(AB) = trace(BA)I trace : Rp×p → R is a linear map

Reformulating (*):

maximizeU∈Rd×k :U>U=I

trace

(U>

m∑i=1

xix>i U

).


Spectral theorem refresher

Suppose A ∈ Rn×n is symmetric, A = A>

and Av = λv for some v ∈ Rn and λ ∈ RWe say that v is an eigenvector of A with eigenvalue λ

Fact: for symmetric A, eignevectors of distinct eigenvalues areorthogonal. Proof:

I if Av = λv , Au = µu, A = A>

I then u>Av = λ〈u, v〉 = µ〈u, v〉I hence, λ 6= µ =⇒ 〈u, v〉 = 0

Theorem: for every symmetric A ∈ Rn×n there is an orthogonalV ∈ Rn×n (i.e., V> = V−1) s.t. V>AV = Λ = diag(λ1, . . . , λn).

V diagonalizes A

V = [v1, . . . , vn] and vi is the eigenvector of A corresponding to λi

The vi s are orthonormal (V>V = I ) and span Rn

A of the form X>X or XX> are positive semidefinite: Λ ≥ 0


Maximizing the trace by k top eigenvalues

We want to


trace

(U>

m∑i=1

xix>i U

).

Put A =∑m

i=1 xix>i .

Diagonalize: A = VΛV>

I V>V = VV> = II Λ = diag(λ1, . . . , λd), λ1 ≥ λ2 ≥ . . . ≥ λd ≥ 0

So goal is:


trace(U>VΛV>U

).

•


Maximizing the trace by k top eigenvaluesGoal:


trace(U>VΛV>U

).

We now show an upper bound on this value.Fix any U ∈ Rd×k with U>U = I and put B = V>U.then U>VλV>U = B>ΛB.Hence, goal is equal to trace(ΛBB>) =

∑dj=1 λj

∑ki=1(Bji )

2.

Define βj =∑k

i=1(Bji )2.

B>B = U>VV>U = U>U = I . Therefore‖β‖1 =

∑dj=1 βj = trace(BB>) = trace(B>B) = trace(I ) = k.

Claim: βj ≤ 1. Proof:I choose B ∈ Rd×d s.t. B[:, 1 : k] = B and B>B = I . Hence BT = B−1.I So BB> = I . Hence ∀j ∈ [d ],

∑di=1(Bji )

2 = 1 =⇒∑k

i=1(Bji )2 ≤ 1.

hence:

trace(U>VΛV>U) ≤ maxβ∈[0,1]d :‖β‖1≤k

d∑j=1

λjβj =k∑

j=1

λj .


PCA: solution

We have:I A =

∑mi=1 xix

>i

I A = VΛV>

I V>V = VV> = II Λ = diag(λ1, . . . , λd), λ1 ≥ λ2 ≥ . . . ≥ λd ≥ 0I U ∈ Rd×k with U>U = I

Under the above, we showed trace(U>AU) ≤∑k

j=1 λj .

Get this with equality: set U = [u1, u2, . . . , uk ] where Aui = λi ui

ith column of U is A’s eigenvector corresponding to λi

then trace(U>AU) =∑k

j=1 λj .

So U is the optimal solution.

•


PCA: solution value

Recall our original goal: minimizeU∈Rd×k :U>U=I

∑mi=1 ‖xi − UU>xi‖22.

We showed, for A =∑m

i=1 xix>i ,

argminU∈Rd×k :U>U=I

m∑i=1

‖xi − UU>xi‖22 = argmaxU∈Rd×k :U>U=I

trace(U>A>U).

We just showed that U maximizes trace(U>AU).

Value of solution:I U = [u1, u2, . . . , uk ] (k top eigenvectors), trace(U>AU) =

∑kj=1 λj .

I∑m

i=1 ‖xi − UU>xi‖22 =∑m

i=1 ‖xi‖22 − trace(U>AU)I∑m

i=1 ‖xi‖22 = trace(A) = trace(VΛV>) = trace(V>VΛ) = trace(Λ) =∑di=1 λi .

I Optimal objective value:∑m

i=1 ‖xi − UU>xi‖22 =∑d

i=k+1 λi .

This is the “distortion” — how much data we throw away.


PCA: the variance view

Suppose the data S = {xi , i ≤ m} is centered: 1m

∑mi=1 xi = 0

Let’s project it onto a 1-dim subspace w/max variance [board]

1-dim projection operator: uu>, for u ∈ Rd , u>u = 1

VarX∼S [u>X ] =1

m

m∑i=1

(u>xi )2 =

1

m

m∑i=1

u>(xix>i )u

Optimization problem: maximizeu∈Rd :u>u=1

m∑i=1

u>(xix>i )u

Looks familiar? PCA!

Maximized by eigenvector u1 of A = XX> corresponding to topeigenvalue λ1What about u2?

Subtract off data component spanned by u1 and repeat.

PCA ≡ choosing the dimensions that maximize sample variance.


PCA computational complexity

Computing A =∑m

i=1 xix>i costs O(md2) time

Diagonalizing A = VΛV> costs O(d3) time

What if d � m?

Write A = X>X , where X = [x>1 ; x>2 ; . . . ; x>m ] ∈ Rm×d

Put B = XX> ∈ Rm×m; then Bij = 〈xi , xj〉Diagonalize B instead of A: get B’s eigenvalues and eigenvectors.

Suppose Bu = λu for some u ∈ Rm, λ ∈ Rthen A(X>u) = X>XX>u = λX>u.

hence, X>u is an eigenvector of A with eigenvalue λ

Conclusion: can diagonalize B instead of A

Pay O(m2d) time to compute B and O(m3) time to diagonalize it

•


PCA and generalization

PCA often used as a pre-processing step to supervised learning (e.g., SVM)

Q: How to choose target dimension k? (At least in supervised setting.)

Theorem

Suppose S ={

(xi , yi ) ∈ Rd × {−1, 1} , i ≤ m}

, is drawn from Dm. Consider hypothesisspace H consisting of all hyperplanes w ∈ Rn, and define hinge loss`(y , y ′) = max {0, 1− yy ′}.If for k ≤ d can run PCA on the sample and get distortion η :=

∑di=k+1 λi .

Then w/high prob., for all w ∈ H

risk`(w ,D) ≤ risk`(w , S) + O

(√k

m+

√η

m

).

bound does not depend on full dimension d!

bound holds even without running PCA (sufficient for w , k, η as above to exist)

result suggests reasonable values for k

other analyses use random matrix theory


PCA summary

Unsupervised learning technique

Often, pre-processing step for supervised; has denoising effect

Performs dimensionality reduction from dim d to dim k < d

Optimization problem: find rank-k orthogonal projection UU> thatminimizes mean square distortion on data

∑mi=1 ‖xi − UU>xi‖22

Solution: define data matrix A =∑m

i=1 xix>i , diagonalize it, choose

U = [u1, . . . , uk ] to be the top k eigenvectors of A

Equivalently: find k orthogonal directions that capture most of thedata variance

Computational cost:I O(md2 + d3) if m� dI O(m2d + m3) if d � m

Generalization (in some cases) depends only on k and distortion butnot on d


Lecture 14: Principal Components Analysisinabd171/wiki.files/lecture14... · 2016-12-20 ·...

Documents

Transcript of Lecture 14: Principal Components Analysisinabd171/wiki.files/lecture14... · 2016-12-20 ·...