Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component...

Post on 04-Aug-2020

7 views 0 download

Transcript of Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component...

Topic 2: Principal Component Analysis

Topic 2: Principal Component Analysis

Ying Li

Stockholm University

September 11, 2012

1/25

Topic 2: Principal Component Analysis

Introduction

Examples

• Have the commodities price gone up during last few years inSweden?

X Helpful to reduce all the commodities price into someindices, which are the linear combination of the original price.

• The marketing manager is interested in developing aregression model to forecast sales. However, the independentvariables (x) are correlated.X Helpful to form ’new’ variables which are linearcombination of old variables, such that the new variables arenot correlated among themselves.

2/25

Topic 2: Principal Component Analysis

Introduction

Examples

• Have the commodities price gone up during last few years inSweden?X Helpful to reduce all the commodities price into someindices, which are the linear combination of the original price.

• The marketing manager is interested in developing aregression model to forecast sales. However, the independentvariables (x) are correlated.X Helpful to form ’new’ variables which are linearcombination of old variables, such that the new variables arenot correlated among themselves.

2/25

Topic 2: Principal Component Analysis

Introduction

Examples

• Have the commodities price gone up during last few years inSweden?X Helpful to reduce all the commodities price into someindices, which are the linear combination of the original price.

• The marketing manager is interested in developing aregression model to forecast sales. However, the independentvariables (x) are correlated.

X Helpful to form ’new’ variables which are linearcombination of old variables, such that the new variables arenot correlated among themselves.

2/25

Topic 2: Principal Component Analysis

Introduction

Examples

• Have the commodities price gone up during last few years inSweden?X Helpful to reduce all the commodities price into someindices, which are the linear combination of the original price.

• The marketing manager is interested in developing aregression model to forecast sales. However, the independentvariables (x) are correlated.X Helpful to form ’new’ variables which are linearcombination of old variables, such that the new variables arenot correlated among themselves.

2/25

Topic 2: Principal Component Analysis

Introduction

Principal Component Analysis

Principal Component Analysis (PCA) is the technique for formingthe new variables which are linear combination of the originalvariables.

• The new variables are called ’principal components’ (PC)(ξ).

• PCs are uncorrelated.

• No. of ξ ≤ No. of x.

• One measure of the amount of information convey of PC:information(ξ) = var(ξ)

• var(ξ1) ≥ var(ξ2) ≥ ...

3/25

Topic 2: Principal Component Analysis

Introduction

Principal Component Analysis

Principal Component Analysis (PCA) is the technique for formingthe new variables which are linear combination of the originalvariables.

• The new variables are called ’principal components’ (PC)(ξ).

• PCs are uncorrelated.

• No. of ξ ≤ No. of x.

• One measure of the amount of information convey of PC:information(ξ) = var(ξ)

• var(ξ1) ≥ var(ξ2) ≥ ...

3/25

Topic 2: Principal Component Analysis

Geometry of PCA

Example Data Mean-centered Data

Obs X1 X2 X1 X2

1 16 8 8 5

2 12 10 4 7

3 13 6 5 3

4 11 2 3 -1

5 10 2 2 5

6 9 -1 1 -4

7 8 4 0 1

8 7 6 -1 3

9 5 -3 -3 6

10 3 -1 -5 -4

11 2 -3 -6 -6

12 0 0 -8 -3

Mean 8 3 0 0

Var 23 21 23 21

4/25

Topic 2: Principal Component Analysis

Geometry of PCA

−5 0 5

−6

−4

−2

02

46

x1

x2

Figure: mean-centered data 5/25

Topic 2: Principal Component Analysis

Geometry of PCA

−5 0 5

−6

−4

−2

02

46

x1

x2

Figure: mean-centered data and one new axis 6/25

Topic 2: Principal Component Analysis

Geometry of PCA

7/25

Topic 2: Principal Component Analysis

Geometry of PCA

8/25

Topic 2: Principal Component Analysis

Geometry of PCA

9/25

Topic 2: Principal Component Analysis

Geometry of PCA

10/25

Topic 2: Principal Component Analysis

Geometry of PCA

11/25

Topic 2: Principal Component Analysis

Geometry of PCA

Consider p-variables, then the p-dimensional space:

1 To find the first new axis, result in new first component,account for the maximum of the total variance.

2 Then a second axis, orthogonal to the first new new axis, andaccount for the maximum of the variance that has not beenaccounted by the first component.

3 ...

4 This procedure is carried on until all the new axes have beenidentified.

12/25

Topic 2: Principal Component Analysis

Geometry of PCA

Consider p-variables, then the p-dimensional space:

1 To find the first new axis, result in new first component,account for the maximum of the total variance.

2 Then a second axis, orthogonal to the first new new axis, andaccount for the maximum of the variance that has not beenaccounted by the first component.

3 ...

4 This procedure is carried on until all the new axes have beenidentified.

12/25

Topic 2: Principal Component Analysis

Geometry of PCA

Consider p-variables, then the p-dimensional space:

1 To find the first new axis, result in new first component,account for the maximum of the total variance.

2 Then a second axis, orthogonal to the first new new axis, andaccount for the maximum of the variance that has not beenaccounted by the first component.

3 ...

4 This procedure is carried on until all the new axes have beenidentified.

12/25

Topic 2: Principal Component Analysis

Geometry of PCA

Consider p-variables, then the p-dimensional space:

1 To find the first new axis, result in new first component,account for the maximum of the total variance.

2 Then a second axis, orthogonal to the first new new axis, andaccount for the maximum of the variance that has not beenaccounted by the first component.

3 ...

4 This procedure is carried on until all the new axes have beenidentified.

12/25

Topic 2: Principal Component Analysis

Geometry of PCA

PCA as a dimensional reducing technique.

Question?

How well can the few new variables represent the informationobtained in the data?

13/25

Topic 2: Principal Component Analysis

Geometry of PCA

PCA as a dimensional reducing technique.

Question?

How well can the few new variables represent the informationobtained in the data?

13/25

Topic 2: Principal Component Analysis

Analytical Approach

Analytical Approach

Assuming p variables from the following p linear combinations:

ξ1 = ω11x1 + ω12x2 + · · ·+ ω1pxp

ξ2 = ω21x1 + ω22x2 + · · ·+ ω2pxp

· · ·ξp = ωp1x1 + ωp2x2 + · · ·+ ωppxp

The ωij are estimated such that

1 max(var(ξ1)), max(var(ξ2)),· · ·2 ω2

i1 + ω2i2 + · · ·+ ω2

ip = 1, i = 1, ...p

3 ωi1ωj1 + ωi2ωj2 + · · ·+ ωipωjp = 0,for all i 6= j

14/25

Topic 2: Principal Component Analysis

Analytical Approach

Now, the mathematical problem is:How do we obtain the weights?

15/25

Topic 2: Principal Component Analysis

Analytical Approach

Result 1

Let Σp×p be the covariance matrix associate with random vectorXT = (x1, x2, · · · , xp), and Σ have the eigenvalue-eigenvectorspairs (λ1, γ1), (λ2, γ2),· · · ,(λp, γp) where λ1 ≥ λ2 ≥ · · · ≥ 0. Theith principal component is given by:

ξi = γTi X = γ1ix1 + γ2ix2 + · · ·+ γpixp, i = 1, 2, · · · p,

withvar(ξi) = λi, i = 1, 2, · · · , p

16/25

Topic 2: Principal Component Analysis

Analytical Approach

Result 2

Let Σp×p be the covariance matrix associate with random vectorXT = (x1, x2, · · · , xp), and Σ have the eigenvalue-eigenvectorspairs (λ1, γ1), (λ2, γ2),· · · ,(λp, γp) where λ1 ≥ λ2 ≥ · · · ≥ 0. Letξ1 = γ

′1X, ξ2 = γ

′2X,· · · . Then

σ21 + σ22 + · · ·+ σ2p =

p∑i=1

V ar(xi) = λ1 + λ2 + · · ·+ λp,

The proportion variance due to kth principal component:

λk∑pi=1 λi

17/25

Topic 2: Principal Component Analysis

Analytical Approach

Result 3

If ξ1 = γ′1X, ξ2 = γ

′2X,· · · are the principal components obtained

from Σ, then the correlation coefficient between ith principalcomponent and kth variable is

ρxi,ξk =γki√λi√σ2i

,

This is also defined as xi’s loading on ξk

18/25

Topic 2: Principal Component Analysis

Example Data

19/25

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q1?

What effect does the type of the data (original data, standardizeddata) have on PCA?

X The weights to form the PCs are affected by the relativevariance of the variables.

X Usually recommend use standardized data.

20/25

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q1?

What effect does the type of the data (original data, standardizeddata) have on PCA?

X The weights to form the PCs are affected by the relativevariance of the variables.

X Usually recommend use standardized data.

20/25

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q2?

How many principal components should be retained?

X One common cutoff point is 80%.

X Screen plot.

X eigenvalue greater than one rule (only for standardized data)

21/25

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q2?

How many principal components should be retained?

X One common cutoff point is 80%.

X Screen plot.

X eigenvalue greater than one rule (only for standardized data)

21/25

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q2?

How many principal components should be retained?

X One common cutoff point is 80%.

X Screen plot.

X eigenvalue greater than one rule (only for standardized data)

21/25

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q2?

How many principal components should be retained?

X One common cutoff point is 80%.

X Screen plot.

X eigenvalue greater than one rule (only for standardized data)

21/25

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q3 ?

How to interpret the principal components?

X Use loadings.

22/25

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q3 ?

How to interpret the principal components?

X Use loadings.

22/25

Topic 2: Principal Component Analysis

Consideration when performing PCA

Loadings Bread Hamburger Milk Oranges TomatoesPC1 0.772 0.896 0.529 0.350 0.788PC2 -0.324 -0.046 -0.453 0.837 0.302

23/25

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q4 ?

How to use the principal component scores?

X PC scores can be plotted for further interpreting the result.

X PC scores can be used as input variables for further analysis,such as cluster analysis,regression and DA.

24/25

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q4 ?

How to use the principal component scores?

X PC scores can be plotted for further interpreting the result.

X PC scores can be used as input variables for further analysis,such as cluster analysis,regression and DA.

24/25

Topic 2: Principal Component Analysis

summary

• The main objective of PCA.

• How to interpret the PCA result: no. of PCs, PCs, PC scores

• Attention: the result of PCA can be affected by the type ofthe data used.

25/25