Principle Component Analysis and its use in MA clustering Lecture 12.

16
Principle Component Analysis and its use in MA clustering Lecture 12

Transcript of Principle Component Analysis and its use in MA clustering Lecture 12.

Page 1: Principle Component Analysis and its use in MA clustering Lecture 12.

Principle Component Analysisand its use in MA clustering

Lecture 12

Page 2: Principle Component Analysis and its use in MA clustering Lecture 12.

What is PCA?

• This is a MATHEMATICAL procedure that transforms a set of correlated responses into a smaller set of uncorrelated variables called PRINCIPAL COMPONENTS.

• Uses:• Data screening• Clustering• Discriminant Analysis• Regression combating Multicollinearity

Page 3: Principle Component Analysis and its use in MA clustering Lecture 12.

Objectives of PCA

• It is an exploratory technique meant to give researchers a better FEEL for their data

• Reduce dimensionality, rather try to understand the TRUE dimensionality of the data

• Identify “meaningful” variables• If you have a VARIANCE-COVARIANCE MATRIX, S:• PCA returns new variables called Principal Components

that are:– Uncorrelated– First component explains MOST of the variability– The remaining PC explain decreasing amounts of variability

Page 4: Principle Component Analysis and its use in MA clustering Lecture 12.

Idea of PCA

• Consider x to be a random variable with mean m and Variance given by S.

• The first PC variable is defined by: y1= a1’(x-m), such that a1 is chosen so that the VAR(a1’(x-m)) is maximized for all vectors a1 satisfying a1’a1=1

• It can be shown that the maximum value of the variance of a1’(x-m) among all vectors satisfying the condition is l1 (the first or largest eigen value of the Matrix, S). This implies a1is the eigen vector corresponding to the eigen value l1.

• The second PC is the eigen vector corresponding to the second largest eigen value l2 and so on to the pth eigen value.

Page 5: Principle Component Analysis and its use in MA clustering Lecture 12.

Supplementary Info:

• What are Eigen Values and Eigen Vectors?

• Also called characteristic root (latent root) eigen values are the roots of the polynomial equation defined by:

• | - S l I| =0• This leads to an equation of form:• c1lp + c2lp-1 + … cpl + cp+1 = 0

• If S is symmetric then the eigen values are real numbers and can be ordered.

Page 6: Principle Component Analysis and its use in MA clustering Lecture 12.

Supplementary Info: II

• What are Eigen Vectors?• Similarly, eigen vectors are the vectors satisfying the equation:

Sa - l a =0• If S is symmetric then there will be p eigen vectors corresponding to

the p eigen values.• Generally not unique and are normalized to aj’aj = 1

• Remarks: if two eigen values are NOT equal there eigen vectors will be orthogonal to each other. When two eigen values are equal their eigen vectors are CHOSEN orthogonal to each other (in this case these are non-unique).

• Tr(S) = S li• |S| = li

Page 7: Principle Component Analysis and its use in MA clustering Lecture 12.

Idea of PCA contd…

• Hence the p principal components are a1, a2….ap, the eigen vectors corresponding to the ordered eigen values of S.

• Here, l1 l2 … lp.• Result: two principal components are uncorrelated if and

only if their defining eigen vectors are orthogonal to each other. Hence the PC are placed on a orthogonal axis system where are the data fall.

Page 8: Principle Component Analysis and its use in MA clustering Lecture 12.

Idea of PCA contd…

• The varaince of the jth component is lj, j=1,…,p.

• Remember: tr(S) = s11+s22+…+spp.

• Also, tr(S)=l1+l2+…+lp.

• Hence, often a measure of “importance” of the jth principal component is given by lj/tr(S).

Page 9: Principle Component Analysis and its use in MA clustering Lecture 12.

Comments

• To actually do PCA we need to compute the principal component scores or the values of the principal component variable for each unit in the data set.

• These scores provide locations of the observations in a data set with respect to the principal component axis.

• Generally eigen vectors are normalized to length 1, aj’aj=1.

• Often to make comparison between eigen values each element in the vector is multiplied by the square root of the corresponding eigen value( called component vectors),

cj = (lj1/2 )aj.

Page 10: Principle Component Analysis and its use in MA clustering Lecture 12.

Estimating PC

• Life would be easy if m and S were known. All we had to do was to estimate the normalized eigen vectors and corresponding eigen values.

• But, most of the time we DO NOT know m and S and we need to estimate those and hence the PCA are the sample values corresponding to the estimated m and S.

• Determining the # of PC:– Look for the eigen values that are much smaller than the others.

Plots like SCREE plot (plot of eigen value versus the eigen number)

Page 11: Principle Component Analysis and its use in MA clustering Lecture 12.

Caveats

• The whole idea of PCA is to transform a set of correlated variables to a set of uncorrelated variables, hence if the data are already uncorrelated, not much additional advantage of doing PCA.

• One can do PCA on correlation matrix or the Covariance matrix.

• In the latter case, the component correlation vectors cj = (lj1/2 )aj give the correlations between the original variables and the jth principal component variable.

Page 12: Principle Component Analysis and its use in MA clustering Lecture 12.

PCA and Multidimensional Scaling

• Essentially what PCA does is what is called SINGULAR VALUE DECOMPOSITION(SVD) of a matrix

• X=UDV’• Where X is n by p, with n<<p (in MA)• U is n by n• D is n by n, diagonal matrix with the diagonals decreasing, d1 d2…

dn.• V is a p by n matrix, which rotates X into a new set of co-ordinates.

such that XV=UD

Page 13: Principle Component Analysis and its use in MA clustering Lecture 12.

SVD and MDS

• SVD is a VERY memory hungry procedure and especially for MA data when there a large number of genes it is very slow and often needs HUGE amounts of memory to work.

• Multidimensional Scaling (MDS): is a collection of methods that do not use the full data matrix but rather the distance matrix between the variables. This reduces the computation from n by p to n by n (quite a reduction!).

Page 14: Principle Component Analysis and its use in MA clustering Lecture 12.

Sammon Mapping

• A common method used in MA is SAMMON mapping which aims to find the two-dimensional representation that has the maximum dissimilarity matrix compared to the original one.

• PCA has the advantage in the sense that it represents the samples in a scatterplot whose axes are made up of a linear combination of the most variable genes.

• Sammon mapping treats all genes equivalently and hence is a bit “duller” than PCA based clustering.

Page 15: Principle Component Analysis and its use in MA clustering Lecture 12.

PCA in Microarrays

• Useful technique to understand the TRUE dimensionality of the data.

• Useful for clustering.• In R under the MASS package you can use:•

• my.data1=read.table("cluster.csv",header=TRUE,sep=",")• princomp(my.data1)• myd.sam <- sammon(dist(my.data1))• plot(myd.sam$points, type = "n")• text(myd.sam$points, labels =as.character(1:nrow(my.data1)))

Page 16: Principle Component Analysis and its use in MA clustering Lecture 12.