Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Principal Component Principal Component AnalysisAnalysis

(PCA) for Clustering(PCA) for ClusteringGene Expression DataGene Expression Data

K. Y. Yeung and W. L. RuzzoK. Y. Yeung and W. L. Ruzzo

OrganizationOrganization

Association of PCA and this paperAssociation of PCA and this paperApproach of this paperApproach of this paperData setsData setsClustering algorithms and similarity Clustering algorithms and similarity

metricsmetricsResults and discussionResults and discussion

The Functions of PCA?The Functions of PCA?

PCA can reduce the dimensionality of the PCA can reduce the dimensionality of the data set.data set.

Few PCs may capture most of the Few PCs may capture most of the variation in the original data set.variation in the original data set.

PCs are uncorrelated and ordered.PCs are uncorrelated and ordered.We expect the first few PCs may ‘extract’ We expect the first few PCs may ‘extract’

the cluster structure in the original data the cluster structure in the original data set.set.

This Paper’s Point of ViewThis Paper’s Point of View

A theoretical result shows that the first few A theoretical result shows that the first few PCs may not contain cluster information. PCs may not contain cluster information. (Chang, 1983).(Chang, 1983).

Chang’s example.Chang’s example.A motivating example. (Coming next).A motivating example. (Coming next).

A Motivating ExampleA Motivating Example

Data: A subset of the sporulation data Data: A subset of the sporulation data (477 genes) were classified into seven (477 genes) were classified into seven temporal patterns (Chu et al., 1998)temporal patterns (Chu et al., 1998)

The first 2 PCs contains 85.9% of the The first 2 PCs contains 85.9% of the variation in the data. (Figure variation in the data. (Figure 1a1a))

The first 3 PCs contains 93.2% of the The first 3 PCs contains 93.2% of the variation in the data. (Figure variation in the data. (Figure 1b1b))

Sporulation DataSporulation Data

The patterns overlap around the origin in (The patterns overlap around the origin in (1a1a).). The patterns are much more separated in (The patterns are much more separated in (1b).1b).

The GoalThe Goal

EMPIRICALLY investigate the EMPIRICALLY investigate the effectiveness of clustering gene effectiveness of clustering gene expression data using PCs instead of the expression data using PCs instead of the original variables.original variables.

Outline of MethodsOutline of Methods

Genes are to be clustered, and the Genes are to be clustered, and the experimental conditions are the variables.experimental conditions are the variables.

Effectiveness of clustering with the orginal Effectiveness of clustering with the orginal data and with different sets of PCs is data and with different sets of PCs is determined, measured by comparing the determined, measured by comparing the clustering results to an objective external clustering results to an objective external criterion. criterion.

Assume the number of clusters is known. Assume the number of clusters is known.

Agreement Between Two PartitionsAgreement Between Two Partitions

The Rand index (Rand, 1971):The Rand index (Rand, 1971):

Given a set of n objects S, let U and V be Given a set of n objects S, let U and V be two different partitions of S. Let:two different partitions of S. Let:a = # of pairs that are placed in the same a = # of pairs that are placed in the same

cluster in U and in the same cluster in Vcluster in U and in the same cluster in V

d = # of pairs that are placed in different d = # of pairs that are placed in different clusters in U and in different clusters in Vclusters in U and in different clusters in V

Rand index = (a+d)/nC2Rand index = (a+d)/nC2⎟⎟⎠

⎞⎜⎜⎝

⎛2

n

Agreement (Cont’d)Agreement (Cont’d)

The adjusted Rand index (ARI, Hubert & The adjusted Rand index (ARI, Hubert & Arabie, 1985):Arabie, 1985):

Note: Higher ARI means higher Note: Higher ARI means higher correspondence between two partitions. correspondence between two partitions.

index expected-index maximum

index expectedindex −

Subset of PCsSubset of PCs

Motivated by Chang’s example, it is Motivated by Chang’s example, it is possible to find other subsets of PCs to possible to find other subsets of PCs to preserve the cluster structure better than preserve the cluster structure better than the first few PCs.the first few PCs.

How?How?

--- The greedy approach.--- The greedy approach.

--- The modified greedy approach.--- The modified greedy approach.

The Greedy ApproachThe Greedy Approach

Let mLet m0 0 be the minimum number of PCs to be be the minimum number of PCs to be clustered, and p be the number of clustered, and p be the number of variables in the data.variables in the data.

1)1) Search for a set of mSearch for a set of m0 0 PCs with maximum PCs with maximum ARI, denoted as sARI, denoted as smm00..

2)2) For each m=(mFor each m=(m00+1),…p, add another PC +1),…p, add another PC to sto s(m-1)(m-1) and calculate ARI. The PC and calculate ARI. The PC giving the maximum ARI is then added to giving the maximum ARI is then added to get sget smm..

The Modified Greedy ApproachThe Modified Greedy Approach

In each step of the greedy approach (# of In each step of the greedy approach (# of PCs = m), retain the k best subsets of PCs PCs = m), retain the k best subsets of PCs for the next step (# of PCs = m+1).for the next step (# of PCs = m+1).

If k=1, this is just the greedy approach.If k=1, this is just the greedy approach.k=3 in this paper.k=3 in this paper.

The Scheme of the StudyThe Scheme of the Study

Given a gene expression data set with n genes Given a gene expression data set with n genes (subjects) and p experimental conditions (subjects) and p experimental conditions (variables), apply a clustering algorithm to(variables), apply a clustering algorithm to

1)1) the given data set, ARI w/ external criterion.the given data set, ARI w/ external criterion.2)2) the first m PCs where m=mthe first m PCs where m=m00,…p.,…p.3)3) the subset of PCs found by the (modified) the subset of PCs found by the (modified)

greedy approach.greedy approach.4)4) 30 sets of random PCs.30 sets of random PCs.5)5) 30 sets of random orthogonal projections.30 sets of random orthogonal projections.

Data SetsData Sets

““Class” refers to a group in the external Class” refers to a group in the external criterion. “Cluster” refers to clusters criterion. “Cluster” refers to clusters obtained by a clustering algorithm.obtained by a clustering algorithm.

There are two real data sets and three There are two real data sets and three synthetic data sets in this study.synthetic data sets in this study.

The Ovary DataThe Ovary Data

The data contains 235 clones and 24 tissue The data contains 235 clones and 24 tissue samples.samples.

For the 24 tissue samples, 7 are from normal For the 24 tissue samples, 7 are from normal tissues, 4 from blood samples, and 13 from tissues, 4 from blood samples, and 13 from ovarian cancers.ovarian cancers.

The 235 clones were found to correspond four The 235 clones were found to correspond four different genes (classes), each having 58, 88, 57 different genes (classes), each having 58, 88, 57 and 32 clones.and 32 clones.

The data for each clone was normalized across The data for each clone was normalized across the 24 experiments to have mean 0 and the 24 experiments to have mean 0 and variance 1.variance 1.

The Yeast Cell Cycle DataThe Yeast Cell Cycle Data

The data set shows the fluctuation of The data set shows the fluctuation of expression levels over two cell cycles.expression levels over two cell cycles.

380 gene were classified into five phases 380 gene were classified into five phases (classes).(classes).

The data for each gene was normalized to The data for each gene was normalized to have mean 0 and variance 1 across each have mean 0 and variance 1 across each cell cycle.cell cycle.

Mixture of Normal on OvaryMixture of Normal on Ovary

In each gene (class), the sample In each gene (class), the sample covariance matrix and the mean vector are covariance matrix and the mean vector are computed.computed.

Sample (58, 88, 57, 32) clones from the Sample (58, 88, 57, 32) clones from the MVN in each class. 10 replicates.MVN in each class. 10 replicates.

It preserves the mean and covariance of It preserves the mean and covariance of the original data, but relies on the MVN the original data, but relies on the MVN assumption.assumption.

Marginal NormalityMarginal Normality

Randomly Resample Ovary Randomly Resample Ovary DataData

For each class c (c=1,…,4) under For each class c (c=1,…,4) under experimental condition j (j=1,…,24), experimental condition j (j=1,…,24), resample the expression level with resample the expression level with replacement. Retain the size of each replacement. Retain the size of each class. 10 replicates.class. 10 replicates.

No MVN assumption. The independent No MVN assumption. The independent sampling for different experimental sampling for different experimental conditions is reasonable as inspected.conditions is reasonable as inspected.

Cyclic DataCyclic Data

This data set models cyclic behavior of This data set models cyclic behavior of genes over different time points.genes over different time points.

Behavior of genes modeled by the sine Behavior of genes modeled by the sine function.function.

A drawback of this model is the arbitrary A drawback of this model is the arbitrary choice of several parameters.choice of several parameters.

Clustering AlgorithmsClustering Algorithmsand Similarity Metricsand Similarity Metrics

Clustering algorithms:Clustering algorithms:Cluster affinity search technique (CAST)Cluster affinity search technique (CAST)Hierarchical average-link algorithmHierarchical average-link algorithmK-mean algorithmK-mean algorithm

Similarity metrics:Similarity metrics:Euclidean distance (mEuclidean distance (m00=2)=2)Correlation coefficient (mCorrelation coefficient (m00=3)=3)

Table 1Table 1

Table 2Table 2

One sided Wilcoxon signed rank test.One sided Wilcoxon signed rank test. CAST always favorites ‘no PCA’.CAST always favorites ‘no PCA’. The two significances for PCA are not clear sucesses.The two significances for PCA are not clear sucesses.

ConclusionConclusion

1)1) The quality of clustering results on the The quality of clustering results on the data after PCA is not necessarily higher data after PCA is not necessarily higher than that on the original data, sometimes than that on the original data, sometimes lower.lower.

2)2) The first m PCs do not give the highest The first m PCs do not give the highest adjusted Rand index, i.E. Another set of adjusted Rand index, i.E. Another set of PCs gives higher ARI.PCs gives higher ARI.

Conclusion (Cont’d)Conclusion (Cont’d)

3)3) There are no clear trends regarding the There are no clear trends regarding the choice of optimal number of PCs over all choice of optimal number of PCs over all the data sets and over all the clustering the data sets and over all the clustering algorithms and over the different algorithms and over the different similarity metrics. There is no obvious similarity metrics. There is no obvious relationship between cluster quality and relationship between cluster quality and the number or set of PCs used.the number or set of PCs used.

Conclusion (Cont’d)Conclusion (Cont’d)

4)4) On average, the quality of clusters On average, the quality of clusters obtained by clustering random sets of obtained by clustering random sets of PCs tend to be slightly lower than those PCs tend to be slightly lower than those obtained by clustering random sets of obtained by clustering random sets of orthogonal projections, esp. when the orthogonal projections, esp. when the number of components is small.number of components is small.

Grand ConclusionGrand Conclusion

In general, we recommend AGAINST In general, we recommend AGAINST using PCA to reduce dimensionality of the using PCA to reduce dimensionality of the data before applying clustering algorithms data before applying clustering algorithms unless external information is available.unless external information is available.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Documents

Transcript of Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.