Exploring Data using Dimension Reduction and Clustering Naomi Altman Nov. 06.

25
Exploring Data using Dimension Reduction and Clustering Naomi Altman Nov. 06

Transcript of Exploring Data using Dimension Reduction and Clustering Naomi Altman Nov. 06.

Exploring Data usingDimension Reduction and

Clustering

Naomi Altman

Nov. 06

Spellman Cell Cycle dataYeast cells were synchronized by arrest of a

cdc15 temperature-sensitive mutant.

Samples were taken every 10 minutes and one array was hybridized for each sample using a reference design. 2 complete cycles are in the data.

I downloaded the data and normalized using loess. (Print tip data were not available.)

I used the normalized value of M as the primary data.

What they didSupervised dimension reduction = regression

They were looking for genes that have cyclic behavior - i.e. a sine or cosine wave in time.

They regressed Mi on sine and cosine waves and selected genes for which the R2 was high.

The period of the wave was known (from observing the cells?), so they regression against sine(wt) and cos(wt) where w is set to give the appropriate period.

If the period is unknown, a method called Fourier analysis can be used to discover it.

RegressionSuppose we are looking for genes that are associated

with a particular quantitative phenotype, or have a pattern that is known in advance.

E.g. Suppose we are interested in genes that change linearly with temperature and quadratically with pH.

Y=b0 + b1Temp + b2pH + b3pH2 + noise

We might fit this model for each gene (assuming that the arrays came from samples subjected to different levels of Temp and pH.

This is similar to differential expression analysis - we have a multiple comparisons problem.

RegressionWe might compute an adjusted p-value, or goodness-of-

fit statistic to select genes based on the fit to a pattern.

If we have many "conditions" we do not need to replicate as much as in differential expression analysis because we consider any deviation from the "pattern" to be random variation.

What I didUnsupervised dimension

reduction:

I used SVD on the 832 genes x 24 time points.

We can see that eigengene 5 has the cyclic genes.

For classI extracted the 304 spots with variance greater

than 0.25.

To my surprise, several of these were empty or control spots. I removed these.

This leaves 295 genes which are in yeast.txt.

Read these into R.

Also: time=c(10,30,50,10*(7:25),270,290)

yeast=read.delim("yeast.txt",header=T)

time=c(10,30,50,10*(7:25),270,290)

M.yeast=yeast[,2:25] #strip off the gene names

svd.m=svd(M.yeast) # svd

#scree plot

plot(1:24,svd.m$d)

par(mfrow=c(4,4)) # plot the first 16 "eigengenes"

for (i in 1:16) plot(time,svd.m$v[,i],main=paste("Eigen",i),type="l")

par(mfrow=c(1,1))

plot(time,svd.m$v[,1],type="l",ylim=c(min(svd.m$v),max(svd.m$v)))

for (i in 2:4) lines(time,svd.m$v[,i],col=i)

#It looks like "eigengenes" 2-4 have the periodic components.

# Reduce dimension by finding genes that are linear combinations

# of these 3 patterns by regression

# We can use limma to fit a regression to every gene and use e.g.

# the F or p-value to pick significant genes

library(limma)

design.reg=model.matrix(~svd.m$v[,2:4)

fit.reg=lmFit(M.yeast,design.reg)

# The "reduced dimension" version of the genes are the fitted

# values: b0+ b1v2 + b2v3 +b3v4 vi is the ith column of svd.m$v

# bi are the coefficients

# Lets look at gene 1 (not periodic) and genes 5, 6, 7

plot(time,M.yeast[i,],type="l")

lines(time,fit.reg$coef[i,1]+ fit.reg$coef[i,2]*svd.m$v[,2]+

fit.reg$coef[i,3]*svd.m$v[,3]+fit.reg$coef[i,4]*svd.m$v[,4])

# Select the genes with a strong period component# We could use R2 but in limma, it is simplest to compute the # moderated F-test for regression and then use the p-values.# Limma requires us to remove the intercept from the coefficients# to get this test :(

contrast.matrix=cbind(c(0,1,0,0),c(0,0,1,0),c(0,0,0,1))

fit.contrast=contrasts.fit(fit.reg,contrast.matrix)

efit=eBayes(fit.contrast)

# We will use the Bonferroni method to pick a significance level

# a=0.05/#genes = 0.00017

sigGenes=which(efit$F.p.value<0.00017)

#plot a few of these genes# You might also want to plot a few genes with p-value > 0.5

Note that we used the normalized but uncentered unscaled data for this exercise.

Things might look very different if the data were transformed.

ClusteringWe might ask which genes have similar

expression patterns.

Once we have expressed (dis)similarity as a distance measure, we can use this measure to cluster genes that are similar.

There are many methods. We will discuss 2 - hierarchical clustering

k-means clustering

1. Choose a distance function for points d(x1,x2) 2. Choose a distance function for clusters D(C1,C2) (for clusters

formed by just one point, D reduces to d).3. Start from N clusters, each containing one data point. At each iteration: a) Using the current matrix of cluster distances, find the two

closest clusters. b)Update the list of clusters by merging the two closest. c) Update the matrix of cluster distances accordingly4. Repeat until all data points are joined in one cluster.

Remarks:• The method is sensitive to anomalous data points/outliers

F. Chiaromonte Sp 06 5

Hierarchical Clustering (agglomerative)

1. Choose a distance function for points d(x1,x2) 2. Choose a distance function for clusters D(C1,C2) (for clusters formed by just one

point, D reduces to d).3. Start from N clusters, each containing one data point. At each iteration: a) Using the current matrix of cluster distances, find the two closest clusters. b)Update the list of clusters by merging the two closest. c) Update the matrix of cluster distances accordingly4. Repeat until all data points are joined in one cluster.

Remarks:1. The method is sensitive to anomalous data points/outliers.2. Mergers are irreversible: “bad” mergers occurring early on

affect the structure of the nested sequence.3. If two pairs of clusters are equally (and maximally) close at a

given iteration, we have to choose arbitrarily; the choice will affect the structure of the nested sequence.

F. Chiaromonte Sp 06 5

Hierarchical Clustering (agglomerative)

D(C1,C2) is a function of the distances f{ d(x1i,x2j) } x1i in C1 x2j in C2

Single (string-like, long) f=minComplete (ball-like, compact) f=maxAverage f=averageCentroid d(ave(x1i),ave(x2j) )

Single and complete linkages produce nested sequences invariant under monotone transformations of d – not the case for average linkage. However, the latter is a compromise between “long”, “stringy” clusters produced by single, and “round”, “compact” clusters produced by complete.F. Chiaromonte Sp 06 5

Defining cluster distance: the linkage function

ExampleAgglomeration step in constructingthe nested sequence (first iteration):

1. 3 and 5 are the closest, and are therefore merged in cluster “35”.2. new distance matrix computedwith complete linkage.

Ordinate: distance, or height, at which each merger occurred. Horizontal ordering of the data points is any order preventing intersections of branches.

F. Chiaromonte Sp 06 5

single linkage complete linkage

Hierarchical Clustering

Hierarchical clustering, per se, does not dictate a partition and a number of clusters.

It provides a nested sequence of partitions (this is more informative than just one partition).

To settle on one partition, we have to “cut” the dendrogram.

Usually we pick a height and cut there - but the most informative cuts are often at different heights for different branches.

F. Chiaromonte Sp 06 5

hclust(dist(M.yeast),method="single")

Partitioning algorithms: K-means.

1. Choose a distance function for points d(xi,xj).

2. Choose K = number of clusters.

3. Initialize the K cluster centroids (with points chosen at random).

4. Use the data to iteratively relocate centroids, and reallocate points to closest centroid.

At each iteration:

a) Compute distance of each data point from each current centroid.

b) Update current cluster membership of each data point, selecting the centroid to which the point is closest.

c) Update current centroids, as averages of the new clusters formed in 2.

5. Repeat until cluster memberships, and thus centroids, stop changing. F. Chiaromonte Sp 06 5

Remarks:1. This method is sensitive to anomalous data points/outliers.2. Points can move from one cluster to another, but the final

solution depends strongly on centroid initialization (so we usually restart several times to check).

3. If two centroids are equally (and maximally) close to an observation at a given iteration, we have to choose arbitrarily (the problem here is not so serious because points can move later).

4. There are several “variants” of the k-means algorithm using e.g. median.

5. K-means converges to a local minimum of the total within-cluster square distance (total within cluster sum of squares) – not necessarily a global one.

6. Clusters tend to be ball-shaped with respect to the chosen distance.

Starting from the arbitrarily chosen open rectangles:Assign every data value to a cluster defined by the nearest centroid.Recompute the centroids based on the most current clustering.Reassign data values to cluster and repeat.

Remarks:

The algorithm does not indicate how to pick K.

To change K, redo the partitioning. The clusters are not necessarily nested.F. Chiaromonte Sp 06 5

Here is the yeast data. (4 runs) To display the clusters, we often use the main eigendirections (svd$u).

These do show that much of the clustering is defined by these 2 directions, but it is not clear that there really are clusters.

6 clusters 4 clusters

k.out=kmeans(M.yeast,centers=6)plot(svd.m$u[,1],svd.m$u[,2],col=k.out5$cl)

Other partitioning Methods1. Partitioning around medioids (PAM): instead of averages, use

multidim medians as centroids (cluster “prototypes”). Dudoit and Freedland (2002).

2. Self-organizing maps (SOM): add an underlying “topology” (neighboring structureon a lattice) that relates cluster centroids to one another. Kohonen (1997), Tamayo et al. (1999).

3. Fuzzy k-means: allow for a “gradation” of points between clusters; soft partitions. Gash and Eisen (2002).

4. Mixture-based clustering: implemented through an EM (Expectation-Maximization)algorithm. This provides soft partitioning, and allows for modeling of cluster centroids and shapes. Yeung et al. (2001), McLachlan et al. (2002)

F. Chiaromonte Sp 06 5

Assessing the ClustersComputationally

The bottom line is that the clustering is "good" if it is biologically meaningful (but this is hard to assess).

Computationally we can:

1) Use a goodness of cluster measure, such as the within cluster

distances compared to the between cluster distances.

2) Perturb the data and assess cluster changes:

a) add noise (maybe residuals after ANOVA)

b) resample (genes, arrays)