Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search...

25
Homework 4 36-350: Data Mining SOLUTIONS I loaded the data and transposed it thus: library(ElemStatLearn) data(nci) nci.t = t(nci) This means that rownames(nci.t) contains the cell classes, or equivalently colnames(nci). I’ll use that a lot, so I’ll abbreviate it: nci.classes=rownames(nci.t). 1. Similarity searching for cells (a) How many features are there in the raw data? Are they discrete or continuous? (If there are some of each, which is which?) If some are continuous, are they necessarily positive? Answer: This depends on whether the cell classes are counted as features or not; you could defend either answer. If you don’t count the cell class, there are 6830 continuous features. Since they are logarithms, they do not have to be positive, and indeed they are not. If you do count the class, there are 6831 features, one of which is discrete (the class). I am inclined not to count the clas as really being a feature, since when we get new data it may not have the label. (b) Describe a method for finding the k cells whose expression profiles are most similar to that of a given cell. Carefully explain what you mean by “similar”. How could you evaluate the success of the search algorithm? Answer: The simplest thing is to call two cells similar if their vec- tors of expression levels have a small Euclidean distance. The most similar cells are the ones whose expression-profile vectors are closest, geometrically, to the target expression profile vector. There are several ways the search could be evaluated. One would be to use the given cell labels as an indication of whether the search re- sults are “relevant”. That is, one would look at the k closest matches and see (a) what fraction of them are of the same type as the tar- get, which is the precision, and (b) what fraction of cells of that type 1

Transcript of Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search...

Page 1: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

Homework 4

36-350: Data Mining

SOLUTIONS

I loaded the data and transposed it thus:

library(ElemStatLearn)data(nci)nci.t = t(nci)

This means that rownames(nci.t) contains the cell classes, or equivalentlycolnames(nci). I’ll use that a lot, so I’ll abbreviate it: nci.classes=rownames(nci.t).

1. Similarity searching for cells

(a) How many features are there in the raw data? Are they discrete orcontinuous? (If there are some of each, which is which?) If some arecontinuous, are they necessarily positive?Answer: This depends on whether the cell classes are counted asfeatures or not; you could defend either answer. If you don’t countthe cell class, there are 6830 continuous features. Since they arelogarithms, they do not have to be positive, and indeed they are not.If you do count the class, there are 6831 features, one of which isdiscrete (the class).I am inclined not to count the clas as really being a feature, sincewhen we get new data it may not have the label.

(b) Describe a method for finding the k cells whose expression profilesare most similar to that of a given cell. Carefully explain what youmean by “similar”. How could you evaluate the success of the searchalgorithm?Answer: The simplest thing is to call two cells similar if their vec-tors of expression levels have a small Euclidean distance. The mostsimilar cells are the ones whose expression-profile vectors are closest,geometrically, to the target expression profile vector.There are several ways the search could be evaluated. One would beto use the given cell labels as an indication of whether the search re-sults are “relevant”. That is, one would look at the k closest matchesand see (a) what fraction of them are of the same type as the tar-get, which is the precision, and (b) what fraction of cells of that type

1

Page 2: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

are in the search results (the recall). Plotting the precision and recallagainst k gives an indication of how efficient the search is. This couldthen be averaged over all the labeled cells.Normalizing the vectors by the Euclidean length is possible here,though a little tricky because the features are the logarithms of theconcentration levels, so some of them can be negative numbers, mean-ing that very little of that gene is being expressed. Also, the reasonfor normalizing bag-of-words vectors by length was the idea that twodocuments might have similar meanings or subjects even if they haddifferent sizes. A cell which is expressing all the genes at a low level,however, might well be very different from one which is expressingthe same genes at a high level.

(c) Would it make sense to weight genes by “inverse cell frequency”?Explain.Answer: No; or at least, I don’t see how. Every gene is expressed tosome degree by every cell (otherwise the logarithm of its concentrationwould be −∞), so every gene would have an inverse cell frequency ofzero. You could fix a threshold and just count whether the expressionlevel is over the threshold, but this would be pretty arbitrary.However, a related idea would be to scale genes’ expression levels bysomething which indicates how much they vary across cells, such asthe range or the variance. Since we have labeled cell types here, wecould even compute the average expression level for each type andthen scale genes by the variance of these type-averages. That is, geneswith a small variance (or range, etc.) should get small weights, sincethey are presumably uninformative, and genes with a large variancein epxression level should get larger weights.

2

Page 3: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

2. k-means clustering Use the kmeans function in R to cluster the cells, withk = 14 (to match the number of true classes). Repeat this three times toget three different clusterings.

Answer:

clusters.1 = kmeans(nci.t,centers=14)$clusterclusters.2 = kmeans(nci.t,centers=14)$clusterclusters.3 = kmeans(nci.t,centers=14)$cluster

I’m only keping the cluster assignments, because that’s all this problemcalls for.

(a) Say that k-means makes a lumping error whenever it assigns twocells of different classes to the same cluster, and a splitting errorwhen it puts two cells of the same class in different clusters. Eachpair of cells can give an error. (With n objects, there are n(n− 1)/2distinct pairs.)

i. Write a function which takes as inputs a vector of classes anda vector of clusters, and gives as output the number of lumpingerrors. Test your function by verifying that when the classesare (1, 2, 2), the three clusterings (1, 2, 2), (2, 1, 1) and (4, 5, 6)all give zero lumping errors, but the clustering (1, 1, 1) gives twolumping errors.Answer: The most brute-force way (Code 1) uses nested forloops. This works, but it’s not the nicest approach. (1) forloops, let alone nested loops, are extra slow and inefficient in R.(2) We only care about pairs which are in the same cluster, wastetime checking all pairs. (3) It’s a bit obscure what we’re reallydoing.Code 2 is faster, cleaner, and more R-ish, using the utility func-tion outer, which (rapidly!) applies a function to pairs of ob-jects constructed from its first two arguments, returning a ma-trix. Here we’re seeing which pairs of objects belong to differentclass, and which to the same clusters. (The solutions to the firsthomework include a version of this trick.) We restrict ourselvesto pairs in the same cluster and count how many are of differentclasses. We divided by two because outer produces rectangularmatrices and works with ordered pairs, so (i, j) 6= (j, i), but theproblem asks for unordered pairs.> count.lumping.errors(c(1,2,2),c(1,2,2))[1] 0> count.lumping.errors(c(1,2,2),c(2,1,1))[1] 0> count.lumping.errors(c(1,2,2),c(4,5,6))[1] 0

3

Page 4: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

# Count number of errors made by lumping different classes into# one cluster# Uses explicit, nested iteration

# Input: vector of true classes, vector of guessed clusters# Output: number of distinct pairs belonging to different classes# lumped into one cluster

count.lumping.errors.1 <- function(true.class,guessed.cluster) {n = length(true.class)# Check that the two vectors have the same length!stopifnot(n == length(guessed.cluster))# Note: don’t require them to have same data type# Initialize counternum.lumping.errors = 0# check all distinct pairsfor (i in 1:(n-1)) {# think of above-diagonal entries in a matrixfor (j in (i+1):n) {if ((true.class[i] != true.class[j])

& (guessed.cluster[i] == guessed.cluster[j])) {# lumping error: really different but put in same clusternum.lumping.errors = num.lumping.errors +1

}}

}return(num.lumping.errors)

}

Code Example 1: The first way to count lumping errors.

# Count number of errors made by lumping different classes into# one cluster# Vectorized

# Input: vector of true classes, vector of guessed clusters# Output: number of distinct pairs belonging to different classes# lumped into one cluster

count.lumping.errors <- function(true.class,guessed.cluster) {n = length(true.class)stopifnot(n == length(guessed.cluster))different.classes = outer(true.class, true.class, "!=")same.clusters = outer(guessed.cluster, guessed.cluster, "==")num.lumping.errors = sum(different.classes[same.clusters])/2return(num.lumping.errors)

}

Code Example 2: Second approach to counting lumping errors.

4

Page 5: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

# Count number of errors made by splitting a single classes into# clusters

# Input: vector of true classes, vector of guessed clusters# Output: number of distinct pairs belonging to the same class# split into different clusters

count.splitting.errors <- function(true.class,guessed.cluster) {n = length(true.class)stopifnot(n == length(guessed.cluster))same.classes = outer(true.class, true.class, "==")different.clusters = outer(guessed.cluster, guessed.cluster, "!=")num.splitting.errors = sum(different.clusters[same.classes])/2return(num.splitting.errors)

}

Code Example 3: Counting splitting errors.

> count.lumping.errors(c(1,2,2),c(1,1,1))[1] 2

ii. Write a function, with the same inputs as the previous one, thatcalculates the number of splitting errors. Test it on the sameinputs. (What should the outputs be?)Answer: A simple modification of the code from the last part.Before we can test this, we need to know what the test resultsare. The classes (1, 2, 2) and clustering (1, 2, 2) are identical, soobviously there will be no splitting errors. Likewise for compar-ing (1, 2, 2) and (2, 1, 1). Comparing (1, 2, 2) and (4, 5, 6), thepair of the second and third items are in the same class but splitinto two clusters — one splitting error. On the other hand, com-paring (1, 2, 2) and (1, 1, 1), there are no splitting errors, becauseeverything is put in the same cluster.> count.splitting.errors(c(1,2,2),c(1,2,2))[1] 0> count.splitting.errors(c(1,2,2),c(2,1,1))[1] 0> count.splitting.errors(c(1,2,2),c(4,5,6))[1] 1> count.splitting.errors(c(1,2,2),c(1,1,1))[1] 0

iii. How many lumping errors does k-means make on each of yourthree runs? How many splitting errors?> count.lumping.errors(nci.classes,clusters.1)[1] 79> count.lumping.errors(nci.classes,clusters.2)[1] 96> count.lumping.errors(nci.classes,clusters.3)

5

Page 6: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

[1] 110> count.splitting.errors(nci.classes,clusters.1)[1] 97> count.splitting.errors(nci.classes,clusters.2)[1] 97> count.splitting.errors(nci.classes,clusters.3)[1] 106

For comparison, there are 64 × 63/2 = 2016 pairs, so the errorrate here is pretty good.

(b) Are there any classes which seem particularly hard for k-means topick up?Answer: Well-argued qualitative answers are fine, but there areways of being quantitative. One is the entropy of the cluster assign-ments for the classes. Start with a confusion matrix:

> confusion = table(nci.classes,clusters.1)> confusion

clusters.11 2 3 4 5 6 7 8 9 10 11 12 13 14

BREAST 2 0 0 0 2 0 0 0 0 0 0 2 1 0CNS 0 0 0 0 3 0 0 0 2 0 0 0 0 0COLON 0 0 1 0 0 0 0 6 0 0 0 0 0 0K562A-repro 0 1 0 0 0 0 0 0 0 0 0 0 0 0K562B-repro 0 1 0 0 0 0 0 0 0 0 0 0 0 0LEUKEMIA 0 1 0 0 0 0 0 0 0 0 0 0 0 5MCF7A-repro 0 0 0 0 0 0 0 0 0 0 0 1 0 0MCF7D-repro 0 0 0 0 0 0 0 0 0 0 0 1 0 0MELANOMA 7 0 0 1 0 0 0 0 0 0 0 0 0 0NSCLC 0 0 4 1 0 0 2 0 0 0 1 0 1 0OVARIAN 0 0 3 1 0 0 0 0 0 2 0 0 0 0PROSTATE 0 0 2 0 0 0 0 0 0 0 0 0 0 0RENAL 0 0 0 1 1 7 0 0 0 0 0 0 0 0UNKNOWN 0 0 0 1 0 0 0 0 0 0 0 0 0 0

> signif(apply(confusion,1,entropy),2)BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA2.00 0.97 0.59 0.00 0.00 0.65

MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE0.00 0.00 0.54 2.10 1.50 0.00RENAL UNKNOWN0.99 0.00

Of course that’s just one k-means run. Here are a few more.

> signif(apply(table(nci.classes,clusters.2),1,entropy),2)BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA2.00 0.00 0.59 0.00 0.00 1.50

MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE

6

Page 7: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

0.00 0.00 0.54 2.10 1.50 0.00RENAL UNKNOWN0.99 0.00

> signif(apply(table(nci.classes,clusters.3),1,entropy),2)BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA2.00 1.50 0.59 0.00 0.00 1.90

MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE0.00 0.00 1.10 1.80 0.65 0.00RENAL UNKNOWN0.99 0.00

BREAST and NSCLC have consistently high cluster entropy, indi-cating that cells of these types tend not to be clustered together.

(c) Are there any pairs of cells which are always clustered together, andif so, are they of the same class?Answer: Cells number 1 and 2 are always clustered together in mythree runs, and are CNS tumors. However, cells 3 and 4 are alwaysclustered together, too, and one is CNS and one is RENAL.

(d) Variation-of-information metric

i. Calculate, by hand, the variation-of-information distance betweenthe partition (1, 2, 2) and (2, 1, 1); between (1, 2, 2) and (2, 2, 1);and between (1, 2, 2) and (5, 8, 11).Answer: Write X for the cell of the first partition in each pairand Y for the second.In the first case, when X = 1, Y = 2 and vice versa. SoH[X|Y ] = 0, H[Y |X] = 0 and the distance between the par-titions is also zero. (The two partitions are the same, they’vejust swapped the labels on the cells.)For the second pair, we have Y = 2 when X = 1, so H[Y |X =1] = 0, but Y = 1 or 2 equally often when X = 2, so H[Y |X =2] = 1. Thus H[Y |X] = (0)(1/3) + (1)(2/3) = 2/3. Symmet-rically, X = 2 when Y = 1, H[X|Y = 1] = 0, but X = 1 or2 equiprobably when Y = 2, so again H[X|Y ] = 2/3, and thedistance is 4/3.In the final case, H[X|Y ] = 0, because for each Y there is asingle value of X, but H[Y |X] = 2/3, as in the previous case. Sothe distance is 2/3.

ii. Write a function which takes as inputs two vectors of class orcluster assignments, and returns as output the variation-of-informationdistance for the two partitions. Test the function by checking thatit matches your answers in the previous part.Answer: The most straight-forward way is to make the contin-gency table or confusion matrix for the two assignment vectors,calculate the two conditional entropies from that, and add them.But I already have a function to calculate mutual information,

7

Page 8: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

# Variation-of-information distance between two partitions# Input: two vectors of partition-cell assignments of equal length# Calls: entropy, word.mutual.info from lecture-05.R# Output: variation of information distancevariation.of.info <- function(clustering1, clustering2) {# Check that we’re comparing equal numbers of objectsstopifnot(length(clustering1) == length(clustering2))# How much information do the partitions give about each other?mi = word.mutual.info(table(clustering1, clustering2))# Marginal entropies# entropy function takes a vector of counts, not assignments,# so call table first

h1 = entropy(table(clustering1))h2 = entropy(table(clustering2))v.o.i. = h1 + h2 - 2*mireturn(v.o.i.)

}

Code Example 4: Calculating variation-of-information distance between par-titions, based on vectors of partition-cell assignments.

so I’ll re-use that via a little algebra:

I[X; Y ] ≡ H[X]−H[X|Y ]H[X|Y ] = H[X]− I[X; Y ]

H[X|Y ] + H[Y |X] = H[X] + H[Y ]− 2I[X; Y ]

implemented in Code 4.Let’s check this out on the test cases.> variation.of.info(c(1,2,2),c(2,1,1))[1] 0> variation.of.info(c(1,2,2),c(2,2,1))[1] 1.333333> variation.of.info(c(1,2,2),c(5,8,11))[1] 0.6666667> variation.of.info(c(5,8,11),c(1,2,2))[1] 0.6666667

I threw in the last one to make sure the function is symmetricin its two arguments, as it should be.

iii. Calculate the distances between the k-means clusterings and thetrue classes, and between the k-means clusterings and each other.Answer:

> variation.of.info(nci.classes,clusters.1)[1] 1.961255> variation.of.info(nci.classes,clusters.2)

8

Page 9: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

[1] 2.026013> variation.of.info(nci.classes,clusters.3)[1] 2.288418> variation.of.info(clusters.1,clusters.2)[1] 0.5167703> variation.of.info(clusters.1,clusters.3)[1] 0.8577165> variation.of.info(clusters.2,clusters.3)[1] 0.9224744

The k-means clusterings are all roughly the same distance to thetrue classes (two bits or so), and noticeably closer to each otherthan to the true classes (around three-quarters of a bit).

9

Page 10: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

3. Hierarchical clustering

(a) Produce dendrograms using Ward’s method, single-link clustering andcomplete-link clustering. Include both the commands you used and theresulting figures in your write-up. Make sure the figures are legible.(Try the cex=0.5 option to plot.)Answer: hclust needs a matrix of distances, handily produced bythe dist function.

nci.dist = dist(nci.t)plot(hclust(nci.dist,method="ward"),cex=0.5,xlab="Cells",main="Hierarchical clustering by Ward’s method")

plot(hclust(nci.dist,method="single"),cex=0.5,xlab="Cells",main="Hierarchical clustering by single-link method")

plot(hclust(nci.dist,method="complete"),cex=0.5,xlab="Cells",main="Hierarchical clustering by complete-link method")

Producing Figures 1, 2 and 3

(b) Which cell classes seem are best captured by each clustering method?Explain.Answer: Ward’s method does a good job of grouping togetherCOLON, MELANOMA and RENAL; LEUKEMIA is pretty goodtoo but mixed up with some of the lab cell lines. Single-link is goodwith RENAL and decent with MELANOMA (though confused withBREAST). There are little sub-trees of cells of the same time, likeCOLON or CNS, but mixed together with others. The complete-linkmethod has sub-trees for COLON, LEUKEMIA, MELANOMA, andRENAL, which look pretty much as good as Ward’s method.

(c) Which method best recovers the cell classes? Explain.Answer: Ward’s method looks better — scanning down the figure,a lot more of the adjacent cells are of the same type. Since thedendrogram puts items in the same sub-cluster together, this suggeststhat the clustering more nearly corresponds to the known cell types.

(d) The hclust command returns an object whose height attribute is thesum of the within-cluster sums of squares. How many clusters doesthis suggest we should use, according to Ward’s method? Explain.(You may find diff helpful.)Answer:

> nci.wards = hclust(nci.dist,method="ward")> length(nci.wards$height)[1] 63> nci.wards$height[1][1] 38.23033

There are 63 “heights”, corresponding to the 63 joinings, i.e., not in-cluding the height (sum-of-squares) of zero when we have 64 clusters,

10

Page 11: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

LEUKEMIA

K562B-repro

K562A-repro

LEUKEMIA

LEUKEMIALEUKEMIA

LEUKEMIA

LEUKEMIA

BREAST

MCF7A-repro

BREAST

MCF7D-repro

COLON

COLON

COLON

COLON

COLON

COLON

COLON NSCLC

NSCLC

OVARIAN

OVARIAN

NSCLC

NSCLC

NSCLC

OVARIAN

OVARIAN

NSCLC

PROSTATE

OVARIAN

PROSTATE

BREAST

BREAST

MELANOMA

MELANOMA

MELANOMA

MELANOMA

MELANOMA

MELANOMA

MELANOMA

RENAL

RENAL

RENAL RENAL

RENAL

RENAL

RENAL

CNS

CNS

NSCLC

RENAL

BREAST

MELANOMA

CNS

NSCLCCNS

CNS

BREAST

UNKNOWN

OVARIAN

RENAL

BREAST

NSCLC

050

100

150

200

250

300

350

Hierarchical clustering by Ward's method

hclust (*, "ward")Cells

Height

Figure 1: Hierarchical clustering by Ward’s method.

11

Page 12: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

LEUKEMIA

NSCLC

LEUKEMIA

K562B-repro

K562A-repro

LEUKEMIA

LEUKEMIA

NSCLC

BREAST

BREAST

RENAL

NSCLC

MELANOMA

BREAST

BREAST

MELANOMA

MELANOMA

MELANOMA

MELANOMA

MELANOMA

MELANOMA

BREAST

CNS

CNS

OVARIAN

NSCLC

OVARIAN

RENAL

MELANOMA

OVARIAN

COLON

UNKNOWN

OVARIAN

BREAST

MCF7A-repro

BREAST

MCF7D-repro

COLON

NSCLC

COLON

OVARIAN

OVARIAN

PROSTATE

NSCLC

NSCLC

NSCLC

PROSTATE

COLON

COLON

COLON

COLON

NSCLC

RENAL

RENAL

RENAL

RENAL

RENALRENAL

RENAL

CNS

CNS

CNS

LEUKEMIA

LEUKEMIA

3040

5060

7080

90

Hierarchical clustering by single-link method

hclust (*, "single")Cells

Height

Figure 2: Hierarchical clustering by the single-link method.

12

Page 13: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

BREAST

MCF7A-repro

BREAST

MCF7D-repro

COLON

COLON

COLON

COLON

COLON

COLON

COLON

LEUKEMIA

LEUKEMIA

LEUKEMIA

LEUKEMIA

K562B-repro

K562A-repro

LEUKEMIA

LEUKEMIA

NSCLC

RENAL

BREAST

NSCLC

NSCLC

MELANOMA

MELANOMA

MELANOMA

MELANOMA

MELANOMA

BREAST

BREAST

MELANOMA

MELANOMA

RENAL

UNKNOWN

OVARIAN

BREAST

NSCLC

CNS

CNS

CNS

CNS

BREAST

OVARIAN

OVARIAN

RENAL

RENAL

RENAL

RENAL

RENAL

RENAL

RENAL OVARIAN

OVARIAN

NSCLC

NSCLC

NSCLC

NSCLC

MELANOMA

CNS

NSCLC

PROSTATE

OVARIAN

PROSTATE

2040

6080

100

120

140

Hierarchical clustering by complete-link method

hclust (*, "complete")Cells

Height

Figure 3: Hierarchical clustering by the complete-link method.

13

Page 14: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

0 10 20 30 40 50 60

010

2030

4050

60

Clusters remaining

Mer

ging

cos

t

Figure 4: Merging costs in Ward’s method.

one for each data point. Since the merging costs are differences, wewant to add an initial 0 to the sequence:

merging.costs = diff(c(0,nci.wards$height))plot(64:2,merging.costs,xlab="Clusters remaining",ylab="Merging cost")

The function diff takes the difference between successive compo-nents in a vector, which is what we want here (Figure 4).The heuristic is to stop joining clusters when the cost of doing sogoes up a lot; this suggests using only 10 clusters, since going from10 clusters to 9 is very expensive. (Alternately, use 63 clusters; butthat would be silly.)

(e) Suppose you did not know the cell classes. Can you think of anyreason to prefer one clustering method over another here, based on

14

Page 15: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

their outputs and the rest of what you know about the problem?Answer: I can’t think of anything very compelling. Ward’s methodhas a better sum-of-squares, but of course we don’t know in advancethat sum-of-squares picks out differences between cancer types withany biological or medical importance.

15

Page 16: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

4. (a) Use prcomp to find the variances (not the standard deviations) as-sociated with the principal components. Include a print-out of thesevariances, to exactly two significant digits, in your write-up.Answer:

> nci.pca = prcomp(nci.t)> nci.variances = (nci.pca$sdev)^2> signif(nci.variances,2)[1] 6.3e+02 3.5e+02 2.8e+02 1.8e+02 1.6e+02 1.5e+02 1.2e+02 1.2e+02[9] 1.1e+02 9.2e+01 8.9e+01 8.5e+01 7.8e+01 7.5e+01 7.1e+01 6.8e+01[17] 6.7e+01 6.2e+01 6.2e+01 6.0e+01 5.8e+01 5.5e+01 5.4e+01 5.0e+01[25] 5.0e+01 4.7e+01 4.5e+01 4.4e+01 4.3e+01 4.2e+01 4.1e+01 4.0e+01[33] 3.7e+01 3.7e+01 3.6e+01 3.5e+01 3.4e+01 3.4e+01 3.2e+01 3.2e+01[41] 3.1e+01 3.0e+01 3.0e+01 2.9e+01 2.8e+01 2.7e+01 2.6e+01 2.5e+01[49] 2.3e+01 2.3e+01 2.2e+01 2.1e+01 2.0e+01 1.9e+01 1.8e+01 1.8e+01[57] 1.7e+01 1.5e+01 1.4e+01 1.2e+01 1.0e+01 9.9e+00 8.9e+00 1.7e-28

(b) Why are there only 64 principal components, rather than 6830? (Hint:Read the lecture notes.)Answer: Because, when n < p, it is necessarily true that the datalie on an n-dimensional subspace of the p-dimensional feature space,so there are only n orthogonal directions along which the data canvary at all. (Actually, one of the variances is very, very small becauseany 64 points lie on a 63-dimensional surface, so we really only need63 directions, and the last variance is within numerical error of zero.)Alternately, we can imagine that there are 6830 − 64 = 6766 otherprincipal components, each with zero variance.

(c) Plot the fraction of the total variance retained by the first q compo-nents against q. Include both a print-out of the plot and the com-mands you used.Answer:

plot(cumsum(nci.variances)/sum(nci.variances),xlab="q",ylab="Fraction of variance",main = "Fraction of variance retained by first q components")

producing Figure 5. The handy cumsum function takes a vector andreturns the cumulative sums of its components (i.e., another vectorof the sum length). (The same effect can be achieved through lapplyor sapply, or through an explicit loop, etc.)

(d) Roughly how much variance is retained by the first two principal com-ponents?Answer: From the figure, about 25%. More exactly:

> sum(nci.variances[1:2])[1] 986.1434> sum(nci.variances[1:2])/sum(nci.variances)[1] 0.2319364

16

Page 17: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

0 10 20 30 40 50 60

0.2

0.4

0.6

0.8

1.0

Fraction of variance retained by first q components

q

Frac

tion

of v

aria

nce

Figure 5: R2 vs. q.

17

Page 18: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

so 23% of the variance.

(e) Roughly how many components must be used to keep half of the vari-ance? To keep nine-tenths?Answer: From the figure, about 10 components keep half the vari-ance, and about 40 components keep nine tenths. More exactly:

> min(which(cumsum(nci.variances) > 0.5*sum(nci.variances)))[1] 10> min(which(cumsum(nci.variances) > 0.9*sum(nci.variances)))[1] 42

so we need 10 and 42 components, respectively.

(f) Is there any point to using more than 50 components? (Explain.)Answer: There’s very little point from the point of view of captur-ing variance. The error in reconstructing the expression levels fromusing 50 components is already extremely small. However, there’s noguarantee that, in some other application, the information carried bythe 55th component (say) wouldn’t be crucial.

(g) How much confidence should you have in results from visualizing thefirst two principal components? Why?Answer: Very little; the first two components represent only a smallpart of the total variance.

18

Page 19: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

-40 -20 0 20 40 60

-40

-20

020

PC1

PC2

CNSCNS

CNS

RENAL

BREASTCNS

CNS

BREAST

NSCLC

NSCLC

RENAL

RENAL

RENALRENAL

RENAL RENALRENAL

BREAST

NSCLC RENAL

UNKNOWN

OVARIAN

MELANOMA

PROSTATE

OVARIANOVARIAN

OVARIAN

OVARIAN

OVARIANPROSTATE

NSCLCNSCLC

NSCLC

LEUKEMIA

K562B-reproK562A-repro

LEUKEMIALEUKEMIA

LEUKEMIALEUKEMIALEUKEMIA

COLONCOLON

COLON

COLON

COLON COLON

COLON

MCF7A-reproBREAST

MCF7D-repro

BREAST

NSCLC

NSCLCNSCLC

MELANOMABREAST

BREAST

MELANOMA

MELANOMA

MELANOMAMELANOMA

MELANOMA

MELANOMA

Figure 6: PC projections of the cells, labeled by cell type.

5. (a) Plot the projection of each cell on to the first two principal compo-nents. Label each cell by its type. Hint: See the solutions to the firsthomework.Answer: Take the hint!

plot(nci.pca$x[,1:2],type="n")text(nci.pca$x[,1:2],nci.classes, cex=0.5)

Figure 6 is the result.

(b) One tumor class (at least) forms a cluster in the projection. Saywhich, and explain your answer.Answer: MELANOMA; most of the cells cluster in the bottom-center of the figure (PC1 between about 0 and −20, PC2 around

19

Page 20: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

−40), with only a few non-MELANOMA cells nearby, and a big gapto the rest of the data. One could also argue for LEUKEMIA in thecenter left, CNS in the upper left, RENAL above it, and COLON inthe top right.

(c) Identify a tumor class which does not form a compact cluster.Answer: Most of them don’t. What I had in mind though wasBREAST, which is very widely spread.

(d) Of the two classes of tumors you have just named, which will be moreeasily classified with the prototype method? With the nearest neighbormethod?Answer: BREAST will be badly classified using the prototype method.The prototype will be around the middle of the plot, where there areno breast-cancer cells. Nearest-neighbors can hardly do worse. Onthe other hand, MELANOMA should work better with the prototypemethod, because it forms a compact blob.

20

Page 21: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

6. Combining dimensionality reduction and clustering

(a) Run k-means with k = 14 on the projections of the cells on to thefirst two principal components. Give the commands you used to dothis, and get three clusterings from three different runs of k-means.

pc.clusters.1 = kmeans(nci.pca$x[,1:2],centers=14)$clusterpc.clusters.2 = kmeans(nci.pca$x[,1:2],centers=14)$clusterpc.clusters.3 = kmeans(nci.pca$x[,1:2],centers=14)$cluster

(b) Calculate the number of errors, as with k-means clustering based onall genes.Answer:

> count.lumping.errors(nci.classes,pc.clusters.1)[1] 96> count.lumping.errors(nci.classes,pc.clusters.2)[1] 92> count.lumping.errors(nci.classes,pc.clusters.3)[1] 111> count.splitting.errors(nci.classes,pc.clusters.1)[1] 126> count.splitting.errors(nci.classes,pc.clusters.2)[1] 128> count.splitting.errors(nci.classes,pc.clusters.3)[1] 134

The number of lumping errors has increased a bit, though not verymuch. The number of splitting errors has gone up by about 10%.

(c) Are there any pairs of cells which are always clustered together? Ifso, do they have the same cell type?Answer: For me, the first three cells are all clustered together, andare all CNS.

(d) Does k-means find a cluster corresponding to the cell type you thoughtwould be especially easy to identify in the previous problem? (Explainyour answer.)Answer: Make confusion matrices and look at them.

> table(nci.classes,pc.clusters.1)pc.clusters.11 2 3 4 5 6 7 8 9 10 11 12 13 14

BREAST 2 1 0 0 0 0 1 0 0 0 1 1 0 1CNS 0 0 0 0 0 0 0 0 0 0 0 0 0 5COLON 0 0 0 2 0 0 2 1 0 0 2 0 0 0K562A-repro 0 0 0 0 1 0 0 0 0 0 0 0 0 0K562B-repro 0 0 0 0 1 0 0 0 0 0 0 0 0 0LEUKEMIA 0 0 0 0 4 2 0 0 0 0 0 0 0 0

21

Page 22: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

MCF7A-repro 0 0 0 0 0 0 0 0 0 0 1 0 0 0MCF7D-repro 0 0 0 0 0 0 0 0 0 0 1 0 0 0MELANOMA 1 0 0 0 0 0 0 0 0 0 0 1 6 0NSCLC 0 0 0 0 0 0 0 2 2 1 0 3 0 1OVARIAN 0 0 0 0 0 0 0 0 1 3 0 2 0 0PROSTATE 0 0 0 0 0 0 0 0 0 2 0 0 0 0RENAL 0 1 2 0 0 0 0 0 5 0 0 1 0 0UNKNOWN 0 0 0 0 0 0 0 0 0 0 0 1 0 0

Six of the eight MELANOMA cells are in one cluster here, which hasno other members — this matches what I guessed from the projectionplot. This is also true in one of the other two runs; in the third thatgroup of six is split into two clusters of 4 cells and 2.

(e) Does k-means find a cluster corresponding to the cell type you thoughtbe be especially hard to identify? (Again, explain.)Answer: I thought BREAST would be hard to classify, and it is,being spread across six clusters, all containing cells from other classes.This remains true in the other k-means runs. On the other hand, itwasn’t well-clustered with the full data, either.

22

Page 23: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

7. Combining dimensionality reduction and classification

(a) Calculate the error rate of the prototype method using all the genefeatures, using leave-one-out cross-validation. You may use the solu-tion code from previous assignments, or your own code. Rememberthat in this data set, the class labels are the row names, not a sepa-rate column, so you will need to modify either your code or the dataframe. (Note: this may take several minutes to run. [Why so slow?])Answer: I find it simpler to add another column for the class labelsthan to modify the code.

> nci.frame = data.frame(labels = nci.classes, nci.t)> looCV.prototype(nci.frame)[1] 0.578125

(The frame-making command complains about duplicated row-names,but I suppressed that for simplicity.) 58% accuracy may not soundthat great, but remember there are 14 classes — it’s a lot better thanchance.Many people tried to use matrix here. The difficulty, as mentionedin the lecture notes, is that a matrix or array, in R, has to haveall its components be of the same data type — it can’t mix, say,numbers and characters. When you try, it coerces the numbers intocharacters (because it can do that but can’t go the other way). Withdata.frame, each column can be of a different data type, so we canmix character labels with numeric features.This takes a long time to run because it has to calculate 14 classcenters, with 6830 dimensions, 64 times. Of course most of the classprototypes are not changing with each hold-out (only the one centerof the class to which the hold-out point belongs), so there’s a fairamount of wasted effort computationally. A slicker implementationof leave-one-out for the prototype method would take advantage ofthis, at the cost of specializing the code to that particular classifiermethod.

(b) Calculate the error rate of the prototype method using the first two,ten and twenty principal components. Include the R commands youused to do this.Answer: Remember that the x attribute of the object returned byprcomp gives the projections on to the components. We need toturn these into data frames along with the class labels, but since wedon’t really care about keeping them around, we can do that in theargument to the CV function.

> looCV.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:2]))[1] 0.421875> looCV.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:10]))[1] 0.515625

23

Page 24: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

proto.with.q.components = function(q, pcs, classes) {pcs = pcs[,1:q]frame = data.frame(class.labels=classes, pcs)accuracy = looCV.prototype(frame)return(accuracy)

}

Code Example 5: Function for calculating the accuracy of a variable numberof components.

> looCV.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:20]))[1] 0.59375

So the accuracy is only slightly reduced, or even slightly increased,by using the principal components rather than the full data.

(c) (Extra credit.) Plot the error rate, as in the previous part, against thenumber q of principal components used, for q from 2 to 64. Includeyour code and comment on the graph. (Hint: Write a function tocalculate the error rate for arbitrary q, and use sapply.)Answer: Follow the hint (Code Example 5).This creates the frame internally — strictly the extra arguments tothe function aren’t needed, but assuming that global variables willexist and have the right properties is always bad programming prac-tice. I could add default values to them (pcs=nci.pca$x, etc.), butthat’s not necessary because sapply can pass extra named argumentson to the function.

> pca.accuracies = sapply(2:64, proto.with.q.components,+ pcs=nci.pca$x, classes=nci.classes)> pca.accuracies[c(1,9,19)][1] 0.421875 0.515625 0.593750> plot(2:64, pca.accuracies, xlab="q", ylab="Prototype accuracy")

The middle command double checks that the proto.with.q.componentsfunction works properly, by giving the same results as we got byhand earlier. (Remember that the first component of the vectorpca.accuracies corresponds to q = 2, etc.) The plot is Figure7. There’s a peak in the accuracy for q between about 15 and 20.

24

Page 25: Homework 4 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-04.pdfare in the search results (the recall). Plotting the precision and recall against k gives an indication

0 10 20 30 40 50 60

0.45

0.50

0.55

0.60

q

Pro

toty

pe a

ccur

acy

Figure 7: Accuracy of the prototype method as a function of the number ofcomponents q.

25