Solutions 04

25
Homework 4 36-350: Data Mining SOLUTIONS I loaded the data and transposed it thus: library(ElemStatLearn) data(nci) nci.t = t(nci) This means that rownames(nci.t) contains the cell classes, or equivalently colnames(nci). I’ll use that a lot, so I’ll abbreviate it: nci.classes=rownames(nci.t). 1. Similarity searching for cells (a) How many features are there in the raw data? Are they discrete or continuous? (If there are some of each, which is which?) If some are continuous, are they necessarily positive? Answer: This depends on whether the cell classes are counted as features or not; you could defend either answer. If you don’t count the cell class, there are 6830 continuous features. Since they are logarithms, they do not have to be positive, and indeed they are not. If you do count the class, there are 6831 features, one of which is discrete (the class). I am inclined not to count the clas as really being a feature, since when we get new data it may not have the label. (b) Describe a method for finding the k cells whose expression profiles are most similar to that of a given cell. Carefully explain what you mean by “similar”. How could you evaluate the success of the search algorithm? Answer: The simplest thing is to call two cells similar if their vec- tors of expression levels have a small Euclidean distance. The most similar cells are the ones whose expression-profile vectors are closest, geometrically, to the target expression profile vector. There are several ways the search could be evaluated. One would be to use the given cell labels as an indication of whether the search re- sults are “relevant”. That is, one would look at the k closest matches and see (a) what fraction of them are of the same type as the tar- get, which is the precision, and (b) what fraction of cells of that type 1

description

Solutions 04

Transcript of Solutions 04

  • Homework 4

    36-350: Data Mining

    SOLUTIONS

    I loaded the data and transposed it thus:

    library(ElemStatLearn)data(nci)nci.t = t(nci)

    This means that rownames(nci.t) contains the cell classes, or equivalentlycolnames(nci). Ill use that a lot, so Ill abbreviate it: nci.classes=rownames(nci.t).

    1. Similarity searching for cells

    (a) How many features are there in the raw data? Are they discrete orcontinuous? (If there are some of each, which is which?) If some arecontinuous, are they necessarily positive?Answer: This depends on whether the cell classes are counted asfeatures or not; you could defend either answer. If you dont countthe cell class, there are 6830 continuous features. Since they arelogarithms, they do not have to be positive, and indeed they are not.If you do count the class, there are 6831 features, one of which isdiscrete (the class).I am inclined not to count the clas as really being a feature, sincewhen we get new data it may not have the label.

    (b) Describe a method for finding the k cells whose expression profilesare most similar to that of a given cell. Carefully explain what youmean by similar. How could you evaluate the success of the searchalgorithm?Answer: The simplest thing is to call two cells similar if their vec-tors of expression levels have a small Euclidean distance. The mostsimilar cells are the ones whose expression-profile vectors are closest,geometrically, to the target expression profile vector.There are several ways the search could be evaluated. One would beto use the given cell labels as an indication of whether the search re-sults are relevant. That is, one would look at the k closest matchesand see (a) what fraction of them are of the same type as the tar-get, which is the precision, and (b) what fraction of cells of that type

    1

  • are in the search results (the recall). Plotting the precision and recallagainst k gives an indication of how efficient the search is. This couldthen be averaged over all the labeled cells.Normalizing the vectors by the Euclidean length is possible here,though a little tricky because the features are the logarithms of theconcentration levels, so some of them can be negative numbers, mean-ing that very little of that gene is being expressed. Also, the reasonfor normalizing bag-of-words vectors by length was the idea that twodocuments might have similar meanings or subjects even if they haddifferent sizes. A cell which is expressing all the genes at a low level,however, might well be very different from one which is expressingthe same genes at a high level.

    (c) Would it make sense to weight genes by inverse cell frequency?Explain.Answer: No; or at least, I dont see how. Every gene is expressed tosome degree by every cell (otherwise the logarithm of its concentrationwould be ), so every gene would have an inverse cell frequency ofzero. You could fix a threshold and just count whether the expressionlevel is over the threshold, but this would be pretty arbitrary.However, a related idea would be to scale genes expression levels bysomething which indicates how much they vary across cells, such asthe range or the variance. Since we have labeled cell types here, wecould even compute the average expression level for each type andthen scale genes by the variance of these type-averages. That is, geneswith a small variance (or range, etc.) should get small weights, sincethey are presumably uninformative, and genes with a large variancein epxression level should get larger weights.

    2

  • 2. k-means clustering Use the kmeans function in R to cluster the cells, withk = 14 (to match the number of true classes). Repeat this three times toget three different clusterings.

    Answer:

    clusters.1 = kmeans(nci.t,centers=14)$clusterclusters.2 = kmeans(nci.t,centers=14)$clusterclusters.3 = kmeans(nci.t,centers=14)$cluster

    Im only keping the cluster assignments, because thats all this problemcalls for.

    (a) Say that k-means makes a lumping error whenever it assigns twocells of different classes to the same cluster, and a splitting errorwhen it puts two cells of the same class in different clusters. Eachpair of cells can give an error. (With n objects, there are n(n 1)/2distinct pairs.)

    i. Write a function which takes as inputs a vector of classes anda vector of clusters, and gives as output the number of lumpingerrors. Test your function by verifying that when the classesare (1, 2, 2), the three clusterings (1, 2, 2), (2, 1, 1) and (4, 5, 6)all give zero lumping errors, but the clustering (1, 1, 1) gives twolumping errors.Answer: The most brute-force way (Code 1) uses nested forloops. This works, but its not the nicest approach. (1) forloops, let alone nested loops, are extra slow and inefficient in R.(2) We only care about pairs which are in the same cluster, wastetime checking all pairs. (3) Its a bit obscure what were reallydoing.Code 2 is faster, cleaner, and more R-ish, using the utility func-tion outer, which (rapidly!) applies a function to pairs of ob-jects constructed from its first two arguments, returning a ma-trix. Here were seeing which pairs of objects belong to differentclass, and which to the same clusters. (The solutions to the firsthomework include a version of this trick.) We restrict ourselvesto pairs in the same cluster and count how many are of differentclasses. We divided by two because outer produces rectangularmatrices and works with ordered pairs, so (i, j) 6= (j, i), but theproblem asks for unordered pairs.> count.lumping.errors(c(1,2,2),c(1,2,2))[1] 0> count.lumping.errors(c(1,2,2),c(2,1,1))[1] 0> count.lumping.errors(c(1,2,2),c(4,5,6))[1] 0

    3

  • # Count number of errors made by lumping different classes into# one cluster# Uses explicit, nested iteration

    # Input: vector of true classes, vector of guessed clusters# Output: number of distinct pairs belonging to different classes# lumped into one cluster

    count.lumping.errors.1

  • # Count number of errors made by splitting a single classes into# clusters

    # Input: vector of true classes, vector of guessed clusters# Output: number of distinct pairs belonging to the same class# split into different clusters

    count.splitting.errors count.lumping.errors(c(1,2,2),c(1,1,1))[1] 2

    ii. Write a function, with the same inputs as the previous one, thatcalculates the number of splitting errors. Test it on the sameinputs. (What should the outputs be?)Answer: A simple modification of the code from the last part.Before we can test this, we need to know what the test resultsare. The classes (1, 2, 2) and clustering (1, 2, 2) are identical, soobviously there will be no splitting errors. Likewise for compar-ing (1, 2, 2) and (2, 1, 1). Comparing (1, 2, 2) and (4, 5, 6), thepair of the second and third items are in the same class but splitinto two clusters one splitting error. On the other hand, com-paring (1, 2, 2) and (1, 1, 1), there are no splitting errors, becauseeverything is put in the same cluster.> count.splitting.errors(c(1,2,2),c(1,2,2))[1] 0> count.splitting.errors(c(1,2,2),c(2,1,1))[1] 0> count.splitting.errors(c(1,2,2),c(4,5,6))[1] 1> count.splitting.errors(c(1,2,2),c(1,1,1))[1] 0

    iii. How many lumping errors does k-means make on each of yourthree runs? How many splitting errors?> count.lumping.errors(nci.classes,clusters.1)[1] 79> count.lumping.errors(nci.classes,clusters.2)[1] 96> count.lumping.errors(nci.classes,clusters.3)

    5

  • [1] 110> count.splitting.errors(nci.classes,clusters.1)[1] 97> count.splitting.errors(nci.classes,clusters.2)[1] 97> count.splitting.errors(nci.classes,clusters.3)[1] 106

    For comparison, there are 64 63/2 = 2016 pairs, so the errorrate here is pretty good.

    (b) Are there any classes which seem particularly hard for k-means topick up?Answer: Well-argued qualitative answers are fine, but there areways of being quantitative. One is the entropy of the cluster assign-ments for the classes. Start with a confusion matrix:

    > confusion = table(nci.classes,clusters.1)> confusion

    clusters.11 2 3 4 5 6 7 8 9 10 11 12 13 14

    BREAST 2 0 0 0 2 0 0 0 0 0 0 2 1 0CNS 0 0 0 0 3 0 0 0 2 0 0 0 0 0COLON 0 0 1 0 0 0 0 6 0 0 0 0 0 0K562A-repro 0 1 0 0 0 0 0 0 0 0 0 0 0 0K562B-repro 0 1 0 0 0 0 0 0 0 0 0 0 0 0LEUKEMIA 0 1 0 0 0 0 0 0 0 0 0 0 0 5MCF7A-repro 0 0 0 0 0 0 0 0 0 0 0 1 0 0MCF7D-repro 0 0 0 0 0 0 0 0 0 0 0 1 0 0MELANOMA 7 0 0 1 0 0 0 0 0 0 0 0 0 0NSCLC 0 0 4 1 0 0 2 0 0 0 1 0 1 0OVARIAN 0 0 3 1 0 0 0 0 0 2 0 0 0 0PROSTATE 0 0 2 0 0 0 0 0 0 0 0 0 0 0RENAL 0 0 0 1 1 7 0 0 0 0 0 0 0 0UNKNOWN 0 0 0 1 0 0 0 0 0 0 0 0 0 0

    > signif(apply(confusion,1,entropy),2)BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA2.00 0.97 0.59 0.00 0.00 0.65

    MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE0.00 0.00 0.54 2.10 1.50 0.00RENAL UNKNOWN0.99 0.00

    Of course thats just one k-means run. Here are a few more.

    > signif(apply(table(nci.classes,clusters.2),1,entropy),2)BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA2.00 0.00 0.59 0.00 0.00 1.50

    MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE

    6

  • 0.00 0.00 0.54 2.10 1.50 0.00RENAL UNKNOWN0.99 0.00

    > signif(apply(table(nci.classes,clusters.3),1,entropy),2)BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA2.00 1.50 0.59 0.00 0.00 1.90

    MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE0.00 0.00 1.10 1.80 0.65 0.00RENAL UNKNOWN0.99 0.00

    BREAST and NSCLC have consistently high cluster entropy, indi-cating that cells of these types tend not to be clustered together.

    (c) Are there any pairs of cells which are always clustered together, andif so, are they of the same class?Answer: Cells number 1 and 2 are always clustered together in mythree runs, and are CNS tumors. However, cells 3 and 4 are alwaysclustered together, too, and one is CNS and one is RENAL.

    (d) Variation-of-information metric

    i. Calculate, by hand, the variation-of-information distance betweenthe partition (1, 2, 2) and (2, 1, 1); between (1, 2, 2) and (2, 2, 1);and between (1, 2, 2) and (5, 8, 11).Answer: Write X for the cell of the first partition in each pairand Y for the second.In the first case, when X = 1, Y = 2 and vice versa. SoH[X|Y ] = 0, H[Y |X] = 0 and the distance between the par-titions is also zero. (The two partitions are the same, theyvejust swapped the labels on the cells.)For the second pair, we have Y = 2 when X = 1, so H[Y |X =1] = 0, but Y = 1 or 2 equally often when X = 2, so H[Y |X =2] = 1. Thus H[Y |X] = (0)(1/3) + (1)(2/3) = 2/3. Symmet-rically, X = 2 when Y = 1, H[X|Y = 1] = 0, but X = 1 or2 equiprobably when Y = 2, so again H[X|Y ] = 2/3, and thedistance is 4/3.In the final case, H[X|Y ] = 0, because for each Y there is asingle value of X, but H[Y |X] = 2/3, as in the previous case. Sothe distance is 2/3.

    ii. Write a function which takes as inputs two vectors of class orcluster assignments, and returns as output the variation-of-informationdistance for the two partitions. Test the function by checking thatit matches your answers in the previous part.Answer: The most straight-forward way is to make the contin-gency table or confusion matrix for the two assignment vectors,calculate the two conditional entropies from that, and add them.But I already have a function to calculate mutual information,

    7

  • # Variation-of-information distance between two partitions# Input: two vectors of partition-cell assignments of equal length# Calls: entropy, word.mutual.info from lecture-05.R# Output: variation of information distancevariation.of.info variation.of.info(c(1,2,2),c(2,1,1))[1] 0> variation.of.info(c(1,2,2),c(2,2,1))[1] 1.333333> variation.of.info(c(1,2,2),c(5,8,11))[1] 0.6666667> variation.of.info(c(5,8,11),c(1,2,2))[1] 0.6666667

    I threw in the last one to make sure the function is symmetricin its two arguments, as it should be.

    iii. Calculate the distances between the k-means clusterings and thetrue classes, and between the k-means clusterings and each other.Answer:

    > variation.of.info(nci.classes,clusters.1)[1] 1.961255> variation.of.info(nci.classes,clusters.2)

    8

  • [1] 2.026013> variation.of.info(nci.classes,clusters.3)[1] 2.288418> variation.of.info(clusters.1,clusters.2)[1] 0.5167703> variation.of.info(clusters.1,clusters.3)[1] 0.8577165> variation.of.info(clusters.2,clusters.3)[1] 0.9224744

    The k-means clusterings are all roughly the same distance to thetrue classes (two bits or so), and noticeably closer to each otherthan to the true classes (around three-quarters of a bit).

    9

  • 3. Hierarchical clustering

    (a) Produce dendrograms using Wards method, single-link clustering andcomplete-link clustering. Include both the commands you used and theresulting figures in your write-up. Make sure the figures are legible.(Try the cex=0.5 option to plot.)Answer: hclust needs a matrix of distances, handily produced bythe dist function.

    nci.dist = dist(nci.t)plot(hclust(nci.dist,method="ward"),cex=0.5,xlab="Cells",main="Hierarchical clustering by Wards method")

    plot(hclust(nci.dist,method="single"),cex=0.5,xlab="Cells",main="Hierarchical clustering by single-link method")

    plot(hclust(nci.dist,method="complete"),cex=0.5,xlab="Cells",main="Hierarchical clustering by complete-link method")

    Producing Figures 1, 2 and 3

    (b) Which cell classes seem are best captured by each clustering method?Explain.Answer: Wards method does a good job of grouping togetherCOLON, MELANOMA and RENAL; LEUKEMIA is pretty goodtoo but mixed up with some of the lab cell lines. Single-link is goodwith RENAL and decent with MELANOMA (though confused withBREAST). There are little sub-trees of cells of the same time, likeCOLON or CNS, but mixed together with others. The complete-linkmethod has sub-trees for COLON, LEUKEMIA, MELANOMA, andRENAL, which look pretty much as good as Wards method.

    (c) Which method best recovers the cell classes? Explain.Answer: Wards method looks better scanning down the figure,a lot more of the adjacent cells are of the same type. Since thedendrogram puts items in the same sub-cluster together, this suggeststhat the clustering more nearly corresponds to the known cell types.

    (d) The hclust command returns an object whose height attribute is thesum of the within-cluster sums of squares. How many clusters doesthis suggest we should use, according to Wards method? Explain.(You may find diff helpful.)Answer:

    > nci.wards = hclust(nci.dist,method="ward")> length(nci.wards$height)[1] 63> nci.wards$height[1][1] 38.23033

    There are 63 heights, corresponding to the 63 joinings, i.e., not in-cluding the height (sum-of-squares) of zero when we have 64 clusters,

    10

  • LEUKEMIA

    K562B-repro

    K562A-repro

    LEUKEMIA

    LEUKEMIALEUKEMIA

    LEUKEMIA

    LEUKEMIA

    BREAST

    MCF7A-repro

    BREAST

    MCF7D-repro

    COLON

    COLON

    COLON

    COLON

    COLON

    COLON

    COLON NSCLC

    NSCLC

    OVARIAN

    OVARIAN

    NSCLC

    NSCLC

    NSCLC

    OVARIAN

    OVARIAN

    NSCLC

    PROSTATE

    OVARIAN

    PROSTATE

    BREAST

    BREAST

    MELANOMA

    MELANOMA

    MELANOMA

    MELANOMA

    MELANOMA

    MELANOMA

    MELANOMA

    RENAL

    RENAL

    RENAL RENAL

    RENAL

    RENAL

    RENAL

    CNS

    CNS

    NSCLC

    RENAL

    BREAST

    MELANOMA

    CNS

    NSCLCCNS

    CNS

    BREAST

    UNKNOWN

    OVARIAN

    RENAL

    BREAST

    NSCLC

    050

    100

    150

    200

    250

    300

    350

    Hierarchical clustering by Ward's method

    hclust (*, "ward")Cells

    Height

    Figure 1: Hierarchical clustering by Wards method.

    11

  • LEUKEMIA

    NSCLC

    LEUKEMIA

    K562B-repro

    K562A-repro

    LEUKEMIA

    LEUKEMIA

    NSCLC

    BREAST

    BREAST

    RENAL

    NSCLC

    MELANOMA

    BREAST

    BREAST

    MELANOMA

    MELANOMA

    MELANOMA

    MELANOMA

    MELANOMA

    MELANOMA

    BREAST

    CNS

    CNS

    OVARIAN

    NSCLC

    OVARIAN

    RENAL

    MELANOMA

    OVARIAN

    COLON

    UNKNOWN

    OVARIAN

    BREAST

    MCF7A-repro

    BREAST

    MCF7D-repro

    COLON

    NSCLC

    COLON

    OVARIAN

    OVARIAN

    PROSTATE

    NSCLC

    NSCLC

    NSCLC

    PROSTATE

    COLON

    COLON

    COLON

    COLON

    NSCLC

    RENAL

    RENAL

    RENAL

    RENAL

    RENALRENAL

    RENAL

    CNS

    CNS

    CNS

    LEUKEMIA

    LEUKEMIA

    3040

    5060

    7080

    90

    Hierarchical clustering by single-link method

    hclust (*, "single")Cells

    Height

    Figure 2: Hierarchical clustering by the single-link method.

    12

  • BREAST

    MCF7A-repro

    BREAST

    MCF7D-repro

    COLON

    COLON

    COLON

    COLON

    COLON

    COLON

    COLON

    LEUKEMIA

    LEUKEMIA

    LEUKEMIA

    LEUKEMIA

    K562B-repro

    K562A-repro

    LEUKEMIA

    LEUKEMIA

    NSCLC

    RENAL

    BREAST

    NSCLC

    NSCLC

    MELANOMA

    MELANOMA

    MELANOMA

    MELANOMA

    MELANOMA

    BREAST

    BREAST

    MELANOMA

    MELANOMA

    RENAL

    UNKNOWN

    OVARIAN

    BREAST

    NSCLC

    CNS

    CNS

    CNS

    CNS

    BREAST

    OVARIAN

    OVARIAN

    RENAL

    RENAL

    RENAL

    RENAL

    RENAL

    RENAL

    RENAL OVARIAN

    OVARIAN

    NSCLC

    NSCLC

    NSCLC

    NSCLC

    MELANOMA

    CNS

    NSCLC

    PROSTATE

    OVARIAN

    PROSTATE

    2040

    6080

    100

    120

    140

    Hierarchical clustering by complete-link method

    hclust (*, "complete")Cells

    Height

    Figure 3: Hierarchical clustering by the complete-link method.

    13

  • 0 10 20 30 40 50 60

    010

    2030

    4050

    60

    Clusters remaining

    Mer

    ging

    cos

    t

    Figure 4: Merging costs in Wards method.

    one for each data point. Since the merging costs are differences, wewant to add an initial 0 to the sequence:

    merging.costs = diff(c(0,nci.wards$height))plot(64:2,merging.costs,xlab="Clusters remaining",ylab="Merging cost")

    The function diff takes the difference between successive compo-nents in a vector, which is what we want here (Figure 4).The heuristic is to stop joining clusters when the cost of doing sogoes up a lot; this suggests using only 10 clusters, since going from10 clusters to 9 is very expensive. (Alternately, use 63 clusters; butthat would be silly.)

    (e) Suppose you did not know the cell classes. Can you think of anyreason to prefer one clustering method over another here, based on

    14

  • their outputs and the rest of what you know about the problem?Answer: I cant think of anything very compelling. Wards methodhas a better sum-of-squares, but of course we dont know in advancethat sum-of-squares picks out differences between cancer types withany biological or medical importance.

    15

  • 4. (a) Use prcomp to find the variances (not the standard deviations) as-sociated with the principal components. Include a print-out of thesevariances, to exactly two significant digits, in your write-up.Answer:

    > nci.pca = prcomp(nci.t)> nci.variances = (nci.pca$sdev)^2> signif(nci.variances,2)[1] 6.3e+02 3.5e+02 2.8e+02 1.8e+02 1.6e+02 1.5e+02 1.2e+02 1.2e+02[9] 1.1e+02 9.2e+01 8.9e+01 8.5e+01 7.8e+01 7.5e+01 7.1e+01 6.8e+01[17] 6.7e+01 6.2e+01 6.2e+01 6.0e+01 5.8e+01 5.5e+01 5.4e+01 5.0e+01[25] 5.0e+01 4.7e+01 4.5e+01 4.4e+01 4.3e+01 4.2e+01 4.1e+01 4.0e+01[33] 3.7e+01 3.7e+01 3.6e+01 3.5e+01 3.4e+01 3.4e+01 3.2e+01 3.2e+01[41] 3.1e+01 3.0e+01 3.0e+01 2.9e+01 2.8e+01 2.7e+01 2.6e+01 2.5e+01[49] 2.3e+01 2.3e+01 2.2e+01 2.1e+01 2.0e+01 1.9e+01 1.8e+01 1.8e+01[57] 1.7e+01 1.5e+01 1.4e+01 1.2e+01 1.0e+01 9.9e+00 8.9e+00 1.7e-28

    (b) Why are there only 64 principal components, rather than 6830? (Hint:Read the lecture notes.)Answer: Because, when n < p, it is necessarily true that the datalie on an n-dimensional subspace of the p-dimensional feature space,so there are only n orthogonal directions along which the data canvary at all. (Actually, one of the variances is very, very small becauseany 64 points lie on a 63-dimensional surface, so we really only need63 directions, and the last variance is within numerical error of zero.)Alternately, we can imagine that there are 6830 64 = 6766 otherprincipal components, each with zero variance.

    (c) Plot the fraction of the total variance retained by the first q compo-nents against q. Include both a print-out of the plot and the com-mands you used.Answer:

    plot(cumsum(nci.variances)/sum(nci.variances),xlab="q",ylab="Fraction of variance",main = "Fraction of variance retained by first q components")

    producing Figure 5. The handy cumsum function takes a vector andreturns the cumulative sums of its components (i.e., another vectorof the sum length). (The same effect can be achieved through lapplyor sapply, or through an explicit loop, etc.)

    (d) Roughly how much variance is retained by the first two principal com-ponents?Answer: From the figure, about 25%. More exactly:

    > sum(nci.variances[1:2])[1] 986.1434> sum(nci.variances[1:2])/sum(nci.variances)[1] 0.2319364

    16

  • 0 10 20 30 40 50 60

    0.2

    0.4

    0.6

    0.8

    1.0

    Fraction of variance retained by first q components

    q

    Frac

    tion

    of v

    aria

    nce

    Figure 5: R2 vs. q.

    17

  • so 23% of the variance.

    (e) Roughly how many components must be used to keep half of the vari-ance? To keep nine-tenths?Answer: From the figure, about 10 components keep half the vari-ance, and about 40 components keep nine tenths. More exactly:

    > min(which(cumsum(nci.variances) > 0.5*sum(nci.variances)))[1] 10> min(which(cumsum(nci.variances) > 0.9*sum(nci.variances)))[1] 42

    so we need 10 and 42 components, respectively.

    (f) Is there any point to using more than 50 components? (Explain.)Answer: Theres very little point from the point of view of captur-ing variance. The error in reconstructing the expression levels fromusing 50 components is already extremely small. However, theres noguarantee that, in some other application, the information carried bythe 55th component (say) wouldnt be crucial.

    (g) How much confidence should you have in results from visualizing thefirst two principal components? Why?Answer: Very little; the first two components represent only a smallpart of the total variance.

    18

  • -40 -20 0 20 40 60

    -40

    -20

    020

    PC1

    PC2

    CNSCNS

    CNS

    RENAL

    BREASTCNS

    CNS

    BREAST

    NSCLC

    NSCLC

    RENAL

    RENAL

    RENALRENAL

    RENAL RENALRENAL

    BREAST

    NSCLC RENAL

    UNKNOWN

    OVARIAN

    MELANOMA

    PROSTATE

    OVARIANOVARIAN

    OVARIAN

    OVARIAN

    OVARIANPROSTATE

    NSCLCNSCLC

    NSCLC

    LEUKEMIA

    K562B-reproK562A-repro

    LEUKEMIALEUKEMIA

    LEUKEMIALEUKEMIALEUKEMIA

    COLONCOLON

    COLON

    COLON

    COLON COLON

    COLON

    MCF7A-reproBREAST

    MCF7D-repro

    BREAST

    NSCLC

    NSCLCNSCLC

    MELANOMABREAST

    BREAST

    MELANOMA

    MELANOMA

    MELANOMAMELANOMA

    MELANOMA

    MELANOMA

    Figure 6: PC projections of the cells, labeled by cell type.

    5. (a) Plot the projection of each cell on to the first two principal compo-nents. Label each cell by its type. Hint: See the solutions to the firsthomework.Answer: Take the hint!

    plot(nci.pca$x[,1:2],type="n")text(nci.pca$x[,1:2],nci.classes, cex=0.5)

    Figure 6 is the result.

    (b) One tumor class (at least) forms a cluster in the projection. Saywhich, and explain your answer.Answer: MELANOMA; most of the cells cluster in the bottom-center of the figure (PC1 between about 0 and 20, PC2 around

    19

  • 40), with only a few non-MELANOMA cells nearby, and a big gapto the rest of the data. One could also argue for LEUKEMIA in thecenter left, CNS in the upper left, RENAL above it, and COLON inthe top right.

    (c) Identify a tumor class which does not form a compact cluster.Answer: Most of them dont. What I had in mind though wasBREAST, which is very widely spread.

    (d) Of the two classes of tumors you have just named, which will be moreeasily classified with the prototype method? With the nearest neighbormethod?Answer: BREAST will be badly classified using the prototype method.The prototype will be around the middle of the plot, where there areno breast-cancer cells. Nearest-neighbors can hardly do worse. Onthe other hand, MELANOMA should work better with the prototypemethod, because it forms a compact blob.

    20

  • 6. Combining dimensionality reduction and clustering

    (a) Run k-means with k = 14 on the projections of the cells on to thefirst two principal components. Give the commands you used to dothis, and get three clusterings from three different runs of k-means.

    pc.clusters.1 = kmeans(nci.pca$x[,1:2],centers=14)$clusterpc.clusters.2 = kmeans(nci.pca$x[,1:2],centers=14)$clusterpc.clusters.3 = kmeans(nci.pca$x[,1:2],centers=14)$cluster

    (b) Calculate the number of errors, as with k-means clustering based onall genes.Answer:

    > count.lumping.errors(nci.classes,pc.clusters.1)[1] 96> count.lumping.errors(nci.classes,pc.clusters.2)[1] 92> count.lumping.errors(nci.classes,pc.clusters.3)[1] 111> count.splitting.errors(nci.classes,pc.clusters.1)[1] 126> count.splitting.errors(nci.classes,pc.clusters.2)[1] 128> count.splitting.errors(nci.classes,pc.clusters.3)[1] 134

    The number of lumping errors has increased a bit, though not verymuch. The number of splitting errors has gone up by about 10%.

    (c) Are there any pairs of cells which are always clustered together? Ifso, do they have the same cell type?Answer: For me, the first three cells are all clustered together, andare all CNS.

    (d) Does k-means find a cluster corresponding to the cell type you thoughtwould be especially easy to identify in the previous problem? (Explainyour answer.)Answer: Make confusion matrices and look at them.

    > table(nci.classes,pc.clusters.1)pc.clusters.11 2 3 4 5 6 7 8 9 10 11 12 13 14

    BREAST 2 1 0 0 0 0 1 0 0 0 1 1 0 1CNS 0 0 0 0 0 0 0 0 0 0 0 0 0 5COLON 0 0 0 2 0 0 2 1 0 0 2 0 0 0K562A-repro 0 0 0 0 1 0 0 0 0 0 0 0 0 0K562B-repro 0 0 0 0 1 0 0 0 0 0 0 0 0 0LEUKEMIA 0 0 0 0 4 2 0 0 0 0 0 0 0 0

    21

  • MCF7A-repro 0 0 0 0 0 0 0 0 0 0 1 0 0 0MCF7D-repro 0 0 0 0 0 0 0 0 0 0 1 0 0 0MELANOMA 1 0 0 0 0 0 0 0 0 0 0 1 6 0NSCLC 0 0 0 0 0 0 0 2 2 1 0 3 0 1OVARIAN 0 0 0 0 0 0 0 0 1 3 0 2 0 0PROSTATE 0 0 0 0 0 0 0 0 0 2 0 0 0 0RENAL 0 1 2 0 0 0 0 0 5 0 0 1 0 0UNKNOWN 0 0 0 0 0 0 0 0 0 0 0 1 0 0

    Six of the eight MELANOMA cells are in one cluster here, which hasno other members this matches what I guessed from the projectionplot. This is also true in one of the other two runs; in the third thatgroup of six is split into two clusters of 4 cells and 2.

    (e) Does k-means find a cluster corresponding to the cell type you thoughtbe be especially hard to identify? (Again, explain.)Answer: I thought BREAST would be hard to classify, and it is,being spread across six clusters, all containing cells from other classes.This remains true in the other k-means runs. On the other hand, itwasnt well-clustered with the full data, either.

    22

  • 7. Combining dimensionality reduction and classification

    (a) Calculate the error rate of the prototype method using all the genefeatures, using leave-one-out cross-validation. You may use the solu-tion code from previous assignments, or your own code. Rememberthat in this data set, the class labels are the row names, not a sepa-rate column, so you will need to modify either your code or the dataframe. (Note: this may take several minutes to run. [Why so slow?])Answer: I find it simpler to add another column for the class labelsthan to modify the code.

    > nci.frame = data.frame(labels = nci.classes, nci.t)> looCV.prototype(nci.frame)[1] 0.578125

    (The frame-making command complains about duplicated row-names,but I suppressed that for simplicity.) 58% accuracy may not soundthat great, but remember there are 14 classes its a lot better thanchance.Many people tried to use matrix here. The difficulty, as mentionedin the lecture notes, is that a matrix or array, in R, has to haveall its components be of the same data type it cant mix, say,numbers and characters. When you try, it coerces the numbers intocharacters (because it can do that but cant go the other way). Withdata.frame, each column can be of a different data type, so we canmix character labels with numeric features.This takes a long time to run because it has to calculate 14 classcenters, with 6830 dimensions, 64 times. Of course most of the classprototypes are not changing with each hold-out (only the one centerof the class to which the hold-out point belongs), so theres a fairamount of wasted effort computationally. A slicker implementationof leave-one-out for the prototype method would take advantage ofthis, at the cost of specializing the code to that particular classifiermethod.

    (b) Calculate the error rate of the prototype method using the first two,ten and twenty principal components. Include the R commands youused to do this.Answer: Remember that the x attribute of the object returned byprcomp gives the projections on to the components. We need toturn these into data frames along with the class labels, but since wedont really care about keeping them around, we can do that in theargument to the CV function.

    > looCV.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:2]))[1] 0.421875> looCV.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:10]))[1] 0.515625

    23

  • proto.with.q.components = function(q, pcs, classes) {pcs = pcs[,1:q]frame = data.frame(class.labels=classes, pcs)accuracy = looCV.prototype(frame)return(accuracy)

    }

    Code Example 5: Function for calculating the accuracy of a variable numberof components.

    > looCV.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:20]))[1] 0.59375

    So the accuracy is only slightly reduced, or even slightly increased,by using the principal components rather than the full data.

    (c) (Extra credit.) Plot the error rate, as in the previous part, against thenumber q of principal components used, for q from 2 to 64. Includeyour code and comment on the graph. (Hint: Write a function tocalculate the error rate for arbitrary q, and use sapply.)Answer: Follow the hint (Code Example 5).This creates the frame internally strictly the extra arguments tothe function arent needed, but assuming that global variables willexist and have the right properties is always bad programming prac-tice. I could add default values to them (pcs=nci.pca$x, etc.), butthats not necessary because sapply can pass extra named argumentson to the function.

    > pca.accuracies = sapply(2:64, proto.with.q.components,+ pcs=nci.pca$x, classes=nci.classes)> pca.accuracies[c(1,9,19)][1] 0.421875 0.515625 0.593750> plot(2:64, pca.accuracies, xlab="q", ylab="Prototype accuracy")

    The middle command double checks that the proto.with.q.componentsfunction works properly, by giving the same results as we got byhand earlier. (Remember that the first component of the vectorpca.accuracies corresponds to q = 2, etc.) The plot is Figure7. Theres a peak in the accuracy for q between about 15 and 20.

    24

  • 0 10 20 30 40 50 60

    0.45

    0.50

    0.55

    0.60

    q

    Pro

    toty

    pe a

    ccur

    acy

    Figure 7: Accuracy of the prototype method as a function of the number ofcomponents q.

    25