Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray...

10
Strategies for Winnowing Microarray Data D.B. Skillicorn School of Computing, Queen’s University Kingston Canada [email protected] Simeon Simoff Paul Kennedy Faculty of Information Technology, University of Technology Sydney Australia {simeon,paulk}@it.uts.edu.au Daniel Catchpoole Westmead Childrens’ Hospital Sydney Australia [email protected] Abstract The analysis of microarray datasets is complicated by the magnitude of the available information. Most data mining techniques are significantly hampered by irrelevant or redundant information. Hence it is useful to reduce datasets to manageable size by discarding such useless information. We present techniques for winnowing microarray datasets using singular value decomposition and semidiscrete decomposition, and show how they can be tuned to extract some informa- tion about the internal correlative structure of large datasets. 1 Introduction Microarrays have the potential to explicate the relationships between biochemical pathways and higher-level biological processes such as disease and drug activity. At present, knowledge extraction from microarrays is the bottleneck to fully exploit- ing their potential, since a substantial amount of human interaction is still required. This is slowly changing as data mining is applied and customized for the microarray setting, and more experience of how much can be automated is gained. 1.1 Technology Background. A microarray captures information about the levels of expression of a large number of genes (and other proteins) in a single sample. Collecting the results from identical microarrays processed with different samples pro- duces datasets with a large number of rows (corre- sponding to the number of genes on the microarray and typically in the tens of thousands) and rela- tively few columns (corresponding to samples from patients and typically in the tens to hundreds). There are two main kinds of microarrays: those in which each location measures the absolute ex- pression level of a particular gene, and so which are used with a single sample; and those in which each location measures the relative expression level of a particular gene, and so are used with a mixture of two samples, one representing the background (‘normal’) expression and one representing the fore- ground (‘disease’), each labelled with a different dye. These two different kinds of microarrays pro- duce datasets with different properties that must be taken into account in subsequent analysis. 1.2 Gene Data Issues. The main application of microarrays is to solve an inverse problem: which changes in gene expression account for observed

Transcript of Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray...

Page 1: Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray Data D.B. Skillicorn School of Computing, Queen’s University Kingston Canada skill@cs.queensu.ca

Strategies for Winnowing Microarray Data

D.B. SkillicornSchool of Computing, Queen’s University

Kingston [email protected]

Simeon SimoffPaul Kennedy

Faculty of Information Technology, University of TechnologySydney Australia

{simeon,paulk}@it.uts.edu.au

Daniel CatchpooleWestmead Childrens’ Hospital

Sydney [email protected]

Abstract

The analysis of microarray datasets is complicatedby the magnitude of the available information. Mostdata mining techniques are significantly hampered byirrelevant or redundant information. Hence it is usefulto reduce datasets to manageable size by discardingsuch useless information. We present techniques forwinnowing microarray datasets using singular valuedecomposition and semidiscrete decomposition, andshow how they can be tuned to extract some informa-tion about the internal correlative structure of largedatasets.

1 Introduction

Microarrays have the potential to explicate therelationships between biochemical pathways andhigher-level biological processes such as disease anddrug activity. At present, knowledge extractionfrom microarrays is the bottleneck to fully exploit-ing their potential, since a substantial amount ofhuman interaction is still required. This is slowlychanging as data mining is applied and customizedfor the microarray setting, and more experience ofhow much can be automated is gained.

1.1 Technology Background. A microarraycaptures information about the levels of expressionof a large number of genes (and other proteins) in asingle sample. Collecting the results from identicalmicroarrays processed with different samples pro-duces datasets with a large number of rows (corre-sponding to the number of genes on the microarrayand typically in the tens of thousands) and rela-tively few columns (corresponding to samples frompatients and typically in the tens to hundreds).

There are two main kinds of microarrays: thosein which each location measures the absolute ex-pression level of a particular gene, and so which areused with a single sample; and those in which eachlocation measures the relative expression level ofa particular gene, and so are used with a mixtureof two samples, one representing the background(‘normal’) expression and one representing the fore-ground (‘disease’), each labelled with a differentdye. These two different kinds of microarrays pro-duce datasets with different properties that mustbe taken into account in subsequent analysis.

1.2 Gene Data Issues. The main applicationof microarrays is to solve an inverse problem: whichchanges in gene expression account for observed

Page 2: Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray Data D.B. Skillicorn School of Computing, Queen’s University Kingston Canada skill@cs.queensu.ca

differences among arrays (representing an organ-ism at different times or different organisms withdifferent conditions, e.g. patients). Such infor-mation can reveal previously unknown subclassesof conditions (perhaps corresponding to new sub-types of diseases), can be used to build predictorsto predict conditions, and may reveal the biochem-ical pathways that result in different conditions.For example, many cancers contain subclasses thatare difficult to distinguish and for which differ-ent treatments have dramatically different successrates. Correct identification of subclass is thereforecritical to successful treatment.

In some cases, it is possible to classify patients,and so samples, based on clinical knowledge (e.g.diagnosis) and so the problem becomes to connectthis classification to its gene expression origins. Inother cases, the appropriate classification may notbe known in advance (e.g. prognosis for diseasecourse); even the appropriate number of classesmay not be known. Here the problem is to discoverboth appropriate clusters of patients and genes,and also the connections between them. In manyclinical settings, the problem is somewhere betweenthese two extremes – pragmatic classifications areknown but assigning patients to classes is notalways completely accurate, and the classes maycontain as yet unrealized subclasses.

Consider two genes that are both expressed dif-ferently for different target classes. The magni-tudes of their relative expression may be different:for example one may have a moderate expressionlevel for one class and a high expression level forthe other class; while the second has a low expres-sion level for the first class and a moderate expres-sion level for the second. If each row is consideredas a point in a metric space, then the points cor-responding to these genes may be far apart. Ofcourse, this may be mitigated by suitable normal-ization of the expression levels, but such normaliza-tion is necessarily affected by all of the values fora particular gene and hence susceptible to noise.It is also possible that the two genes are stronglynegatively correlated, in which case they will neces-sarily be far apart (on opposite sides of the origin)in a metric space representation of the dataset. Inany case, normalization will change the shape of a

localized set of genes which may mislead cluster-ing algorithms that assume, for example, sphericalclusters.

From this we conclude that both proximity-based methods and density-based methods shouldbe treated with caution as techniques for cluster-ing gene expression data. Such clusters are not lo-calized, and so are not easily visible to techniquesthat treat the dataset as a representation of a met-ric space. (This may partially explain why someattempts at clustering have found large numbersof clusters in settings where quite small numbersmight have been expected.)

On the other hand, gene expressions that areunrelated to target classes tend to be uncorrelatedwith each other, while gene expressions that arerelated to target classes are correlated (positivelyor negatively), even if they are not localized. Inother words, even though clusters of genes relatedto classes are not localized, those genes that areunrelated are unlocalized, and the remaining cor-relation among the interesting genes may be visibleagainst that background. Hence data-mining tech-niques that are based on correlation are likely todetect gene expression clusters.

One obvious initial strategy is to improvedatasets by discarding genes and/or samples thatare (probably) irrelevant to any interesting bio-chemistry. When target classes are known, theexpression level of a relevant gene should differbetween the columns corresponding to each class.Significance tests can be applied to try to de-tect such genes: for example, Significance Anal-ysis of Microarrays (SAM) [13], thresholding [11],or neighborhood analysis [4]. In a setting wherehigher-order correlation is commonplace, discard-ing genes on the basis of their individual propertiesruns the risk of discarding a significant set of geneswhose expression, although small in magnitude, istightly correlated.

PCA and SVD are standard techniques fordimensionality reduction and factor analysis. Theyhave been used to analyze microarrays in threerelated ways:

• To reduce the dimensionality of the dataand/or remove ‘noise’ from the data;

Page 3: Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray Data D.B. Skillicorn School of Computing, Queen’s University Kingston Canada skill@cs.queensu.ca

• To generate ‘eigengenes’ and ‘eigensamples’that capture concerted behavior of a numberof genes or samples [1];

• To cluster genes or samples using spectralclustering methods [7].

1.3 Proposed Approach. In this paper weshow how two matrix decompositions, singularvalue decomposition (SVD), and semidiscrete de-composition (SDD) can be used to winnow a setof genes to a much smaller set. Unlike the stan-dard application of SVD, the goal is not to producesome set of eigengenes, but to remove from consid-eration genes whose correlative relationship to theother genes suggests that they are not of interestin the context of a given set of samples. The re-duced dataset can then be analyzed using any ofa number of other data mining techniques. Thetechniques described here can also be used whenthe target classes are unknown, so they can be ap-plied even when thresholding cannot.

Both SVD and SDD map the original datainto a structure with the following two importantproperties:

1. Genes that are similar and/or correlated aremapped to close locations within the struc-ture; and

2. The structure contains a neutral point towhich genes that are either correlated with ev-erything or correlated with nothing (and henceare probably ‘uninteresting’) are mapped.

The second property is the key to winnowing, sincethese uninteresting genes can be discarded withlittle risk of losing useful information. Notice thatthis is sharply different to most attribute selectiontechniques which try to find the genes with themost predictive power, a strategy that does notwork well when many genes have a small amountof predictive power. The first property can helpwith prediction since, for example, those genes thatpredict a given target gene well will be close to itwithin the structure.

A further advantage of these matrix decompo-sitions is that they are symmetric with respect togenes and samples, so that identical techniques can

be used to discover relationships among samples(including outlier samples that may correspond ei-ther to unexpected subclasses of disease or to pro-cess flaws).

We will illustrate these techniques using adataset comparing pediatric patients diagnosedwith acute lymphocytic leukemia (ALL) andhealthy subjects. This dataset was gathered us-ing cDNA technology and postprocessed using anAxon GenePix scanner. Per sample analysis of thequality of the measured expression levels showedhigh reliability, so only four attributes per patientsample were used – the median contrasts at eachlaser frequency. We used a total of 24 samples from9 different patients. The coordinates of each spotwere also included so that we could check for loca-tion effects.

The paper is organized as follows: Section 2outlines some of the related work. Section 3 definesthe singular value decomposition and shows howit can be used for winnowing using correlation.Section 4 defines the semidiscrete decompositionand shows how it is used for winnowing usingdifference. Section 5 illustrates how weighting canbe used to focus attention on particular structures.Finally we draw some conclusions.

2 Related Work

There is a vast amount of work related to datamining of microarray datasets and we sketch onlya few techniques that have been used for leukemia-based data. A useful survey is Yang and Speed [14].The unusual difficulties of mining such datasetsare (a) the large number of genes compared tothe number of samples, which makes it likely thatspurious correlations will be observed (see [12] fora summary of these issues), (b) the wide errorbounds for microarray readings arising from thecomplex processes of preparation, hybridization,and microarray preparation and reading [2], and(c) the fact that most biochemical processes are theresult of multiple weak influences rather than onestrong one, whereas most data-mining techniques,for example decision trees, look for the smallest setof influences.

Most techniques are supervised, that is it isassumed that the classification of samples (with

Page 4: Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray Data D.B. Skillicorn School of Computing, Queen’s University Kingston Canada skill@cs.queensu.ca

respect to subclasses of disease or prognosis) isknown. Golub et al. [4] used a measure basedon correlation with the target class to determinethe predictive power of each individual gene, andprovided lists of genes predictive of AML (AcuteMyeloid Leukemia) and ALL (Acute Lymphoblas-tic Leukemia). Yeoh et al. [15] use two-dimensionalhierarchical classification trees to learn the genesmost predictive of six subclasses of ALL.

3 SVD Winnowing of Microarray Data

3.1 Singular Value Decomposition. SingularValue Decompositions (SVD) [3] is a well-knownmatrix transformation that is often used to reducethe dimensionality of data. Suppose that a mi-croarray dataset is a matrix A with n rows (corre-sponding to genes) and m columns (correspondingto measurements). Then the SVD expresses A inthe form

A = USV ′

where U is an n × m orthogonal matrix, S is anm×m diagonal matrix whose r non-negative entries(where A has rank r) are in decreasing order, andV is an m×m orthogonal matrix. The superscriptdash indicates matrix transpose. The diagonalentries of S are called the singular values of thematrix A.

One way to understand SVD is as an axis trans-formation to new orthogonal axes (represented byV ), with stretching in each dimension specified bythe values on the diagonal of S. The rows of Ugive the coordinates of each original row in the co-ordinate system of the new axes. Hence U is anew representation for the genes. SVD measuresvariation with respect to the origin, so it is usualto transform the matrix A so that the attributes(i.e. columns) have zero mean. If this is not done,the first singular vector (the first axis of the trans-formed space) represents the vector from the originto the center of the data, and this information isnot usually particularly useful.

The most powerful property of SVD is that themaximal variation among genes is captured in thefirst dimension, as much of the remaining varia-tion as possible in the second dimension, and soon. Hence, truncating the matrices so that Uk is

n× k, Sk is k× k and Vk is m× k gives a represen-tation for the dataset in a lower-dimensional space.Moreover, such a representation is the best possiblewith respect to both the Frobenius and L2 norms.(Note that SVD is a decomposition into linearlyindependent components, not statistically indepen-dent components, a distinction that is sometimesimportant.)

SVD has often been used for dimensionalityreduction in data mining. When m is large,Euclidean distance between objects, represented aspoints in m-dimensional space is badly behavedin the sense that the expected distance betweenthe farthest and nearest neighbors of a given pointis very small. Choosing some smaller value for kallows a faithful representation in which Euclideandistance is practical as a similarity metric. Whenk = 2 or 3, visualization is possible.

3.2 SVD-based Information Extractionfrom Microarray Data. The property of mostinterest for winnowing is the following: vieweach row of U as a vector in the transformedm-dimensional space. Two vectors that are closetogether are positively correlated, and the anglebetween them is small (their dot product is a largepositive number). Similarly, two vectors that arenegatively correlated are on opposite sides of theorigin, and their dot product is a large negativenumber. Two vectors that are at right angles toeach other are completely uncorrelated, so theirdot product is (close to) zero. However, the spaceonly has m dimensions and there are many morevectors than this. Vectors that are uncorrelatedto all of the others must have (almost) zero dotproducts with all of them; the only way this canhappen is if all of their coordinates are close tozero. In other words, genes that are uncorrelatedwith most of the others will tend to be positionedclose to the origin in the transformed space, whilegenes with significant correlation to other geneswill tend to be positioned further from the origin.

The structure to which SVD maps genes inthe dataset is a low-dimensional space in whichproximity corresponds to similarity or correlation,and whose neutral point in the origin.

SVD transforms the given dataset into a struc-

Page 5: Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray Data D.B. Skillicorn School of Computing, Queen’s University Kingston Canada skill@cs.queensu.ca

ture in which distance from the origin is a surro-gate for interestingness, in the sense of possessingsignificant correlative structure. We can draw thefollowing conclusions about genes from their posi-tion in a transformed SVD space:

• Genes that are uncorrelated with any othergenes will be close to the origin. Intuitively,the point corresponding to such a gene is beingpushed in towards the origin by the othergenes. It is unlikely that processes of biologicalinterest will involve only a single gene, so suchgenes may be removed from the dataset withlow risk of losing information.

• Genes that are correlated with many othergenes will also be close to the origin. In-tuitively, the point corresponding to such agene is being pulled towards many of the othergenes and reaches equilibrium near the center.Again, such genes are unlikely to be of greatinterest since they have little predictive power.

• Genes that are correlated with a moderatenumber of other genes will tend to be farfrom the origin. The distance from the ori-gin provides some information about the tight-ness of the correlation with other ‘interesting’genes. Since proximity corresponds to correla-tion, the direction from the origin is also sig-nificant.

It is worth distinguishing two structures thatcommonly occur. When a group of genes arecorrelated among themselves, but also corre-lated with many other genes, they may ap-pear as a bulge in a central spheroid. If thegroup are less correlated with other genes,such bulges may actually be separated fromthe central spheroid, and may appear as dis-tinct clusters. This is, of course, the most in-teresting case since it corresponds to a clearsubset of genes with mutually correlated be-havior.

Note that we are not using SVD as a dimension-ality reduction technique in the usual way – weare in fact reducing the dimensionality of the pa-tient/sample space.

−0.3−0.25

−0.2−0.15

−0.1−0.05

00.05

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

U1

U2

U3

Figure 1: 3-dimensional plot of genes after SVD

3.3 Example of the application of SVD. Wenow show how this strategy works using a datasetof samples from pediatric leukemia patients. Datarelated to housekeeping spots were removed, leav-ing a total of 10752 genes, only a few of which wereduplicates. The dataset was normalized by columnto zero mean and unit standard deviation. Fig-ure 1 shows a plot in 3 dimensions of the entire setof genes.

The main cluster is elliptical, oriented alongthe U1 axis, the axis of maximal variation in thetransformed space. This axis would typically rep-resent the magnitude of normalized gene expres-sion, with relatively highly expressed genes at oneend and relatively weakly expressed genes at theother. The points far from the origin along theU2 axis represent a process that is uncorrelatedwith the process expressed along the U1 axis. Thetop twenty known extremal genes in the plot ofthe entire microarray dataset are (in decreasing or-der): B2M, RPS27, RPL13A, SERPINB6, HLA-C,RPL30, RPS6, KIAA0404, HLA-A, RPS8, LYZ,RPL9, and HBE1. Note that the main cluster ishard to analyze in any further depth because of thesheer number of points within it.

Figure 2 shows the singular values for thisdecomposition. The magnitude of these valuesindicates how much variation is being captured byeach dimension. In particular, much variation iscaptured by the first 10 dimensions.

We now remove genes that are unlikely to beinteresting, by removing those points closer to theorigin than the median Euclidean distance. We

Page 6: Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray Data D.B. Skillicorn School of Computing, Queen’s University Kingston Canada skill@cs.queensu.ca

0 10 20 30 40 50 60 70 80 90 1000

50

100

150

200

250

300

350

400

450

500

s

Figure 2: Magnitude of the singular values

−0.3−0.25

−0.2−0.15

−0.1−0.05

00.05

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

U1

U2

U3

Figure 3: 3-d plot of genes further than the mediandistance from the origin

could compute the distances in all 98 dimensions,but having transformed the dataset using SVD andknowing the magnitudes of the singular values,we can instead compute the median using only10 dimensions with confidence that those pointsdiscarded will be essentially the same.

Figure 3 shows that the points removed are, asexpected, not extremal in any direction – indeed itis difficult to see that half of the points have beenremoved.

Removing points closer than 1.5 times themedian distance leaves only 1660 genes, but asFigure 4 shows, the extremal points remain intact,and it is possible to see some of the structure ofthe main cluster. Note, for example, the small tightcluster just to the left of the top of the main clusterin this view.

−0.3−0.25

−0.2−0.15

−0.1−0.05

00.05

−0.04

−0.02

0

0.02

0.04

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

U1

U2

U3

Figure 4: 3-d plot of genes further than 1.5 timesthe median distance from the origin

The list that remains after winnowing has beenexamined using congo, which uses the GO geneontology to determine whether sets of genes sharecommon function. The list does show substantialfunctional coherence [6].

Winnowing based on distance from the origincould have been used directly on the original ma-trix A to discard genes whose differential expres-sion is very small. It is instructive to consider thedifferent effect of applying this technique to the Umatrix instead. Winnowing based on the U ma-trix will keep as interesting groups of genes withsmall differential expression provided that they oc-cur as part of a group with similar, though small,expression. Conversely, single genes with large dif-ferential expression will be preserved by winnowingon A directly, but probably not by winnowing onU . Hence, winnowing on U is more discriminat-ing. This argument shows some of the dangers ofblindly thresholding during early analysis.

4 SDD Winnowing of Microarray Data

SVD views the data in a geometrical way, in whichthe rows and columns of matrices are regarded ascoordinates in vector spaces. In contrast, semidis-crete decomposition works with the matrix A itself,looking for ‘bumps’, rectilinearly aligned regions ofsimilar value.

4.1 SemiDiscrete Decomposition. Thesemidiscrete decomposition [8–10] expresses the

Page 7: Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray Data D.B. Skillicorn School of Computing, Queen’s University Kingston Canada skill@cs.queensu.ca

matrix A as a sum

A =∑

Ai

where each Ai is the product of a column vector,xi, of length n, a row vector, yj , of length m, anda scalar, di like this:

Ai = dixiyi

where the entries of xi and yi are constrained tobe only −1, 0, or +1. Note that the productxiyi is n × m, that is the same shape as A itself.This product forms a stencil or footprint thatdescribes the locations in A that are accounted forby this ‘bump’ while di describes the magnitudeof the value that is accounted for at each of theselocations. In other words, the SDD describes howto recreate A as the sum of a set of submatrices,each of which is (negatively or positively) constantat the locations described by the footprint and zeroelsewhere.

Note that SDD regards values in the matrix ascorrelated if they are of about the same magnitude,whether positive or negative (corresponding tolocations where the entries of xi and yj are +1 or−1).

SDD can be expressed as a matrix equationsimilar in form to that of SVD as follows:

A = XDY

where X is the horizontal concatenation of the Xis,Y the vertical concatenation of the Yis, and D adiagonal matrix whose entries are the dis. Eachrow of X corresponds to a row of A, and hence to agene. Collectively X defines a ternary hierarchicaldecomposition of the genes: they are divided into3 groups according to the value in the first columnof X; each of these groups is further subdividedinto three subgroups according to the value in thesecond column of X, and so on.

Because SDD uses the volume of bumps todecide which to remove next, it has one tunableparameter, the relative magnitudes of the matrixentries. If the entries are increased, say by signedsquaring, the ‘height’ of each location in the matrixchanges while the area each bump covers remainsthe same – hence small, high bumps are likely

to be considered more significant. On the otherhand, if the entries are decreased, then the effect isto make low bumps covering many locations seemmore important. This can be exploited to look foreither outlier structure or mainstream structure ina dataset.

Genes whose expression values have similarmagnitudes will tend to fall into the same bumps.The rows of X corresponding to genes of little in-terest will contain zeroes – hence the zero branch ofthe hierarchical classification represents the neutralpoint.

4.2 SDD-based Winnowing of MicroarrayData. In an SDD, the number of columns of theX matrix may become quite large, even larger thanm the number of columns of the original matrix. Inother words, A can be expressed as the sum of morethan m matrices. However, just as in SVD, the ear-lier terms of this sum tend to be more important;the magnitudes of the dis decrease. This has twoimplications for the hierarchical classification in-duced by the columns of X: first, the higher levelsof the tree represent more important distinctions;second, the 0 branches of the tree represent genesthat have not participated in any ‘bumps’ and soare perhaps less interesting.

Some care is needed. Because the decisionabout the next bump to include in the sum is basedon the volume of the bump, bumps of low heightbut which occur in a large number of locations canbe considered more important than localized highbumps. Hence, bumps at level 1 may not be more‘important’ than those at level 2, although theyare certainly more important than those at level10, say.

It is useful to visualize the classification in-duced by SDD by overlaying it on plots of pointsfrom SVD. This combined plot shows how SVD andSDD agree and disagree about the classification ofgenes.

Figure 5 shows the SDD classification super-imposed on the positions from SVD. Here color isused to indicate the first column of X (green = 0,red = +1, the −1 branch is empty); and shape isused to indicate the second column of X (dot =−1, circle = 0, cross = +1).

Page 8: Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray Data D.B. Skillicorn School of Computing, Queen’s University Kingston Canada skill@cs.queensu.ca

−0.3−0.25−0.2−0.15−0.1−0.0500.05

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

U1

U2

U3

Figure 5: SDD classification imposed on SVDpositions

This figure shows two interesting things aboutthe SDD classification. First, it is clear where theboundary between ordinary genes and those differ-entially expressed in leukemia might be (those thatare colored green and red respectively). The genesrelated to leukemia form a bump that is stronglyvisible in the data. Second, further information isprovided about how unusual each of the genes re-lated to leukemia is. For example, those genes la-belled with red dots are clearly more unusual thanthose labelled with red crosses, according to bothSVD and SDD.

Golub et al. [4] published an early paperanalyzing a dataset for genes predictive of thedifference between AML and ALL, and data fromthis paper has been extensively analyzed by others[5, 15]. This dataset is in some ways easy, sincethe two classes can be correctly predicted basedon a single gene (zyxin). However, Golub et al.provide a list of the top fifty genes predictive ofeach case. Figure 6 shows the SVD plot from thisdataset with these 100 genes labelled. It is clearthat SVD produces the same kind of results forthis dataset as for the dataset above: there is adimension that discriminates the classes well (inboth cases the U2 dimension), and the extremalpoints appear to be most interesting. The originalpaper makes it clear that discrimination betweenthe classes could be done well using other genes,and this is clear from the plot – there are two well-defined opposite ‘arms’ of points, each of which ispredictive of one leukemia type. Indeed, others [15]

−0.1

−0.05

0

0.05

0.1

0.15

−0.2−0.15−0.1−0.0500.050.10.150.20.25

U1

U2

Figure 6: SVD plot of Golub dataset, blue circles:genes predictive of AML, red circles: genes predic-tive of ALL

have defined smaller sets of discriminating genes.

5 Focus as Fine Tuning of WinnowingTechniques

When certain genes (or patients) are known to beof interest, both matrix decompositions allow thisinformation to be used to focus the analysis.

For SVD, adding scalar weights to rows orcolumns of the (previously normalized) data hasthe effect of moving them further from the origin,and hence making them seem more interesting.However, as a side-effect, the points correspondingto other genes that are correlated with the weightedgenes also move further from the origin becauseof their correlation. This often reveals a clusterof correlated genes that was previously hiddenbecause it was inside another larger cluster andhence hard to see.

The amount of weight to be applied to the 20most extremal genes in the U2 direction in theoriginal plot requires some experimentation. If aweight of 10 is applied to these genes, the resultingplot has these genes as outliers, and the remaininggenes in one undifferentiated cluster. However,it is clear from this weighting that these genesdivide themselves into four subclusters: HLA-Aand HLA-C; LYZ; HBE1; and the rest. Such aclustering is at least superficially plausible. Whena weight of 2 is applied to these genes, there is somemovement of related genes outward from the maincluster. Although this improves the visualization

Page 9: Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray Data D.B. Skillicorn School of Computing, Queen’s University Kingston Canada skill@cs.queensu.ca

of these genes, no new information results becausethey were detected by sorting the data positions.Some examples of such genes are: RPS7, VRP, andTPT1.

Weighting of the columns (corresponding topatients) can also reveal properties of genes. Forthis dataset, we have physician ratings of patientrisk (normal versus high risk) for the patients inthis dataset. Increasing the weight on high-riskpatients induces different positions for genes in theSVD plot. In effect, we are now able to examinethose genes that are associated with high-risk ALLleukemia, rather than just with ALL leukemia.

Figure 7 shows the SVD plot when the weighton the high-risk patients is increased to 8. Thereis little visible change from the unweighted case.However, when the SDD classification is overlaid,as shown in Figure 8, a new set of genes, labelledby green dots, begin to separate from the maincluster. There are also an increasing number ofgenes separating from the main cluster along itslong axis.

−0.3−0.25

−0.2−0.15

−0.1−0.05

00.05

−0.04

−0.02

0

0.02

0.04

−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

U1

U2

U3

Figure 7: SVD plot of genes with high-risk patientsweighted

6 Conclusions

Microarray datasets plausibly contain a great dealof information that is not helpful for determin-ing the biochemical structures underlying particu-lar conditions. Most data mining techniques, eventhose with built-in abilities to perform attributeselection, do not perform well in the presence ofirrelevant or redundant information. We have pre-

−0.3−0.25

−0.2−0.15

−0.1−0.05

00.05

−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

−0.15

−0.1

−0.05

0

0.05

U1

U2

U3

Figure 8: SDD labelling of SVD plot of genes withhigh-risk patients weighted

sented two techniques, based on singular value de-composition and semidiscrete decomposition, thatare able to generate a kind of ranking of the pre-sumptive interestingness of genes. This rankingcan be used to discard substantial fractions of thegenes, allowing more sophisticated techniques tobe applied robustly to what remains. The two ma-trix decompositions interact in ways that are morerevealing than using them separately; each also al-lows for some tuning to allow particular structuresto be searched for. We have illustrated these tech-niques on a dataset of patients with pediatric ALL,where more than 80% of the genes appear not tobe significant.

Matrix decompositions have the potential toextract clustering information from datasets, butbackground knowledge and fairly sophisticated useof the techniques is required. Here we only considertheir use as a preprocessing step to reduce datasetsize and complexity for other, downstream modelbuilding techniques.

References

[1] O. Alter, P.O. Brown, and D. Botstein. Singularvalue decomposition for genome-wide expressiondata processing and modeling. Proceedings ofthe National Academy of Science, 97(18):10101–10106, 2000.

[2] M.E. Futschika, A. Reeveb, and N. Kasabov.Evolving connectionist systems for knowledge dis-covery from gene expression data of cancer tis-

Page 10: Strategies for Winnowing Microarray Data - Queen's UniversityStrategies for Winnowing Microarray Data D.B. Skillicorn School of Computing, Queen’s University Kingston Canada skill@cs.queensu.ca

sue. Artificial Intelligence in Medicine, 28:165–189, 2003.

[3] G.H. Golub and C.F. van Loan. Matrix Compu-tations. Johns Hopkins University Press, 3rd edi-tion, 1996.

[4] T.R. Golub, D.K. Slonim, P. Tamayo, C.Huard,M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh,J.R. Downing, M.A. Caligiuri, C.D. Bloomfield,and E.S. Lander. Molecular classification of can-cer: Class discovery and class prediction by geneexpression monitoring. Science, 286:531–537, 15October 1999.

[5] J. Hardin, D.M. Rocke, and D.L. Woodruff. Ro-bust model-based clustering of genes in microar-ray data: Are there gene clusters. In Proceedingsof CAMDA 2000, 2000.

[6] P.J. Kennedy, S.J. Simoff, D.B. Skillicorn, andD. Catchpoole. Extracting and explaining bio-logical knowledge in microarray data. In PacificAsia Knowledge Discovery and Data Mining Con-ference (PAKDD2004), Sydney, May 2004.

[7] Y. Kluger, R. Basri, J.T. Chang, and M. Gerstein.Spectral biclustering of microarray data: Coclus-tering genes and conditions. Genome Research,13:703–716, 2003.

[8] G. Kolda and D.P. O’Leary. A semi-discretematrix decomposition for latent semantic indexingin information retrieval. ACM Transactions onInformation Systems, 16:322–346, 1998.

[9] T.G. Kolda and D.P. O’Leary. Computationand uses of the semidiscrete matrix decomposi-tion. ACM Transactions on Information Process-ing, 1999.

[10] S. McConnell and D.B. Skillicorn. Semidiscretedecomposition: A bump hunting technique. InAustralasian Data Mining Workshop, pages 75–82,December 2002.

[11] D. Nguyen and D. Rocke. Classification of acuteleukemia based on DNA microarray gene expres-sions using partial least squares. In Methods ofMicroarray Data Analysis, pages 109–124. 2002.

[12] G. Piatetsky-Shapiro, T. Khabaza, and S. Ra-maswamy. Capturing best practise for microar-ray gene expression data analysis. In P. Domin-gos, C. Faloutsos, T. Senator, H. Kargupta, andL. Getoor, editors, Proceedings of the Ninth ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining KDD2003, pages 407–415. ACM Press, 2003.

[13] V.G. Tusher, R. Tibshirani, and G. Chu. Signifi-cance analysis of microarrays applied to the ioniz-ing radiation response. Proceedings of the NationalAcademy of Science, 98(9):5116–5121, 2001.

[14] Y.H. Yang and T. Speed. Design issues for cDNAmicroarray experiments. Nature Reviews, 3:579–588, 2002.

[15] E.J. Yeoh, M.E. Ross, S.A. Shurtleff, W.K.Williams, D. Patel, R. Mahfouz, F.G. Behm, S.C.Raimondi, M.V. Relling, A. Patel, C. Cheng,D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu,C.H. Pui, W.E. Evans, C. Naeve, L. Wong, andJ.R. Downing. Classification, subtype discovery,and prediction of outcome in pediatric acute lym-phoblastic leukemia by gene expression profiling.Cancer Cell, 1, March 2002.