UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

39
arXiv:0707.0481v3 [stat.ME] 25 Jul 2008 The Annals of Applied Statistics 2008, Vol. 2, No. 2, 435–471 DOI: 10.1214/07-AOAS137 c Institute of Mathematical Statistics, 2008 TREELETS—AN ADAPTIVE MULTI-SCALE BASIS FOR SPARSE UNORDERED DATA By Ann B. Lee, 2 Boaz Nadler 3 and Larry Wasserman 2 Carnegie Mellon University, Weizmann Institute of Science and Carnegie Mellon University In many modern applications, including analysis of gene expres- sion and text documents, the data are noisy, high-dimensional, and unordered—with no particular meaning to the given order of the vari- ables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper we present treelets —a novel construction of multi-scale bases that ex- tends wavelets to nonsmooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonormal basis which both reflect the internal structure of the data. Treelets are especially well- suited as a dimensionality reduction and feature selection tool prior to regression and classification, in situations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis, as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables. 1. Introduction. For many modern data sets (e.g., DNA microarrays, financial and consumer data, text documents and internet web pages), the Received May 2007; revised August 2007. 1 Discussed in 10.1214/08-AOAS137A, 10.1214/08-AOAS137B, 10.1214/08-AOAS137C, 10.1214/08-AOAS137D, 10.1214/08-AOAS137E and 10.1214/08-AOAS137F; rejoinder at 10.1214/08-AOAS137REJ. 2 Supported in part by NSF Grants CCF-0625879 and DMS-0707059. 3 Supported by the Hana and Julius Rosen fund and by the William Z. and Eda Bess Novick Young Scientist fund. Key words and phrases. Feature selection, dimensionality reduction, multi-resolution analysis, local best basis, sparsity, principal component analysis, hierarchical clustering, small sample sizes. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Applied Statistics, 2008, Vol. 2, No. 2, 435–471. This reprint differs from the original in pagination and typographic detail. 1

Transcript of UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

Page 1: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

arX

iv:0

707.

0481

v3 [

stat

.ME

] 2

5 Ju

l 200

8

The Annals of Applied Statistics

2008, Vol. 2, No. 2, 435–471DOI: 10.1214/07-AOAS137c© Institute of Mathematical Statistics, 2008

TREELETS—AN ADAPTIVE MULTI-SCALE BASIS FOR SPARSEUNORDERED DATA

By Ann B. Lee,2 Boaz Nadler3 and Larry Wasserman2

Carnegie Mellon University, Weizmann Institute of Scienceand Carnegie Mellon University

In many modern applications, including analysis of gene expres-sion and text documents, the data are noisy, high-dimensional, andunordered—with no particular meaning to the given order of the vari-ables. Yet, successful learning is often possible due to sparsity: thefact that the data are typically redundant with underlying structuresthat can be represented by only a few features. In this paper wepresent treelets—a novel construction of multi-scale bases that ex-tends wavelets to nonsmooth signals. The method is fully adaptive,as it returns a hierarchical tree and an orthonormal basis which bothreflect the internal structure of the data. Treelets are especially well-suited as a dimensionality reduction and feature selection tool priorto regression and classification, in situations where sample sizes aresmall and the data are sparse with unknown groupings of correlatedor collinear variables. The method is also simple to implement andanalyze theoretically. Here we describe a variety of situations wheretreelets perform better than principal component analysis, as well assome common variable selection and cluster averaging schemes. Weillustrate treelets on a blocked covariance model and on several datasets (hyperspectral image data, DNA microarray data, and internetadvertisements) with highly complex dependencies between variables.

1. Introduction. For many modern data sets (e.g., DNA microarrays,financial and consumer data, text documents and internet web pages), the

Received May 2007; revised August 2007.1Discussed in 10.1214/08-AOAS137A, 10.1214/08-AOAS137B,

10.1214/08-AOAS137C, 10.1214/08-AOAS137D, 10.1214/08-AOAS137E and10.1214/08-AOAS137F; rejoinder at 10.1214/08-AOAS137REJ.

2Supported in part by NSF Grants CCF-0625879 and DMS-0707059.3Supported by the Hana and Julius Rosen fund and by the William Z. and Eda Bess

Novick Young Scientist fund.Key words and phrases. Feature selection, dimensionality reduction, multi-resolution

analysis, local best basis, sparsity, principal component analysis, hierarchical clustering,small sample sizes.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in The Annals of Applied Statistics,2008, Vol. 2, No. 2, 435–471. This reprint differs from the original in paginationand typographic detail.

1

Page 2: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

2 A. B. LEE, B. NADLER AND L. WASSERMAN

collected data are high-dimensional, noisy, and unordered, with no particu-lar meaning to the given order of the variables. In this paper we introducea new methodology for the analysis of such data. We describe the theo-retical properties of the method, and illustrate the proposed algorithm onhyperspectral image data, internet advertisements, and DNA microarraydata. These data sets contain structure in the form of complex groupingsof correlated variables. For example, the internet data include more thana thousand binary variables (various features of an image) and a couple ofthousand observations (an image in an internet page). Some of the vari-ables are exactly linearly related, while others are similar in more subtleways. The DNA microarray data include the expression levels of severalthousand genes but less than 100 samples (patients). Many sets of genesexhibit similar expression patterns across samples. The task in both cases ishere classification. The results can therefore easily be compared with thoseof other classification algorithms. There is, however, a deeper underlyingquestion that motivated our work: Is there a simple general methodologythat, by construction, captures intrinsic localized structures, and that as aconsequence improves inference and prediction of noisy, high-dimensionaldata when sample sizes are small? The method should be powerful enoughto describe complex structures on multiple scales for unordered data, yet besimple enough to understand and analyze theoretically. Below we give somemore background to this problem.

The key property that allows successful inference and prediction in high-dimensional settings is the notion of sparsity. Generally speaking, there aretwo main notions of sparsity. The first is sparsity of various quantities relatedeither to the learning problem at hand or to the representation of the data inthe original given variables. Examples include a sparse regression or classifi-cation vector [Tibshirani (1996)], and a sparse structure to the covariance orinverse covariance matrix of the given variables [Bickel and Levina (2008)].The second notion is sparsity of the data themselves. Here we are referringto a situation where the data, despite their apparent high dimensionality,are highly redundant with underlying structures that can be representedby only a few features. Examples include data where many variables areapproximately collinear or highly related, and data that lie on a nonlin-ear manifold [Belkin and Niyogi (2005), Coifman et al. (2005)].1 While thetwo notions of sparsity are different, they are clearly related. In fact, a lowintrinsic dimensionality of the data typically implies, for example, sparseregression or classification vectors, as well as low-rank covariance matrices.However, this relation may not be directly transparent, as the sparsity of

1A referee pointed out that another issue with sparsity is that very high-dimensionalspaces have very simple structure [Hall, Marron and Neeman (2005), Murtagh (2004),Ahn and Marron (2008)].

Page 3: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 3

these quantities sometimes becomes evident only in a different basis repre-sentation of the data.

In either case, to take advantage of sparsity, one constrains the set of possi-ble parameters of the problem. For the first kind of sparsity, two key tools aregraphical models [Whittaker (2001)] that assume statistical dependence be-tween specific variables, and regularization methods that penalize nonsparsesolutions [Hastie, Tibshirani and Friedman (2001)]. Examples of such regu-larization methods are the lasso [Tibshirani (1996)], regularized covarianceestimation methods [Bickel and Levina (2008), Levina and Zhu (2007)] andvariable selection in high-dimensional graphs [Meinshausen and Buhlmann(2006)]. For the second type of sparsity, where the goal is to find a newset of coordinates or features of the data, two standard “variable trans-formation” methods are principal component analysis [Jolliffe (2002)] andwavelets [Ogden (1997)]. Each of these two methods has its own strengthsand weaknesses which we briefly discuss here.

PCA has gained much popularity due to its simplicity and the uniqueproperty of providing a sequence of best linear approximations in a leastsquares sense. The method has two main limitations. First, PCA computesa global representation, where each basis vector is a linear combination ofall the original variables. This makes it difficult to interpret the results anddetect internal localized structures in the data. For example, in gene expres-sion data, it may be difficult to detect small subsets of highly correlatedgenes. The second limitation is that PCA constructs an optimal linear rep-resentation of the noisy observations, but not necessarily of the (unknown)underlying noiseless data. When the number of variables p is much largerthan the number of observations n, the true underlying principal factorsmay be masked by the noise, yielding an inconsistent estimator in the jointlimit p,n→∞, p/n→ c [Johnstone and Lu (2008)]. Even for a finite samplesize n, this property of PCA and other global methods (such as partial leastsquares and ridge regression) can lead to large prediction errors in regres-sion and classification [Buckheit and Donoho (1995), Nadler and Coifman(2005b)]. Equation (25) in our paper, for example, gives an estimate of thefinite-n regression error for a linear mixture error-in-variables model.

In contrast to PCA, wavelet methods describe the data in terms of lo-calized basis functions. The representations are multi-scale, and for smoothdata, also sparse [Donoho and Johnstone (1995)]. Wavelets are used in manynonparametric statistics tasks, including regression and density estimation.In recent years wavelet expansions have also been combined with regulariza-tion methods to find regression vectors which are sparse in an a priori knownwavelet basis [Candes and Tao (2007), Donoho and Elad (2003)]. The mainlimitation of wavelets is the implicit assumption of smoothness of the (noise-less) data as a function of its variables. In other words, standard waveletsare not suited for the analysis of unordered data. Thus, some work suggests

Page 4: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

4 A. B. LEE, B. NADLER AND L. WASSERMAN

first sorting the data, and then applying fixed wavelets to the reordereddata [Murtagh, Starck and Berry (2000), Murtagh (2007)].

In this paper we propose an adaptive method for multi-scale representa-tion and eigenanalysis of data where the variables can occur in any givenorder. We call the construction treelets, as the method is inspired by bothhierarchical clustering trees and wavelets. The motivation for the treelets istwo-fold: One goal is to find a “natural” system of coordinates that reflectsthe underlying internal structure of the data and that is robust to noise.A second goal is to improve the performance of conventional regression andclassification techniques in the “large p, small n” regime by finding a reducedrepresentation of the data prior to learning. We pay special attention tosparsity in the form of groupings of similar variables. Such low-dimensionalstructure naturally occurs in many data sets; for example, in DNA microar-ray data where genes sharing the same pathway can exhibit highly correlatedexpression patterns, and in the measured spectra of a chemical compoundthat is a linear mixture of certain simpler substances. Collinearity of vari-ables is often a problem for a range of existing dimensionality reductiontechniques—including least squares, and variable selection methods that donot take variable groupings into account.

The implementation of the treelet transform is similar to to the classicalJacobi method from numerical linear algebra [Golub and van Loan (1996)].In our work we construct a data-driven multi-scale basis by applying a seriesof Jacobi rotations (PCA in two dimensions) to pairs of correlated variables.The final computed basis functions are orthogonal and supported on nestedclusters in a hierarchical tree. As in standard PCA, we explore the covariancestructure of the data but—unlike PCA—the analysis is local and multi-scale.As shown in Section 3.2.2 the treelet transform also has faster convergenceproperties than PCA. It is therefore more suitable as a feature extractiontool when sample sizes are small.

Other methods also relate to treelets. In recent years hierarchical clus-tering methods have been widely used for identifying diseases and groupsof co-expressed genes [Eisen et al. (1998), Tibshirani et al. (1999)]. Manyresearchers are also developing algorithms that combine gene selection andgene grouping; see, for example, Hastie et al. (2001), Dettling and Buhlmann(2004), Zou and Hastie (2005) among others, and see Fraley and Raftery(2002) for a review of model-based clustering.

The novelty and contribution of our approach is the simultaneous con-struction of a data-driven multi-scale orthogonal basis and a hierarchicalcluster tree. The introduction of a basis enables application of the well-developed machinery of orthonormal expansions, wavelets, and wavelet pack-ets for nonparametric smoothing, data compression, and analysis of generalunordered data. As with any orthonormal expansion, the expansion coeffi-cients reflect the effective dimension of the data, as well as the significance

Page 5: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 5

of each coordinate. In our case, we even go one step further: The basis func-tions themselves contain information on the geometry of the data, while thecorresponding expansion coefficients indicate their importance; see examplesin Sections 4 and 5.

The treelet algorithm has some similarities to the local Karhunen–LoeveBasis for smooth ordered data by Coifman and Saito (1996), where the basisfunctions are data-driven but the tree structure is fixed. Our paper is alsorelated to recent independent work on the Haar wavelet transform of a den-drogram by Murtagh (2007). The latter paper also suggests basis functionson a data-driven cluster tree but uses fixed wavelets on a pre-computed den-drogram. The treelet algorithm offers the advantages of both approaches, asit incorporates adaptive basis functions as well as a data-driven tree struc-ture. As shown in this paper, this unifying property turns out to be ofkey importance for statistical inference and prediction: The adaptive treestructure allows analysis of unordered data. The adaptive treelet functionslead to results that reflect the internal localized structure of the data, andthat are stable to noise. In particular, when the data contain subsets ofco-varying variables, the computed basis is sparse, with the dominant basisfunctions effectively serving as indicator functions of the hidden groups. Formore complex structure, as illustrated on real data sets, our method returns“softer,” continuous-valued loading functions. In classification problems, thetreelet functions with the most discriminant power often compute differencesbetween groups of variables.

The organization of the paper is as follows: In Section 2 we describe thetreelet algorithm. In Section 3 we examine its theoretical properties. Theanalysis includes the general large-sample properties of treelets, as well asa specific covariance model with block structure. In Section 4 we discussthe performance of the treelet method on a linear mixture error-in-variablemodel and give a few illustrative examples of its use in data representationand regression. Finally, in Section 5 we apply our method to classificationof hyperspectral data, internet advertisements, and gene expression arrays.

A preliminary version of this paper was presented at AISTATS-07[Lee and Nadler (2007)].

2. The treelet transform. In many modern data sets the data are notonly high-dimensional but also redundant with many variables related toeach other. Hierarchical clustering algorithms [Jain, Murty and Flynn (1999),Xu and Wunsch (2005)] are often used for the organization and grouping ofthe variables of such data sets. These methods offer an easily interpretabledescription of the data structure in terms of a dendrogram, and only requirethe user to specify a measure of similarity between groups of observations orvariables. So called agglomerative hierarchical methods start at the bottomof the tree and, at each level, merge the two groups with highest inter-group

Page 6: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

6 A. B. LEE, B. NADLER AND L. WASSERMAN

similarity into one larger cluster. The novelty of the proposed treelet al-gorithm is in constructing not only clusters or groupings of variables, butalso functions on the data. More specifically, we construct a multi-scale or-thonormal basis on a hierarchical tree. As in standard multi-resolution analy-sis [Mallat (1998)], the treelet algorithm provides a set of “scaling functions”defined on nested subspaces V0 ⊃ V1 ⊃ · · · ⊃ VL, and a set of orthogonal “de-tail functions” defined on residual spaces {Wℓ}Lℓ=1, where Vℓ ⊕Wℓ = Vℓ−1.The treelet decomposition scheme represents a multi-resolution transform,but technically speaking, not a wavelet transform. (In terms of the tiling of“time-frequency” space, the method is more similar to local cosine trans-forms, which divide the time axis in intervals of varying sizes.) The detailsof the treelet algorithm are in Section 2.1.

In this paper we measure the similarity Mij between two variables si andsj with the correlation coefficient

ρij =Σij

√ΣiiΣjj

,(1)

where Σij = E[(si−Esi)(sj−Esj)] is the usual covariance. Other information-theoretic or graph-theoretic similarity measures are also possible. For someapplications, one may want to use absolute values of correlation coefficients,or a weighted sum of covariances and correlations as in Mij = |ρij |+ λ|Σij |,where the parameter λ is a nonnegative number.

2.1. The algorithm: Jacobi rotations on pairs of similar variables. Thetreelet algorithm is inspired by the classical Jacobi method for computingeigenvalues of a matrix [Golub and van Loan (1996)]. There are also somesimilarities with the Grand Tour [Asimov (1985)], a visualization tool forviewing multidimensional data through a sequence of orthogonal projections.The main difference from Jacobi’s method—and the reason why the treelettransform, in general, returns an orthonormal basis different from standardPCA—is that treelets are constructed on a hierarchical tree.

The idea is simple. At each level of the tree, we group together the mostsimilar variables and replace them by a coarse-grained “sum variable” anda residual “difference variable.” The new variables are computed by a lo-cal PCA (or Jacobi rotation) in two dimensions. Unlike Jacobi’s originalmethod, difference variables are stored, and only sum variables are processedat higher levels of the tree. Hence, the multi-resolution analysis (MRA) in-terpretation. The details of the algorithm are as follows:

• At level ℓ= 0 (the bottom of the tree), each observation or “signal” x isrepresented by the original variables x

(0) = [s0,1, . . . , s0,p]T , where s0,k =

xk. Associate to these coordinates, the Dirac basis, B0 = [φ0,1, φ0,2, . . . , φ0,p],where B0 is the p×p identity matrix. Compute the sample covariance and

Page 7: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 7

similarity matrices Σ(0) and M (0). Initialize the set of “sum variables,”S = {1,2, . . . , p}.

• Repeat for ℓ= 1, . . . ,L:

1. Find the two most similar sum variables according to the similaritymatrix M

(ℓ−1). Let

(α,β) = argmaxi,j∈S

M(ℓ−1)ij ,(2)

where i < j, and maximization is only over pairs of sum variables thatbelong to the set S . As in standard wavelet analysis, difference variables(defined in step 3) are not processed.

2. Perform a local PCA on this pair. Find a Jacobi rotation matrix

J(α,β, θℓ) =

1 · · · 0 · · · 0 · · · 0...

. . ....

......

0 · · · c · · · −s · · · 0...

.... . .

......

0 · · · s · · · c · · · 0...

......

. . ....

0 · · · 0 · · · 0 · · · 1

,(3)

where c = cos(θℓ) and s = sin(θℓ), that decorrelates xα and xβ ; more

specifically, find a rotation angle θℓ such that |θℓ| ≤ π/4 and Σ(ℓ)αβ =

Σ(ℓ)βα = 0, where Σ(ℓ) = JT Σ(ℓ−1)J . This transformation corresponds to

a change of basis Bℓ = Bℓ−1J , and new coordinates x(ℓ) = JT

x(ℓ−1).

Update the similarity matrix M (ℓ) accordingly.

3. Multi-resolution analysis. For ease of notation, assume that Σ(ℓ)αα ≥ Σ

(ℓ)ββ

after the Jacobi rotation, where the indices α and β correspond to thefirst and second principal components, respectively. Define the sum

and difference variables at level ℓ as sℓ = x(ℓ)α and dℓ = x

(ℓ)β . Similarly,

define the scaling and detail functions φℓ and ψℓ as columns α and βof the basis matrix Bℓ. Remove the difference variable from the set ofsum variables, S = S \ {β}. At level ℓ, we have the orthonormal treeletdecomposition

x=p−ℓ∑

i=1

sℓ,iφℓ,i +ℓ∑

i=1

diψi,(4)

where the new set of scaling vectors {φℓ,i}p−ℓi=1 is the union of the vector

φℓ and the scaling vectors {φℓ−1,j}j 6=α,β from the previous level, and

the new coarse-grained sum variables {sℓ,i}p−ℓi=1 are the projections of

the original data onto these vectors. As in standard multi-resolution

Page 8: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

8 A. B. LEE, B. NADLER AND L. WASSERMAN

analysis, the first sum is the coarse-grained representation of the signal,while the second sum captures the residuals at different scales.

The output of the algorithm can be summarized in terms of a hierarchi-cal tree with a height L≤ p− 1 and an ordered set of rotations and pairs ofindices, {(θℓ, αℓ, βℓ)}Lℓ=1. Figure 1 (left) shows an example of a treelet con-

struction for a “signal” of length p= 5, with the data representations x(ℓ) atthe different levels of the tree shown on the right. The s-components (pro-jections in the main principal directions) represent coarse-grained “sums.”We associate these variables to the nodes in the cluster tree. Similarly, thed-components (projections in the orthogonal directions) represent “differ-ences” between node representations at two consecutive levels in the tree.For example, in the figure d1ψ1 = (s0,1φ0,1 + s0,2φ0,2)− s1φ1,1.

We now briefly consider the complexity of the treelet algorithm on a gen-eral data set with n observations and p variables. For a naive implementa-tion with an exhaustive search for the optimal pair (α,β) in Equation 2, theoverall complexity is m+O(Lp2) operations, where m= O(min(np2, pn2))is the cost of computing the sample covariance matrix by singular valuedecomposition, and L is the height of the tree. However, by storing thesimilarity matrices Σ(0) and M (0) and keeping track of their local changes,the complexity can be further reduced to m+ O(Lp). In other words, thecomputational cost is comparable to hierarchical clustering algorithms.

2.2. Selecting the height L of the tree and a “best K-basis.” The defaultchoice of the treelet transform is a maximum height tree with L = p − 1;

Fig. 1. (Left) A toy example of a hierarchical tree for data of dimension p = 5. Atℓ = 0, the signal is represented by the original p variables. At each successive levelℓ = 1,2, . . . , p − 1, the two most similar sum variables are combined and replaced by thesum and difference variables sℓ, dℓ corresponding to the first and second local principalcomponents. (Right) Signal representation x

(ℓ) at different levels. The s- and d-coordinatesrepresent projections along scaling and detail functions in a multi-scale treelet decomposi-tion. Each such representation is associated with an orthogonal basis in R

p that capturesthe local eigenstructure of the data.

Page 9: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 9

see examples in Sections 5.1 and 5.3. This choice leads to a fully parameter-free decomposition of the data and is also faithful to the idea of a multi-resolution analysis. For more complexity, one can alternatively also chooseany of the orthonormal (ON) bases at levels ℓ < p− 1 of the tree. The dataare then represented by coarse-grained sum variables for a set of clusters inthe tree, and difference variables that describe the finer details. In princi-ple, any of the standard techniques in hierarchical clustering can be usedin deciding when to stop “merging” clusters (e.g., use a preset thresholdvalue for the similarity measure, or use hypothesis testing for homogeneityof clusters, etc.). In this work we propose a rather different method thatis inspired by the best basis paradigm [Coifman and Wickerhauser (1992),Saito and Coifman (1995)] in wavelet signal processing. This approach di-rectly addresses the question of how well one can capture information in thedata.

Consider IID data x1, . . . ,xn, where x

i ∈ Rp is a p-dimensional random

vector. Denote the candidate ON bases by B0, . . . ,Bp−1, where Bℓ is thebasis at level ℓ in the tree. Suppose now that we are interested in findingthe “best” K-dimensional treelet representation for data representation andcompression, where the dimension K < p has been determined in advance.It then makes sense to use a scoring criterion that measures the percent-age of explained variance for the chosen coordinates. Thus, we propose thefollowing greedy scoring and selection approach:

For a given orthonormal basis B = (w1, . . . ,wp), assign a normalized en-ergy score E to each vector wi according to

E(wi) =E{|wi · x|2}E{‖x‖2} .(5)

The corresponding sample estimate is E(w) =

∑n

j=1|wi·xj |2

∑n

j=1‖xj‖2 . Sort the vectors

according to decreasing energy, w(1), . . . ,w(p), and define the score ΓK of the

basis B by summing the K largest terms, that is, let ΓK(B)≡∑Ki=1 E(wi).

The best K-basis is the treelet basis with the highest score

BL = arg maxBℓ : 0≤ℓ≤p−1

ΓK(Bℓ).(6)

It is the basis that best compresses the data with only K components. Incase of degeneracies, we choose the coordinate system with the smallest ℓ.Furthermore, to estimate the score ΓK for a particular data set, we usecross-validation (CV); that is, the treelets are constructed using subsets ofthe original data set and the score is computed on independent test setsto avoid overfitting. Both theoretical calculations (Section 3.2) and simu-lations (Section 4.1) indicate that an energy-based measure is useful fordetecting natural groupings of variables in data. Many alternative measures

Page 10: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

10 A. B. LEE, B. NADLER AND L. WASSERMAN

(e.g., Fisher’s discriminant score, classification error rates, entropy, and othersparsity measures) can also be used. For the classification problem in Sec-tion 5.1, for example, we define a discriminant score that measures how wella coordinate separates data from different classes.

3. Theory.

3.1. Large sample properties of the treelet transform. In this section weexamine the large sample properties of treelets. We introduce a more generaldefinition of consistency that takes into account the fact that the treeletoperator (based on correlation coefficients) is multi-valued, and study themethod under the stated conditions. We also describe a bootstrap algorithmfor quantifying the stability of the algorithm in practical applications. Thedetails are as follows.

First some notation and definitions: Let T (Σ) = JTΣJ denote the co-variance matrix after one step of the treelet algorithm when starting withcovariance matrix Σ. Let T ℓ(Σ) denote the covariance matrix after ℓ stepsof the treelet algorithm. Thus, T ℓ = T ◦ · · · ◦ T corresponds to T applied ℓtimes. Define ‖A‖∞ =maxj,k |Ajk| and let

Tn(Σ, δn) =⋃

‖Λ−Σ‖∞≤δn

T (Λ).(7)

Define T 1n (Σ, δn) = Tn(Σ, δn), and

T ℓn (Σ, δn) =

Λ∈T ℓ−1n

T (Λ), ℓ≥ 2.(8)

Let Σn denote the sample covariance matrix. We make the following as-sumptions:(A1) Assume that x has finite variance and satisfies one of the followingthree assumptions: (a) each xj is bounded or (b) x is multivariate normalor (c) there exist M and s such that E(|xjxk|q)≤ q!M q−2s/2 for all q ≥ 2.(A2) The dimension pn satisfies pn ≤ nc for some c > 0.

Theorem 1. Suppose that (A1) and (A2) hold. Let δn =K√

logn/n,where K > 2c. Then, as n,pn →∞,

P(T ℓ(Σn) ∈ T ℓn (Σ, δn), ℓ= 1, . . . , pn)→ 1.(9)

Some discussion is in order. The result says that T ℓ(Σn) is not too farfrom T ℓ(Λ) for some Λ close to Σ. It would perhaps be more satisfying to

have a result that says that ‖T ℓ(Σ)− T ℓ(Σ)‖∞ converges to 0. This wouldbe possible if one used covariances to measure similarity, but not in the caseof correlation coefficients.

Page 11: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 11

For example, it is easy to construct a covariance matrix Σ with the fol-lowing properties:

1. ρ12 is the largest off-diagonal correlation,2. ρ34 is nearly equal to ρ12,3. the 2 × 2 submatrix of Σ corresponding to x1 and x2 is very different

than the 2× 2 submatrix of Σ corresponding to x3 and x4.

In this case there is a nontrivial probability that ρ34 > ρ12 due to samplefluctuations. Therefore T (Σ) performs a rotation on the first two coordi-

nates, while T (Σ) performs a rotation on the third and fourth coordinates.Since the two corresponding submatrices are quite different, the two rota-tions will be quite different. Hence, T (Σ) can be quite different from T (Σ).This does not pose any problem since inferring T (Σ) is not the goal. Under

the stated conditions, we would consider both T (Σ) and T (Σ) to be rea-sonable transformations. We examine the details and include the proof ofTheorem 1 in Appendix A.1.

Because T (Σ1) and T (Σ2) can be quite different even when the ma-trices Σ1 and Σ2 are close, it might be of interest to study the stabil-ity of T (Σn). This can be done using the bootstrap. Construct B boot-strap replications of the data and corresponding sample covariance matricesΣ∗n,1, . . . , Σ

∗n,B . Let δn = J−1

n (1− α), where Jn is the empirical distribution

function of {‖Σ∗n,b − Σn‖∞, b= 1, . . . ,B} and α is the confidence level. If F

has finite fourth moments and p is fixed, then it follows from Corollary 1of Beran and Srivastava (1985) that

limn→∞

PF (Σ ∈Cn) = 1− α,

where Cn = {Λ:‖Λ− Σn‖∞ ≤ δn}. LetAn = {T (Λ) :Λ ∈Cn}.

It follows that P(T (Σ) ∈An)→ 1− α. The set An can be approximated by

applying T to all Σ∗n,b for which ‖Σ∗

n,b− Σn‖∞ < δn. In Section 4.1 (Figure 3)we use the bootstrap method to estimate confidence sets for treelets.

3.2. Treelets on covariance matrices with block structures.

3.2.1. An exact analysis in the limit n→∞. Many real life data sets,including gene arrays, consumer data sets, and word-documents, display co-variance matrices with approximate block structures. The treelet transformis especially well suited for representing and analyzing such data—even fornoisy data and small sample sizes.

Here we show that treelets provide a sparse representation when covari-ance matrices have inherent block structures, and that the loading functions

Page 12: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

12 A. B. LEE, B. NADLER AND L. WASSERMAN

themselves contain information about the inherent groupings. We consideran ideal situation where variables within the same group are collinear, andvariables from different groups are weakly correlated. All calculations areexact and computed in the limit of the sample size n→∞. An analysis ofconvergence rates later appears in Section 3.2.2.

We begin by analyzing treelets on p random variables that are indistin-guishable with respect to their second-order statistics. We show that thetreelet algorithm returns scaling functions that are constant on groups ofindistinguishable variables. In particular, the scaling function on the full setof variables in a block is a constant function. Effectively, this function servesas an indicator function of a (sometimes hidden) set of similar variables indata. These results, as well as the follow-up main results in Theorem 2 andCorollary 1, are due to the fully adaptive nature of the treelet algorithm—a property that sets treelets apart from methods that use fixed waveletson a dendrogram [Murtagh (2007)], or adaptive basis functions on fixedtrees [Coifman and Saito (1996)]; see Remark 2 for a concrete example.

Lemma 1. Assume that x = (x1, x2, . . . , xp)T is a random vector with

distribution F , mean 0, and covariance matrix Σ= σ211p×p, where 1p×p de-notes a p× p matrix with all entries equal to 1. Then, at any level 1≤ ℓ≤p− 1 of the tree, the treelet operator T ℓ (defined in Section 3.1) returns (forthe population covariance matrix Σ) an orthogonal decomposition

T ℓ(Σ) =p−ℓ∑

i=1

sℓ,iφℓ,i +ℓ∑

i=1

diψi,(10)

with sum variables sℓ,i =1√|Aℓ,i|

j∈Aℓ,ixj and scaling functions φℓ,i =

1√|Aℓ,i|

×Isℓ,i , which are defined on disjoint index subsets Aℓ,i ⊆ {1, . . . , p} (i= 1, . . . , p−ℓ) with lengths |Aℓ,i| and

∑p−ℓi=1 |Aℓ,i| = p. The expansion coefficients have

variances V{sℓ,i}= |Aℓ,i|σ21, and V{di}= 0. In particular, for ℓ= p− 1,

T p−1(Σ) = sφ+p−1∑

i=1

diψi,(11)

where s= 1√p(x1 + · · ·+ xp) and φ= 1√

p[1 . . .1]T .

Remark 1. Uncorrelated additive noise in (x1, x2, . . . , xp) adds a diago-

nal perturbation to the 2× 2 covariance matrices Σ(ℓ), which are computedat each level in the tree [see (35)]. Such noise may affect the order in whichvariables are grouped, but the asymptotic results of the lemma remain thesame.

Page 13: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 13

Remark 2. The treelet algorithm is robust to noise because it com-putes data-driven rotations on variables. On the other hand, methods thatuse fixed transformations on pre-computed trees are often highly sensitive tonoise, yielding inconsistent results. Consider, for example, a set of four sta-tistically indistinguishable variables {x1, x2, x3, x4}, and compare treelets toa Haar wavelet transform on a data-driven dendrogram [Murtagh (2004)].The two methods return the same results if the variables are merged inthe order {{x1, x2},{x3, x4}}; that is, s = 1

2(x1 + x2 + x3 + x4) and φ =12 [1,1,1,1]

T . Now, a different realization of the noise may lead to the or-der {{{x1, x2}, x4}, x3}. A fixed rotation angle of π/4 (as in Haar wavelets)would then return the sum variable sHaar = 1√

2( 1√

2( 1√

2(x1 + x2) + x4) + x3)

and scaling function φHaar = [ 12√2, 12√2, 1√

2, 12 ]

T .

Next we consider data where the covariance matrix is a K × K blockmatrix with white noise added to the original variables. The following mainresult states that, if variables from different blocks are weakly correlated andthe noise level is relatively small, then the K maximum variance scalingfunctions are constant on each block (see Figure 2 in Section 4 for an exam-ple). We make this precise by giving a sufficient condition [equation (13)] interms of the noise level, and within-block and between-block correlations ofthe original data. For illustrative purposes, we have reordered the variables.A p×p identity matrix is denoted by Ip, and a pi×pj matrix with all entriesequal to 1 is denoted by 1pi×pj .

Theorem 2. Assume that x= (x1, x2, . . . , xp)T is a random vector with

distribution F , mean 0 and covariance matrix Σ=C + σ2Ip, where σ2 rep-

resents the variance of white noise in each variable and

C =

C11 C12 . . . C1K

C12 C22 . . . C2K...

.... . .

...C1K C2K . . . CKK

(12)

is a K × K block matrix with “within-block” covariance matrices Ckk =σ2k1pk×pk (k = 1, . . . ,K) and “between-block” covariance matrices Cij = σij1pi×pj

(i, j = 1, . . . ,K; i 6= j). If

max1≤i,j≤K

(σijσiσj

)

<1

1 + 3max(δ2, δ4),(13)

where δ = σmink σk

, then the treelet decomposition at level ℓ= p−K has theform

T p−K(Σ) =K∑

k=1

skφk +p−K∑

i=1

diψi,(14)

Page 14: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

14 A. B. LEE, B. NADLER AND L. WASSERMAN

where sk =1√pk

j∈Bkxj , φk =

1√pkIBk

, and Bk represents the set of indices

of variables in block k (k = 1, . . . ,K). The expansion coefficients have meansE{sk} = E{di} = 0, and variances V{sk} = pkσ

2k + σ2 and V{di} = O(σ2),

for i= 1, . . . , p−K.

Note that if the conditions of the theorem are satisfied, then all treelets(both scaling and difference functions) associated with levels ℓ > p−K areconstant on groups of similar variables. In particular, for a full decompositionat the maximum level ℓ= p− 1 of the tree we have the following key result,which follows directly from Theorem 2:

Corollary 1. Assume that the conditions in Theorem 2 are satisfied.A full treelet decomposition then gives T p−1(Σ) = sφ +

∑p−1i=1 diψi, where

the scaling function φ and the K − 1 detail functions ψp−K+1, . . . , ψp−1 areconstant on each of the K blocks. The coefficients s and dp−K+1, . . . , dp−1

reflect between-block structures, as opposed to the coefficients d1, . . . , dp−K

which only reflect noise in the data with variances V{di} = O(σ2) for i =1, . . . , p−K.

The last result is interesting. It indicates a parameter-free way of findingK, the number of blocks, namely, by studying the energy distribution of afull treelet decomposition. Furthermore, the treelet transform can uncoverthe block structure even if it is hidden amidst a large number of backgroundnoise variables (see Figure 3 for a simulation with finite sample size).

Remark 3. Both Theorem 2 and Corollary 1 can be directly general-ized to include p0 uncorrelated noise variables, so that x = (x1, . . . , xp−p0 ,xp−p0+1, . . . , xp)

T , where E(xi) = 0 and E(xixj) = 0 for i > p− p0 and j 6= i.For example, if equation (13) is satisfied, then the treelet decomposition atlevel ℓ= p− p0 is

T p−p0(Σ) =K∑

k=1

skφk +p−p0−K∑

i=1

diψi + (0, . . . ,0, xp−p0+1, . . . , xp)T .

Furthermore, note that according to equation (41) in the Appendix A.3,within-block correlations are smallest (“worst-case scenario”) when single-tons are merged. Thus, the treelet transform is a stabilizing algorithm; oncea few correct coarse-grained variables have been computed, it has the effectof denoising the data.

3.2.2. Convergence rates. The aim of this section is to give a roughestimate of the sample size required for treelets to discover the inherentstructures of data. For covariance matrices with block structures, we show

Page 15: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 15

that treelets find the correct groupings of variables if the sample size n≫O(log p), where p is the dimension of the data. This is a significant re-sult, as standard PCA—on the other hand—is consistent if and only ifp/n → 0 [Johnstone and Lu (2008)], that is, when n ≫ O(p). The resultis also comparable to that in Bickel and Levina (2008) for regularizationof sparse nearly diagonal covariance matrices. One main difference is thattheir paper assumes an a priori known ordered set of variables in which thecovariance matrix is sparse, whereas treelets find such an ordering and co-ordinate system as part of the algorithm. The argument for treelets and ablock covariance model goes as follows.

Assume that there are K blocks in the population covariance matrix Σ.Define AL,n as the event that the K maximum variance treelets, constructedat level L= p−K of the tree, for a data set with n observations, are sup-ported only on variables from the same block. In other words, let AL,n rep-resent the ideal case where the treelet transform finds the exact groupingsof variables. Let Eℓ denote the event that at level ℓ of the tree, the largestbetween-block sample correlation is less than the smallest within-block sam-ple correlation,

Eℓ = {max ρ(ℓ)B <min ρ

(ℓ)W }.

According to equations (31)–(32), the corresponding population correlations

maxρ(ℓ)B < ρ1 ≡ max

1≤i,j≤K

(σijσiσj

)

, minρ(ℓ)W > ρ2 ≡

1√

1 + 3max(δ2, δ4),

where δ = σmink σk

, for all ℓ. Thus, a sufficient condition for Eℓ is that {max |ρ(ℓ)B −ρ(ℓ)B |< t} ∩ {max |ρ(ℓ)W − ρ

(ℓ)W |< t} , where t= (ρ2 − ρ1)/2> 0. We have that

P(AL,n)≥ P

(⋂

0≤ℓ<L

Eℓ

)

≥ P

(⋂

0≤ℓ<L

{max |ρ(ℓ)B − ρ(ℓ)B |< t} ∩ {max |ρ(ℓ)W − ρ

(ℓ)W |< t}

)

.

If (A1) holds, then it follows from Lemma 3 that

P(ACL,n)≤

0≤ℓ<L

(P(max |ρ(ℓ)B − ρ(ℓ)B |> t) + P(max |ρ(ℓ)W − ρ

(ℓ)W |> t))

≤ Lc1p2e−nc2t

2

for positive constants c1, c2. Thus, the requirement P(ACL,n)< α is satisfied

if the sample size

n≥ 1

c2t2log

(Lc1p

2

α

)

.

Page 16: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

16 A. B. LEE, B. NADLER AND L. WASSERMAN

From the large-sample properties of treelets (Section 3.1), it follows thattreelets are consistent if n≫O(log p).

4. Treelets and a linear error-in-variables mixture model. In this sectionwe study a simple error-in-variables linear mixture model (factor model)which, under some conditions, gives rise to covariance matrices with blockstructures. Under this model, we compare treelets with PCA and variableselection methods. An advantage of introducing a concrete generative modelis that we can easily relate our results to the underlying structures or compo-nents of real data; for example, different chemical substances in spectroscopydata, genes from the same pathway in microarray data, etc.

In light of this, consider a linear mixture model with K components andadditive noise. Each multivariate observation x ∈R

p has the form

x=K∑

j=1

ujvj + σz.(15)

The components or “factors” uj are random (but not necessarily indepen-dent) variables with variances σ2j . The “loading vectors” vj are fixed, buttypically unknown linearly independent vectors. In the last term, σ repre-sents the noise level, and z∼Np(0, I) is a p-dimensional random vector.

In the unsupervised setting, we are given a training set {xi}ni=1 sampledfrom equation (15). Unsupervised learning tasks include, for example, infer-ence on the number of components K, and on the underlying vectors vj . Inthe supervised setting, we consider a data set {xi, yi}ni=1, where the responsevalue y of an observation x is a linear combination of the variables uj witha random noise term ǫ,

y =K∑

j=1

αjuj + ǫ.(16)

The standard supervised learning task in regression and classification is pre-diction of y for new data x, given a training set {xi, yi}ni=1.

Linear mixture models are common in many fields, including spectroscopyand gene expression analysis. In spectroscopy equation (15) is known asBeer’s law, where x is the logarithmic absorbance spectrum of a chemicalsubstance measured at p wavelengths, uj are the concentrations of con-stituents with pure absorbance spectra vj , and the response y is typicallyone of the components, y = ui. In gene data x is the measured expressionlevel of p genes, uj are intrinsic activities of various pathways, and eachvector vj represents the set of genes in a pathway. The quantity y is typi-cally some measure of severity of a disease, such as time until recurrence ofcancer. A linear relation between y and the values of uj , as in equation (16),is commonly assumed.

Page 17: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 17

4.1. Treelets and a linear mixture model in the unsupervised setting. Con-sider data {xi}ni=1 from the model in equation (15). Here we analyze an il-lustrative example with K = 3 components and loading vectors vk = I(Bk),where I is the indicator function, and Bk ⊂ {1,2, . . . , p} are sets of variableswith sizes pk = |Bk| (k = 1,2,3). A more general analysis is possible but maynot provide more insight.

The unsupervised task is to uncover the internal structure of the linearmixture model from data, for example, to infer the unknown structure of thevectors vk, including the sizes pk of the sets Bk. The difficulty of this problemdepends, among other things, on possible correlations between the randomvariables uj , the variances of the components uj , and interferences (overlap)between the loading vectors vk. We present three examples with increasingdifficulty. Standard methods, such as principal component analysis, succeedonly in the simplest case (Example 1), whereas more sophisticated methods,such as sparse PCA (elastic nets), sometimes require oracle information tocorrectly fit tuning parameters in the model. The treelet transform seemsto perform well in all three cases. Moreover, the results are easy to explainby computing the covariance matrix of the data.

Example 1 (Uncorrelated factors and nonoverlapping loading vectors).The simplest case is when the random variables uj are all uncorrelated forj = 1,2,3, and the loading vectors vj are nonoverlapping. The populationcovariance matrix of x is then given by Σ = C + σ2Ip, where the noise-freematrix

C =

C11 0 0 00 C22 0 00 0 C33 00 0 0 0

(17)

is a 4×4 block matrix with the first three blocks Ckk = σ2k1pk×pk (k = 1,2,3),and the last diagonal block having all entries equal to zero.

Assume that σk ≫ σ for k = 1,2,3. This is a specific example of a spikedcovariance model [Johnstone (2001)] the three components correspondingto distinct large eigenvalues or “spikes” of a model with background noise.As n→∞ with p fixed, PCA recovers the hidden vectors v1, v2, and v3,since these three vectors exactly coincide with the principal eigenvectors ofΣ. A treelet transform with a height L determined by cross-validation anda normalized energy criterion returns the same results—which is consistentwith Section 3.2 (Theorem 2 and Corollary 1).

The difference between PCA and treelets becomes obvious in the “smalln, large p regime.” In the joint limit p,n→ ∞, standard PCA computesconsistent estimators of the vectors vj (in the presence of noise) if andonly if p(n)/n→ 0 [Johnstone and Lu (2008)]. For an analysis of PCA for

Page 18: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

18 A. B. LEE, B. NADLER AND L. WASSERMAN

finite p,n, see, for example, Nadler (2007). As described in Section 3.2.2,treelets require asymptotically far fewer observations with the condition forconsistency being log p(n)/n→ 0.

Example 2 (Correlated factors and nonoverlapping loading vectors). Ifthe random variables uj are correlated, treelets are far better than PCA at in-ferring the underlying localized structure of the data—even asymptotically.Again, this is easy to explain and quantify by studying the data covariancestructure. For example, assume that the loading vectors v1, v2, and v3 arenonoverlapping, but that the corresponding factors are dependent accordingto

u1 ∼N(0, σ21), u2 ∼N(0, σ22), u3 = c1u1 + c2u2.(18)

The covariance matrix is then given by Σ =C + σ2Ip, where

C =

C11 0 C13 00 C22 C23 0C13 C23 C33 00 0 0 0

(19)

with Ckk = σ2k1pk×pk (note that σ23 = c21σ21 + c22σ

22), C13 = c1σ

211p1×p3 , and

C23 = c2σ221p2×p3 . Due to the correlations between uj , the loading vectors of

the block model no longer coincide with the principal eigenvectors, and it isdifficult to extract them with PCA.

We illustrate this problem by the example in Zou, Hastie and Tibshirani(2006). Specifically, let

v1 = [

B1︷ ︸︸ ︷

1 1 1 1

B2︷ ︸︸ ︷

0 0 0 0

B3︷︸︸︷

0 0 ]T ,

v2 = [0 0 0 0 1 1 1 1 0 0 ]T ,(20)

v3 = [0 0 0 0 0 0 0 0 1 1 ]T ,

where there are p = 10 variables total, and the sets Bj are disjoint withp1 = p2 = 4, p3 = 2 variables, respectively. Let σ21 = 290, σ22 = 300, c1 =−0.3,c2 = 0.925, and σ = 1. The corresponding variance σ23 of u3 is 282.8, and thecovariances of the off-diagonal blocks are σ13 =−87 and σ23 = 277.5.

The first three PCA vectors for a training set of 1000 samples are shownin Figure 2 (left). It is difficult to infer the underlying vectors vi fromthese results, as ideally, we would detect that, for example, the variables(x5, x6, x7, x8) are all related and extract the latent vector v2 from only thesevariables. Simply thresholding the loadings and discarding small values alsofails to achieve this goal [Zou, Hastie and Tibshirani (2006)]. The exampleillustrates the limitations of a global approach even with an infinite num-ber of observations. In Zou, Hastie and Tibshirani (2006) the authors show

Page 19: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 19

Fig. 2. In Example 2 PCA fails to find the important variables in the three-componentmixture model, as the computed eigenvectors (left) are sensitive to correlations between dif-ferent components. On the other hand, the three maximum energy treelets (right) uncoverthe underlying data structures.

by simulation that a combined L1 and L2-penalized least squares method,which they call sparse PCA or elastic nets, correctly identifies the sets ofimportant variables if given “oracle information” on the number of variablesp1, p2, p3 in the different blocks. Treelets are similar in spirit to elastic netsas both methods tend to group highly correlated variables together. In thisexample the treelet algorithm is able to find both K, the number of compo-nents in the mixture model, and the hidden loading vectors vi—without anya priori knowledge or parameter tuning. Figure 2 (right) shows results froma treelet simulation with a large sample size (n= 1000) and a height L= 7of the tree, determined by cross-validation (CV) and an energy criterion.The three maximum energy basis vectors correspond exactly to the hiddenloading vectors in equation (20).

Example 3 (Uncorrelated factors and overlapping loading vectors). Fi-nally, we study a challenging example where the first two loading vectorsv1 and v2 are overlapping, the sample size n is small, and the backgroundnoise level is high. Let {B1, . . . ,B4} be disjoint subsets of {1, . . . , p}, and let

v1 = I(B1) + I(B2), v2 = I(B2) + I(B3), v3 = I(B4),(21)

where I(Bk) as before represents the indicator function for subset k (k =1, . . . ,4). The population covariance matrix is then given by Σ =C + σ2Ip,where the noiseless matrix has the general form

C =

C11 C12 0 0 0C12 C22 C23 0 00 C23 C33 0 00 0 0 C44 00 0 0 0 0

,(22)

Page 20: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

20 A. B. LEE, B. NADLER AND L. WASSERMAN

with diagonal blocks C11 = σ211p1×p1 , C22 = (σ21 +σ22)1p2×p2 , C33 = σ221p3×p3 ,

C44 = σ231p4×p4 , and off-diagonal blocks C12 = σ211p1×p2 and C23 = σ221p2×p3 .Consider a numerical example with n = 100 observations, p = 500 vari-ables, and noise level σ = 0.5. We choose the same form for the compo-nents u1, u2, u3 as in [Bair et al. (2006)], but associate the first two com-ponents with overlapping loading vectors v1 and v2. Specifically, the com-ponents are given by u1 = ±0.5 with equal probability, u2 = I(U2 < 0.4),and u3 = I(U3 < 0.3), where I(x) is the indicator of x, and Uj are all inde-pendent uniform random variables in [0,1]. The corresponding variances areσ21 = 0.25, σ22 = 0.24, and σ23 = 0.21. As for the blocks Bk, we consider B1 ={1, . . . ,10},B2 = {11, . . . ,50},B3 = {51, . . . ,100}, and B4 = {201, . . . ,400}.

Inference in this case is challenging for several different reasons. The sam-ple size n < p, the loading vectors v1 and v2 are overlapping in the regionB2 = {11, . . . ,50}, and the signal-to-noise ratio is low with the variance σ2 ofthe noise essentially being of the same size as the variances σ2j of uj . Further-more, the condition in equation (13) is not satisfied even for the populationcovariance matrix. Despite these difficulties, the treelet algorithm is remark-ably stable, returning results that by and large correctly identify the internalstructures of the data. The details are summarized below.

Figure 3 (top center) shows the energy score of the best K-basis at differ-ent levels of the tree. We used 5-fold cross-validation; that is, we generated asingle data set of n= 100 observations, but in each of the 5 computations thetreelets were constructed on a subset of 80 observations, with 20 observationsleft out for the energy score computation. The five curves as well as theiraverage clearly indicate a “knee” at the level L= 300. This is consistent withour expectations that the treelet algorithm mainly merges noise variables atlevels L≥ |⋃k Bk|. For a tree with “optimum” height L= 300, as indicatedby the CV results, we then constructed a treelet basis on the full data set.Figure 3 (top right) shows the energy of these treelets sorted according todescending energy score. The results indicate that we have two dominanttreelets, while the remaining treelets have an energy that is either slightlyhigher or of the same order as the variance of the noise. In Figure 3 (bottomleft) we plot the loadings of the four highest energy treelets. “Treelet 1”(red) is approximately constant on the set B4 (the support of v3), “Treelet2” (blue) is approximately piecewise constant on blocks B1, B2, and B3 (thesupport of v1 and v2), while the low-energy degenerate treelets 3 (green)and 4 (magenta) seem to take differences between variables in the sets B1,B2, and B3. Finally, we computed 95% confidence bands of the treelets using1000 bootstrap samples and the method described in Section 3.1. Figure 3(bottom right) indicate, that the treelet results for the two maximum energytreelets are rather stable despite the small sample size and the low signal-to-noise ratio. Most of the time the first treelet selects variables from B4,and most of the time the second treelet selects variables from B2 and either

Page 21: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 21

B1 or B3 or both sets. The low-energy treelets seem to pick up differencesbetween blocks B1, B2, and B3, but the exact order in which they selectthe variables vary from simulation to simulation. As described in the nextsection, for the purpose of regression, the main point is that the linear spanof the first few highest energy treelets is a good approximation of the spanof the unknown loading vectors, Span{v1, . . . ,vK}.

4.2. The treelet transform as a feature selection scheme prior to regres-sion. Knowing some of the basic properties of treelets, we now examinea typical regression or classification problem with data {xi, yi}ni=1 given byequations (15) and (16). As the data x are noisy, this is an error-in-variablestype problem. Given a training set, the goal is to construct a linear functionf :Rp →R to predict y = f(x) = r · x+ b for a new observation x.

Before considering the performance of treelets and other algorithms in thissetting, we review some of the properties of the optimal mean-squared error(MSE) predictor. For simplicity, we consider the case y = u1 in equation(16), and denote by P1 :R

p → Rp the projection operator onto the space

Fig. 3. Top left: The vectors v1 (blue), v2 (green), v3 (red) in Example 3. Top center:The “score” or total energy of K = 3 maximum variance treelets computed at differentlevels of the tree with 5-fold cross-validation; dotted lines represent the five different sim-ulations and the solid line the average score. Top right: Energy distribution of the treeletbasis for the full data set at an “optimal” height L= 300. Bottom left: The four treeletswith highest energy. Bottom right: 95% confidence bands by bootstrap for the two dominanttreelets.

Page 22: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

22 A. B. LEE, B. NADLER AND L. WASSERMAN

spanned by the vectors {v2, . . . ,vK}. In this setting the unbiased MSE-optimal estimator has a regression vector r = vy/‖vy‖2, where vy = v1 −P1v1. The vector vy is the part of the loading vector v1 that is unique tothe response variable y = u1, since the projection of v1 onto the span of theloading vectors of the other components (u2, . . . , uK) has been subtracted.For example, in the case of only two components, we have that

vy = v1 −v1 · v2

‖v2‖2v2.(23)

The vector vy plays a central role in chemometrics, where it is known as thenet analyte signal [Lorber, Faber and Kowalski (1997), Nadler and Coifman(2005a)]. Using this vector for regression yields a mean squared error ofprediction

E{(y − y)2}= σ2

‖vy‖2.(24)

We remark that, similar to shrinkage in point estimation, there exist biasedestimators with smaller MSE [Gruber (1998), Nadler and Coifman (2005b)],but for large signal to noise ratios (σ/‖vy‖≪ 1), such shrinkage is negligible.

Many regression methods [including multivariate least squares, partialleast squares (PLS), principal component regression (PCR), etc.] attemptto compute the optimal regression vector or net analyte signal (NAS). It canbe shown that in the limit n→∞, both PLS and PCR are MSE-optimal.However, in some applications, the number of variables is much larger thanthe number of observations (p≫ n). The question at hand is then, what theeffect of small sample size is on these methods, when combined with noisyhigh-dimensional data. Both PLS and PCR first perform a global dimen-sionality reduction from p to k variables, and then apply least squares linearregression on these k features. As described in Nadler and Coifman (2005b),their main limitation is that in the presence of noisy high dimensional data,the computed projections are noisy themselves. For fixed p and n, a Taylorexpansion of the regression coefficient as a function of the noise level σ showsthat these methods have an averaged prediction error

E{(y − y)2} ≃ σ2

‖vy‖2[

1 +c1n

+c2 σ

2

µ‖vy‖2p2

n2(1 + o(1))

]

.(25)

In equation (25) the coefficients c1 and c2 are both O(1) constants, indepen-dent of σ, p, and n. The quantity µ depends on the specific algorithm used,and is a measure of the variances and covariances of the different compo-nents uj , and of the amount of interferences of their loading vectors vj . Thekey point of this analysis is that when p≫ n, the last term in (25) can dom-inate and lead to large prediction errors. This emphasizes the limitations ofglobal dimensionality reduction methods, and the need for robust feature

Page 23: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 23

selection and dimensionality reduction of the data prior to application oflearning algorithms such as PCR and PLS.

Other common approaches to dimensionality reduction in this setting arevariable selection schemes, specifically those that choose a small subset ofvariables based on their individual correlation with the response y. To ana-lyze their performance, we consider a more general dimensionality reductiontransformation T :Rp →R

k defined by k orthonormal projections wi ∈Rp,

Tx= (x ·w1,x ·w2, . . . ,x ·wk).(26)

This family of transformations includes variable subset selection methods,where each projection wj selects one of the original variables. It also includeswavelet methods and our proposed treelet transform. Since an orthonormalprojection of a Gaussian noise vector in R

p is a Gaussian vector in Rk, and

a relation similar to equation (15) holds between Tx and y, formula (25)still holds, but with the original dimension p replaced by k, and with vy

replaced by its projection Tvy,

E{(y − y)2} ≃ σ2

‖Tvy‖2[

1 +c1n

+c2 σ

2

µ‖Tvy‖2k2

n2(1 + o(1))

]

.(27)

Equation (27) indicates that a dimensionality reduction scheme should ide-ally preserve the net analyte signal of y (‖Tvy‖ ≃ ‖vy‖), while at the sametime represent the data by as few features as possible (k≪ p).

The main problem of PCA is that it optimally fits the noisy data, yieldingfor the noise-free response ‖Tvy‖/‖vy‖ ≃ (1− cσ2p2/n2). The main limita-tion of variable subset selection schemes is that in complex settings withoverlapping vectors vj , such schemes may at best yield ‖Tvy‖/‖vy‖< 1. Dueto high dimensionality, the latter methods may still achieve better predictionerrors than methods that use all the original variables. However, with a moregeneral variable transformation/compression method, one could potentiallybetter capture the NAS. If the data x are a priori known to be smooth con-tinuous signals, a reasonable choice is wavelet compression, which is knownto be asymptotically optimal. In the case of unstructured data, we proposeto use treelets.

To illustrate these points, we revisit Example 3 in Section 4.1, and com-pare treelets to the variable subset selection scheme of Bair et al. (2006) forPLS, as well as global PLS on all variables. As before, we consider a rela-tively small training set of size n= 100 but here we include 1500 additionalnoise variables, so that p= 2000≫ n. We furthermore assume that the re-sponse is given by y = 2u1. The vectors vj are shown in Figure 3 (top left).The two vectors v1 and v2 overlap, but v1 (associated with the response)and v3 are orthogonal. Therefore, the response vector unique to y (the netanalyte signal) is given by equation (23); see Figure 4 (left).

Page 24: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

24 A. B. LEE, B. NADLER AND L. WASSERMAN

Fig. 4. Left: The vector vy (only the first 150 coordinates are shown as the rest arezero). Right: Averaged prediction errors of 20 simulation results for the methods, from topto bottom: PLS on all variables (blue), supervised PLS with variable selection (purple),PLS on treelet features (green), and PLS on projections onto the true vectors vi (red).

To compute vy, all the 100 first coordinates (the set B1 ∪ B2 ∪ B3) areneeded. However, a feature selection scheme that chooses variables based ontheir correlation to the response will pick the first 10 coordinates and thenthe next 40, that is, only variables in the set B1 ∪ B2 (the support of theloading vector v1). Variables numbered 51 to 100 (set B3), although criti-cal for prediction of the response y = 2u1, are uncorrelated with it (as u1and u2 are uncorrelated) and are thus not chosen, even in the limit n→∞.In contrast, even in the presence of moderate noise and a relatively smallsample size of n = 100, the treelet algorithm correctly joins together thesubsets of variables 1–10, 11–50, 51–100 and 201–400 (i.e., variables in thesets B1,B2,B3,B4). The rest of the variables, which contain only noise, arecombined only at much higher levels in the treelet algorithm, as they areasymptotically uncorrelated. Because of this, using only coarse-grained sumvariables in the treelet transform yields near optimal prediction errors. InFigure 4 (right) we plot the mean squared error of prediction (MSEP) for20 different simulations with prediction error computed on an independenttest set of 500 observations. The different methods are PLS on all vari-ables (MSEP = 0.17), supervised PLS with variable selection as in Bair et al.(2006) (MSEP = 0.09), PLS on the 50 treelet features with highest variance,with the level of the treelet determined by leave-one-out cross validation(MSEP = 0.035), and finally PLS on the projection of the noisy data ontothe true vectors vi, assuming they were known (MSEP = 0.030). In all cases,the optimal number of PLS projections (latent variables) is also determinedby leave-one-out cross validation. Due to the high dimensionality of the data,choosing a subset of the original variables performs better than full-variable

Page 25: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 25

methods. However, choosing a subset of treelet features performs even bet-ter, yielding an almost optimal prediction error (σ2/‖vy‖2 ≈ 0.03); comparethe green and red curves in the figure.

5. Examples.

5.1. Hyperspectral analysis and classification of biomedical tissue. To il-lustrate how our method works for data with highly complex dependenciesbetween variables, we use an example from hyperspectral imaging of biomed-ical tissue. Here we analyze a hyperspectral image of an H&E stained mi-croarray section of normal human colon tissue [see Angeletti et al. (2005)for details on the data collection method]. This is an ordered data set ofmoderate to high dimension. One scan of the tissue specimen returns a1024× 1280 data cube or “hyperspectral image,” where each pixel locationcontains spectral measurements at 28 known frequencies between 420 nmand 690 nm. These spectra give information about the chemical structureof the tissue. There is, however, redundancy as well as noise in the spectra.The challenge is to find the right coordinate system for this relatively high-dimensional space, and extract coordinates (features) that contain the mostuseful information about the chemicals and substances of interest.

We consider the problem of tissue discrimination using only spectral in-formation. With the help of a pathologist, we manually label about 60000pixels of the image as belonging to three different tissue types (colon cell nu-clei, cytoplasm of colon cells, cytoplasm of goblet cells). Figure 5 shows thelocations of the labeled pixels, and their tissue-specific transmission spectra.Figure 6 shows an example of how treelets can learn the covariance structurefor colon cell nuclei (Tissue type 3). The method learns both the tree struc-ture and a basis through a series of Jacobi rotations (see top right panel). Byconstruction, the basis vectors are localized and supported on nested clustersin the tree (see the bottom left and top left panels). As a comparison, wehave also computed the PCA eigenvectors. The latter vectors are global andinvolve all the original variables (see bottom right panel).

In a similar way, we apply the treelet transform to the training datain a 5-fold cross-validation test on the full data set with labeled spectra:Using a (maximum height) treelet decomposition, we construct a basis forthe training set in each fold. To each basis vector, we assign a discriminantscore that quantifies how well it distinguishes spectra from two differenttissue types. The total score for vector wi is defined as

E(wi) =K∑

j=1

K∑

k=1;k 6=j

H(p(j)i ||p(k)i ),(28)

where K = 3 is the number of classes, and H(p(j)i ||p(k)i ) is the Kullback–

Leibler distance between the estimated marginal density functions p(j)i and

Page 26: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

26 A. B. LEE, B. NADLER AND L. WASSERMAN

p(k)i of class-j and class-k signals, respectively, in the direction of wi. We

project our training data onto the K (< 28) most discriminant directions,and build a Gaussian classifier in this reduced feature space. This classifieris finally used to label the test data and to estimate the misclassificationerror rate. The left panel in Figure 7 shows the average CV error rate asa function of the number of local discriminant features. (As a comparison,we show similar results for Haar–Walsh wavelet packets and a local discrim-inant basis [Saito, Coifman, Geshwind and Warner (2002)] which use thesame discriminant score to search through a library of orthonormal waveletbases.) The straight line represents the error rate if we apply a Gaussianclassifier directly to the 28 components in the original coordinate system.The key point is that, with 3 treelet features, we get the same performanceas if we used all the original data. Using more treelet features yields an evenlower misclassification rate. (Because of the large sample size, the curse ofdimensionality is not noticeable for < 15 features.) These results indicatethat a treelet representation has advantages beyond the obvious benefitsof a dimensionality reduction. We are effectively “denoising” the data bychanging our coordinate system and discarding irrelevant coordinates. Theright panel in Figure 7 shows the three most discriminant treelet vectors forthe full data set. These vectors resemble continuous-valued versions of theindicator functions in Section 3.2. Projecting onto one of these vectors hasthe effect of first taking a weighted average of adjacent spectral bands, andthen computing a difference between averages of bands in different regions

Fig. 5. Left: Microscopic image of a cross-section of colon tissue. At each pixel po-sition, the spectral characteristics of the tissue is measured at 28 different wavelengths(λ = 420,430, . . . ,690 nm). For our analysis, we manually label about 60000 individualspectra: Red marks the locations of spectra of “Tissue type 1” (nuclei), green “Tissue type2” (cytoplasm of colon cells), and blue corresponds to samples of “Tissue type 3” (cyto-plasm of goblet cells). Right: Spectral signatures of the 3 different tissue types. Each plotshows the sample mean and standard deviation of the log-transmission spectra.

Page 27: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 27

Fig. 6. Top left: Learned tree structure for nuclei (Tissue Type 1). In the dendrogramthe height of each U-shaped line represents the distance dij = (1− ρij)/2, where ρij is thecorrelation coefficient for the two variables combined. The leaf nodes represent the p= 28original spectral bands. Top right: 2D scatter plots of the data at levels ℓ = 1, . . . , p− 1.Each plot shows 500 randomly chosen data points; the lines indicate the first principaldirections and rotations relative to the variables that are combined. (Note that a Haarwavelet corresponds to a fixed π/4 rotation.) Bottom left: Learned orthonormal basis. Eachrow represents a localized vector, supported on a cluster in the hierarchical tree. Bottomright: Basis computed by a global eigenvector analysis (PCA).

of the spectrum. (In Section 5.3, Figure 10, we will see another example thatthe loadings themselves contain information about structure in data.)

5.2. A classification example with an internet advertisement data set.Here we study an internet advertisement data set from the UCI ML repos-itory [Kushmerick (1999)]. This is an example of an unordered data set ofhigh dimension where many variables are collinear. After removal of the firstthree continuous variables, this set contains 1555 binary variables and 3279observations, labeled as belonging to one of two classes. The goal is to pre-dict whether a new observation (an image in an internet page) is an internet

Page 28: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

28 A. B. LEE, B. NADLER AND L. WASSERMAN

Fig. 7. Left: Average misclassification rate (in a 5-fold cross-validation test) as a func-tion of the number of top discriminant features retained, for a treelet decomposition (rings),and for Haar-Walsh wavelet packets (crosses). The constant level around 2.5% indicatesthe performance of a classifier directly applied to the 28 components in the original co-ordinate system. Right: The top 3 local discriminant basis (LDB) vectors in a treeletdecomposition of the full data set.

Table 1

Classification test errors for an internet advertisement data set

Classifier Full data set Reduced data set Final representation with(1555 variables) (760 variables) coarse-grained treelet features

LDA 5.5% 5.1% 4.5%1-NN 4.0% 4.0% 3.7%

advertisement or not, given values of its 1555 variables (various features ofthe image).

With standard classification algorithms, one can easily obtain a general-ization error of about 5%. The first column in Table 1, labeled “full dataset,” shows the misclassification rate for linear discriminant analysis (LDA)(with the additional assumption of a diagonal covariance matrix), and for1-nearest neighbor (1-NN) classification. The average is taken over 25 ran-domly selected training and test sets, with 3100 and 179 observations each.

The internet-ad data set has several distinctive properties that are clearlyrevealed by an analysis with treelets: First of all, several of the original vari-ables are exactly linearly related. As the data are binary (−1 or 1), thesevariables are either identical or of opposite values. In fact, one can reduce thedimensionality of the data from 1555 to 760 without loss of information. Thesecond column in the table labeled “reduced data set” shows the decreasein error rate after a lossless compression where we have simply removed

Page 29: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 29

Fig. 8. Left: The correlation matrix of the first 200 out of 760 variables in the orderthey were originally given. Right: The corresponding matrix, after sorting all variablesaccording to the order in which they are combined by the treelet algorithm.

redundant variables. Furthermore, of these remaining 760 variables, manyare highly related, with subsets of similar variables. The treelet algorithmautomatically identifies these groups, as the algorithm reorders the variablesduring the basis computation, encoding the information in such a group witha coarse-grained sum variable and difference variables for the residuals. Fig-ure 8, left, shows the correlation matrix of the first 200 out of 760 variablesin the order they are given. To the right, we see the corresponding matrix,after sorting all variables according to the order in which they are combinedby the treelet algorithm. Note how the (previously hidden) block structures“pop out.”

A more detailed analysis of the reduced data set with 760 variables showsthat there are more than 200 distinct pairs of variables with a correlationcoefficient larger than 0.95. Not surprisingly, as shown in the right columnof Table 1, treelets can further increase the predictive performance on thisdata set, yielding results competitive with other feature selection methodsin the literature [Zhao and Liu (2007)]. All results in Table 1 are averagedover 25 different simulations. As in Section 4.2, the results are achieved at alevel L< p−1, by projecting the data onto the treelet scaling functions, thatis, by only using coarse-grained sum variables. The height L of the tree isfound by 10-fold cross-validation and a minimum prediction error criterion.

5.3. Classification and analysis of DNA microarray data. We concludewith an application to DNA microarray data. In the analysis of gene ex-pression, many methods first identify groups of highly correlated variablesand then choose a few representative genes for each group (a so-called genesignature). The treelet method also identifies subsets of genes that exhibitsimilar expression patterns, but in contrast, replaces each such localized

Page 30: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

30 A. B. LEE, B. NADLER AND L. WASSERMAN

group by a linear combination that encodes the information from all vari-ables in that group. As illustrated in previous examples in the paper, sucha representation typically regularizes the data which improves the perfor-mance of regression and classification algorithms.

Another advantage is that the treelet method yields a multi-scale datarepresentation well-suited for the application. The benefits of hierarchicalclustering in exploring and visualizing microarray data are well recognizedin the field [Eisen et al. (1998), Tibshirani et al. (1999)]. It is, for example,known that a hierarchical clustering (or dendrogram) of genes can sometimesreveal interesting clusters of genes worth further investigation. Similarly, adendrogram of samples may identify cases with similar medical conditions.The treelet algorithm automatically yields such a re-arrangement and inter-pretation of the data. It also provides an orthogonal basis for data represen-tation and compression.

We illustrate our method on the leukemia data set of Golub et al. (1999).This data monitor expression levels for 7129 genes and 72 patients, suf-fering from acute lymphoblastic leukemia (ALL, 47 cases) or acute myeloidleukemia (AML, 25 cases). The data are known to have a low intrinsic dimen-sionality, with groups of genes having similar expression patternsacross samples (cell lines). The full data set is available athttp://www.genome.wi.mit.edu/MPR, and includes a training set of 38samples and a test set of 34 samples.

Prior to analysis, we use a standard two-sample t-test to select genes thatare differentially expressed in the two leukemia types. Using the trainingdata, we perform a full (i.e., maximum height) treelet decomposition of thep = 1000 most “significant” genes. We sort the treelets according to theirenergy content [equation (5)] on the training samples, and project the testdata onto the K treelets with the highest energy score. The reduced datarepresentation of each sample (from p genes to K features) is finally used toclassify the samples into the two leukemia types, ALL or AML. We examinetwo different classification schemes:

In the first case, we apply a linear Gaussian classifier (LDA). As in Sec-tion 5.2, the treelet transform serves as a feature extraction and dimen-sionality reduction tool prior to classification. The appropriate value of thedimension K is chosen by 10-fold cross-validation (CV). We divide the train-ing set at random into 10 approximately equal-size parts, perform a separatet-test in each fold, and choose the K-value that leads to the smallest CVclassification error (Figure 9, left).

In the second case, we classify the data using a novel two-way treeletdecomposition scheme: we first compute treelets on the genes, then we com-pute treelets on the samples. As before, each sample (patient) is representedby K treelet features instead of the p original genes. The dimension K ischosen by cross-validation on the training set. However, instead of applying

Page 31: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 31

a standard classifier, we construct treelets on the samples using the newpatient profiles. The two main branches of the associated dendrogram di-vide the samples into two classes, which are labeled using the training dataand a majority vote. Such a two-way decomposition—of both genes andsamples—leads to classification results competitive with other algorithms;see Figure 9, right, and Table 2 for a comparison with benchmark results in

Fig. 9. Number of misclassified cases as a function of the number of treelet features. Left:LDA on treelet features; ten-fold cross-validation gives the lowest misclassification rate(2/38) for K = 3 treelets; the test error rate is then 3/34. Right: Two-way decompositionof both genes and samples; the lowest CV misclassification rate (0/38) is for K = 4; thetest error rate is then 1/34.

Fig. 10. Left, the gene expression data with rows (genes) and columns (samples) orderedaccording to a hierarchical two-way clustering with treelets. (For display purposes, theexpression levels for each gene are here normalized across the samples to zero mean andunit standard deviation.) Right, the three maximum energy treelets on ordered samples.The loadings of the highest-energy treelet (red) is a good predictor of the true labels (bluecircles).

Page 32: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

32 A. B. LEE, B. NADLER AND L. WASSERMAN

Table 2

Leukemia misclassification rates; courtesy of Zou and Hastie (2005)

Method Ten-fold CV error Test error

Golub et al. (1999) 3/38 4/34Support vector machines (Guyon et al. (2002)) 2/38 1/34Nearest shrunken centroids (Tibshirani et al. (2002)) 2/38 2/34Penalized logistic regression (Zhu and Hastie (2004)) 2/38 1/34Elastic nets (Zou and Hastie (2005)) 3/38 0/34LDA on treelet features 2/38 3/34Two-way treelet decomposition 0/38 1/34

Zou and Hastie (2005). Moreover, the proposed method returns orthogonalfunctions with continuous-valued information on hierarchical groupings ofgenes or samples.

Figure 10 (left) displays the original microarray data, with rows (genes)and columns (samples) ordered according to a hierarchical two-way cluster-ing with treelets. The graph to the right shows the three maximum energytreelets on ordered samples. Note that the loadings are small for the twocases that are misclassified. In particular, “Treelet 2” is a good “continuous-valued” indicator function of the true classes. The results for the treelets ongenes are similar. The key point is that whenever there is a group of highlycorrelated variables (genes or samples), the algorithm tends to choose acoarse-grained variable for that whole group (see, e.g., “Treelet 3” in thefigure). The weighting is adaptive, with loadings that reflect the complexinternal data structure.

6. Conclusions. In the paper we described a variety of situations wherethe treelet transform outperforms PCA and some common variable selec-tion methods. The method is especially useful as a feature extraction andregularization method in situations where variables are collinear and/or thedata is noisy with the number of variables, p, far exceeding the number ofobservations, n. The algorithm is fully adaptive, and returns both a hierar-chical tree and loading functions that reflect the internal localized structureof the data. We showed that, for a covariance model with block structure,the maximum energy treelets converge to a solution where they are constanton each set of indistinguishable variables. Furthermore, the convergence rateof treelets is considerably faster than PCA, with the required sample sizefor consistency being n≫O(log p) instead of n≫O(p). Finally, we demon-strated the applicability of treelets on several real data sets with highlycomplex dependencies of variables.

Page 33: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 33

APPENDIX

A.1. Proof of Theorem 1. Let x= (x1, . . . , xp)T be a random vector with

distribution F and covariance matrix Σ= ΣF . Let ρij denote the correlationbetween xi and xj . Let x

1, . . . ,xn be a sample from F , and denote the sample

covariance matrix and sample correlations by Σ and ρij . Let Sp denote allp× p covariance matrices. Let

Fn(b) =

{

F :ΣF is positive definite, min1≤j≤pn

σj ≥ b

}

.

Any of the assumptions (A1a), (A1b), or (A1c) are sufficient to guaranteecertain exponential inequalities.

Lemma A.1. There exist positive constants c1, c2 such that, for everyǫ > 0,

P(‖Σjk −Σjk‖∞ > ǫ)≤ c1p2ne

−nc2ǫ2.(29)

Hence,

‖Σjk −Σjk‖∞ =OP

(√

logn

n

)

.

Proof. Under (A1), (29) is an immediate consequence of standard ex-ponential inequalities and the union bound. The last statement follows bysetting ǫn =K

logn/n for sufficiently large K and applying (A2). �

Lemma A.2. Assume either that (i) x is multivariate normal or that(ii) max1≤j≤p |xj| ≤ B for some finite B and minj σj ≥ b > 0. Then, thereexist positive constants c3, c4 such that, for every ǫ > 0,

P

(

maxjk

|ρjk − ρjk|> ǫ

)

≤ c3p2e−nc4ǫ

2.(30)

Proof. Under normality, this follows from Kalisch and Buhlmann (2007).Under (ii) note that h(σ1, σ2, σ12) = σ12/(σ1σ2) satisfies

|h(σ1, σ2, σ12)− h(σ′1, σ′2, σ

′12)| ≤

3max{|σ1 − σ′1|, |σ2 − σ′2|, |σ12 − σ′12|}b2

.

The result then follows from the previous lemma. �

Let Jθ denote the 2× 2 rotation matrix of angle θ. Let

JΣ =

(cos(θ(Σ)) − sin(θ(Σ))sin(θ(Σ)) cos(θ(Σ))

)

(31)

Page 34: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

34 A. B. LEE, B. NADLER AND L. WASSERMAN

denote the Jacobi rotation where

θ(Σ) =1

2tan−1

(2Σ12

Σ11 −Σ22

)

.(32)

Lemma A.3. Let F be a bivariate distribution with 2 × 2 covariancematrix Σ. Let J = JΣ and J = JΣ. Then,

P(‖JT ΣJ − JTΣJ‖∞ > ǫ)≤ c5p2e−nc6ǫ

2.(33)

Proof. Note that θ(Σ) a bounded, uniformly continuous function of Σ.Similarly, the entries of Jθ are also bounded, uniformly continuous functionsof Σ. The result then follows from (29). �

For any pair (α,β), let θ(α,β) denote the angle of the principal componentrotation and let J(α,β, θ) denote the Jacobi rotation on (α,β). Define theselection operator

∆ :Sp →{(j, k) : 1≤ j < k ≤ p}by ∆(Σ) = (α,β) where ρα,β = argmaxijρij . In case of ties, define ∆(Σ)to be the set of pairs (α,β) at which the maximum occurs. Hence, ∆ ismultivalued on a subset S∗

p ⊂Sp of measure 0. The one-step treelet operatorT :Sp →Sp is defined by

T (Σ) = {JTΣJ :J = J(α,β, θ(α,β)), (α,β) ∈∆(Σ)}.(34)

Formally, T is a multivalued map because of potential ties.

Proof of Theorem 1. The proof is immediate from the lemmas.For the matrices Σn, we have that ‖Σn − Σ‖∞ < δn except on a set Ac

n

of probability tending to 0 at rate O(n−(K−2c)). Hence, on the set An =

{Σn :‖Σ∗n,b − Σn‖∞ < δn}, we have that T (Σn) ∈ Tn(Σ). The same holds at

each step. �

A.2. Proof of Lemma 1. Consider first the case where at each level inthe tree the treelet operator combines a coarse-grained variable with a sin-gleton according to {{x1, x2}, x3}, . . . . Let s0 = x1. For ℓ= 1, the 2× 2 co-

variance submatrix Σ(0) ≡ V{(s0, x2)} = σ21

(1 11 1

)

. A principal component

analysis of Σ(0) gives θ1 = π/4 and s1 = 1√2(x1 + x2). By induction, for

1≤ ℓ≤ p− 1, Σ(ℓ−1) ≡ V{(sℓ−1, xℓ+1)}= σ21

(ℓ

√ℓ√

ℓ 1

)

. PCA on Σ(ℓ−1) gives

the (unconstrained) rotation angle θℓ = arctan√ℓ, and the new sum variable

sℓ =1√ℓ+1

∑ℓ+1i=1 xi.

Page 35: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 35

More generally, at level ℓ of the tree, the treelet operator combines twosum variables u= 1√

m

i∈Auxi and v =

1√n

j∈Avxj , whereAu,Av ⊆ {1, . . . , p}

denote two disjoint index subsets with m = |Au| and n = |Av| number ofterms, respectively. The 2× 2 covariance submatrix

Σ(ℓ−1) ≡V{(u, v)}= σ21

(m

√mn√

mn n

)

.(35)

The correlation coefficient ρuv = 1 for any pair (u, v); thus, the treelet op-erator Tℓ is a multivariate function of Σ. A principal component analy-sis of Σ(ℓ−1) gives the eigenvalues λ1 = m + n,λ2 = 0, and eigenvectorse1 =

1√m+n

(√m,

√n)T , e2 =

1√m+n

(−√n,

√m)T . The rotation angle

θℓ = arctan

√n

m.(36)

The new sum and difference variables at level ℓ are given by

sℓ =1√

m+ n(+

√mu+

√nv)

=1√

m+ n

i∈{Au,Av}xi,

(37)

dℓ =1√

m+ n(−√

nu+√mv)

=1√

m+ n

(

−√n

m

i∈Au

xi +

√m

n

j∈Av

xj

)

.

The results of the lemma follow.

A.3. Proof of Theorem 2. Assume that variables from different blockshave not been merged for levels ℓ′ < ℓ, where 1 ≤ ℓ ≤ p. From Lemma 1,we then know that any two sum variables at the preceding level ℓ− 1 havethe general form u= 1√

m

i∈Auxi and v =

1√n

j∈Avxj , where Au and Av

are two disjoint index subsets with m= |Au| and n= |Av| number of terms,respectively. Let δk = σ/σk.

IfAu ⊆Bi andAv ⊆Bj , where i 6= j, that is, the subsets belong to differentblocks, then

Σ(ℓ−1) =V{(u, v)}=(

mσ2i√mnσij√

mnσij nσ2j

)

+ σ2I.(38)

The corresponding “between-block” correlation coefficient

ρ(ℓ−1)B =

σijσiσj

√mn

m+ δ2i

n+ δ2j

≤ σijσiσj

(39)

Page 36: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

36 A. B. LEE, B. NADLER AND L. WASSERMAN

with equality (“worst-case scenario”) if and only if σ = 0.If Au,Av ⊂Bk, that is, the subsets belong to the same block, then

Σ(ℓ−1) =V{(u, v)}= σ2k

(m

√mn√

mn n

)

+ σ2I.(40)

The corresponding “within-block” correlation coefficient

ρ(ℓ−1)W =

1√

1 + (m+ n)/(mn)δ2k + (1/(mn))δ4k(41)

≥ 1√

1 + 3max(δ2k, δ4k),

with the “worst-case scenario” occurring when m = n = 1, that is, whensingletons are combined. Finally, the main result of the theorem followsfrom the bounds in Equations (39) and (41), and the fact that

maxρ(ℓ−1)B <minρ

(ℓ−1)W(42)

for ℓ = 1,2, . . . , p −K is a sufficient condition for not combining variablesfrom different blocks. If the inequality equation (13) is satisfied, then thecoefficients in the treelet expansion have the general form in equation (37)at any level ℓ of the tree. With white noise added, the expansion coefficients

have variances V{sℓ} = (m+ n)σ2k + σ2 and V{dℓ} = σ2 m2+n2

mn(m+n) . Further-

more, E{sℓ}= E{dℓ}= 0.

Acknowledgments. We are grateful to R. R. Coifman and P. Bickel forhelpful discussions. We would also like to thank K. Roeder and three anony-mous referees for comments that improved the manuscript, Dr. D. Rimmfor sharing his database on hyperspectral images, and Dr. G. L. Davis forcontributing his expertise on the histology of the tissue samples.

REFERENCES

Ahn, J. and Marron, J. S. (2008). Maximal data piling in discrimination. Biometrika.To appear.

Angeletti, C., Harvey, N. R., Khomitch, V., Fischer, A. H., Levenson, R. M.

and Rimm, D. L. (2005). Detection of malignancy in cytology specimens using spectral-spatial analysis. Laboratory Investigation 85 1555–1564.

Asimov, D. (1985). The Grand Tour: A tool for viewing multidimensional data. SIAM J.Sci. Comput. 6 128–143. MR0773286

Bair, E., Hastie, T., Paul, D. and Tibshirani, R. (2006). Prediction by supervisedprincipal components. J. Amer. Statist. Assoc. 101 119–137. MR2252436

Belkin, M. and Niyogi, P. (2005). Semi-supervised learning on Riemannian manifolds.Machine Learning 56 209–239.

Beran, R. and Srivastava, M. (1985). Bootstrap tests and confidence regions for func-tions of a covariance matrix. Ann. Statist. 13 95–115. MR0773155

Page 37: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 37

Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices.Ann. Statist. 36 199–227. MR2387969

Buckheit, J. and Donoho, D. (1995). Improved linear discrimination using time fre-quency dictionaries. In Proc. SPIE 2569 540–551.

Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is

much larger than n (with discussion). Ann. Statist. 35 2313–2404. MR2382644Coifman, R., Lafon, S., Lee, A., Maggioni, M., Nadler, B., Warner, F. and

Zucker, S. (2005). Geometric diffusions as a tool for harmonics analysis and structuredefinition of data: Diffusion maps. Proc. Natl. Acad. Sci. 102 7426–7431.

Coifman, R. and Saito, N. (1996). The local Karhunen–Loeve basis. In Proc. IEEE

International Symposium on Time-Frequency and Time-Scale Analysis 129–132. IEEESignal Processing Society.

Coifman, R. and Wickerhauser, M. (1992). Entropy-based algorithms for best basisselection. In Proc. IEEE Trans. Inform. Theory. 32 712–718.

Dettling, M. and Buhlmann, P. (2004). Finding predictive gene groups from microarray

data. J. Multivariate Anal. 90 106–131. MR2064938Donoho, D. and Elad, M. (2003). Maximal sparsity representation via l1 minimization.

Proc. Natl. Acad. Sci. USA 100 2197–2202. MR1963681Donoho, D. and Johnstone, I. (1995). Adapting to unknown smoothness via wavelet

shrinkage. J. Amer. Statist. Assoc. 90 1200–1224. MR1379464

Eisen, M., Spellman, P., Brown, P. and Botstein, D. (1998). Cluster analysis anddisplay of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95 14863–14868.

Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis,and density estimation. J. Amer. Statist. Assoc. 97 611–631. MR1951635

Golub, G. and van Loan, C. F. (1996). Matrix Computations, 3rd ed. Johns Hopkins

Univ. Press. MR1417720Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov,

J. P. Coller, H., Lob, M. L., Downing, J. R., Caliguiri, M., Bloomfield, C.

and Lander, E. (1999). Molecular classification of cancer: Class discovery and classprediction by gene expression monitoring. Science 286 531–537.

Gruber, M. (1998). Improving Efficiency by Shrinkage: The James–Stein and Ridge Re-gression Estimators. Dekker, New York. MR1608582

Guyon, I., Weston, J., Barnhill, S. and Vapnik, V. (2002). Gene selection for cancerclassification using support vector machines. Machine Learning 46 389–422.

Hall, P., Marron, J. S. and Neeman, A. (2005). Geometric representation of high

dimension, low sample size data. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 427–444.MR2155347

Hastie, T., Tibshirani, R., Botstein, D. and Brown, P. (2001). Supervised harvestingof expression trees. Genome Biology 2 research0003.1–0003.12.

Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learn-

ing. Springer, New York. MR1851606Jain, A. K., Murty, M. N. and Flynn, P. J. (1999). Data clustering: A review. ACM

Computing Surveys 31 264–323.Johnstone, I. and Lu, A. (2008). Sparse principal component analysis. J. Amer. Statist.

Assoc. To appear.

Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal com-ponent analysis. Ann. Statist. 29 295–327. MR1863961

Jolliffe, I. T. (2002). Principal Component Analysis, 2nd ed. Springer, New York.MR2036084

Page 38: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

38 A. B. LEE, B. NADLER AND L. WASSERMAN

Kalisch, M. and Buhlmann, P. (2007). Estimating high-dimensional directed acyclicgraphs with the pc-algorithm. J. Machine Learning Research 8 613–636.

Kushmerick, N. (1999). Learning to remove internet advertisements. In Proceedings ofthe Third Annual Conference on Autonomous Agents 175–181.

Lee, A. and Nadler, B. (2007). Treelets—a tool for dimensionality reduction and multi-scale analysis of unstructured data. In Proc. of the Eleventh International Conferenceon Artificial Intelligence and Statistics (M. Meila and Y. Shen, eds.).

Levina, E. and Zhu, J. (2007). Sparse estimation of large covariance matrices via ahierarchical lasso penalty. Submitted.

Lorber, A., Faber, K. and Kowalski, B. R. (1997). Net analyte signal calculation inmultivariate calibration. Anal. Chemometrics 69 1620–1626.

Mallat, S. (1998). A Wavelet Tour of Signal Processing. Academic Press, San Diego,CA. MR1614527

Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs and variable se-lection with the lasso. Ann. Statist. 34 1436–1462. MR2278363

Murtagh, F. (2004). On ultrametricity, data coding, and computation. J. Classification21 167–184. MR2100389

Murtagh, F. (2007). The Haar wavelet transform of a dendrogram. J. Classification 243–32. MR2370773

Murtagh, F., Starck, J.-L. and Berry, M. W. (2000). Overcoming the curse of di-mensionality in clustering by means of the wavelet transform. Computer J. 43 107–120.

Nadler, B. (2007). Finite sample approximation results for principal component analysis:A matrix perturbation approach. Submitted.

Nadler, B. and Coifman, R. (2005a). Partial least squares, Beer’s law and the netanalyte signal: Statistical modeling and analysis. J. Chemometrics 19 45–54.

Nadler, B. and Coifman, R. (2005b). The prediction error in CLS and PLS: The impor-tance of feature selection prior to multivariate calibration. J. Chemometrics 19 107–118.

Ogden, R. T. (1997). Essential Wavelets for Statistical Applications and Data Analysis.Birkhauser, Boston. MR1420193

Saito, N. and Coifman, R. (1995). On local orthonormal bases for classification andregression. In On Local Orthonormal Bases for Classification and Regression 1529–1532. IEEE Signal Processing Society.

Saito, N., Coifman, R., Geshwind, F. B. and Warner, F. (2002). Discriminant fea-ture extraction using empirical probability density estimation and a local basis library.Pattern Recognition 35 2841–2852.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist.Soc. Ser. B 58 267–288. MR1379242

Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D. and Brown, P. (1999).Clustering methods for the analysis of DNA microarray data. Technical report, Dept.Statistics, Stanford Univ.

Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiplecancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99 6567–6572.

Whittaker, J. (2001). Graphical Models in Multivariate Statistics. Wiley, New York.Xu, R. and Wunsch, D. (2005). Survey of clustering algorithms. IEEE Trans. Neural

Networks 16 645–678.Zhao, Z. and Liu, H. (2007). Searching for interacting features. In Proceedings of the 20th

International Joint Conference on AI (IJCAI-07).Zhu, J. and Hastie, T. (2004). Classification of gene microarrays by penalized logistic

regression. Biostatistics 5 427–444.

Page 39: UNORDERED DATA By Ann B. Lee, Boaz Nadler arXiv:0707 ...

TREELETS 39

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.J. Roy. Statist. Soc. Ser. B 67 301–320. MR2137327

Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis.J. Comput. Graph. Statist. 15 265–286. MR2252527

A. B. Lee

L. Wasserman

Department of Statistics

Carnegie Mellon University

Pittsburgh, Pennsylvania

USA

E-mail: [email protected]@stat.cmu.edu

B. Nadler

Department of Computer Science

and Applied Mathematics

Weizmann Institute of Science

Rehovot

Israel

E-mail: [email protected]