Dimension reduction techniques for...

67
Dimension reduction techniques for classification Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Dimension reduction techniquesfor classification – p.1/53

Transcript of Dimension reduction techniques for...

Page 1: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Dimension reduction techniquesfor classification

Milan, May 2003

Anestis Antoniadis

Laboratoire IMAG-LMC

University Joseph Fourier

Grenoble, France

Dimension reduction techniquesfor classification – p.1/53

Page 2: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Outline

Why using dimension reduction methods withmicroarray data?

Dimension reduction techniquesfor classification – p.2/53

Page 3: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Outline

Why using dimension reduction methods withmicroarray data?

thousands of variables (genes, p) and a very smallnumber of (biological) replicates, so regressionanalysis is difficult in practice.

Dimension reduction techniquesfor classification – p.2/53

Page 4: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Outline

Why using dimension reduction methods withmicroarray data?

thousands of variables (genes, p) and a very smallnumber of (biological) replicates, so regressionanalysis is difficult in practice.

Dimension reduction for regression

Dimension reduction techniquesfor classification – p.2/53

Page 5: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Outline

Why using dimension reduction methods withmicroarray data?

thousands of variables (genes, p) and a very smallnumber of (biological) replicates, so regressionanalysis is difficult in practice.

Dimension reduction for regression

PCA (SVD) based methods

Dimension reduction techniquesfor classification – p.2/53

Page 6: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Outline

Why using dimension reduction methods withmicroarray data?

thousands of variables (genes, p) and a very smallnumber of (biological) replicates, so regressionanalysis is difficult in practice.

Dimension reduction for regression

PCA (SVD) based methods

PLS methods (from chemometrics)

Dimension reduction techniquesfor classification – p.2/53

Page 7: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Outline

Why using dimension reduction methods withmicroarray data?

thousands of variables (genes, p) and a very smallnumber of (biological) replicates, so regressionanalysis is difficult in practice.

Dimension reduction for regression

PCA (SVD) based methods

PLS methods (from chemometrics)

Sliced Inverse Regression methods (SIR, SAVE,MAVE)

Dimension reduction techniquesfor classification – p.2/53

Page 8: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Outline

Why using dimension reduction methods withmicroarray data?

thousands of variables (genes, p) and a very smallnumber of (biological) replicates, so regressionanalysis is difficult in practice.

Dimension reduction for regression

PCA (SVD) based methods

PLS methods (from chemometrics)

Sliced Inverse Regression methods (SIR, SAVE,MAVE)

Applications in classification

Dimension reduction techniquesfor classification – p.2/53

Page 9: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

NotationMicroarray data. The set of observed data arrives in twoparts:

a n × p matrix X = (xij), where, typically, each row

corresponds to a gene (1 × p) expression profile foran individual or subject (i) corresponding to the ith

row of X, so the microarray corresponds to XT. Wewill suppose that the data have already beenpreprocessed and normalized.

an n-dimensional vector Y of group labels, takingvalues in {0, . . . , G − 1}.

We will assume that the rows of X are standardized to

have mean zero and variance one for each column (gene).Dimension reduction techniquesfor classification – p.3/53

Page 10: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Two type of problems

Class discovery and class prediction.We will focus here on the class prediction problem:observations are known to belong to a prespecified classand the task is to build predictors for assigning newobservations to these classes.

Standard methods include linear discriminant analysis,diagonal linear discriminant classifiers, classificationtrees, nearest neighbor (NN), SVM and aggregatingclassifiers.

Comprehensive account in Dudoit, Fridlyant and Speed(2002).

Dimension reduction techniquesfor classification – p.4/53

Page 11: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

A regression model based on SVD

Formulate the effects of gene expression on class typeusing the multinomial logistic regression model:

logP(Yi = r)

P(Yi = 0)= Xi·βr0, r = 1, . . . , G − 1,

where βr0 is a p-dimensional vector of unknownregression coefficients.

Since p >> n it is not possible to estimate the parametersof the above model using standard statistical methods. Aprincipal component study becomes then a suitable firststep to reduce the dimension of βr0.

Dimension reduction techniquesfor classification – p.5/53

Page 12: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Principal Component Regression

We first perform a singular value decomposition of the

p × n matrix XT:

XT = UDV,

where

U is a p × n matrix whose columns are orthonormal.

D is the diagonal matrix containing the orderedsingular values di of X. We will assume that di > 0for i = 1, . . . , n.

V is the n × n singular value decomposition factormatrix and has both orthonormal rows andcolumns.

Dimension reduction techniquesfor classification – p.6/53

Page 13: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

PCR

Aim: project the high-dimensional multivariate data intoa lower dimensional subspace.

By SVD we now have

logP(Yi = r)

P(Yi = 0)= Wi·γr0,

where Wi· is the ith row of W = DV and γr0 = UTβr0 isnow an n-dimensional vector of regression coefficients.

We have therefore reduced the dimension of space forthe predictor variables from p to n, thus making theproblem computationally tractable, fitting the model bymaximum likelihood methods.

Dimension reduction techniquesfor classification – p.7/53

Page 14: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

The number of potential predictor genes is only afraction of the set of genes that have real biologicalactivity.

Suggestion: reduce the initial number p of genes byANOVA like based prefiltering techniques for improvingpredictive performance (see e.g. Golub et al. (1999),Dutoit et al. (2002), Nguyen and Rocke (2002)).

Idea: fit an ANOVA model for gene expression versusclass for each gene. For each ANOVA model, computean overall F-statistic and take the m genes with thelargest F-statistic as the potential predictors in the model.

Dimension reduction techniquesfor classification – p.8/53

Page 15: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Choosing the components

A major issue is determining how many components toretain. One way of performing this is leave-one-out crossvalidation.

One sample is removed from the data set at a time. For afixed number of components, say k, the regression modelis fit to the remaining data. Based on the estimatedmodel, the fit is used to predict the withheld sample. Anerror measure is then computed and the procedure isrepeated to get an estimate of the prediction classificationerror. This is done for each value of k and the value of kthat yields the smallest classification error is then chosen.

Dimension reduction techniquesfor classification – p.9/53

Page 16: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Caution

Performance of the classification rules for a selectedsubset of genes is measured by their errors on the test setand also by their leave-one-out cross validated errors.

If these errors are calculated within the gene preliminaryselection process, there is a selection bias in them whenthey are used as an estimate of the prediction error.

Dimension reduction techniquesfor classification – p.10/53

Page 17: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Partial Least Squares

SVD produces orthogonal class descriptors that reducethe high dimensional data (supergenes).

This is achieved without regards to the responsevariation and may be inefficient. This way of reducingthe regressor dimensionality is totally independent of theoutput variable.

One must not treat the predictors separately from theresponse. This is the spirit of the methods developed byNguyen and Rocke (2002) and Gosh (2002), where thePLS components are chosen so that the samplecovariance between the response and a linearcombination of the p predictors (genes) is maximum.

Dimension reduction techniquesfor classification – p.11/53

Page 18: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Partial Least squares

Popular regression method in chemometrics. Itattempts to simultaneously find linear combinationsof the predictors whose correlation is maximizedwith the response and which are uncorrelated overthe training sample.

Dimension reduction techniquesfor classification – p.12/53

Page 19: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Partial Least squares

Popular regression method in chemometrics. Itattempts to simultaneously find linear combinationsof the predictors whose correlation is maximizedwith the response and which are uncorrelated overthe training sample.

Several algorithms for numerical fitting partial leastsquares models.

Dimension reduction techniquesfor classification – p.12/53

Page 20: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Partial Least squares

Popular regression method in chemometrics. Itattempts to simultaneously find linear combinationsof the predictors whose correlation is maximizedwith the response and which are uncorrelated overthe training sample.

Several algorithms for numerical fitting partial leastsquares models.

However, PLS is really designed to handlecontinuous responses and especially for models thatdo not really suffer from conditionalheteroscedasticity as it is the case for binary ormultinomial data.

Dimension reduction techniquesfor classification – p.12/53

Page 21: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Sample PLS

Given X and Y ∈ Rn, an important feature of PLS is that

the following two decompositions are carried outtogether:

E0 = X =K

∑j=1

tjpTj + EK

and

f0 = y? =K

∑j=1

qjtj + fK,

where the tj are n-vector latent variables (scores), pj are

the p-vector loadings, and EK is a residual matrix. The qj

are scalar coefficients and fK is an n vector of residuals.

Dimension reduction techniquesfor classification – p.13/53

Page 22: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Sample PLS

Denoting by T the n × K matrix of scores and by P the

K × p matrix of loadings whose rows are the pTj , one has:

X = TP + EK.

One may see that the above decomposition is not

unique since for any invertible K × K matrix C we have

(TC)(C−1P). The uniqueness of the tj’s and pj’s comes

from imposing conditions of orthogonality, i.e. PPT = DP

and TTT = DT.

Dimension reduction techniquesfor classification – p.14/53

Page 23: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Pseudo code

1. Set E0 = X and f0 = y. Computep1 = argmax{p, ‖p‖ = 1, |〈E0p, y〉|} (answer

p1 = ET0 y/‖ET

0 y‖).

2. t1 = E0p1/‖E0p1‖

3. X = t1(XTt1)T + E1

4. y = t1(yTt1)T + f1

5. repeat the above until K. One may show thatK ≤ rank(X).

Dimension reduction techniquesfor classification – p.15/53

Page 24: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Denote

wk = XTfk−1/‖XTfk−1‖

and let WK the p × K matrix whose columns are the wk’s.The fitted PLS regression is then given by

Y = XWK(WTKXTXWK)−1WT

KXTY

andβPLS = WK(WT

KXTXWK)−1WTKXTY

In contrast to PCR, PLS procedures are nonlinear. Again

many ways in choosing K.

Dimension reduction techniquesfor classification – p.16/53

Page 25: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Another lookPLS Algorithm (Initialize E0 = X and f0 = y)

1. Iterate until ∆y is small

(a) For k = 1 to K

i. wk < −ETk−1fk−1/‖ET

k−1fk−1‖n

ii. tk < −Ek−1wk

iii. qk < − coefficient lsfit (fk−1 on tk with nointercept)

iv. fk < −fk−1 − tkqk

v. pk < − coefficient lsfit (Ek−1 on tk with nointercept)

vi. Ek < − residual lsfit (Ek−1 on tk with nointercept)

(b) end ForDimension reduction techniquesfor classification – p.17/53

Page 26: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

(c) y < − mean(f0) + ∑Kk=1 qktk

3. Choose s ≤ K

4. lm(y ˜ t1 · · · ts)

However, PLS is really designed to handle continuous re-

sponses and especially for models that do not really suffer

from conditional heteroscedasticity as it is the case for bi-

nary or multinomial data.

Dimension reduction techniquesfor classification – p.18/53

Page 27: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

GLM

The purpose here is to shortly introduce some relevantfacts on generalized linear models. For a more thoroughdescription of such models please refer to McCullaghand Nelder (1989) or Fahrmeir and Tutz (1994).Consider a pair (X, Y) of random variables, where Y isreal-valued and X is possibly real vector-valued; here Yis referred to as a response or dependent variable and Xas the vector of covariates or predictor variables.Generalized models (GM for short) are particularregression models which describe the dependence of theresponse variable of interest Y on one or more predictorvariables X.

Dimension reduction techniquesfor classification – p.19/53

Page 28: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

A basic generalized model analysis starts with a randomsample of size n from the distribution of (X, Y) where theconditional distribution of Y given that X = x is assumedto be from a one-parameter exponential familydistribution with a density of the form

exp

(

yθ(x)− b(θ(x))

φ+ c(y, φ)

)

The natural parameter function θ(x) specifies how the re-

sponse depends on the covariates.

Dimension reduction techniquesfor classification – p.20/53

Page 29: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

The conditional mean and variance of the ith response Yi

are given by

E(Yi/X = xi) = b(θ(xi)) = µ(xi)

andVar(Yi/X = xi) = φb(θ(xi))

Here a dot denotes differentiation.

Dimension reduction techniquesfor classification – p.21/53

Page 30: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

In the usual GM framework, the mean is related to theGM regression surface via the link functiontransformation g(µ(xi)) = η(xi) where η(x) is referred asthe predictor function.

A wide variety of distributions can be modelled using this

approach including normal regression with additive nor-

mal errors (identity link), logistic regression (logit link)

where the response is a binomial variable and Poisson

regression models (log link) where the observations are

from a Poisson distribution.

Dimension reduction techniquesfor classification – p.22/53

Page 31: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

MLE algorithm

Consider the caseη = Xβ

with a monotone and continuously differentiable linkfunction g such that ηi = g(µi) (the canonical link ifg(µi) = θi).

One may show that the ML estimators of the parameters

may be obtained by an iterative reweighted least squares

procedure, as follows:

Dimension reduction techniquesfor classification – p.23/53

Page 32: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Let η0 the current estimate of the linear predictor, with acorresponding value of the mean µ0 by means of the linkg. One forms the adjusted response

z0 = η0 + (y − µ0)

(

)

0

,

the derivative of the link being computed at µ0. The

weight matrix W is evaluated at µ0. One then regresses

by LS the vector z0 on X with weights W to get β1 and the

procedure is repeated until convergence.

Dimension reduction techniquesfor classification – p.24/53

Page 33: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

PLS for GLMsA close look at the PLS algorithm shows that theadjusted dependent vector residuals, fk−1 (in step k − 1),are regressed on the explanatory variable residuals, Ek−1.Next, the adjusted dependent vector residuals (in stepk − 1) are regressed on the current latent variable. Theresult of this fitted value is then subtracted from theresiduals to form the next sequence of adjusteddependent vector residuals.

Borrowing the equivalence between MLE and IRLS,Marx (1996) treats the iterated adjusted dependent vectoras the current dependent variable, in a weighted metricwhich allows him to express an IRPLS algorithm as theanalog of the familiar PLS algorithm.

Dimension reduction techniquesfor classification – p.25/53

Page 34: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Pseudocode

Initialize E0 = X, f0 = ψ(y), V = g′(ψ(y))}2/(Var(Y)

1. Iterate until ∆η is small

(a) For k = 1 to K

i. wk = ETk−1Vfk−1/‖ET

k−1Vfk−1‖n

ii. tk = Ek−1wk

iii. qk = coefficient lsfit (fk−1 on tk with no

intercept and weight V)

iv. fk = fk−1 − tkqk

v. pk = coefficient lsfit (Ek−1 on tk with no

intercept and weight V)

vi. Ek = residual lsfit (Ek−1 on tk with no

intercept and weight V)Dimension reduction techniquesfor classification – p.26/53

Page 35: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

(b) end For

(c) η < − wt.mean(f0, wt = V) + ∑Kk=1 qktk

(d) V = {g′(η)}2/var(Y)

(e) f0 = η + diag{1/g′(ηi)}(y − g(η))

(f) Compute the adjusted residuals E0

3. Choose s ≤ K

4. glm(y ˜ t1 · · · ts)

Dimension reduction techniquesfor classification – p.27/53

Page 36: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

An example

The data consist of 22 cDNA microarrays, each represent-

ing 5361 genes based on biopsy specimens of primary

breast tumours of 7 patients with germ-line mutations of

BRCA1, 8 patients with germline mutations of BRCA2,

and 7 with sporadic cases. These data were first presented

and analysed by Hedenfalk et al. (2001). Information

on the data can be found in http://www.nejm.org and

http://www.nhgri.nih.gov/DIR/Microarray. The analy-

sis focuses on identifying groups of genes that can be used

to predict class membership to the two BRCA mutationDimension reduction techniquesfor classification – p.28/53

Page 37: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

SVD Logistic regression

2 0 2 4 6

43

21

01

Discriminant Plot for SVD

Discriminant Var 1

Dis

crim

ina

nt

Va

r 21

11

1

1

1

1

11

1

2

2

22

2

2

23

3

3

3

3

3

3

3

Applying SVD based logistic regression to the BRCA data.

Dimension reduction techniquesfor classification – p.29/53

Page 38: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

PLS regression

6 4 2 0 2 4 6

43

21

01

2

Discriminant Plot for PLS

Discriminant Var 1

Dis

crim

inant V

ar

21

11

1

1

1

1

1

2

2

2

22

2

2 2

33

33

3

3

3

33

Applying (classical) PLS to the BRCA data.

Dimension reduction techniquesfor classification – p.30/53

Page 39: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

GPLS regression

15 10 5 0 5 10 15

0.0

10

0

.00

5

0.0

00

0

.00

5

0.0

10

Predicted values

Re

sid

ua

ls

Residuals vs Fitted

Estimated coefficients Dimension reduction techniquesfor classification – p.31/53

Page 40: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Sliced inverse regression (SIR)A convenient data reduction formulation, that accountsfor the correlation among genes, is to assume there existsa p × k, k ≤ p, matrix η so that

F(Y|X) = F(Y|ηTX)

where F(·|·) is the conditional distribution function ofthe response Y given the second argument.

The above statement that the p × 1 predictor vector X can

be replaced by the k × 1 predictor vector ηTX withoutloss of information.

Dimension reduction techniquesfor classification – p.32/53

Page 41: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Most importantly, if k < p, then sufficient reduction inthe dimension of the regression is achieved. The linearsubspace S(η) spanned by the columns of η is adimension-reduction subspace (Li, 1991) and itsdimension denotes the number of linear combinations ofthe components of X needed to model Y .

Let SY|X denote the unique smallest dimension reduction

subspace, referred to as the central subspace (Cook,1996). The dimension d = dim(SY|X) is called the

structural dimension of the regression of Y on X, and cantake on any value in the {0, 1, . . . , p}set.

Dimension reduction techniquesfor classification – p.33/53

Page 42: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Estimation of the central subspace

The estimation of the central subspace is based onfinding a kernel matrix M so that S(M) ⊂ SY|X.

SIR and variations (Li, 1991) M = Cov(E(X|Y)),

Dimension reduction techniquesfor classification – p.34/53

Page 43: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Estimation of the central subspace

The estimation of the central subspace is based onfinding a kernel matrix M so that S(M) ⊂ SY|X.

SIR and variations (Li, 1991) M = Cov(E(X|Y)),

polynomial inverse regression (Bura and Cook,2001) M = E(X|Y).

Dimension reduction techniquesfor classification – p.34/53

Page 44: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Estimation of the central subspace

The estimation of the central subspace is based onfinding a kernel matrix M so that S(M) ⊂ SY|X.

SIR and variations (Li, 1991) M = Cov(E(X|Y)),

polynomial inverse regression (Bura and Cook,2001) M = E(X|Y).

pHd (Li, 1991) M = E((Y − E(Y))XXT),

Dimension reduction techniquesfor classification – p.34/53

Page 45: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Estimation of the central subspace

The estimation of the central subspace is based onfinding a kernel matrix M so that S(M) ⊂ SY|X.

SIR and variations (Li, 1991) M = Cov(E(X|Y)),

polynomial inverse regression (Bura and Cook,2001) M = E(X|Y).

pHd (Li, 1991) M = E((Y − E(Y))XXT),

SAVE (Cook and Weisberg, 1991)

M = E(Cov(X) − Cov(X|Y))2, and

Dimension reduction techniquesfor classification – p.34/53

Page 46: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Estimation of the central subspace

The estimation of the central subspace is based onfinding a kernel matrix M so that S(M) ⊂ SY|X.

SIR and variations (Li, 1991) M = Cov(E(X|Y)),

polynomial inverse regression (Bura and Cook,2001) M = E(X|Y).

pHd (Li, 1991) M = E((Y − E(Y))XXT),

SAVE (Cook and Weisberg, 1991)

M = E(Cov(X) − Cov(X|Y))2, and

SIRII (Li, 1991) with

M = E(Cov(X|Y)− E(Cov(X|Y)))2

Dimension reduction techniquesfor classification – p.34/53

Page 47: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

The two conditions for all kernel matrices M to spansubspaces of the central dimension reduction subspace

are that E(X|γTX) be linear, and that Var(X|γTX) beconstant. The conditions are empirically checked byconsidering the scatterplot matrix of the predictors.

Linearity of E(X|γTX) can be ascertained if thescatterplots look roughly linear or random,

homogeneity of the variance holds if there are nopronounced fluctuations in data density in thescatterplots.

Dimension reduction techniquesfor classification – p.35/53

Page 48: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Without loss of generality we use standardised

predictors Z = Σ−1/2X (X − E(X)). Let µj = E(Z|Y = j)

and Σj = Var(Z|Y = j), j = 0, 1, denote the conditional

means and variances, respectively, for the binaryresponse Y , and let

ν = µ1 − µ0, ∆ = Σ1 − Σ0

The main results by Cook and Lee (1999) state that

SSAVE = S(ν, ∆) ⊂ SY|Z and also showed that SSIR =

S(ν) ⊂ SY|Z.

Dimension reduction techniquesfor classification – p.36/53

Page 49: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Implementation

In implementing the method, ν and ∆ are replaced by thecorresponding sample moments,

ν = Σ−1/2x (x1 − x0)

and

∆ = Σ−1/2x (Σx|1 − Σx|0)Σ−1/2

x

to yield SSAVE = S(ν, ∆), a k × (k + 1) matrix, and SSIR =

S(ν), a k × 1 vector. The latter has obviously dimension

of at most 1.

Dimension reduction techniquesfor classification – p.37/53

Page 50: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Estimating dThe test statistic for dimension is given by

Λd = n ∑k`=d+1 λ2

`, where λ1 ≥ λ2 ≥ · · · ≥ λk are the

singular values of the estimated kernel matrix

MSAVE = (ν, ∆), or MSIR = ν , depending on the method

used.

In both cases, the estimation is carried out by performing a

series of tests for testing H0 : d = m against Ha : d > m,

starting at m = 0, which corresponds to independence of Y

and Z.

The test statistic for SAVE has an asymptotic weighted

chisquared distribution. The SIR test statistic for

dimension has an asymptotic chi-squared distribution (Li,

1991).Dimension reduction techniquesfor classification – p.38/53

Page 51: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Remarks

SIR and SAVE can be applied to problems withmultinomial or multi-valued responses.

Dimension reduction techniquesfor classification – p.39/53

Page 52: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Remarks

SIR and SAVE can be applied to problems withmultinomial or multi-valued responses.

When X are normally distributed, SIR is equivalentto Linear Discriminant Analysis in the sense thatthey both estimate the same discriminant linearcombinations of the predictors.

Dimension reduction techniquesfor classification – p.39/53

Page 53: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Remarks

SIR and SAVE can be applied to problems withmultinomial or multi-valued responses.

When X are normally distributed, SIR is equivalentto Linear Discriminant Analysis in the sense thatthey both estimate the same discriminant linearcombinations of the predictors.

In binary regression both LDA and SIR estimate atmost one direction in the central dimensionreduction subspace.

Dimension reduction techniquesfor classification – p.39/53

Page 54: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Remarks

SIR and SAVE can be applied to problems withmultinomial or multi-valued responses.

When X are normally distributed, SIR is equivalentto Linear Discriminant Analysis in the sense thatthey both estimate the same discriminant linearcombinations of the predictors.

In binary regression both LDA and SIR estimate atmost one direction in the central dimensionreduction subspace.

When X is a normal vector, SAVE is equivalent toQuadratic Discriminant Analysis.

Dimension reduction techniquesfor classification – p.39/53

Page 55: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

SIR example

Dir1

Dir2

Applying SIR to the BRCA data.

Dimension reduction techniquesfor classification – p.40/53

Page 56: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

MAVE

A regression-type model for dimension reduction can bewritten as

Y = g(ηTX) + ε

where g is an unknown smooth link function, η =

(β1, . . . , βD) is is a p × k orthogonal matrix (ηT η = Ik)

with k < p and E(ε|X) = 0 almost surely. The last condi-

tion allows ε to be dependent on X and covers, in partic-

ular, the binary regression case.

Dimension reduction techniquesfor classification – p.41/53

Page 57: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

The MAVE algorithm of Xia et al. (2002) is devoted to the

estimation of the matrix η in E(Y|X) = g(ηTX) with g anunknown smooth function.The estimated η is a solution to

minB

E{Y − E(Y|BTX)}2 = E(σ2B(BTX)),

subject to BTB = I. To minimize the above expressionone has first to estimate the conditional varianceσ2

B(BTX) = E[{Y − E(Y|BTX)}2|BTX]. Let

gB(v) = E(Y|BTX = v).

Dimension reduction techniquesfor classification – p.42/53

Page 58: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Given a sample {Xi, Yi} a local linear fit is applied toestimate gB(·) and the EDR directions are estimated bysolving the minimization problem

minB,aj,bj

(

n

∑j=1

n

∑i=1

(Yi − [aj + bTj BT(Xi − Xj)])

2wij

)

,

where wij = Kh{BT(Xi − Xj)}/ ∑n`=1 Kh{BT(X` − Xj)} are

multidimensional kernel weights.

Dimension reduction techniquesfor classification – p.43/53

Page 59: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

We start with the identity matrix as an initial estimator ofB to be used in the kernel weights. Then iteratively, weuse the multidimensional kernel weights to obtain an

estimator B by minimization and refine the kernelweights with the updated value of B and iterate untilconvergence.

The choices of the bandwidth h and the EDR dimension d

are implemented through a cross-validation technique.

Dimension reduction techniquesfor classification – p.44/53

Page 60: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Results

We demonstrate the usefulness of the proposedmethodology described above on a well known data set:the leukemia data first analyzed in Golub et al. (1999). Itconsists of absolute measurements from Affymetrixhigh-density oligonucleotide arrays and contains n = 72tissue samples on p = 7, 129 genes (47 cases of acutelymphoblastic leukemia (ALL) and 25 cases of acutemyeloid leukemia (AML)) .

Dimension reduction techniquesfor classification – p.45/53

Page 61: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

We have used the MATLAB software environment forpreprocessing the data and to implement theclassification methodology. The suite of MATLABfunctions implementing the procedure is freely availableat the URLhttp://www-lmc.imag.fr/SMS/Software/mi-croarrays/.We followed exactly the protocol in Dudoit et al. (2002)to pre-process the data by thresholding, filtering, a base10 logarithmic transformation and standardization, sothat the final data is summarized by a 3571 × 72 matrixX = (xij), where xij denotes the base 10 logarithm of the

expression level for gene i in mRNA sample j.Dimension reduction techniquesfor classification – p.46/53

Page 62: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

The data are already divided into a learning set of 38mRNA samples and a test set of 34 mRNA samples. Theobservations in the two sets came from different labs andwere collected at different times.When training the rule and for the pre-selection of genes,we first reduced the set of available genes to the topp∗ = 50, 100 and 200 genes as ranked in terms of aBSS/WSS criterion and used by Dudoit et al. (2002).To compare our results we have also applied the twodiscriminant analysis procedures DLDA et DQDAdescribed in Dudoit et al. (2002)

Dimension reduction techniquesfor classification – p.47/53

Page 63: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Table 1: Classification rates by the four methods for the

leukemia data set with 38 training samples (27 ALL, 11

AML) and 34 test samples (20 ALL, 14 AML). Given are

the number of correct classification out of 38 and 34 for the

training and test samples respectively.

Training Data

(Leave-out-one CV)

p∗ MAVE-LD DLDA DQDA MAVE-NPLD

50 37 38 37 37

100 38 38 37 38

200 38 38 36 38

Dimension reduction techniquesfor classification – p.48/53

Page 64: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Test Data

(Out-of-sample)

p∗ MAVE-LD DLDA DQDA MAVE-NPLD

50 33 33 33 33

100 33 33 33 33

200 32 33 32 32

Dimension reduction techniquesfor classification – p.49/53

Page 65: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

References

Bura, E. and Cook, R. D. (2001a). Estimating thestructural dimension of regressions via parametricinverse regression. J. Roy. Statist. Soc. Ser. B, 63, 393–410.

Chiaromonte, F. and Martinelli, J. (2002). Dimensionreduction strategies for analyzing global gene expressiondata with a response. Math. Biosciences, 176, 123–144.

Cook, R. D. and Lee, H. (1999). Dimension reduction inbinary response regression. J. Amer. Statist. Assoc., 94,1187–1200.

Dimension reduction techniquesfor classification – p.50/53

Page 66: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

Dudoit, S., Fridlyand, J. and Speed, T. P. (2002).Comparison of discrimination methods for theclassification of tumours using gene expression data. J.Amer. Statist. Assoc., 97, 77–87.

Golub, T. R., Slonim, D. K., Tamayo, P. et al. (1999).Molecular classification of cancer: class discovery andclass prediction by gene expression monitoring. Science,286 531–537.

Li, K. C. (1991). Sliced inverse regression for dimensionreduction. J. Am. Statist. Assoc., 86, 316–342.

Dimension reduction techniquesfor classification – p.51/53

Page 67: Dimension reduction techniques for classificationmaster.bioconductor.org/help/course-materials/2003/Milan/...multinomial data. Dimension reduction techniquesfor classification –

B. D. Marx (1996), Iteratively Reweighted Partial LeastSquares Estimation for Generalized Linear Regression,Technometrics,Vol. 38, No. 4, 374–392.

Nguyen DV and Rocke DM (2002) Tumor classificationby partial least squares using microarray gene expressiondata. Bioinformatics 18: 39–50.

Dimension reduction techniquesfor classification – p.52/53