Statistical Applications in Genetics and Molecular Biology · Applications have shown better...

24
Statistical Applications in Genetics and Molecular Biology Volume , Issue Article Parameter estimation for the calibration and variance stabilization of microarray data Wolfgang Huber * Anja von Heydebreck Holger Sueltmann Annemarie Poustka ** Martin Vingron †† * German Cancer Research Center, Heidelberg, Germany, [email protected] Max-Planck-Institute for Molecular Genetics, Berlin, Germany, heyde- [email protected] German Cancer Research Center, Heidelberg, Germany, [email protected] ** German Cancer Research Center, Heidelberg, Germany, [email protected] †† Max-Planck-Institute for Molecular Genetics, Berlin, Germany, Mar- [email protected] Copyright c 2003 by the authors. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, bepress. Statistical Applications in Genet- ics and Molecular Biology is produced by The Berkeley Electronic Press (bepress). http://www.bepress.com/sagmb

Transcript of Statistical Applications in Genetics and Molecular Biology · Applications have shown better...

Page 1: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

Statistical Applications in Geneticsand Molecular Biology

Volume , Issue Article

Parameter estimation for the calibration and

variance stabilization of microarray data

Wolfgang Huber∗ Anja von Heydebreck† Holger Sueltmann‡

Annemarie Poustka∗∗ Martin Vingron††

∗German Cancer Research Center, Heidelberg, Germany, [email protected]†Max-Planck-Institute for Molecular Genetics, Berlin, Germany, heyde-

[email protected]‡German Cancer Research Center, Heidelberg, Germany, [email protected]

∗∗German Cancer Research Center, Heidelberg, Germany, [email protected]††Max-Planck-Institute for Molecular Genetics, Berlin, Germany, Mar-

[email protected]

Copyright c©2003 by the authors. All rights reserved. No part of this publicationmay be reproduced, stored in a retrieval system, or transmitted, in any form or byany means, electronic, mechanical, photocopying, recording, or otherwise, without theprior written permission of the publisher, bepress. Statistical Applications in Genet-ics and Molecular Biology is produced by The Berkeley Electronic Press (bepress).http://www.bepress.com/sagmb

Page 2: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

Parameter estimation for the calibration and

variance stabilization of microarray data

Wolfgang Huber, Anja von Heydebreck, Holger Sueltmann, AnnemariePoustka, and Martin Vingron

Abstract

We derive and validate an estimator for the parameters of a transformation for thejoint calibration (normalization) and variance stabilization of microarray intensity data.With this, the variances of the transformed intensities become approximately indepen-dent of their expected values. The transformation is similar to the logarithm in the highintensity range, but has a smaller slope for intensities close to zero. Applications haveshown better sensitivity and specificity for the detection of differentially expressed genes.In this paper, we describe the theoretical aspects of the method. We incorporate calibra-tion and variance-mean dependence into a statistical model and use a robust variant ofthe maximum-likelihood method to estimate the transformation parameters. Using simu-lations, we investigate the size of the estimation error and its dependence on sample sizeand the presence of outliers. We find that the error decreases with the square root ofthe number of probes per array and that the estimation is robust against the presence ofdifferentially expressed genes. Software is publicly available as an R package through theBioconductor project (http://www.bioconductor.org).

KEYWORDS: microarrays, error model, variance stabilizing transformation, resistantregression, robust estimation, maximum likelihood, simulation

Page 3: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

1 Introduction

Two important topics in the analysis of microarray data are the calibration (normal-ization) of data from different samples and the problem of variance inhomogeneity,in the sense that the variance of the measured intensities depends on their expectedvalue. A family of transformations has been proposed that makes the variance oftransformed intensities roughly independent of their expectation value [1, 2, 3].In the following, we describe and validate an approach to the estimation of theparameters of a transformation for calibration and variance stabilization. In Sec-tion 2, we formulate our assumptions about the data in terms of a statistical model.It is a version of the multiplicative-additive error model that has been introduced inthe context of microarray data by Ideker et al. [4] and Rocke and Durbin [5]. Theframework encompasses two-color data from spotted cDNA arrays, Affymetrixgenechip probe intensities, and radioactive intensities from nylon membranes. InSection3, we use the method of approximate variance stabilization to simplify themodel. In Section4, we derive a robust estimator for the parameters of the cali-bration and the variance stabilizing transformation. Finally, in Section5, we usesimulations to investigate the validity of the variance stabilizing transformation andto evaluate the estimator for different sample sizes and parameter regimes.Applications to real data, demonstrating increased sensitivity and specificity forthe detection of differentially expressed genes compared to other methods, havepreviously been described [3]. The comparison method and program code are pro-vided in a Bioconductor vignette [6], with the intention to make it easy for otherscientists to reproduce this type of comparison.

2 The model

A microarray consists of a set of probes immobilised on a solid support. The probesare chosen such that they bind to specific sample molecules; for DNA arrays, thisis ensured by the sequence-specificity of the hybridization reaction between com-plementary DNA strands. The interesting fraction from the biological sample isprepared in solution, labeled with fluorescent dye and allowed to bind to the array.The abundance of sample molecules can then be compared through comparing thefluorescence intensities at the matching probe sites.The measured intensityyki of probek = 1, . . . , n for samplei = 1, . . . , d may bedecomposed into a specific and an unspecific part,

yki = αki + βkixki. (1)

Here,xki is the abundance of the transcript represented by probek in the sam-ple i, βki is a proportionality factor, andαki subsumes unspecific signal contri-

1Huber et al.: Variance stabilization of microarray data

Produced by The Berkeley Electronic Press, 2004

Page 4: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

butions which may be caused by effects such as non-specific hybridization, cross-hybridization or background fluorescence.The offsetsαki and gain factorsβki are usually not known, but microarray tech-nologies are designed in such a way that their values for differentk andi are re-lated. This makes it possible to infer statements about the concentrationsxki fromthe measured datayki. Relations between the offsets and gain factors for differentk andi can be expressed in terms of a further decomposition,

βki = βi γk eηki , (2)

αki = ai + νki. (3)

Thus, the gain factor is the product of a probe affinityγk, which is the same for allmeasurements involving probes of typek, times a normalization factorβi, whichapplies to all measurements from samplei. The remainderβki/(βiγk) is accountedfor by eηki . One can choose the units ofβi andγk such that

∑k ηki =

∑i ηki = 0.

The unspecific signal contributionαki can be decomposed into a per-sample offsetai and a remainderνki with

∑k νki = 0.

The probe affinityγk may depend, for example, on the probe sequence, secondarystructure and the abundance of probe molecules on the array. The normalizationfactorβi may depend, for example, on the amount of mRNA in the sample, on thelabeling efficiency, and on dye quantum yield. The idea behind the decomposi-tions (1)–(3) is that while the individual values ofηki andνki may fluctuate aroundzero, they do so in an unsystematic, random manner. Thus, for example, we as-sume that there are no systematic non-linear effects, which would imply trends intheηki or νki dependent on the value ofxki.Now one can reduce the parameter complexity of Eqn. (1) through the followingthree modeling steps:

1. Do not try to explicitly determine the probe affinitiesγk. They can be ab-sorbed intomki = γkxki, which may be considered a measure of the abun-dance of transcriptk in samplei in probe-specific units.

2. Treatηki andνki as “noise terms” coming from appropriate probability dis-tributions.

3. Estimate the values of the normalization factorsβi and offsetsai, as well asparameters of the probability distributions from the data.

Thus, Eqn. (1) leads to the following stochastic model:

Yki − aiβi

= mki eηki + νki, ηki

iid∼ Lη, νkiiid∼ Lν . (4)

2 Statistical Applications in Genetics and Molecular Biology Vol. 2 [2003], No. 1, Article 3

http://www.bepress.com/sagmb/vol2/iss1/art3

Page 5: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

Here,νki = νki/βi is the additive noise scaled by the normalization factorβi. Theright hand side of Eqn. (4) is a combination of an additive and a multiplicative errorterm. It was proposed by Rocke and Durbin [5], using normal distributionsLη =N(0, σ2

η) andLν = N(0, σ2ν). In the following, we will consider distributionsLη

andLν that are unimodal, roughly symmetric, and have mean zero and variancesσ2η

andσ2ν , respectively, but we do not rely on the assumption of a normal distribution.

The left hand side describes the calibration of the microarray intensitiesYki throughsubtraction of offsetsai and scaling by normalization factorsβi [7, 3].According to Eqn. (4), the variance of the random variableYki is related to its meanthrough

Var(Yki) = c2 (E(Yki)− ai)2 + β2i σ

2ν , (5)

wherec2 = Var(eη)/E2(eη) is a parameter of the distribution ofη ∼ Lη. In thelog-normal case,c2 = exp(σ2

η) − 1. Thus, the relationship of the variance tothe mean is a strictly positive, quadratic function. For a highly expressed gene,the variance Var(Yki) is dominated by the quadratic term and the coefficient ofvariation ofYki is approximatelyc, independent ofk andi. For a weakly expressedor unexpressed gene, the variance Var(Yki) is dominated by the constant term andthe standard deviation ofYki is approximatelyβiσν , which may be interpreted asthe background noise level for thei-th sample, and is independent ofk.

3 Variance stabilizing transformations

Consider a random variableX with expectation value 0 and a differentiable func-tion h defined on the range ofX. Then

h(X) = h(0) + h′(0)X + r(X)X, (6)

wherer is a continuous function withr(0) = 0 and

Var(h(X)) = h′(0)2 Var(X) + Var(r(X)X) + 2h′(0) E(r(X)X2). (7)

If h does not deviate from linearity too strongly within the range of typical valuesof X, thenr(X) is small and the terms involvingr(X) on the right hand side ofEqn. (7) are negligible. Thus, for a family of random variablesYu with expectationvalues E(Yu) = u and variances Var(Yu) = v(u)

Var(h(Yu)) ≈ h′(u)2 v(u). (8)

An approximately variance-stabilizing transformation can be obtained by finding afunctionh for which the right hand side is constant, that is, by integratingh′(u) =

3Huber et al.: Variance stabilization of microarray data

Produced by The Berkeley Electronic Press, 2004

Page 6: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

v−1/2(u),

h(y) =∫ y

1/√v(u) du. (9)

Note that if h is approximately variance-stabilizing, then so isγ1h + γ2 withγ1, γ2 ∈ R.For a variance-mean dependence as in Eqn. (5), this leads to

hi(y) = arsinhy − aibi

, (10)

with bi = βiσν/c. Thus, on the transformed scale the model (4) takes the simpleform

arsinhYki − ai

bi= µki + εki, εki

iid∼ Lε, (11)

whereµki represents the true abundance on the transformed scale of genek insamplei andLε has mean zero and variancec2. The relation betweenµki andmki

is µki = E(arsinh( cσν

(mkieηki + νki))) ≈ arsinh( c

σνmki).

−2 0 2 4

−2

−1

01

23

x

asinh(x/b)

b=0.5b=1b=2log(x)

Figure 1: Graph of the function (10) for ai = 0 and three different values ofbi. Forcomparison, the dotted line shows the graph of the logarithm function.

The graph of the function (10) is shown in Fig.1. The inverse hyperbolic sine andthe logarithm are related to each other via

arsinh(x) = log(x+√x2 + 1) (12)

log(x) = arsinh12

(x− 1

x

)(13)

4 Statistical Applications in Genetics and Molecular Biology Vol. 2 [2003], No. 1, Article 3

http://www.bepress.com/sagmb/vol2/iss1/art3

Page 7: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

limx→∞

(arsinhx− log x− log 2) = 0. (14)

Functions that may have, within the range of the data, a graph similar to that ofarsinh((y − a)/b) have been proposed, among them the shifted logarithm and thelinlog-transformation [8, 9]

h(y) = logy − ab

(15)

h(y) ={

log(y/b), y ≥ ay/a+ log(a/b)− 1, y < a

(16)

While transformation (10) corresponds, via Eqn. (9), to a variance-mean depen-dence of the formv(u) ∝ (u− a)2 + const., the two transformations (15) and (16)correspond to variance-mean dependences of the form

v(u) ∝ (u− a)2,

v(u) ∝{u2, u ≥ aa2, u < a

respectively. All of these may fit the data, but in the following we choose trans-formation (10) both for computational convenience and for its interpretability interms of the error model (4).

4 Parameter estimation

Model (11) relates the measured intensitiesYki to the true expression valuesµkiin terms of the calibration and variance stabilization parametersai andbi and thenoise distributionLε. We would like to estimate the parametersai andbi. To thisend, we consider the non-differentially expressed genes, for whichµki = µk forall i. If in addition we require thatLε be normal, we obtain

arsinhYki − ai

bi= µk + εki, εki

iid∼ N(0, c2). (17)

In the following, we derive the maximum likelihood (ML) estimator for the param-etersai andbi. Then, by using a robust procedure similar toleast trimmed sumof squares regression[10], we extend its validity to situations in which there is aminority fraction of differentially expressed genes, for which model (17) is mis-specified, and to distributionsLε that are approximately normal in the center, butmay have different tails.

5Huber et al.: Variance stabilization of microarray data

Produced by The Berkeley Electronic Press, 2004

Page 8: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

4.1 Maximum likelihood estimation

According to Eqn. (17), the probability of observing a valueyki within an interval[yαki, y

ωki] is

P (yki ∈ [yαki, yωki]) =

hi(yωki)∫

hi(yαki)

ρ

(hi(yki)− µk

c

)dhi(yki)

=

yωki∫yαki

ρ

(hi(yki)− µk

c

)h′i(yki) dyki. (18)

Here,ρ denotes the density of the standard normal distribution. The ML estimatesof the model parameters{ai}, {bi}, c, {µk} are those that maximize the likelihood

n∏k=1

d∏i=1

ρ

(hi(yki)− µk

c

)h′i (yki). (19)

Now the ML estimates of the parameters ofhi are those that maximize theprofilelikelihoodpl(a1, b1, . . . , ad, bd), which is obtained by replacingc2 andµk by theirmaximizing values [11]

µk =1d

d∑i=1

hi(yki)

c2 =1nd

n∑k=1

d∑i=1

(hi(yki)− µk)2.

This results in

pl(a1, b1, . . . , ad, bd) =

=n∏k=1

d∏i=1

1√2πc

exp

(hi(yki)− µk)2

2nd

n∑k=1

d∑i=1

(hi(yki)− µk)2

h′i(yki)

=end/2

(2π)nd/2 cnd

n∏k=1

d∏i=1

h′i(yki). (20)

6 Statistical Applications in Genetics and Molecular Biology Vol. 2 [2003], No. 1, Article 3

http://www.bepress.com/sagmb/vol2/iss1/art3

Page 9: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

The profile log-likelihood is, up to a constant, given by

pll(a1, b1, . . . , ad, bd) =

= −nd log c+n∑k=1

d∑i=1

log h′i(yki)

= −nd2

log

(n∑k=1

d∑i=1

(hi(yki)− µk)2

)+

n∑k=1

d∑i=1

log h′i(yki), (21)

and can be maximized numerically under the constraintsbi > 0. The first term onthe right hand side involves the sum of squared residuals, while the second termcontains the Jacobi determinant of the transformation. The optimization landscapeis not too different from that of an ordinary quadratic cost function and we havenever encountered multiple local maxima. Our implementation uses the R functionoptim with the methodL-BFGS-B .

4.2 Resistant regression

The maximum likelihood estimator that is obtained by maximizing (21) is sensitiveto deviations from normality and to the presence of differentially expressed genes,for which µki = µk does not hold. In the following, we consider a modificationwhich makes it more robust against outliers.In least sum of squares (LS) regression, the fitted parametersa are those that min-imize the sum of squared residuals,

als = argmina

n∑k=1

rk(a)2, (22)

whererk(a) is the residual between fit and thek-th data point andn is the num-ber of data points. Inleast trimmed sum of squares (LTS) regression[10], this isreplaced by

alts = argmina

minK

∑k∈K

rk(a)2. (23)

The minimization overK runs over all subsets of{1, . . . , n} of sizednqltse, where0.5 < qlts ≤ 1 anddxe is the smallest integer greater or equal tox. While the LSestimate can be arbitrarily off due to even a single outlier, the LTS estimator has abreakdown point of approximately1 − qlts, that means, the estimation error doesnot become too large as long as the fraction of outliers is less than1− qlts.Practically, the exact solution of (23) is possible only for smalln. We use thefollowing heuristic iterative procedure. In the expression (21) for the profile log-likelihood, replace the sums overk = 1, . . . , n by sums overk ∈ K. Start the

7Huber et al.: Variance stabilization of microarray data

Produced by The Berkeley Electronic Press, 2004

Page 10: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

iteration withK = {1, . . . , n}. This results in an initial parameter estimatea =(a1, b1, . . . , ad, bd) and residuals

rk =d∑i=1

(hi(yki)− µk)2, k = 1, . . . , n. (24)

Now partition the set{1, . . . , n} into 10 slicesTs, such thatT1 contains thosekfor which µk is smaller than the 10%-quantile of theµk, T2 those for whichµkis between the 10%- and 20%-quantile, and so on. LetQs be theqlts-quantile of{rk|k ∈ Ts}. A new selectionK of putative non-outlying probes is found through

K =⋃s

{k ∈ Ts|rk ≤ Qs}. (25)

With this, compute new parameter estimates, and repeat the iteration. Stable esti-mates are usually reached after less than 10 iterations.The iteration procedure is a simple version of the method proposed in [12]. Inaddition, rather than permitting forK any subset of{1, . . . , n} of size dnqltse,we require thatK contains an equal amount of data from every slice, accordingto Eqn. (25). This has the following advantage. Robust estimation involves theweighting of data points as more or less reliable and thus implies a trade-off be-tween being diverted by outliers, and ignoring important data. To find a goodcompromise, it can be useful to consider further context information. Here, wehave invoked the notion that the fraction of outliers should be roughly the sameacross the whole range of average intensitiesµk.

5 Results

5.1 Properties of the variance stabilizing transformation

The derivation of the variance stabilizing transformation (10) involves the approx-imation (8). Fig.2 investigates how well this approximation holds for a familyYmof random variables distributed according to

Ym = meη + ν, η ∼ N(0, σ2η), ν ∼ N(0, 1) (26)

for parametersm,ση > 0. This corresponds to the right hand side of Eqn. (4).In the notation of Eqn. (26), the variance stabilizing transformation (10) takes theform h(y) = arsinh(cy) with c2 = exp(σ2

η) − 1. If m is large,Ym is dominatedby the multiplicative term,Ym ≈ meη, andh(Ym) ≈ log(Ym). The asymptoticstandard deviation ofh(Ym) form→∞ is thus Sd(log(meη)) = ση. In Fig.2, the

8 Statistical Applications in Genetics and Molecular Biology Vol. 2 [2003], No. 1, Article 3

http://www.bepress.com/sagmb/vol2/iss1/art3

Page 11: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

behavior of Sd(h(Ym)) for finite values ofm is compared against the asymptoticvalue for different choices of the parameterση. For each plot, the function wasnumerically evaluated at 60 values ofm through Monte Carlo integration with106

samples. Even in the case ofση = 0.4, which corresponds to a probability of5% of observing a relative error larger thane2·0.4 ≈ 2.2, the standard deviationSd(h(Ym)) does not depart from the asymptotic value by more than a factor of1.035.

ση = 0.05

m

Sd(

h(Y

m))

σ η

0 20 40 60

0.96

1.00

1.04

ση = 0.1

m

Sd(

h(Y

m))

σ η

0 20 40 600.

961.

001.

04

ση = 0.2

m

Sd(

h(Y

m))

σ η

0 20 40 60

0.96

1.00

1.04

ση = 0.4

m

Sd(

h(Y

m))

σ η

0 20 40 60

0.96

1.00

1.04

Figure 2: Validation of the approximation used in Section3. For a family of ran-dom variables distributed according to the error model (26), the plots show thestandard deviation of the transformed valuesh(Ym) = arsinh(cYm), divided bythe asymptotic valueση that is obtained form→∞.

An example for the effect of the variance stabilizing transformation (10) on realdata is shown in Fig.3. RNA from biopsies of adjacent parts of a kidney tumorwas labeled with red and green dyes, respectively, and hybridized to a cDNA mi-croarray. Fluorescence intensities were measured with a laser scanner. Per-spotsummary intensity values were determined with the ArrayVision software (Imag-ing Research Inc., St. Catharines, Ontario, Canada). A spot’s intensity was ob-

9Huber et al.: Variance stabilization of microarray data

Produced by The Berkeley Electronic Press, 2004

Page 12: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

0 2000 4000 6000 8000

−10

000

1000

∆y

0 2000 4000 6000 8000

−4

−2

02

4∆l

og(y

)

0 2000 4000 6000 8000

−4

−2

02

4∆h

(y)

Figure 3: Three different transformations applied to data from a cDNA microar-ray. The top panel shows the difference∆yk = yk2 − yk1 between background-subtracted and calibrated red and green intensities on they-axis versus the rank oftheir sumyk1 + yk2 on thex-axis. Similarly, the middle plot shows the log-ratio∆ log(yk) = log(yk2) − log(yk1) versus the rank of the sumlog(yk1) + log(yk2)and the bottom plot shows∆h(yk) = h(yk2) − h(yk1) versus the rank ofh(yk1) + h(yk2). Plotting against the rank distributes the data evenly along thex-axis and thus facilitates the visualization of variance heterogeneity.

10 Statistical Applications in Genetics and Molecular Biology Vol. 2 [2003], No. 1, Article 3

http://www.bepress.com/sagmb/vol2/iss1/art3

Page 13: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

tained by subtracting the median brightness of surrounding pixels from the medianof those within. The expression levels of almost all genes in the two samples areexpected to be unchanged. Thus, the vertical width of the band of points repre-sents the scale of the distribution of differences. For the untransformed intensities(upper plot), the width increases to the right, in accordance with the error modeldescribed in Section2. For log-ratios (middle plot), the differences are largest atthe left end, for measurements of low intensity. The far left region of the plot cor-responds to spots in which one or both of the intensities were smaller than 1, inwhich case the logarithm was replaced by the value 0. While the log-ratios have aninterquartile range (IQR) of about0.2 in the high-intensity regime, the IQR is aslarge as0.95 for the points with ranks 1500 to 2000. A constant IQR of about0.2along the whole range of intensities is observed for the generalized log-ratio∆h(bottom plot). Note that this value is the same as that in the high-intensity end ofthe log-ratio plot, in agreement with the asymptotic relationship (14).The marginal distribution of∆h is compared a against a normal distribution inthe quantile-quantile (QQ) plot Fig.4. While the distribution is roughly normalin the center and approximately symmetric, its tails are heavier than normal. Thisobservation, which we have made on many data sets, motivates the use of normaltheory and the robust modification of Section4.

5.2 Simulation of data

To verify the computational feasibility of the estimator constructed in Section4and to investigate its behavior for different sample sizes, we ran simulation studies.Simulated data were generated according to model (11) with Lε = N(0, c2).Values for the parametersµki were generated as follows. First, following [13], foreach genek a valueµk was drawn according to

µk = arsinh (mk) , 1/mk ∼ Γ(1, 1). (27)

The density of the reciprocalΓ(1, 1) distribution is shown in Fig.5. To model themixture of non-differentially and differentially expressed genes, indicatorspk ∈{0, 1} were generated withP [pk = 1] = pdiff . For each gene withpk = 1 andfor each samplei ≥ 2 a factorski ∈ {−1, 1} was drawn withP [ski = 1] = pup

and an amplitudezki was drawn from the uniform distributionU(0, zmax). Thus,pk indicates whether or not genek shows differential expression;ski, whether it isup- or down-regulated in samplei compared to sample 1;zki, by how much. Thesewere combined to obtain

µk1 = µk

µki = µk + pkskizki, i ≥ 2. (28)

11Huber et al.: Variance stabilization of microarray data

Produced by The Berkeley Electronic Press, 2004

Page 14: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

xxx xxx

xx xx xxx x

xxxxxx

xxx xxx

xx xxx

xx

x xxx xxx

x

x xxx xxx

x xxx xxxx

x xx

xx

xxx

xxx

xx xxxx x xx

x xx

x xx xxxxxxx

xxx

xx x x xx x

x

xxx xxx

x xx x

xx

x xx

xx

xxx xxxx

xxx

xxxx xxxx xx x

xxxxxxxx

x xx x

xxxxxx x

xxx

xxxxxxx xxx xxx x

xxxx

xxx

xx

xxx

xx

xx xxxxx

x xx

xx xx

xx

xxx xxx

xx

x x xxxx

xxx

xxx

xxxxxx x

xxx

xxxx

xxxxx

xxx x

x xxx x

xx xxx x x

xx

xx

xxxx xx

xx x

xx

x xx x

xxx xx x

x

xxx

x x xx x xx x

xx

xxxx xx

x

xxx

xxxxxxx x

xxx xxx x x

xx x xx

xxxxxxx x

xxx x

x

xxx

xx xx

xxx

xx

xxxxx

xxx

xx

xxx

xxx

xxxx x

xx

xx xxx

xxx

xxx

xx xxxxxxx xxx

xxxxxx

x xxxxx

xxxx

x xx xxx

x xx xxx xx

xx

xxx

xxxxx

xxxxxxxx

xxxxxxx

xxx

xxxxxx x x

x

xx

xxx

xxxx

xx

xxx

xxx x

xx

xx xxxxxxx

x

xx

xxx

x xx

xxx xxxxxxx xxxxxxxx

x x xxx xxxx

xx

xx

xxxxxx

xxxx x x x x

xxx

x xx

xxxx

xx

x xxx x xxx

xx

xxxxxx

xxxx

xx xx

x xx x

x xxxxx

x

xxxxxxxx xx x

xxxxxxxxxx xx

xxxx

xxxx

xxx x

x x xx

x xxx

x xxxxx xxxxx xx xxxxx xx

xxxx xxxxx

xxxx

x xx xxxxx

x x xxx xx x

x

xxx

xxx

xx

xxxxx

xx xxx xxxxxxx xxx

xx

xxx

xxx xx

xxxxxxx

xx

xx xxxxxxxxxx

x xxxx

x

xx x

xxx

xxxxxxxx

xxx

x xxx

xxx

xx xxxxxxx

xx xx x

xx

xx

xxx xxxx

x xx xx x x xxx

xx

xxx

x

xxxx x

xx

x xxx x

xx xx x

xxxxxx xxx x

xx

xxxxxx x

xxx xxx

xxx

x xxxxxxxxx

xx

xxx

x xxxx

x xx

xxx

xx

x

xxxxx xx

xx xxx

xxx

x

xxxxx

x xx

xxx x

xxxx

xxxx xxxxx

x xxx xxxx xx

xx xxxx xx xxxxx

xxxxxxx

xxxxxx xx

xx

x xxxx

xxxx

xxxx xx xxx

xx

x x xxx xx x xxx

xx

xxx x xxx x

xx x x

xxxx

xx xxx

xxxxx xxxx xxx

xxxxx

xxxxx

x

xxxxx

xx

x

xxxx x

xxxx

x xxx x

xx xxx x

xxxx

xxxx

xxxx x xx xxx

xxxxx

x xxx

xxxxx

xx

xx

xxx

xxxx

xxxx

xx xxxxx

xxxx xxxxxx xxxx xx x x

xx x xxxxx

xx

xxxxxx

xxxxx

xxxxxxx

x xxx

xx xxxxx

x xx

xxxxx

x xx

xxx x xxx xxxx xxxx x

x x xxxxx

xx

xxx xx

xx x

x xx

xx

xx x xxxxxx xxx

xxx

xxxx xxx

xx

x xxxxx xx

x xxxx x x xxxx

xxxx xxxx

xxxx

xxx xx xxxxx x

xxx

xxx

x x xxxxx x

xxx xx

xxx

xx x x

xxxxx xxx

x

xxx xx

xxx x

xxxxx

xx x x

xx

xxxx

xxxxxxx x

xxx

xxxxxx

xxxxxxx

xxx

xx x xx x xxxx xxxxx x

x xxx

xx xx

x x

xx

x xx

x

xx

x xx

xxxx

x xxx x xx

x x x xxxxx xxxxxxx x

xx xx x

xxxxxx

xxxxxx

xxxxx x

x xxx

x

xxx xx xxx

xxxxx xx

xx

xxxxx

xxxxx

x xx x

xxx xxxx

xx

xxx xxxx

x x xxxx x xx

xxx x

xxxx

xx xxxx

x x

xxxxx

x xxx

x x xxxx

xx

xxx x

x xx xxxx xx

x xxxxxxxxx

xx x

xxx

xxxx xxxx

xx

xx

xxxxx

x

xxxxx

xxx

xx

xxxxxxxx x xxxxxx xx

x

xx

x

xxxxxxx xxx

xx x xx xx

xx xxxx

xxx

x xx x xxxx

xx

xxx xx x xxx

xxx

xxxx

xxxxx

xxxx

xx

xxxx

xxxx

xx

xxx

x

xx

xx

xxxx x

xxxx

x xxxxxx

xx

xx

xxx x x x

xxxxxx xx x

xxxxx

xxx

xxx

xx x

xx

xxx

xx

xx xx xx

xxxx

xxx

xx x

xx xxxx x

xxx x

x x xx xx

xx

xxxx xx x xx x x

xxxx

x xx x xx xxx

xx

xxxxx x x

x xx

xxx

xxxx

xx

xxx x

xx xxx

x

xx

xxxx xxxx x

xx

x xxxxxxx x xxx

x

xx xxx xx

xxx

xx

xxx

x

xx xx xxx x

xxx

x

x x

xxxxxxx xxx xxx

xxxxx xx

xx xxxxxxx

x x xxxxx

x xxxxx

xxxxx

xx

xx xxxxx

xxx x

xxxxxxxx

xxxxxxxxxxxx

xx

xxxx xx xxx

xxxxxxxxx

xx

xxxxxx

xxxxx x

x x

xxx x

x xxx

xx

x xxxxx xxx xx

x

xxx

xxxx xxxxxxx

xx

xx

xxxxx

xxxxx

xx

xxx

xxxxxxxxx

xx

x xxxxx

xxxxxxx xxx

xxxx

xx x xxxxxxx

xxxxx x

x xxx x

xxx xx

xxxx

xxx

xxx

x xxx xxx

x xx

xx

x

x xxxxx

xxx

xx xxxxx

xxxxx

x x xxx

xxx

xx

xx

xxx

xx

xx

xxxxxx

xxxxx

xxx

xxx

xx

xx

xxx x

xxx

x

xxxx

x xxx

x xxx

xx

xxxx

xx

x

xx xx

xx x xx

xx x xx x xx

xx

xxxx xx x

x

xx

x xxxxx

xxxxxxx

x

x x x xxxxxx x xx

xxxxxx

x xxx

x xx

xx xx

xx

xxx

xx x

x xxxx x x

xxxxx x xxxxxxx

xxxx

xxx

xx xxxx

xxx xx

xxx

xxx

x

xx

x xxxx x xx

xx xx

xxx

x

x

xxxxx x

xx

xxx

xxxxxxx

xxx

xxxx

xx x

xx

xx x xxx xxx xx

xxx x xx x xx x xxx

xx

xxxx

x xxx

xx

xx x xx x

xx

xxx

xxxx

xxxxxx x xxxxxx

xx

x

xxxx

xxx xx

xx x xxx x x

x xxxx

x

xxx

xx xxx

xxxx xx x xxx x x

x x

xxx xxxx

x xxx

xxx xx

xxxx x

x

xxxx

x xx xx x

xx

xxxx

xx xxxx

xx

x xxx

xxxx xxxxxxx x

xxx

xx

xxx

x

x x xx

xxxx

xxxx xxxxxxx

xxx x xx xx x

xx

xxx

x xxxx xx

xxxx

xx

xx xx x

xxxxxx

xx

x xxxxx x

xx x

xxxxxx x

x x xxxx x xxx

xx

xx

xx

xxx x xx xx x

x

xx xx xx x

xxxxx

xx x

xxx

xxxx

xx

xx

xxxx x

xxx x

xx

x xx xxxxx xxxxx

xxx

xx x

xx

xx

xxxxxxx

xxxxxx x xxxx xxx x

xxxxx xxx

x

xxxxxx

xxx

xxxx

x xxxxx

xxx xxx

xx

xx x

x

xx

xxxxxx

xxx

xxxxx

xxx x

xxx xxxxx

xx xxx xxxxxx

xxx x

xxxx

xx x

x xxx

xx

xx

x xxx

xxxxxx

xxxx x xxxx

xxxx

xxx

xxx

xxxxx

xx

xx

x

xxxx x

xx

xxx

xxxxxx

x

xxx

xxx

xx

xxx x

x

x

x

xx

xx

xx x

x xxxxx x

x x

xx

xxx

xx

x xxxx x

x xxx

xxx x

xxx

xx

xxx x

xx

xxx x x

xx

xxxxx

xx

xx

xxxx x x

x

x

xxxx

xx

x xxxxxxx xxxxxxxx

x xx xxxx

xxx

xxx

xx x

xxxx

xxx

xx

xx xxx

x xxx

xx

xx

x xx

xxx x

xxxx

xxxxxx

xx xxx

xxxxx

xxxxx

xx

xx xx

x xxxx

xx xx xxx

xxx

xx

xxxxxx

xxx xx

xxxx

xxxxx

xx

x xx

x xx x xxx

x xxx

xx xxx

xxx xx

xx x xxx

xxxxxxx

x

xx

xx xxx

x

xxxx

x x xxxxx

xxx

x x

xx

xxxx

xx

xx

xxxx

xxxx

xx

xx

x

x xx

x xxx x

xx xx x x x

xx

xxx x xx x

xx xx

x

xxxx

xxx

xxx

xxxx

xxxxxx xx

xxx

xxx

x

x

xx

xxx

xx

x xxx xxx

xxxx

xxx x

xx xxxx

xx

xxxxxx

xx

xxx

xxxx x xxxxxxx

x xxxx xxx x

x

xxx x x

xx

x xx xxx

x

xxxxxxx

xx

xx

xx

x xx

xxx

xx xxx xxxxx

xxxxx xxxxxx x

xxxx

xxxx

xxxx

xxx

x

xx

xx

xx xxx

xx

xx

x xx

x

x

x

xxxx x

xx xxxx

xx

xx x

xx

x xxxxxxxx

x xxxx x

xx

x xx

xxxx xxx xxxx

xx xx

xx xx

x xx

x xxx

xxx

xxxxx

xxxx x

xxx

xxx

xx

xx

xxxx

xx

xxx

xx

x xx

xx

xxx

xxx x

xxx

xx x x x

xxxx xxx

xxxx xx

x xxxxx x x xx

xxxxxx

xxx x

xx

xxx

x xxxx

x xxxx x

x xx

xxx xx

xx

xxx

xx xxx

xxx

x

xxxxx x

xx x x

x xx xx

x x xxxx xx x

xxx

xxx

xxx xx

xx

x xxxxx

x

xx

xx

xx xxxx

x

xx

x xxxx

xx

xxxx x

xxxxx

xxx xxxx xxxxxx

xx x

x xxxxx

xxxxxx xx

xxxxxxx

xxx x

x xxx

xx xx xx x

xx

xxx

xxxx x

xx x xxxxx

x

xxx

x x

xxx x x

x x xx

xx

x

x x

xx

xxx

xx

xx x

xx

x

xx

xxxxx

xxxxx

x

xx

xx

xx

xxxxx

x x xxxx xx xx

xx

x xxx x

x

xx

x

x

xx x xxxx x

xx

xx

xxx

xx

xxx xx

xx

xxxxx

xxxx xx

x xxx

xx xx x xxx x

xxx xxxx

xx

xx

x

xxx

xx

xx

x

xxx xx

xx

xx

xx

xxx

x

x

xx

x x xxxxx

xxxxxx

xx

xx

x xxxxx xx xxx

xxxxxx x

xx

xxx

xxxx

x

x

xx

xxx

xxxxx

xxx xxxx

xx

xx

xx

xx xxx x

xx

xxxx

xxxxx

xx

xx

x

xx

xx x

xxx

x x

x

xxx xx x x

x

x

xxxx

xxxxxxxx

xx

xx

xxxxxxx

xx xxxxxx xxx

xx x xxxxx

xx

xxx

xx xxx xxx

x

xxxx

xx x

xxxx

xxx

xxxx

x xxx

xxxx

xx

x

xxxx x

xxx

xx xxxx x

xx

x

xxxx xxx xx

x x xx

xx

xxx

xxx

xx xxxxx

xxx

x xxxxxxx

xxx

x xx xx xxxx

xxx

x

xxx xxx

xxxx

xxxx

xxxx

xx xx

x xx

xxxx

xx

x xxxx

xxxx

x xxxxx

xxxxx

xxxxx xxxx x

xxx

xx

x x x xxxx xx

xxx

xx

x xxx

xx xx xx x x

x

xxxxx

xxx

xxx

xxxx

xx

xxxxxx xxxx x

xxx x x xx

xxx

x xx

xx xxx xxx

xx

xx

xxxxxxx

xx xxx

xxxx xx

xxx

x xxxx x

xxx

xxx

xxx

xx

xx

xxxx

xx

xxxx

xxxx

x

xxxx

x xxxxx

xxx

xxxx

xxxx

xxxx

x

xxxx

xx

xxxx x

xx

xxx

x xx

xxxxxx xx

xxxxx

xx xx xxx

xxxx

xx

x xxx

xxxx

xx

x

xxx

xxx

xxx

xxxxx x

xxxxxx x

xx

xxxx

xx xxxxxxxxx

x xxxxxx

xxx

xxxxxxxxx x

xx xx

xx

xxx xx

xxx

x xxxxx

xxx

xxx xx

xx

xxx xxx

x xxxxxx x

x x xxxxxxxx

xx

xx xxx

xx x xxx x

xx

x xxxx

x

xx

x

xx

xxxxx xx xx

x xxx xxxx x xx xxxx

x xx

xxx xx

x

xxx

x x xxx xxx

xxxx x

xx

x xxx x

xx xxxxx

xxxxxxx

xx xx

xx xx

xx

xx x x xx

xxx xxx

xx x

xx

xxxx

x xxxx xx

xx xxxx

xx x

xx

xxx x

xx

x xxxxxxxxx

xxx x xxxx x

xxxxx xx

x

x x x xx x

x xxxx

x xxx

x xx xxx

x xx

x

xx xxxxxx

x xxxx

x xxxxx

xxxxx

xx

xxxx

xx

xxxx x x xxxx

xxxx

xx

x xxxx

xxx x

x xx xxx

x x xxx x xxx

xx

xx

x xx x

xxxxxxxx

xx x xx

xxxxxx

x xxx

xx xxx xxx

xx xx

x xxxx x xx xxx xxx xx

xxx

x x

xxxx

x

xxxx x

xxxxx

x

xx

x xxx

xx

xxx x

xxx x xx

xxx

x xxx

xxxxx

xx xxx

xxx

x xxx

xx

xx

xxxx xx

x

xxxxx

xx

xx xxx

xx

xxxxx

xx x xxx

x

x

x

x

xx

xxx

xx

xxx

xx

xx

xxxx

xxx

xxx

xx

xxxxxxx

xx x

xxxx

xxxxx xxxxx

xxxxxx

xxx

xx xx xxx

xx

x

xxx x

xx

x

xxx

xxx xxxx

xx xx

xx

xxxx xxxxx

x xxx x

xxxx

xxxx

xxxx

xx

x

x

xxxx x

xxxx

xx xx xx

xxx

xx

xx

x xxx

xx

xxxx

xx

xxxxx x

x xxx xxxx

xx x xxxx

xxx x

xxx

x

xx

xx xxx x x xxxxxx

xxx

xx

xxxxxxx xx xxxxx

xxxxx

xx x

x

xxxxx

x xxxx x

x xx

xxxx

xxxx xx

xx

xxx

xxxx

xxxxx

xx

xxxx

xxx

xx

xx x xxx x xxxx

xxx xxxx

x

xx

x

xx

xxx x x xx

xxx x

xxx x

xx

xx

xxx xx

xxx

xx

xx

xxx x

xx

x xxxx

x xxx x

xxx x xx xxxx xx

xxxxx

xxxxxxx

xxx xxxx

x x x xxxx xx

xxxxx

xxxxx

xxx

xx x

x x xx xx

x xx x

x xx xx xxxxxx xxx

xxx

xx

xxxx

xx xxx

xxxxxx

xxx

x

xxxxxx

x xx

xx

xx

xx

xx xxx

xx xxx

xxxxx

xx

x

xxxxxx

x xxxxxxxxx

xx x xxxxxxx

xxx

xxx

xxx

x xxx

xx xx

xxxxx

xxx

xxxx

x xxx

x xxxx x x

x xxx x

xx

x xxxx xxx

xx xx x

x

x

x

xx

xxxxx

xx

xxxx

xxx

xxxx

xx

xxxx x

xx

x xxxxx

xx

xx

xx

xx x xxx

xxx x

xxx xx

xx

x xx

xx

x xxx

xxxx

x x xxx

xx

xx

xx

xxxx x xx

xxxx

x

xxx

xxxxx

x xxxx xxx xxxxx xxx

x

xxxxx

xx

x

xx

xxx

x xx

x x xx x

xx

xxx xx

xx x

xx

xxxx

xxx

xx xx xxxx xxx

xxxxx

xxxx

xx x

xx

x xxx

xxxx x x xxxx xxx xx

x xxxxx x

x xxxxx

x x

xxxxx

xx

xxxx xxxxxx

xx

x xxx

xx

x xx

x xxx

xxx

xxxx

x

xxxx xx

x x xx

xx

xxx

xx x xx

x xxx

xxxxx x

xxxxx

xxx

xxxxx

xxxxxxx

x xx

xxxx x

x xx

xxx xx

xxxx

xxx

x x x

x

xx x

x xx x xx

xx

xx

xxx

xx xxx

x

x

x

xx

xxxxx

x x xxx

xxx

xx

xxxx

xx

xx

xxxx xx

xxxx

xxx

x

x xx

xxxx

xx

xxx

xxxx

xx xx xxx

x

x x

xxx x x

x

xxx x

xx

xx

xx

xxxxx

xxxxx

xx

xx

xx

x

x

xx

xx

x

x

x

xxxxx

xx

xxxx

xx

x x

xxx x

xx

x xxx xx

xx

xxxxxx

xx

xx x xxxx

x xx

x xxxx

xx

x

xx

x

xx

x

x x

x

xx

xxx

xxxxx x

xx

xxxxx

xxxx

xx

xxxx

x

xx xx

xx xx

xx

xx

xx

xxx x

xxx

xxx

x xxxxx

x

x

xx

x

xxx

xx

x

x

xx

x

x

xx

xx xx

xxxx

xx

xx xx xx x x xxx

xx xxx

x xxxx

xx

xxxx

xxx

xxx x

xx

xxxxxxxxxx

x

x xxxx xxxx

xx

x xxx xx

xxx xxx

xx

xxx xxxxx x xx

x

x

xx

x

xxx

xx

xxxxxxx xxxxx x xxxx

xx xxx

xxx

x x xxxx xx xx

xxx

xx xx x xxxxxxxx x x

xxx

xx xxxx x

xxx xx xxxx

xxxx

xx

xx

x

xx xxx

xx xx

xxx xx xx x xx xxxx

x xx

xx x

xx

xx

xx xx

xxx

xxxxx

x

xx

x xxx x

xxx xx xx x

xx

x x

xx x xxxx

xx xx

x

x

x

x

xx x xxxx

xxx

xxxx

xx xxxxxxx

xxx

x xxxx x x xx x

xx

xx

x xxx

x xxxxxxxx

xx xx xx xx

xx x

xx

xxxx

xxx xxxx x

xxxx

xxxx

x

x xxx x

xx x

xxx

xx xxxx x

xxxx xx

xxxxx x

x x xx x

xx

x xxxx

x

x

x

xxxxxx

xxx

xxxxxx

xxxxxxx xxxxxxxx

xxx

xxxx x

x

xx

x

x

xxx xxxxx x

xx

x x

xx x

xxx

xx

x

xx xx

xxxx

x xx

xxxxxx

x xxxx

xx

x

xxx x

x xx xx

xxx

xxx x

x

xxxxx

x x x x

x xx xx

xx

x xxx xx

xxxx

xxx xx x

x xx

xxx x

xxxxxxxxx x

x

xxxxxxx

xxx

xx x

xx

xx

xx

xxx x

xx

x

xx

xx

xxxx xx x

xxxxx

xxxx x x x xxxxxxx

xxx

xx

xx

x

x

xx

xxx xxx

xx

x

xx

xx x

x xxx

xx xx

xxxx xx

xx

xxxx

xx

xxxx

x

xxxx xx

x xx x

x xxxx

xxx x

xx

x

xx

x

xx

xxxx

xx

xx

xx

xxxx

xx

x xxxxx

x

x

xxx x

xx

xx

xxxxx

x

xx

xxxxx

xx

x xxxx xx

x

xxx

x

xxx x

xxxx x

xx

x x

x

x

xxx x x xxx

xxxxxx

xxxx x xxxxxxx xx x x

xxxxxx

x

x

xx

x

xx

xx

x xx

xx

x

xxx

xx

xx xx

xxx

xx

xx

x

xxxxx

xxxx xxx

xxx

xxx

xx

x xxxxx

xxx x

x xx

xxxx

xx

xx

xx x

xx

xxxx

xx

xx

xx xx

x xxx

xxxx

xxx x xxx

x

xx xx

xxx

xxx

x xxx x

xxxxx x

xx xxxxx

xx

x xxx

xx

xx

x x

x x

xxx

x xxx xx

xx

xx

xxx

xx x

xxx

xxx xx

x xx

xx xxx

xxx

x x xxxxx x

xx

xxxx xxxxx

x

xxx xx

xxxxxxxx

x

xxx

x

xx

x

x

xxx

x

xx

xx

xx xx

xx

xx

xx xxx

xx

xx x xx x

xx

xx xxxx

x xxx

xxxxx

x

xxxx x

xxx xxx

xx x

x

xx

xx x x

xx

xx

xx

xx

xx

xx

xx

xxxxx

xxx xxxx

x

xxx

xxxx xxx

xxxx

xx x

x x xx

xxx

x

xxxxx

xxxxxx x x

xx x

x

xxx

xx

x

xx

xx

xxxxx

x

x

x xx

x

xx

xxxx

x xx

xxxxxxx

xx

xx

xxx x

xx x

xxx

xxxxx

x

xxxxx

x

xxxx

x xx

xx x xx x

xxxxxx

xx xxxxx

xxx

xx xx

xxxx x

x xxxx

xxx x x xx xxx

xx

xx

xx xx

xx

xx

x xxx x xx x

xx

xxxx xx

xxx

xxx

x

xxxxxxxxxx

xx x

x

xx

xx

xx

x xx

xxx

x xxx

x xxxx

xx

x xx x

xx x x

xx

x

xx

xxxxx

xx

xxxx xxx

x xx

xx

xx xx

xxxx xxxx x

xxx

xxxxx

xx

xx

x xxxx

x x

xxxxx xx xx

xxx xx

x x

xx

x xxxx

xx

xxxxx

xx

x x xx x

x xxxx

xx x

xx xx

x x xxx xxxxx

xxxxx

x

xx xx

x xxx

x

x x

xxxx

xx

xxx

xxx

x

xx

xx xx

x

xxx x xx

x

xx

xx x

x

xxx

xxxx x

xxx xxx

xxx xxx

xx xxx xx

x

xxx

xxxx

x x x xxxxx

x xxx

xxx

xxxx xxx

x xx

xxxxxx xxx x xx

xxx

x

xx

xxxxx xx

xx

xx xx

xx xx

xx

xxx

xxx

x xx

xxx xx

x

xx xxxx xx

xx

xxxxx

xxx

xx

x xxx

xxx

xxxxx

xxxxxx x

xxx

xx

xxxx

xxx

xxx

xxx

x

xxx x

x xxx

xx

x

xxx xx

x x

xxxx

xx

xx xxxxxxx

xxxx

x x xxxx

x

xxx xxx

xx x

x xxx

xx

x

xx

x

xxxxxx x

xxxxx

xx

xxx

xxx xxx

xxx

x

x xx

x

xx xx

x

xxxx

xx x x xxxx

xxxxxxxxx

xx xxx x xx

xxxxxx

xx x

x xx

xx

xxxxx

xxx

xxxx

xxxxxx x x

x xxx

xxx

xxx

x x

xxxx

xxxx xx

xxxxx

xxxx

xx

x x x x

xxxx

xx x x

xxx

xxx

xxxxx

xxx x

x xxxxxx

xxxx

xxxxx

x x

x xxxxx

xxx

xxxxx x

x xxxx x

xxx

xxx

xx xx

xxx

xx

xx

x xxx

xxx

xx

xxx

x xxxx

xxxx

xxxxx

x xxx x

xxxx

x

xx

xx

xx xx xxxx xxx

xxx xxxxxx

xxx x

xx

xx

xxxxxxx

xx

x

xxxx

xxxxxx

xxx x

x xxx xx

xxxx x x

x

x

x xx

xxxxx xx

xx

xxx

xxxxx x x x

x x

x

xx

xx

x x

x

xxx

xx

xxxx

x

xxxxxx

xx

xxxxx xxx

x

xx

x

xxx

xx

xxxx xxx x

xxx x

x

xxx

xx xxx xx

xx

xx x

xxxxxxxx

xx

xxx x

xx

x

xx xxx

x xxxx x

x xx

xxxxxxx

xx

xxx xx

xxx

xx xxxx

xxxx

xx

xxxx xxx

xxxx

xxx

xxx

xx

xx

xx

xx xx xxxx xxx xx

xx

x xx xxxxx xxxx

x x xxxxx xx

xx x

xxx x

xxxx x

xxxxx xx xxxxxxx xxx

xx

x

xxx xx xx x x

x

xx

xx x

xx

x xxx

xxx

x xxxx

xx

xx x xxx xxx x

x xx

xxxxx xx

xx

xxx

x

xxx

xx xxxx

xx x

x xxx

xx x xxx

xxx

x

xxx x x xxx x

xxxx

xx xxx

xx

xxx

x

−4 −2 0 2 4

−1.

5−

0.5

0.0

0.5

1.0

1.5

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 4: Normal QQ-plot of the distribution of∆h, using the same data as inFig. 3. The dashed lines correspond to the 5% and 95% percentiles of the sampledistribution, respectively.

12 Statistical Applications in Genetics and Molecular Biology Vol. 2 [2003], No. 1, Article 3

http://www.bepress.com/sagmb/vol2/iss1/art3

Page 15: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

0 2 4 6 8

density of simulated mk

Figure 5: Density of the reciprocalΓ distribution,1/mk ∼ Γ(1, 1). At the rightend, the plot is cut off at the 90%-quantile of the distribution.

Values for the calibration parametersai andbi were generated through

a1 = 0, a2, . . . , adiid∼ U(−∆a,∆a)

b1 = 1, b2, . . . , bdiid∼ LN(0, 1),

where∆a = 0.95 roughly corresponds to the peak of the distribution shown inFig. 5, U(−∆a,∆a) is the uniform distribution on the interval[−∆a,∆a], andLN(0, 1) is the log-normal distribution that corresponds to the standard normaldistribution. This yielded simulated datayki = ai + bi sinh(µki + εki).The matrixyki of simulated data was presented to the software implementationin the Bioconductor [14] packagevsn . It returns the estimated transformationsh1, . . . , hd, parameterized bya1, . . . , ad andb1, . . . , bd (see Eqn. (10)), as well asthe matrix of transformed datahki = arsinh((yki− ai)/bi). Generalized log-ratioswere calculated as

∆hki = hki − hk1 (29)

and compared to the true values

∆hki = hki − hk1, (30)

13Huber et al.: Variance stabilization of microarray data

Produced by The Berkeley Electronic Press, 2004

Page 16: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

simulation A B C D

number of probes n 384, . . . , 69120 9216 9216 9216number of samples d 2 2, . . . , 64 2 2proportion ofdifferentially expressedgenes

pdiff 0 0 0, . . . , 0.6 0.2

proportion ofup-regulated genes

pup - - 0.5, 1 0, . . . , 1

amplitude of differentialexpression

zmax - - 2 2

trimming quantile qlts 0.5, 0.75, 1asymptotic coefficientof variation

c 0.2

Table 1: Simulation parameters.

with hki = µki + εki, by means of the root mean squared deviation

δ =

√√√√ 1|κ|(d− 1)

d∑i=2

∑k∈κ

(∆hki −∆hki)2, (31)

whereκ is the set ofk for whichpk = 0.For a given set of simulation parameters, this procedure was repeated multipletimes, resulting in a simulation distribution of the root mean squared errorδ. Thiswas used to obtain the error bars shown in Figs.6, 8, and9. The error bars arecentered at the mean and extend by twice the standard error of the mean in eachdirection.

5.3 Simulation results

Four series of simulations were performed to investigate the influence of the num-ber of probes, the number of samples, the proportion of differentially expressedgenes, and the choice of the trimming quantileqlts. The parameter settings aresummarized in Table1.The dependence of the estimation errorδ on the number of probesn and the numberof samplesd was investigated in simulation series A and B. The results are shownin Fig. 6. In the left plot, the number of probesn varies from 384 to 69120. Fromthe plot, a scaling of the root mean squared error approximately as

δ ∝ 1√n

(32)

14 Statistical Applications in Genetics and Molecular Biology Vol. 2 [2003], No. 1, Article 3

http://www.bepress.com/sagmb/vol2/iss1/art3

Page 17: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

10 12 14 16

−9

−8

−7

−6

−5

−4

log2 n

log 2

δ●

qlts

0.50.751

1 2 3 4 5 6

−8.

0−

7.5

−7.

0−

6.5

−6.

0

log2 d

log 2

δ

● ●

qlts

0.50.751

Figure 6: Dependence of the estimation errorδ on the number of probesn (leftpanel) and the number of samplesd (right panel), for three different choices of thetrimming quantileqlts. The dashed line in the left panel has a slope of−1/2 andcorresponds to a scalingδ ∝ 1/

√n.

can be observed. In the right plot,d varies from 2 to 64, again with three differentvalues ofqlts. While δ does slightly decrease withd, the decrease is much slowerthan that withn and does not show an obvious scaling such as (32). The differencebetween the two plots may be explained by the fact that the number of parametersof the transformations (10) is 2d. Thus, the number of data points per parame-ter remains constant whend is increased, but increases proportionally whenn isincreased.The dependence of the required computation time on the parametersn, d andqlts

is shown in Fig.7. The plots indicate a scaling approximately as

tCPU∝ n× d. (33)

The computation times were measured with the Bioconductor packagevsn version1.0.3 andR version 1.6.1 on a DEC Alpha EV68 processor at 1 GHz. On thissystem, the proportionality factor in (33) is about 2 ms.The effect of the presence of differentially expressed genes on the estimation errorδ was investigated in simulation series C and is shown in Fig.8. With qlts = 1, i. e.,without use of the robust trimming criterion,δ becomes large even in the presenceof only a few differentially expressed genes. Aspdiff increases, the estimationerror remains smaller with trimming at the median (qlts = 0.5) than at the 75%

15Huber et al.: Variance stabilization of microarray data

Produced by The Berkeley Electronic Press, 2004

Page 18: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

0 10000 30000 50000 70000

050

100

150

n

t CP

U [

s]

●●●

qlts

0.50.751

0 10 20 30 40 50 60

050

010

0015

00

d

t CP

U [

s]

●●

qlts

0.50.751

Figure 7: Dependence of the computation timetCPU on the parametersn, d, andqlts. Parameter values were as in Fig.6.

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

pdiff

δ

● ●●

●●

qlts

0.50.751pup = 0.5

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

pdiff

δ

qlts

0.50.751pup = 1

Figure 8: Estimation errorδ for different mixture proportionspdiff of differentiallyexpressed genes and three different choices of the trimming quantileqlts. In the leftplot, half of the differentially expressed genes are up-regulated while the other halfis down-regulated. In the right plot, all of the differentially expressed genes areup-regulated.

16 Statistical Applications in Genetics and Molecular Biology Vol. 2 [2003], No. 1, Article 3

http://www.bepress.com/sagmb/vol2/iss1/art3

Page 19: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

quantile (qlts = 0.75). Asymmetric situations (pup = 1, right panel) are worse thansymmetric ones (pup = 0.5, left panel), but still can be handled reasonably as longas the proportion of differentially expressed genes is not too large.

0.0 0.2 0.4 0.6 0.8 1.0

0.05

0.10

0.15

0.20

0.25

0.30

pup

δ

●●

● ●

qlts

0.50.751

0.0 0.2 0.4 0.6 0.8 1.0−

0.2

−0.

10.

00.

1pup

θ

●●

●●

●●

qlts

0.50.751

Figure 9: Estimation errorδ and biasθ for different values of the asymmetry pa-rameterpup. Among the set of differentially expressed genes,pup is the fraction ofup-regulated and1− pup the fraction of down-regulated genes.

Another look at the influence of asymmetry between up- and down-regulated genesis shown in Fig.9, which was obtained from simulation series D. Here,pdiff wasfixed at0.2. The right panel shows the average bias

θ =1|κ|∑k∈κ

(∆hk2 −∆hk2

). (34)

The robust procedures (qlts = 0.5 and0.75) perform much better than the unrobustone (qlts = 1). Bias θ and errorδ are smallest when the fractions of up- anddown-regulated genes are about the same. An interesting feature of Fig.9 is theasymmetry of the plots about the vertical linepup = 0.5. The presence of up-regulated genes has a somewhat stronger effect onδ and θ than that of down-regulated ones. This appears to be related to the skewness of the distribution ofµk (see Eqn. (27)), and to the curvature of the function (10).

17Huber et al.: Variance stabilization of microarray data

Produced by The Berkeley Electronic Press, 2004

Page 20: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

6 Discussion

Models are never correct, but they may be useful. There is a trade-off betweenthe wish to describe as much detail as possible and the problem of overfitting.Model (17) has two parameters per sample, an offsetai and a normalization fac-tor bi, and one parameter for the whole experiment, the standard deviationσε. Inaddition, we assume that for each non-differentially expressed genek the center ofthe distribution of intensities is parameterized byµk for all samplesi. Many exten-sions and variations of the model are possible. Instead of offsets and normalizationfactors that are common for all measurements from a sample, print-tip-specific orplate-specific parameters may be used [15]. Systematic differences between dif-ferent production batches of arrays or reagents could be modeled by allowing fordifferent values ofµk in the different batches. Furthermore, we have assumed thatthe standard deviations of the additive noise terms for different samples are relatedto each other via Sd(νki) = const. We find this to be an acceptable approximationfor the data we have encountered, but other relationships, such as Sd(νki) = const.or Sd(νki) = λi with further parametersλi could also be appropriate.Error modeling and calibration depend on the particularities of the technologiesused. For this article, we have simply considered the probe intensities as given, set-ting aside important questions such as how the labeling and detection are realizedand how the probe intensities are obtained from the fluorescence images. Whetheror not the intensities measured from an experiment accord to the assumptions laidout in Section2 has to be verified case by case; however, we have generally foundgood agreement with data from two-color spotted cDNA arrays, from radioactivenylon membranes, and from Affymetrix genechips. An advantage of the modelingapproach compared to a heuristic, algorithm-oriented approach is that it providescriteria for quality control: by explicitly stating the assumptions made on the data,insufficient data quality can be detected by statistical tools such as residual analy-sis.On cDNA arrays, typically each probe is sensitive for a distinct gene transcript.The calibrated and transformed intensities may be directly used as a measure forthe abundance of transcripts in the samples. On Affymetrix genechips, multipleoligonucleotides of potentially different specificity and sensitivity probe for thesame transcript. There, we recommend to use our method on the individual probeintensities. Approaches to the question of how these then can be combined intoper-gene summary values are described in the references [16, 17, 18].The computation time consumed by our software implementation is generally toolarge for interactive use. For example, withn = 40000 probes,d = 100 samplesand a proportionality factor of 2 ms in Eqn. (33), tCPU≈ 2h. The time-critical partof the computations is the iterative likelihood optimization. There are two ways to

18 Statistical Applications in Genetics and Molecular Biology Vol. 2 [2003], No. 1, Article 3

http://www.bepress.com/sagmb/vol2/iss1/art3

Page 21: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

reduce the time: First, the proportionality factor could be reduced by implement-ing code in C instead of R. Second, sublinear scaling can be achieved instead ofEqn. (33) by not using the data from all probes, but only from a (quasi-)randomsubset.

From the point of view of the user, two important issues are interpretability andbottom-line performance. In these respects, the method presented here needs to becompared against established approaches that are based on the logarithmic trans-formation and the calculation of (log-)ratios. A comparison on real data showedhigher selectivity and sensitivity for the idenfication of differentially expressedgenes [3]. The (log)-ratio has, at first sight, the advantage that it can be sim-ply and intuitively interpreted in terms of “fold change”. However, the value ofthe log-ratio is highly variable or may be undefined when either the numeratoror the denominator are close to zero. Since many microarray data sets includegenes that are not or only weakly expressed in some of the conditions of interest,the significance of fold changes can be difficult to assess for a large and poten-tially important part of the data. Furthermore, many authors have noted a needfor non-linear normalization transformations to be applied in conjunction with thelog-transformation [15, 18, 19, 20]. These have the goal of removing an intensitydependent bias from the log-ratios. They are often implemented through a scat-terplot smoother or a local regression estimator. The result of that has no longer asimple interpretation in terms of fold changes.The approach presented in this paper offers a rational and practicable solution tothese problems. Through the criterion of variance stabilization, we arrive at a trans-formation that corresponds to the logarithm when the intensity is well above back-ground, but has a smaller slope for intensities close to zero. Thus, the generalizedlog-ratio∆h coincides with the usual log-ratio∆ log when the latter is meaning-ful, but is shrunk towards zero when the numerator or denominator are small. Non-linear calibration transformations are often motivated by the curvilinear appearanceof the scatterplot on the log-log scale. This may be caused, for example, by differ-ences in the overall background between different arrays or color channels. In thatcase, these differences may also be modeled by the family of transformations (10),which allow different offsetsai for different arrays or colorsi. An advantage ofthis over calibration by local regression is that it does not depend on the choice ofa smoothing bandwidth, and that the offsetsai and scaling factorsbi are easier tointerprete.While our approach approximately removes the intensity-dependence of the vari-ance, this does not necessarily mean that the variance of the data for all genes isthe same. There may still be technical or biological reasons for the variance ofone gene being different from that of another, or even from itself under a differ-

19Huber et al.: Variance stabilization of microarray data

Produced by The Berkeley Electronic Press, 2004

Page 22: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

ent biological condition. For instance, the tightness of regulatory control could bedifferent for a highly networked transcription factor than for a protein with mainlystructural function. However, whether or not such gene- or condition-specific vari-ances play a role, in any case the removal of the intensity-dependence should beadvantageous for subsequent analyses.

Acknowledgements

We thank Gunther Sawitzki and Dirk Buschmann for fruitful discussions, andRobert Gentleman and Jeff Gentry for useful suggestions and practical advice onthe software implementation of the Bioconductor package.

References

[1] Blythe P. Durbin, Johanna S. Hardin, Douglas M. Hawkins, and David M.Rocke. A variance-stabilizing transformation for gene-expression microarraydata.Bioinformatics, 18 Suppl. 1:S105–S110, 2002. ISMB 2002.1

[2] Peter Munson. A ”consistency” test for determining the signif-icance of gene expression changes on replicate samples and twoconvenient variance-stabilizing transformations. Genelogic Work-shop on Low Level Analysis of Affymetrix Genechip data,http://stat-www.berkeley.edu/users/terry/zarray/Affy/GLWorkshop/genelogic2001.html,2001. 1

[3] Wolfgang Huber, Anja von Heydebreck, Holger Sultmann, AnnemariePoustka, and Martin Vingron. Variance stabilization applied to microarraydata calibration and to the quantification of differential expression.Bioinfor-matics, 18 Suppl. 1:S96–S104, 2002. ISMB 2002.1, 3, 19

[4] T. Ideker, V. Thorsson, A.F. Siegel, and L.E. Hood. Testing for differentiallyexpressed genes by maximum-likelihood analysis of microarray data.Journalof Computational Biology, 7:805–818, 2000.1

[5] David M. Rocke and Blythe Durbin. A model for measurement error for geneexpression analysis.Journal of Computational Biology, 8:557–569, 2001.1,3

[6] Wolfgang Huber. Vignette: Robust calibration and variance stabilization withvsn, 2002. The bioconductor project,http://www.bioconductor.org. 1

20 Statistical Applications in Genetics and Molecular Biology Vol. 2 [2003], No. 1, Article 3

http://www.bepress.com/sagmb/vol2/iss1/art3

Page 23: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

[7] Tim Beißbarth, Kurt Fellenberg, Benedikt Brors, Rose Arribas-Prat, Ju-dith Maria Boer, Nicole C. Hauser, Marcel Scheideler, Jorg D. Hoheisel,Gunther Schutz, Annemarie Poustka, and Martin Vingron. Processing andquality control of DNA array hybridization data.Bioinformatics, 16:1014–1022, 2000.3

[8] David M. Rocke and Blythe Durbin. Approximate variance-stabilizing trans-formations for gene-expression microarray data.Bioinformatics, in press,2002. 5

[9] Xiangqin Cui, M. Kathleen Kerr, and Gary A. Churchill. Data transforma-tions for cDNA microarray data. Technical report, The Jackson Laboratory,http://www.jax.org/research/churchill, 2002. 5

[10] Peter J. Rousseeuw and Annick M. Leroy.Robust Regression and OutlierDetection. John Wiley and Sons, 1987.5, 7

[11] Susan A. Murphy and A. W. van der Vaart. On profile likelihood.Journal ofthe American Statistical Association, 95:449–465, 2000.6

[12] Peter J. Rousseeuw and Katrien van Driessen. Computing LTS regressionfor large data sets. Technical report, Antwerp Group on Robust & AppliedStatistics, 1999.http://win-www.uia.ac.be/u/statis/abstract/Comlts99.htm. 8

[13] M.A. Newton, C.M. Kendziorski, C.S. Richmond, F.R. Blattner, and K.W.Tsui. On differential variability of expression ratios: improving statisticalinference about gene expression changes from microarray data.Journal ofComputational Biology, 8(1):37–52, 2001.11

[14] Ross Ihaka and Robert Gentleman. R: A language for data analysis and graph-ics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996.http://www.bioconductor.org. 13

[15] Sandrine Dudoit, Yee Hwa Yang, Terence P. Speed, and Matthew J. Callow.Statistical methods for identifying differentially expressed genes in replicatedcDNA microarray experiments.Statistica Sinica, 12:111–139, 2002.18, 19

[16] Affymetrix, Santa Clara, CA.Microarray Suite Version 5.0, User’s Guide.18

[17] Cheng Li and Wing Hung Wong. Model-based analysis of oligonucleotidearrays: Expression index computation and outlier detection.PNAS, 98:31–36, 2001. 18

21Huber et al.: Variance stabilization of microarray data

Produced by The Berkeley Electronic Press, 2004

Page 24: Statistical Applications in Genetics and Molecular Biology · Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In

[18] Rafael A. Irizarry, Bridget Hobbs, Francois Collin, Yasmin D. Beazer-Barclay, Kristen J. Antonellis, Uwe Scherf, and Terence P. Speed. Ex-ploration, normalization, and summaries of high density oligonucleotidearray probe level data. Biostatistics, 2003. Accepted for publication.http://biosun01.biostat.jhsph.edu/∼ririzarr/papers/index.html. 18, 19

[19] Eric E. Schadt, Cheng Li, Byron Ellis, and Wing Hung Wong. Feature ex-traction and normalization algorithms for high-density oligonucleotide geneexpression array data.Journal of Cellular Biochemistry, Supplement 37:120–125, 2001. 19

[20] Thomas B. Kepler, Lynn Crosby, and Kevin T. Morgan. Normalization andanalysis of DNA microarray data by self-consistency and local regression.Genome Biology, 3(7):research0037.1–0037.12, 2002.19

22 Statistical Applications in Genetics and Molecular Biology Vol. 2 [2003], No. 1, Article 3

http://www.bepress.com/sagmb/vol2/iss1/art3