Textual Description of a y - Johns Hopkins Bloomberg ...ririzarr/Raffy/affy.pdf · Various built in...

Textual Description of affy

Rafael A. Irizarry and Laurent Gautier

May 2, 2002

Contents

1 Getting Started: Using Plobs 21.1 If you only want expression values . . . . . . . . . . . . . . . . . . . . . . 31.2 Loading Data into R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Exploring Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Expression Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Working with Probe Level Data 172.1 Cdf-class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Class Cel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Class Cel.container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4 Class PPSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5 Class PPSet.container . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Extending the package 243.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Measures of Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Introduction

affy is part of the Bioconductor1 project. It is meant to be an extendible, interactiveenvironment for data analysis and exploration of Affymetrix oligonucleotide array probelevel data.

The software utilities provided with the Affymetrix suite summarizes the probe setintensities to form one expression measure for each gene. This is the data available foranalysis. However, as pointed out by Li and Wong (2001), much can be learned from

1http://www.bioconductor.org/

1

studying the individual probe intensities, or as we call them, the probe level data. Thisis why we developed this package.

We assume that the reader is already familiar with oligonucleotide arrays and with thedesign of the Affymetrix GeneChip arrays. If you’re not, we recommend the Appendix ofthe Affymetrix MAS manual Affymetrix (1999). The following terms are used throughoutthis document with a specific meaning.

probe oligonucleotides with length of 25 base pairs used to probe RNA targets.

perfect match probe supposed to match perfectly the target sequence.

PM intensity value read from the perfect matches.

mismatch the probe having one base mismatch with the target sequence.

MM intensity value read from the mis-matches.

probe pair a unit composed of a perfect match and its mismatch.

affyID an identification for a gene or a fraction of a gene on the array.

probe pair set several probes relating to a common affyID.

CEL files contain measured intensities and locations for an array that has been hy-bridized.

CDF file contain the information relating probe pair sets to locations on the array arestored.

The package provides two approaches to working with probe level data. The firstis via probe level objects (Plobs) and is intended for the non-expert R user. One canautomatically read in all the relevant information with one functions call and store it ina Plob. Various built in function permit the user to easily view graphical displays of thedata and compute expression values. Section 1 demonstrates how to work with Plobs.The second approach is more flexible and intended for users with some R experience.This approach is better if you intend to do a lot of probe level analyses. This approachuses objects of the classes Cdf and Cel described through various examples in Section 2.Section 3 gives some details on how to extend and alter the package for your needs.

1 Getting Started: Using Plobs

The first thing you need to do is load the package.

R> library(affy) ##load the affy package

2

This section described the use of Plobs (Probe level objects) through an example.These objects are easy to use. You can read data, explore, and compute expressionvalues with one function call for each. However, if you are going to be doing much probelevel analyses we recommend using the Cel/Cdf classes described in Section 2.

Section 1.1 describes three lines of code that can get you started if all you want isexpression values. The rest of the Section demonstrate the utility of the package via anexample using the dilution experiment data set described in Irizarry et al. (2002).

1.1 If you only want expression values

If all you want is to convert probe level data into expression measures a quick way ofdoing it is like this: Create a directory, move all your CEL files and the appropriateCDF file to that directory. Start R in that directory. Load the library and then type:

R> Data <- ReadAffy() ##read data in working directory

R> e <- express(Data)

Note there must be exactly one CDF file for this to work.This will store expression values, in the object e, as an object of class exprSet (see

the Biobase package). You can either use R and the bioconductor packages to analyzeyour expression data or if you rather use another package you can write it out to a tabdelimited file like this

R> write.exprs(e,file="mydata.txt")

In the mydata.txt file, row will represent genes and columns will represent sam-ples/arrays. The first row will be a header describing the columns. The first column willhave the affyIDs. The write.exprs function is quite flexible on what it writes (see thehelp file).

1.2 Loading Data into R

As described in the previous section, the simplest way to read data is by using thefunction ReadAffy. The first step is to create a directory that contains all the CEL filesyou want to read in and the appropriate CDF file. Then you can type:

R> Dilution <- ReadAffy() ##read data and store it in Dilution

Note there must be exactly one CDF file for this to work.Alternatively the user can choose what files to be read using a graphical display such

as those shown in Figures 1 and 2. In order to do this, the widget argument must beTRUE like this:

R> Dilution <- ReadAffy(widget=TRUE)

The object created, and saved in Dilution will be a probe level object (Plob).

3

Figure 1: Graphical display for selecting CDF file.

4

Figure 2: Graphical display for selecting CEL files.

5

1.3 Exploring Data

For the users convenience we have included a sample data set containing part of the datafrom a Dilution experiment. The full data is publicly available from Gene Locic http:

//qolotus02.genelogic.com/datasets.nsf/ and described in Irizarry et al. (2002).The help file obtained via ?Dilution describes the data set. Section 1.6 presents a casestudy using this Data.

R> data(Dilution)

R> print(Dilution)

Plob object

CDF used=hg_u95a.cdf

number of chips=9

number of genes=100

number of probe pairs=1604

annotation=

description=

notes=This is the dilution data

size of chip=640x640

chip names=

A2.5

B2.5

C2.5

A5

B5

C5

A10

B10

C10

This will create the Dilution object of class Plob. print (or show) will show somerelevant information. These objects are meant to represent data from one experiment.The Plob class combines the information of various CEL files with a common CDF file.This class is designed to keep information of one experiment. It is especially useful forusers who want to obtain expression values without much tinkering with Cel and Cdfobjects. However, the probe level data is available.

The data in Dilution is a small sample of probe sets from 3 sets of triplicate arrayshybridized with different concentrations of the same RNA. This information is part ofthe Plob and can be accessed with the phenoData and pData methods:

R> phenoData(Dilution)

6

http://qolotus02.genelogic.com/datasets.nsf/

http://qolotus02.genelogic.com/datasets.nsf/

phenoData object with 2 variables and 9 cases

varLabels

LiverAmt: Amount of liver RNA hybridized to chip

SN18Amt: Amount of SN18 RNA hybridized to chip (all 0 for this example)

R> pData(Dilution)

LiverAmt SN18Amt

A2.5 2.5 0

B2.5 2.5 0

C2.5 2.5 0

A5 5.0 0

B5 5.0 0

C5 5.0 0

A10 10.0 0

B10 10.0 0

C10 10.0 0

The main slots of the Plob class can be accessed with the pm, mm, and probeNames

methods. The slots pm and mm are matrices with rows representing probe pairs andcolumns representing arrays. The gene name associated with the probe pair in row i canbe found in the ith entry of the name slot (accessed with probeNames).

R> pm(Dilution)[1:5, ]

A2.5 B2.5 C2.5 A5 B5 C5 A10 B10 C10

35612_at4 60.3 36.3 62.0 47.3 51.8 75 74.3 64.0 89.0

35612_at5 49.0 39.0 60.0 59.0 57.0 73 74.8 61.0 89.0

34791_at7 75.8 60.3 64.0 101.5 55.5 91 78.5 91.5 148.0

40621_at3 77.0 54.0 77.0 81.0 66.3 87 91.8 81.5 112.5

40577_at4 143.5 100.3 105.5 128.8 130.8 155 193.3 159.3 231.8

R> mm(Dilution)[1:5, ]

A2.5 B2.5 C2.5 A5 B5 C5 A10 B10 C10

35612_at4 48.5 44.0 61.0 45.3 62.5 84.3 72.3 59.5 80.3

35612_at5 59.0 39.3 58.0 53.0 52.0 79.3 84.0 59.0 89.5

34791_at7 57.3 46.0 65.0 58.0 50.5 76.5 71.0 65.3 85.0

40621_at3 64.0 56.8 77.8 75.0 56.0 84.0 105.8 82.0 124.5

40577_at4 77.3 70.3 93.0 97.0 67.8 108.0 124.0 100.5 152.0

R> probeNames(Dilution)[1:5]

[1] "35612_at" "35612_at" "34791_at" "40621_at" "40577_at"

7

Other information about the experiment is available in this class. See the Plob helpfile for details.

boxplot(), hist(), history(), image(), profile(), summary() are examples ofuseful methods available for objects of this class. For example, Figure 3 shows histogramsof PM (red) and MM (blue) for the 3rd array obtained using hist().

R> hist(Dilution[, 3])

hg_u95a.cdf − C2.5

log intensity

Fre

quen

cy

6 8 10 12 14

010

020

030

040

050

060

070

0

Figure 3: Histogram of PM (red) and MM (blue) of third array

It is important to note that Dilution[i,j] gives back another Plob object repre-senting the ith probe set (as opposed to the ith probe) and the jth sample. Look atwhat happens when you show the PMs and probe names for this with i=2 and j=3:

R> pm(Dilution[2, 3])

C2.5

34791_at7 64.0

8

34791_at3 90.0

34791_at1 207.0

34791_at0 232.0

34791_at13 702.0

34791_at11 60.0

34791_at14 664.5

34791_at10 297.5

34791_at15 931.0

34791_at8 300.3

34791_at9 169.0

34791_at2 65.0

34791_at12 738.3

34791_at4 118.0

34791_at5 307.5

34791_at6 108.3

R> probeNames(Dilution[2, 3])

[1] "34791_at" "34791_at" "34791_at" "34791_at" "34791_at" "34791_at"

[7] "34791_at" "34791_at" "34791_at" "34791_at" "34791_at" "34791_at"

[13] "34791_at" "34791_at" "34791_at" "34791_at"

1.4 Normalization

Various researchers have pointed out the need for normalization of Affymetrix arrays.See for example Bolstad et al. (2002). Let’s look at an example. The first three arraysin Dilution are technical replicates (same RNA), so the intensities obtained from theseshould be about the same. Using the boxplot method for the Plob class, shown inFigure 4 we can see if this is the case.

Figure 4 shows the need for normalization. For example the first array is globallylarger than the second.

Another way to see that normalization is needed is by looking at log ratio versusaverage log intensity (MVA) plots. The method mva.pairs will show all MVA plots ofeach pairwise comparison on the top right half and the interquartile range (IQR) of thelog ratios on the bottom left half. For replicates and cases where most genes are notdifferentially expressed, we want the cloud of points to be around 0 and the IQR to besmall. The method normalize lets one normalize the data.

R> normalized.Dilution <- normalize(Dilution[, 1:3])

Various methods are available for normalization (see the help file). The default is quantilenormalization (see Bolstad et al. (2002)).

Figures 6 and 7 show the boxplot and mva pairs plot after normalization. Thenormalization routine seems to correct the boxplots and mva plots.

9

R> par(mfrow = c(2, 1), mar = c(2, 4, 2, 2))

R> boxplot(Dilution[, 1:3], range = 0)

A2.5 B2.5 C2.5

68

1012

hg_u95a.cdf : MM

A2.5 B2.5 C2.5

68

1012

14

hg_u95a.cdf : PM

Figure 4: Boxplot of first three arrays in dilution data.

10

R> mva.pairs(Dilution[, 1:3])

A2.5

6 8 10 12

0.0

0.5

1.0

1.5

0.286

6 8 10 12 14

−0.

50.

00.

51.

0

0.385

B2.5

6 8 10 12−1.

0−

0.6

−0.

20.

2

0.244 C2.5

A

MVA plot

Figure 5: MVA pairs for first three arrays in dilution data

11

R> par(mfrow = c(2, 1), mar = c(2, 4, 2, 2))

R> boxplot(normalized.Dilution, range = 0)

A2.5 B2.5 C2.5

68

1012

hg_u95a.cdf : MM

A2.5 B2.5 C2.5

68

1012

14

hg_u95a.cdf : PM

Figure 6: Boxplot of first three arrays in normalized dilution data.

12

R> mva.pairs(normalized.Dilution[, 1:3])

A2.5

6 8 10 12

−0.

50.

00.

5

0.199

6 8 10 12−0.

50.

00.

51.

0

0.191

B2.5

6 8 10 12

−0.

50.

00.

5

0.232 C2.5

A

MVA plot

Figure 7: MVA pairs for first three arrays in normalized dilution data

13

1.5 Expression Measures

To perform gene expression analyses we need to summarize the probe set data availablefor each gene into one expression measure. Various approaches on how to do this havebeen proposed, see for example Affymetrix (1999), Li and Wong (2001), and Irizarryet al. (2002). Our package permits the user to construct expression measures using anyof the approaches mentioned. The function express returns expression measures in anobject of class ExprSet (defined in the Biobase package). The exprs slot of this class isa matrix with rows representing genes and columns representing arrays. Standard errorestimates are also provided in the se.exprs slot. Various bioconductor packages can beused to explore and analyze expression data. The write.exprs method is available towrite expression tables to a file for users who wish to use other packages for analyses.

An expression measure for a gene can be considered a summary of the backgroundcorrected and normalized PMs of the probe set representing that gene. The default forexpress is to use the approach presented in Irizarry et al. (2002). However, the functionis quite flexible. The normalization method can be change through the normalize.methodargument. Arbitrary functions can be passed along via the bg and summary.stat ar-guments for background correction and to summarize the 20 PMs into one numberrespectively. Some background correction and summary functions are provided with thepackage but the user can add their own. More details are given in Section 3.2 and read-ing the help file. The next section presents a case study related to comparing expressionmeasures.

1.6 Case Study

Two sources of cRNA, A (human liver tissue) and B (Central Nervous System cell line),have been hybridized to human arrays (HGU95A) in a range of proportions and dilutions.The Dilution data provided with this package is taken from arrays hybridized to sourceA at concentrations of 2.5, 5.0, and 10.0 µg. We have three replicate arrays for eachgenerated cRNA. Three scanners have been used in this study. Each array replicate wasprocessed in a different scanner.

This dilution data set is a great resource for assessing the technology because weknow that measures of expression for genes that are expressed should grow directlyproportional to the amount of RNA hybridized (which is known) and that replicatesshould be like each other.

As seen in Section 1.4, this data needs normalization. However, in this case wecan only normalize with replicates. Normalizing across replicates is not recommendedbecause the assumption needed for most normalization procedures (most genes, or atleast a large group of genes, don’t change much). We therefore normalize the replicatesand then use the union method to put them together.

R> tmp1 <- normalize(Dilution[, 1:3])


14


R> normalized.Dilution <- union(tmp1, tmp2)

R> normalized.Dilution <- union(normalized.Dilution, tmp3)

We are now ready to compute measures of expression. Keep in mind that the expressfunction normalizes by default, so we need to avoid this using the normalize argument.

R> e <- express(normalized.Dilution, normalize = F)

R> f <- express(Dilution, normalize = F)

To see if normalizing makes a difference, let’s look at expression for two randomlyselected genes for the 9 arrays plotted against known concentration. If the gene isexpressed these values should be growing with concentration but the replicates shouldbe close if not equal. Figure 8 shows the advantages of normalization: we still capturethe signal and reduce the variance.

Now let’s assess the variance of 4 different measures of expression.

R> Dilution <- Dilution[, 1:3]

R> e <- express(Dilution)

R> LiWong <- express(Dilution, bg = subtractmm, summary.stat = li.wong)

R> Avdiff <- express(Dilution, normalize = F, bg = subtractmm, summary.stat = avdiff)

R> Avdiff2 <- express(Dilution, normalize = F, bg = subtractmm,

+ summary.stat = function(x) apply(x, 2, mean, trim = 0.5))

R> assess <- function(w) {

+ x <- exprs(w)[, 1]

+ y <- exprs(w)[, 2]

+ z <- exprs(w)[, 3]

+ Index <- x > 0 & y > 0

+ signif(c(IQR(log2(x[Index]/y[Index]), na.rm = T), IQR(log2(x[Index]/z[Index]),

+ na.rm = T), IQR(log2(y[Index]/z[Index]), na.rm = T)),

+ 2)

+ }

R> e@exprs <- 2^exprs(e)

R> assess(e)

[1] 0.037 0.013 0.036

R> assess(LiWong)

[1] 0.28 0.27 0.39

R> assess(Avdiff)

15

R> concentration <- pData(Dilution)[, 1]

R> par(mfrow = c(2, 2))

R> for (j in 1:2) {

+ i <- sample(ngenes(Dilution), 1)

+ expr.with.norm <- 2^exprs(e)[i, ]

+ expr.without.norm <- 2^exprs(f)[i, ]

+ YLIM <- range(c(expr.with.norm, expr.without.norm))

+ MAIN <- rownames(exprs(e))[j]

+ plot(concentration, expr.with.norm, log = "xy", main = MAIN,

+ ylim = YLIM, xaxt = "n")

+ axis(1, concentration)

+ plot(concentration, expr.without.norm, log = "xy", main = MAIN,

+ ylim = YLIM, xaxt = "n")

+ axis(1, concentration)

+ }

●●

●

●●●

●●

●

3040

6080

1219_at

concentration

expr

.with

.nor

m

2.5 5.0 10.0

●

●

●

●

●

●

●

●

●

3040

6080

1219_at

concentration

expr

.with

out.n

orm

2.5 5.0 10.0

●●●

●●●

●●

●

200

300

500

700

126_s_at

concentration

expr

.with

.nor

m

2.5 5.0 10.0

●

●

●

●

●

●

●

●

●

200

300

500

700

126_s_at

concentration

expr

.with

out.n

orm

2.5 5.0 10.0

Figure 8: Comparison of expression measures obtained with and without normalization.

16

[1] 0.49 0.35 0.58

R> assess(Avdiff2)

[1] 0.70 0.55 0.65

Notice the default measure has the smallest IQR between replicates. See Irizarryet al. (2002) for assessments such as the one presented in this Section.

2 Working with Probe Level Data

The class Plob combines the information of various CEL files with a common CDF. Thisclass is designed to keep information of one experiment. It is especially useful for userswho want to obtain only expression values. For the user that is interested in detailedanalyses of the the probe level data the Cdf and Cel classes are available.

The classes Cdf and Cel model the physical files in which data are stored. Thecomplete processing of the data presented in the previous section could be done usingthese objects. Although the coding would be more complex, they provide more flexibility.The user would need to understand the steps of the analysis leading from the probesdata on different arrays/files to the expression values (stored in an exprSet). However,this section should not be considered for advanced users only2. While you may want torely on the steps and methods in the function express, we also encourage you to explorethe raw data contained in the CEL files. It might be informative about problems thatoccurred during the experimental steps leading to the data.

Due to several attributes of the Cdf and Cel objects, matrices are used to store thedata. The look-up of values at particular positions on the chip becomes extremely easy(see figure 2).

The information is extracted from the combination of the Cel and the correspondingCdf. Objects of class Plob have a different data structure but also obtain informa-tion from a Cdf and Cel objects. A convenience function to shuttle between the datastructures is included in the package (try help("convert") in R).

In this Section we give more details about the functions, classes, and methods in theaffy package

2.1 Cdf-class

The Chip Definition Files (CDF files) are used to store information related to Affymetrix’sGeneChip arrays (a type of high density oligonucleotide array sold by the manufacturer).All the arrays belonging to a given type will share this same information. As the quan-tity of information in a CDF can be rather large, this is an important point. This is kept

2Advanced users will not find enough details in the following this section and are recommended toread the help files

17

Figure 9: The data matrix used to store the data can be thought of as a grid. Knowingthe position of probes on the grid, finding the values for them is done by indexing thecorresponding elements.

in the design of the package, as there will be only one Cdf object in memory, to whichwill refer the corresponding Cel files.

For instance, if during an experiment a total of 30 arrays of type Hu6800 are used,the information relative to the array type is common to all the chips. Knowing thatCDF files are usually of size 20 Megabytes, illustrates the reason for avoiding havingmultiple instances of this information.

The Cdf class is designed to store the information in a CDF file. The slots are:

name Each probe (or feature) on an array is an oligonucleotide from a larger sequence ofnucleic acids (generally a gene or a fraction of a gene, we refer to as a probe pairsset). All these large sequences were given a unique name by the chip manufacturer.We call this name affyID. By design of the arrays several probes correspond todifferent parts of the same larger sequence, hence have the same name. Names arestored as factors, and the corresponding factors labels (or levels) are found in theattribute name.levels.

name.level See previous item.

pbase In the CDF files, the column called PBASE holds one of the nucleic acid letters.From trials and errors 3 , the p was guessed to stand for probe.

3Comparing the letter between PBASE and TBASE, it appeared that two cases could appear.

18

pbase.levels The four levels for pbase are the four letters used to designatee nucleicacids.

tbase In the CDF files, the column called TBASE holds also a nucleic acid letter. Fromtrials and errors where the t was assumed to stand for tbase.

tbase.levels The four same letters than for pbase.levels are found here.

atoms Each probe pair in a probe pair set is given a unique integer as an identifier. Thisnumber is refereed as the atom number. The corresponding perfect matches andmismatches are found by using this number.

The function read.cdffile can be used to create an object of class Cdf from a file:

R> cdf <- read.cdffile("filename.CDF")

This function has various useful arguments. For example, the compress argument per-mits you to specify if the file you are reading is compressed. See the help file for moredetails.

In Figure 11 we look for the physical locations on a chip of a particular affyID. Wethen plot them over over a grid.

2.2 Class Cel

Our package defines a class for storing what we consider to be relevant information in aCEL file. The mean intensities (and if desired the SD) are stored in matrices. The i,j

entry in these matrices contain the probe intensity and SD in position i,j on the array.These objects also contain information on the position of masked and outlier probes, thename of the array, and relevant information added by the user. Since it is not clear howthe pixel level SD information is useful, the default behavior was set not to load it.

The slots are:

intensity Object of class ”matrix” containing intensity values in a matrix of dimension(nrow,ncol). The position in the matrix represents the physical position on thearray

sd Object of class ”matrix” containing the standard deviation for the intensity valuesobtained from CEL files.

name Object of class ”character” the name given to this CEL file data. The functionread.celfile uses the file name by default.

masks Object of class ”matrix” containing the coordinates of masked measurements.

outliers Object of class ”matrix” containing the coordinates of outlier measurements.

19

R> data(CDF.example)

R> cdf <- CDF.example

R> affyid <- "D13640_at"

R> l.pm <- locate.name(affyid, cdf, type = "pm")

R> l.mm <- locate.name(affyid, cdf, type = "mm")

R> n.y <- ncol(probeNames(cdf))

R> n.x <- nrow(probeNames(cdf))

R> plot(c(1, n.x), c(1, n.y), xlab = "rows", ylab = "columns", type = "n")

R> grid(n.x, n.y)

R> points(l.pm, col = "red")

R> points(l.mm, col = "blue")

0 20 40 60 80 100

020

4060

8010

0

rows

colu

mns

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Figure 10: Location of the PM (red) and MM (blue) for the D13640 probe set on theHU6800 chip.

20

history Object of class ”list” containing details about where the file.

The function read.celfile() is used to read CEL file information into R. An objectof class Cel is returned. In this example we read in a CEL file to create a Cel object:

R> cel <- read.celfile("filename.cel")

For the users convenience a Cel.container (see next section to learn about Cel.containersand Biobase to learn about containers in general) is available. For an example of a Cel

object use

R> data(listcel)

R> mycelfile <- listcel[[1]]

The following useful methods are available for this class:

image Display an image of the data in the Cel object. Among the other parameters,transfo=log can be convenient.

show Outputs few general facts about the object on the console.

By default the method image() creates an image of the intensities. Figure 11 showssome results from the the image method. These images are quite useful for qualitycontrol. We recommend looking at these images as a first step in data exploration. Inour experience, the log scale is more useful for detecting biases. This seems to be thecase in Figure 1.

Objects of class Cel contain the information held in CEL files.

2.3 Class Cel.container

The class Cel.container extends the class container of the package Biobase. It allowsthe user to bundle a collection of Cel instances together. Typically, a Cel.container

will contain Cel instances sharing the same Cdf. The listcel example provided withthe package is an example of Cel.containter.

The following useful methods are available.

generateExprSet Compute expression values and return them in an exprSet object.

normalize A Cel.container can be normalized, which means its Cel elements will bescaled to be comparable. More details about normalization can be found in theSection 3.

normalize.methods Return the known normalization methods for the Cel.

show Display global information about the object

21


R> image(mycelfile, sub = "raw values")

R> image(mycelfile, col = rainbow(32), sub = "raw values")

R> image(mycelfile, transf = log, sub = "log-transformed values")

R> image(mycelfile, transf = log, col = rainbow(32), sub = "log-tranfomed values")

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

CEL file 1

raw values

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

CEL file 1

raw values

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

CEL file 1

log−transformed values

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

CEL file 1

log−tranfomed values

Figure 11: Images of PM intensities from a section of a cel file using different colors andconcentrations.

22

2.4 Class PPSet

The probe pair set (PPSet) class holds the information of all the probes related to aaffyID. The components are:

probes a data frame having two column named pm and mm respectively.

name the name of the probe pair set

Useful methods available for this class are:

show Outputs few general facts about the object on the console.

barplot Display a barplot of the probes information

plot Plot the probes information in a lines graph.

The function get.PPSet() can be used to obtain a PPSet object from its Affymetrixidentifier, and from objects of class Cel and Cdf. In the example below, after findingthe number of different identifiers on an array, we pick one at random and extract thePPSet.

R> data(CDF.example)

R> data(listcel)

R> mylevel <- sample(1:100, 1)

R> myname <- name.levels(CDF.example)[mylevel]

R> mypps <- get.PPSet(myname, CDF.example, listcel[[1]])

The methods barplot() and plot() can be used to graphically explore the intensitiesof a PPSet object. See figure 12.


R> barplot(mypps)

R> plot(mypps)

Figure 12: PMs (red) and MMs (blue) for a randomely chosen probe set.

As an illustrative example we show how one can find the intensities of a particulargene, say myname without using get.PPSet. From the name entry number in a Cdf

object, we look for the positions of the related probes. From these positions on the array,we extract intensities from a Cel object. This gives the PM and MM information inone vector.

R> positions <- which(probeNames(CDF.example) == mylevel, arr.ind = TRUE)

R> intensities <- intensity(listcel[[1]])

23

The generation of an expression value from the probe set intensities is straight for-ward. Two lines give us a naive version of Affymetrix’s average difference measure.

R> difference <- mypps$probes$pm - mypps$probes$mm

R> averagediff <- mean(difference)

2.5 Class PPSet.container

A class of PPSet containers is also available. For the users convenience we include anexample containing intensities read from 12 arrays where the concentration of the controlcRNA BioB-5 was added at known concentrations. The concentrations were varied fromchip to chip. These are also available thought the example data set:

R> data(SpikeIn)

R> print(concentrations)

[1] 0.50 0.75 1.00 1.50 2.00 3.00 5.00 12.50 25.00 50.00

[11] 75.00 150.00

Notice the concentrations are growing exponentially. Figure 13 shows, using the exampledata set, that the PMs in fact detect the growth in concentration. This figure also showsthat the MMs detect the growth as well.

3 Extending the package

3.1 Normalization

Different approaches to normalize array data have already been suggested, but newmethods are very likely be developed too. We tried to make the addition of new normal-ization methods straightforward. It should be the case if the following indications arecarefully observed. Brave people may want to add new normalization methods withouttaking these rules into considerations. It should be possible, but they are on their own.

- Normalization method must obey a naming convention. The name of the functionshould be like normalize.XYZ.abc, where abc is the nickname of the method andXYZ is the class of the object the normalization methods works on. Currently XYZcan only be Cel.container or Plob4. You are free to implement a normalizationmethod working on one object without having to implement it for the others.

- The first argument passed to the function must be of class XYZ. Optional parametercan be appended as long as they have default values.

4A general scheme for normalization of arrays, whether Affymetrix or cDNA (or even others in thefuture (?)) is currently evaluated.

24


R> plot(concentrations, concentrations, ylim = c(20, 20000), main = "PM",

+ ylab = "Intensity", log = "xy", type = "n")

R> for (i in 1:12) points(rep(concentrations[i], 20), pm(SpikeIn[[i]]),

+ col = rainbow(20), pch = 1:20)

R> plot(concentrations, concentrations, ylim = c(20, 20000), main = "MM",

+ ylab = "Intensity", log = "xy", type = "n")

R> for (i in 1:12) points(rep(concentrations[i], 20), mm(SpikeIn[[i]]),

+ col = rainbow(20), pch = 1:20)

0.5 2.0 10.0 100.0

2050

100

500

2000

5000

2000

0

PM

concentrations

Inte

nsity

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●●

●

●

●●

●●

●

●

●●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

0.5 2.0 10.0 100.0

2050

100

500

2000

5000

2000

0

MM

concentrations

Inte

nsity

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

Figure 13: PM and MM intensities read from 12 different arrays plotted against knownconcentrations of BioB-5 added to each array. Different symbols/colors are used for the20 different probe pairs.

25

- The nickname of the new normalization method must be registered in the variablenormalize.XYZ.methods (a vector of mode character).

The following example shows how to create a normalization methods for Cel.containerthat does... absolutely nothing:

R> normalize.Cel.container.nothing <- function(object) return(object)

R> normalize.Cel.container.methods <- c(normalize.Cel.container.methods,

+ "nothing")

R> data(listcel)

R> listcel.n <- normalize(listcel, method = "nothing")

3.2 Measures of Expression

As seen in Section 1.5 the function express creates exprSet from Plob objects. Thefunction express is quite flexible.

One can choose from various normalization routines through the argument normalize.method.The default is quantile normalization, but one can also use constant, loess, invariantset,or contrasts normalization (see help file for normalize for more information on these rou-tines). The option not to normalize is also available via the logic argument normalize.

The way in which PM values are “background” corrected can be changed throughan arbitrary function sent via the argument bg. The current default is based on abackground correction method described in Irizarry et al. (2002) and available throuh thefunction bg.adjust. If something other than a function is sent, no background correctionis performed. If a function is sent, it must receive a matrix containing the PMs in thefirst half of the rows and the MMs in the bottom. The function must return a matrixwith the background corrected PMs. For example, the function subtractmm, madeavailable in this package, receives x (typically rbind(pm(plob),mm(plob) wiht plob anobject of class Plob) and returns x[1:(nrow(x)/2),]-x[(nrow(x)/2 + 1):nrow(x),].

Once background corrected PMs are available we compute expression measures viasome summary statistic. The PM data will be a matrix, say mat of dimensions probestimes number of arrays. The default is to return the overall plus column effect of amedian polish. In general one can use any function that returns one value for everycolumn, i.e. one measure of expression for each array. The function express takes careof doing this for every gene. As an example, the function avdiff is available throughthis package, an returns something quite similar to apply(mat,2,mean,trim=.9). Muchcan be learned by looking at the function express.

Expression values can also be obtained for some other classes. A way to include newapproaches is to go through the PPSet objects, or more precisely PPSset.container

objects. As described above, a container is used to bundle several PPSet objects andit used here to collect the PPSet for a given gene in different hybridizations (i.e. ondifferent arrays). Methods that generate expression values accept as first parameter alist of data.frame. There is one data.frame per hybridization. All the data are related

26

to the same probe pair set on a chip. Each data.frame has a column called pm and acolumn called mm (note: each of these data.frame correspond to the slot probes of thePPSet). The method can then accept other optional parameters, but default must bedefined. The name of the met hod must be of the form generateExprVal.method.xyz,where xyz in the name of the method. Then the methods nickname must be registered ingenerateExprSet.methods. The mechanism is not unlike the one used for normalize.As an illustrative example, we present a method called outlieraverage, that generatesexpression values in two steps. In a first step, identical pm probes across the probes setsfrom the different hybridizations are processed. Probe values that are more than 2 timesaway from the median absolute deviation (mad) are replaced by the median value forthese probes. In a second step the average of all the probes in the set is calculated foreach hyridization experiment.

generateExprVal.method.outlieraverage <- function(probes) {

## probes is a list of data.frame

m <- length(probes) # number of hybridizations

n <- length(probes[[1]]$pm) # number of pm probes in a set

modif.probes <- matrix(0, n, m) # modified probes to be stored in a matrix

# one row per probe, one column per hybridization

## loop over the probes

for (i in 1:n) {

## get all the identical probes from all the hybridizations

p <- unlist(lapply(probes, function(x) x$pm[i]))

pmed <- median(p)

pmad <- mad(p)

p[abs(p-pmed) > 2 * pmad)] <- pmed # replace the ’outliers’ by the median value

modif.probes[i,] <- p

}

r <- apply(modif.probes, 1, mean)

return(r)

}

## register the method

generateExprSet.methods <- c(generateExprSet.methods, "outlieraverage")

27

References

Affymetrix. Affymetrix Microarray Suite User Guide. Affymetrix, Santa Clara, CA,version 4 edition, 1999.

B.M. Bolstad, R.A. Irizarry, M. Astrand, and T.P. Speed. A comparison of normaliza-tion methods for high density oligonucleotide array data based on variance and bias.Unpublished Manuscript, 2002.

Rafael A. Irizarry, Bridget Hobbs, Francois Collin, Yasmin D. Beazer-Barclay, Kris-ten J. Antonellis, Uwe Scherf, and Terence P. Speed. Exploration, normaliza-tion, and summaries of high density oligonucleotide array probe level data. http:

// biosun01. biostat. jhsph. edu/ ~ririzarr/ papers , 2002.

C. Li and W.H. Wong. Model-based analysis of oligonucleotide arrays: Expression indexcomputation and outlier detection. Proceedings of the National Academy of ScienceU S A, 98:31–36, 2001.

28

http://biosun01.biostat.jhsph.edu/~ririzarr/papers

http://biosun01.biostat.jhsph.edu/~ririzarr/papers

Textual Description of a y - Johns Hopkins Bloomberg ...ririzarr/Raffy/affy.pdf · Various built in...

Documents

Transcript of Textual Description of a y - Johns Hopkins Bloomberg ...ririzarr/Raffy/affy.pdf · Various built in...