Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop...

76
Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA

Transcript of Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop...

Page 1: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Welcome to lecture 7:An introduction to bioconductor

IGERT – Sponsored Bioinformatics Workshop SeriesMichael Janis and Max Kopelevich, Ph.D.

Dept. of Chemistry & Biochemistry, UCLA

Page 2: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

What is bioconductor?From the bioconductor webpage (www.bioconductor.org):

• Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data.

• The project was started in the Fall of 2001. The Bioconductor core team is based primarily at the Fred Hutchinson Cancer Research Center. Other members come from various US and international institutions.

• Bioconductor is primarily based on the R programming language but we do accept contributions in any programming language. There are two releases of Bioconductor every year (they appear shortly after the corresponding R release). At any one time there is a release version, which corresponds to the released version of R, and a development version, which corresponds to the development version of R. Most users will find the release version appropriate for their needs. In addition there are a large number of meta-data packages available. They are mainly, but not solely oriented towards different types of microarrays.

Page 3: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Where to get bioconductor?From the bioconductor website:

Right now, the easiest way to install BioConductor packages is using the biocLite.R installation script. Using the R CLI, simply type the following:

>source("http://www.bioconductor.org/biocLite.R")

>biocLite()

This will install a subset of packages by default. Some optional parameters are accepted by the script:

• pkgs

– character vector of Bioconductor packages to install.

• destdir

– the directory where the downloaded packages will be stored.

• lib

– character vector giving the library directories under which packages may be installed. Recycled as needed.

Page 4: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Bioconductor principlesThe following slides were compiled from the

bioconductor vignettes and short courses available through the bioconductor website.

Of special note are the presentations by Sandrine Dudoit and Robert Gentleman, 2002 (http://www.bioconductor.org/workshops/Summer02Course/RBioC/RBioC.pdf) and the EMBO short course on microarray analysis, with emphasis on the bioconductor project (http://www.bioconductor.org/workshops/EMBO03), also by Sandrine Dudoit and Robert Gentleman, 2003.

Page 5: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Bioconductor designObject-oriented class/method design.

• Allows efficient representation and manipulation of large and complex biological datasets of multiple types.

Widgets.

• Specific, small scale, interactive components providing graphically driven analyses - point & click interface.

Page 6: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Bioconductor tools

Interactive tools for linking experimental results to annotation and literature WWW resources in real time.

• E.g. PubMed, GenBank, LocusLink.

Scenario: For a list of differentially expressed genes obtained from multtest or genefilter, use the annotate package to retrieve PubMed abstracts for these genes and to generate an HTML report with links to LocusLink for each gene.

Page 7: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Bioconductor packages

• General infrastructure– Biobase– annotate, AnnBuilder– tkWidgets.

• Pre-processing for Affymetrix data– affy

• Pre-processing for cDNA data– marrayClasses, marrayInput, marrayNorm,

marrayPlots

• Differential expression

Page 8: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Bioconductor training• R vignettes system

– comprehensive repository of step-by-step tutorials covering a wide variety of computational objectives in /doc subdirectory;– integrated statistical documents intermixing text, code, and

code output (textual and graphical);– documents can be automatically updated if either data or

analyses are changed.

• Modular training segments– short courses: lectures and computer labs;– interactive learning and experimentation with the software

platform and statistical methodology.

Page 9: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

R programming, extended• In order to deliver high quality software the Bioconductor

project relies on a few programming techniques that might not be familiar:– environments and closures;– object oriented programming.

• We review these here for interested programmers (understanding them is not essential but is often very helpful).

Page 10: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Environments and closuresAn environment is an object that contains bindings between

symbols and values.

• It is very similar to a hash table.• Environments can be accessed using the following

functions:– ls(env=e) # get a listing.– get(“x”, env=e) # get the value of the object in ewith name x.– assign(“x”,y,env=e) # assign to the name xthe value y in the environment e.

Page 11: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Environments and closuresSince these operations are used a great deal in Bioconductor we

have provided two helper functions:

– multiget– multiassign

• These functions get and assign multiple values into the specified environment.

Page 12: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Environments and closuresEnvironments can be associated with functions:

• When an environment is associated with a function, then that environment is used to obtain values for any

unbound variables.• The term closure refers to the coupling of the function body

with the enclosing environment.• The annotate, genefilter, and other packages take advantage

of environments

Page 13: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Environments and closures> x <- 4> e1 <- new.env()> assign(“x”,10, env=e1)> f <- function() x> environment(f) <- e1> x # returns 4> f() # returns 10!

Page 14: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

OOPThe Bioconductor project has adopted the OOP paradigm

presented in Programming with Data, J. M. Chambers, 1998.

• Tools for programming using the class/method mechanism are provided in the methods package.

Page 15: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

OOPA class provides a software abstraction of a real world object. It

reflects how we think of certain objects and what information these objects should contain.

• A class defines the structure, inheritance, and initialization of objects.

• Classes are defined in terms of slots which contain the relevant data.

• An object is an instance of a class.

Page 16: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

OOPA method is a function that performs an action on data

(objects).

• A generic function is a dispatcher, it examines its arguments and determines the appropriate method to invoke.

• Examples of generic functions includeplotsummaryprint

Page 17: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

OOPTo obtain documentation (on-line help) about:

– a class: class?classnameso, class?exprSet, will display the help file for the

exprSet class.

– a method: methods?methodnameso, methods?print, will display the help file for the

print methods.

Page 18: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

OOP• The methods package contains a number of functions for

defining new classes and methods (e.g. setClass, setMethod) and for working with these classes and methods.

• A tutorial is available athttp://www.omegahat.org/RSMethods/index.html

Page 19: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

OOP> setClass(“simple",representation(x="numeric",y="matrix“),prototype = list(x=numeric(),y=matrix(0)))

> z <- new("simple", x=1:10,y=matrix(rnorm(50),10,5))

> z@x[1] 1 2 3 4 5 6 7 8 9 10

Page 20: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

BiobaseThe Biobase package provides class definitions and other

infrastructure tools that will be used by other packages.

• The two important classes defined in Biobase are:

– phenoData: sample level covariate data.– exprSet: the sample level covariate data combined with the

expression data and a few other quantities of interest.

Page 21: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Biobase: exprSetSlots for the exprSet class:• exprs: a matrix of expression measures

genes are rows, samples are columns.• se.exprs: standard errors for the expression measures, if

available.• phenoData: an object of class phenoData that describes the

samples.• annotation: a character vector.• description: a character vector.• notes: a character vector.

Page 22: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Biobase: exprSetOne of the most important tasks is to align the expression data

and the phenotypic data (and to keep that alignment through the analysis).

• To achieve this, the exprSet class combines these two data sources into one object, and provides subsetting and access methods that make it easy to manipulate the data while ensuring that they are correctly aligned.

Page 23: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Biobase: exprSetA design principle that was adopted for the exprSet and other

classes was that they should be closed under the subsetoperation.

• So any subsetting, either of rows or columns, will return a valid exprSet object.

• This makes it easier to use exprSet in other software packages

Page 24: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Biobase: exprSetSome methods for the exprSet class:

• show: controls the printing (you seldom want a few hundred thousand numbers rolling by).

• subset, [ and $, are both designed to keep correct subsets of the exprs, se.exprs, and phenoData objects.

• split, splits the exprSet into two or more parts depending on the vector used for splitting.

Page 25: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Biobase: exprSetSome methods for the exprSet class:

• geneNames, retrieves the gene names (row names of exprs).• phenoData, pData, and sampleNames provide access to the

phenoData slot.• write.exprs, writes the expression values to a file for

processing or storage.

Page 26: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Biobase: phenoDataSlots for the phenoData class:

• pData: a dataframe, where the samplesare rows and the variables are columns(this is the standard format).• varLabels: a vector containing thevariable names (as they appear in pData)and a longer description of the variables.

Page 27: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Biobase: phenoDataMethods for the phenoData class include:

– [, the subset operator;this method ensures that when a subset is taken,

both the pData object and the varLabels object have the appropriate subsets taken.

– $, extracts the appropriate column of thepData slot (as for a dataframe).

– show, a method to control printing, we showonly the varLabels (and the size).

Page 28: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Biobase: exprSetSome methods for the exprSet class:

• show: controls the printing (you seldom want a few hundred thousand numbers rolling by).

• subset, [ and $, are both designed to keep correct subsets of the exprs, se.exprs, and phenoData objects.

• split, splits the exprSet into two or more parts depending on the vector used for splitting.

Page 29: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

OOPA method is a function that performs an action on data

(objects).

• A generic function is a dispatcher, it examines its arguments and determines the appropriate method to invoke.

• Examples of generic functions includeplotsummaryprint

Page 30: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Let’s take an example simple affymetrix analysis using the Golub data set

– AML/ALL leukemia dataset (1999)

ALL

AML

Hu6800 GeneChip

47

ALL

AML

AML

ALL

25

TEST

Train

Page 31: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Affy arrays

Page 32: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Affymetrix GeneChip Design

5’ 3’

Reference Sequence…TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT…

CCCTTACCCAGTCTTCCGGAGGCTA Perfect MatchCCCTTACCCAGTGTTCCGGAGGCTA Mismatch

Page 33: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

TerminologyEach gene or portion of a gene is represented by 1q to 20 oligonucleotides of 25 base-pairs.

• Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer.• Perfect match (PM): A 25-mer complementary to a reference sequence of interest

(e.g., part of a gene).• Mismatch (MM): same as PM but with a single homomeric base change for the middle (13th) base (transversion purine <->pyrimidine, G <->C, A <->T) .• Probe-pair: a (PM,MM) pair.• Probe-pair set: a collection of probe-pairs (1q to 20) related to a common gene or fraction of a gene.• Affy ID: an identifier for a probe-pair set.

The purpose of the MM probe design is to measure nonspecific binding and background noise.

Page 34: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Affymetrix filesMain software from Affymetrix company

MicroArray Suite - MAS, now version 5.• DAT file: Image file, ~10^7 pixels, ~50 MB.• CEL file: Cell intensity file, probe level PM and MM values.• CDF file: Chip Description File. Describes which probes go in which probe sets and the location of probe-pair sets (genes, gene fragments, ESTs).

Page 35: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Expression Measures

10-20K genes represented by 11-20 pairs of probe intensities (PM & MM)

• Obtain expression measure for each gene on each array by summarizing these pairs• Background adjustment and normalization are important issues• There are many methods (MAS, RMA, Li/Wong, …)

Page 36: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Let’s take an example the Golub training data set

We can access this as built-in data

For a description of the Golub et al. (1999) dataset type ?

golubTrain and to load these data type:

> library(golubEsets)> data(golubTrain)> data(golubTest)

Page 37: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

affy: oligo arrays case studyLet’s look at some of the main functions in the affy package for pre-processing

Affymetrix microarray data. A number of sample datasets are available in the package; to list these, type data(package="affy"). To load the

package:

> library(affy)

The function ReadAffy is available for reading CEL files.

One of the main classes in affy is the AffyBatch class. Other classes include ProbeSet (PM and MM intensities for individual probe sets), Cdf (information contained in a CDF file), and Cel (single array cel intensity data).

The object Dilution is an instance of the class AffyBatch.

Page 38: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

affy: oligo arrays case study> class(Dilution)[1] "AffyBatch"> slotNames(Dilution)[1] "cdfName" "nrow" "ncol" "exprs" "se.exprs"[6] "phenoData" "description" "annotation" "notes"> Dilution

AffyBatch objectsize of arrays=640x640 features (12805 kb)cdf=HG_U95Av2 (12625 affyids)number of samples=4number of genes=12625annotation=hgu95av2notes=

> annotation(Dilution)[1] "hgu95av2"

Page 39: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

affy: oligo arrays case studyFor a description of the target samples hybridized to the arrays> phenoData(Dilution)

phenoData object with 3 variables and 4 casesvarLabelsliver: amount of liver RNA hybridized to array in

microgramssn19: amount of central nervous system RNA hybridized

to array in microgramsscanner: ID number of scanner used

> pData(Dilution)liver sn19 scanner

20A 20 0 120B 20 0 210A 10 0 110B 10 0 2

Page 40: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

affy: oligo arrays case studyThe exprs slot contains a matrix with columns corresponding to arrays and rows toindividual probes on the array. To obtain the matrix of intensities for all four arrays:

> e <- exprs(Dilution)> nrow(Dilution) * ncol(Dilution)[1] 409600> dim(e)[1] 409600 4

You can access probe-level PM and MM intensities using:

> PM <- pm(Dilution)> dim(PM)[1] 201800 4> PM[1:5, ]

20A 20B 10A 10B1000_at1 468.8 282.3 433.0 198.01000_at2 430.0 265.0 308.5 192.81000_at3 182.3 115.0 138.0 86.31000_at4 930.0 588.0 752.8 392.51000_at5 171.0 128.0 152.3 97.8

Page 41: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

affy: oligo arrays case studyTo get the probe set names (Affy IDs)> gnames <- geneNames(Dilution)> length(gnames)[1] 12625> gnames[1:5][1] "1000_at" "1001_at" "1002_f_at" "1003_s_at" "1004_at"> nrow(e)/length(gnames)[1] 32.44356As with other microarray objects in Bioconductor packages, you can use subsettingcommands for AffyBatch objects> dil1 <- Dilution[1]> class(dil1)[1] "AffyBatch"> dil1AffyBatch objectsize of arrays=640x640 features (3204 kb)cdf=HG_U95Av2 (12625 affyids)number of samples=1number of genes=12625annotation=hgu95av2notes=

Page 42: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

affy: oligo arrays case study

Page 43: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

affy: oligo arrays case studyTo produce boxplots of probe log intensities:

> boxplot(Dilution, col = c(2, 2, 3, 3))

Page 44: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

affy: oligo arrays case studyThe boxplots and density plots show that the Dilution data needs

normalization. Asdescribed in the dataset help file and in the phenoData slot

(pData(Dilution)), two concentrations of mRNA were used and, for each concentration, two scanners were used.

From the plots, we note that scanner effects seem stronger than concentration effects (different colors). Arrays that should be the same are different; arrays that should be different are similar. Because different mRNA concentrations were used, we perform probe-level normalization within concentration groups:

> Dil20 <- normalize(Dilution[1:2])> Dil10 <- normalize(Dilution[3:4])> normDil <- merge(Dil20, Dil10)

Notice how the boxplot now looks better:

> boxplot(normDil, col = c(2, 2, 3, 3))

Page 45: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

affy: oligo arrays case studyHaving access to PM and MM data can be useful. Let’s look at a

plot of PM vs.MM:

> plot(mm(Dilution[1]), pm(Dilution[1]), pch = ".", log = "xy")

> abline(0, 1, col = "red")

Page 46: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Case study: Golub et. al., 1999The microarray expression measures and clinical variables corresponding to each ofthe 38 training mRNA samples are stored in an object of class exprSet. For

informationon the training data golubTrain:> class(golubTrain)[1] "exprSet“

> slotNames(golubTrain)[1] "exprs" "se.exprs" "phenoData" "description"

"annotation"[6] "notes“

> golubTrainExpression Set (exprSet) with

7129 genes38 samples

phenoData object with 11 variables and 38 cases

varLabelsSamples: SamplesALL.AML: ALL.AMLBM.PB: BM.PBT.B.cell: T.B.cell

Page 47: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Naïve differential expression: Golub et. al., 1999

We will illustrate some of the tools for reasoning about differential expression.

First, we compute two-sample t statistics in this restricted dataset:

> library(genefilter)> ALLinds <- which(small$ALL.AML == "ALL")> AMLinds <- which(small$ALL.AML == "AML")> myt <- fastT(exprs(small), ALLinds, AMLinds)

We will focus on genes for which the signed t statistic exceeds 2:

> bigT <- myt$z > 2> heatmap(exprs(small)[bigT, ])

Page 48: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Case study: Golub et. al., 1999The phenotypic data are stored in a separate, but linked, object

of class phenoData.An object of class phenoData is a combination of a data frame containing a number

of variables for each array and a list that explains what each variable represents.

> phenoTrain <- phenoData(golubTrain)> class(phenoTrain)[1] "phenoData"> slotNames(phenoTrain)[1] "pData" "varLabels"> varLabels(phenoTrain)$Samples[1] "Samples"$ALL.AML[1] "ALL.AML"…> pData(phenoTrain)Samples ALL.AML BM.PB T.B.cell FAB Date Gender pctBlasts

Treatment1 1 ALL BM B-cell <NA> 9/4/1996 M NA <NA>…

Page 49: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Case study: Golub et. al., 1999The $ operator can be used to extract particular variables from an

object of class phenoData. It also can be used directly on the exprSet instance:

> table(phenoTrain$ALL.AML)ALL AML27 11> table(golubTest$ALL.AML)ALL AML20 14

Data on only the first 10 genes in the first 3 chips can be obtained using the subsetting operator "[“

> golubTrain[1:10, 1:3]Expression Set (exprSet) with

10 genes3 samplesphenoData object with 11 variables and 3 cases

Page 50: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999

One of the largest challenges in analyzing genomic data is associating the experimental data with the available biological metadata, e.g., sequence, gene

annotation,chromosomal maps, literature.

Bioconductor provides two main packages for this purpose:

annotate (end-user)AnnBuilder (developer)

These packages can be used for querying databases such as GenBank, GO, LocusLink, and PubMed, from R,

andfor processing the query results in R.

Page 51: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999

Using the Golub et al. (1999) dataset as a case study we will now explore annotation resources for the Affymetrix HU6800 chip used in this study.

The Bioconductor annotation data package for the HU6800 chip is hu6800 and can be downloaded from the ”Data packages” section of the Bioconductor website. To load the packagesneeded for annotation type:

> library(Biobase)> library(annotate)> library(tkWidgets)> library(geneplotter)> library(golubEsets)> library(hu6800)> library(GO)

Page 52: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999Quality control data, counts, etc., for the mappings are available by calling the

function hu6800() :> hu6800()

Quality control information for hu6800:Date built: Thu Feb 13 11:00:27 2003Number of probes: 7129Probe number missmatch: NoneProbe missmatch: NoneMappings found for probe based rda files:

hu6800ACCNUM found 7092 of 7129hu6800CHRLOC found 6467 of 7129hu6800CHRORI found 6467 of 7129hu6800CHR found 6882 of 7129hu6800ENZYME found 1067 of 7129hu6800GENENAME found 6913 of 7129hu6800GO found 5731 of 7129hu6800GRIF found 3533 of 7129hu6800LOCUSID found 6918 of 7129hu6800MAP found 6873 of 7129hu6800PATH found 1448 of 7129hu6800PMID found 6837 of 7129

Page 53: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999

The main R functions we will use for dealing with environments are ls (to list the names of objects in a specified environment) and get (to search for an R object with a given name and return its value, if found, in the specified environment).

We obtain the Affymetrix identifiers using ls and note that there are 7,129 identifiers for the HU6800 chip (this matches the Golub exprSets).

> affyID <- ls(env = hu6800CHR)> length(affyID)[1] 7129

Page 54: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999We now arbitrarily select the probe set with Affymetrix identifier "U18237_at" and

obtain a number of other IDs corresponding to this probe set. Note that there may beseveral PMIDs, i.e., PubMed abstracts, associated with a given gene.

> mygene <- affyID[4001]> mygene[1] "U18237_at"> get(mygene, env = hu6800ACCNUM)[1] "U18237"> get(mygene, env = hu6800LOCUSID)[1] 23456> get(mygene, env = hu6800SYMBOL)[1] "ABCB10"> get(mygene, env = hu6800GENENAME)[1] "ATP-binding cassette, sub-family B (MDR/TAP), member

10"

Page 55: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999In some cases, we need to obtain data on several genes at once. We wrote a specialfunction for this purpose: multiget.

> fivegenes <- affyID[6:10]> fivegenes[1] "AB000409_at" "AB000410_s_at" "AB000449_at" "AB000450_at"[5] "AB000460_at"> multiget(fivegenes, env = hu6800PMID)$"AB000409_at"[1] 10859165 9155018$"AB000410_s_at"[1] 12244119 12189194 12164330 12119232 12117782 12034821

11992556 11927502[9] 11902834 11837743 10449904 10233168 9681819 9348312 9321410

9223306[17] 9223305 9207108 9197244 9190902 9187114$"AB000449_at"[1] 9344656$"AB000450_at"[1] 9344656$"AB000460_at"[1] 9734812

Page 56: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999A pubMedAbst class structure was defined for handling PubMed

abstracts in R. The slots are:

> slotNames("pubMedAbst")[1] "authors" "abstText" "articleTitle" "journal"

"pubDate"[6] "abstUrl"

Page 57: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999The function pm.getabst provides a simpler way to download the specified PubMed

abstracts (stored in XML) and create a list of pubMedAbst objects. The following commands can be used to store the PubMed abstracts for 5 genes in

a list of objects of class pubMedAbst, to compute the number of abstracts retrieved for each gene, and to print the abstract(s) for the first gene.

We can see, for example, that one of the genes has 21 abstracts associated with it:

> absts2 <- pm.getabst(fivegenes, "hu6800")> lapply(absts2, length)$"AB000409_at"[1] 2$"AB000410_s_at"[1] 21

Page 58: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999> absts2[[1]]

[[1]]list()attr(,"authors")[1] "R Cuesta" "G Laroia" "RJ Schneider"attr(,"abstText")[1] "Inhibition of protein synthesis during heat shock

limits accumulation of unfolded attr(,"articleTitle")[1] "Chaperone hsp27 inhibits translation during heat

shock by binding eIF4G and facilitating attr(,"journal")

[1] "Genes Dev"attr(,"pubDate")[1] "Jun 2000"attr(,"abstUrl")[1] "No URL Provided"attr(,"class")[1] "pubMedAbst"

Page 59: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999Another source of data is GenBank. We can interact with

GenBank in much the same way as with PubMed; the main function for this purpose is genbank.

> gbacc <- multiget(fivegenes, hu6800ACCNUM)> genbank(gbacc, disp = "browser")> gb <- genbank(gbacc[1], disp = "data")

Page 60: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999The classes chromLoc and chromLocation are used to keep track of location

informationfor a single gene and a set of genes (entire genome), respectively.

The object hu6800ChrClass is an instance of the class chromLocation for the Affymetrix HU6800 chip.

> slotNames("chromLoc")[1] "chrom" "position" "strand"> data(hu6800ChrClass)> species(hu6800ChrClass)[1] "Human"> chromNames(hu6800ChrClass)[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19"

"2" "20" "21" "22"[16] "3" "4" "5" "6" "7" "8" "9" "X" "Y"

Page 61: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999The geneplotter package provides two main functions for plotting genomic

data: cPlot and alongChrom. These functions operate on instances of the class chromLocation.

With cPlot, we can render all chromosomes the same length or we can scale themby their relative lengths.

> cPlot(hu6800ChrClass, scale = "relative")

Page 62: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Annotation: Golub et. al., 1999We can feed information from these objects into standard R graphing

capabilities, such as image:

> alongChrom(golubTrain, chrom = 22, specChrom = hu6800ChrClass, plotFormat = "image")

Page 63: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

marray packages for spotted arraysThe four main marray functions available are:

marrayClasses. This package contains class definitions and associated methods for

pre- and post-normalization intensity data for batches of arrays. Methods are provided

for the creation and modification of microarray objects, basic computations,printing, subsetting, and class conversions.

marrayInput. This package provides functionality for reading microarray data into R,

such as intensity data from image processing output files (e.g., .spot and .gprfiles for the Spot and GenePix packages, respectively) and textual

information onprobes and targets (e.g., from .gal files and god lists).

marrayPlots. This package provides functions for diagnostic plots of microarray spot

statistics, such as boxplots, scatterplots, and spatial color images.

marrayNorm. This package implements robust adaptive location and scale normalization

procedures, which correct for different types of dye biases (e.g., intensity,spatial, plate biases) and allow the use of control sequences spotted onto the

arrayand possibly spiked into the mRNA samples. Normalization is needed to

ensurethat observed differences in intensities are indeed due to differential

expression andnot experimental artifacts; fluorescence intensities should therefore be

normalizedbefore any analysis that involves comparisons among gene expression

measureswithin or between arrays.

Page 64: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Marray: spotted arrays case studyTo load the packages

> library(marrayNorm)

We will work with the example sample dataset swirl; for a description of swirl, type ? swirl. To load this dataset:

> data(swirl)

Page 65: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Marray: spotted arrays case studyThe object swirl is an instance of the class marrayRaw. Try the

following commands to obtain information on this object:

> class(swirl)[1] "marrayRaw"> slotNames(swirl)[1] "maRf" "maGf" "maRb" "maGb" "maW" "maLayout"[7] "maGnames" "maTargets" "maNotes"> swirl

Pre-normalization intensity data: Object of class marrayRaw.

Number of arrays: 4 arrays.A) Layout of spots on the array:Array layout: Object of class marrayLayout.Total number of spots: 8448Dimensions of grid matrix: 4 rows by 4 cols…

Page 66: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Marray: spotted arrays case studyTo access individual slots:

> maLayout(swirl)Array layout: Object of class marrayLayout.Total number of spots: 8448Dimensions of grid matrix: 4 rows by 4 colsDimensions of spot matrices: 22 rows by 24 colsCurrently working with a subset of 8448 spots.Control spots:There are 2 types of controls :…

Page 67: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Marray: spotted arrays case studyAs with other microarray objects in Bioconductor packages, you can

use subsettingcommands for marrayRaw objects. For data on the first 100 genes in the second array in the swirl batch:

> sw <- swirl[1:100, 2]> class(sw)[1] "marrayRaw"> sw

Pre-normalization intensity data: Object of class marrayRaw.Number of arrays: 1 arrays.A) Layout of spots on the array:Array layout: Object of class marrayLayout.Total number of spots: 8448Dimensions of grid matrix: 4 rows by 4 colsDimensions of spot matrices: 22 rows by 24 colsCurrently working with a subset of 100 spots.…

Page 68: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Marray: spotted arrays case studyYou can access red and green foreground and background

intensities, and log ratios as follows:

> Rf <- maRf(swirl)> dim(Rf)[1] 8448 4

> Rf[1:5, ] swirl.1.spot swirl.2.spot swirl.3.spot swirl.4.spot[1,] 19538.470 16138.720 2895.1600 14054.5400[2,] 23619.820 17247.670 2976.6230 20112.2600[3,] 21579.950 17317.150 2735.6190 12945.8500[4,] 8905.143 6794.381 318.9524 524.0476[5,] 8676.095 6043.542 780.6667 304.6190

Page 69: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Marray: spotted arrays case studyThe marrayPlots package provides functions for diagnostic plots

of microarray spot statistics, such as boxplots, scatterplots, and spatial color images. To produce a spatial image of background intensities for the Cy3 channel in the third array:

> tmp <- maImage(swirl[, 3], x = "maGb", bar = FALSE)

Page 70: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Marray: spotted arrays case studyTo produce a spatial image of log ratios for the first array in the

batch:

> tmp <- maImage(swirl[, 1], col = maPalette(low = "blue", high = "yellow"), bar = FALSE)

Page 71: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Marray: spotted arrays case studyTo produce boxplots of log ratios by sector for the first array in

the batch:

> maBoxplot(swirl[, 1])

Page 72: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Marray: spotted arrays case studyFor boxplots of log ratios for all four arrays:

> maBoxplot(swirl)

Page 73: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Marray: spotted arrays case studyThe marrayNorm package implements robust adaptive location

and scale normalization procedures, which correct for different types of dye biases (e.g., intensity, spatial, plate biases).

The main location and scale normalization function is maNormMain.

Simpler wrapper functions are provided in maNorm and maNormScale. The functions operate on objects of class

marrayRaw (or possibly marrayNorm, if normalization is performed in several steps) and return objects of class marrayNorm. For within-print-tip-group loess location normalization of the batch swirl:

> swirl.norm <- maNormMain(swirl)

Page 74: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Marray: spotted arrays case study

Page 75: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Case study: Golub et. al., 1999The microarray expression measures and clinical variables corresponding to each ofthe 38 training mRNA samples are stored in an object of class exprSet. For

informationon the training data golubTrain:> class(golubTrain)[1] "exprSet“

> slotNames(golubTrain)[1] "exprs" "se.exprs" "phenoData" "description"

"annotation"[6] "notes“

> golubTrainExpression Set (exprSet) with

7129 genes38 samples

phenoData object with 11 variables and 38 cases

varLabelsSamples: SamplesALL.AML: ALL.AMLBM.PB: BM.PBT.B.cell: T.B.cell

Page 76: Welcome to lecture 7: An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of.

Thank you!

I hope I have been helpful to you this summer.Please always feel free to post questions on the

bioinformatics mailing list.Have fun with bioinformatics!!!

I need bioinformatics tools… Lots of bioinformatics tools…