Post on 16-Jan-2016
description
Overview of Bioconductor
Aedín Culhane
aedin@jimmy.harvard.edu
http://bcb.dfci.harvard.edu/~aedin
http://www.hsph.harvard.edu/research/aedin-culhane
BioconductorBiannual release (normally April, October) to coincide with R release.
Current: Bioconductor 2.9 (release coincide with R 2.14)
To install use script on Bioconductor Website source("http://www.bioconductor.org/biocLite.R")
biocLite()
Packages Overview
BioConductor web site
• Bioconductor BiocViews Task view
Software
Annotation Data
Experimental Data
What Packages do I need?
Specific to you data and analysis pipeline but for examples:
• Bioconductor Workshops
• Bioconductor Workflows
Main types of Annotation Packages• Gene centric AnnotationDbi packages:
– Organism: org.Mm.eg.db.
– Technology/Platform: hgu133plus2.db.
– GeneSets and Pathway (biology level): GO.db or KEGG.db
– .db packages can be queried with sql or accessed using annotation package (totable, get, mget)
• Genome centric GenomicFeatures packages:– Transriptome level: TxDb.Hsapiens.UCSC.hg19.knownGene
– Generic features: Can generate via GenomicFeatures
• biomaRt:– Query web-based `biomart' resource for genes, sequence, SNPs, and
etc.• See http://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/AnnotationSlidesBioc2011.pdf
Bioconductor resources
• Mailing List (sign up for daily digest)
• Documentation, workshop/course material online– Slides from talks, pdf of tutorials, R code
• Help available for each software package– Each package MUST contain vignette (howto)
• Other resources ww.Rseek.org www.r-bloggers.com
Vignette
• Tutorials, provide worked example of package• Required in Bioconductor packages• Written in Sweave (Leisch, 2002).
– LATEX dynamic reports in which R code is embedded and executable
– All R code in vignette is checked (and executed) by R CMD check
– http://www.bioconductor.org/docs/vignettes.html
library("Biobase") library("GOstats") # Load package of interestopenVignette()
S4 classes and ExpressionSet
• Within Bioconductor, you will encounter packages are structured around S4 object-oriented programming proposed by John Chambers (developer of S)
• A class provides a software abstraction of a real world object.
• A method performs an action on a class(Think of a class as a noun, and method as verb)
Object (S4)
• An object is an instance of a class.
• Descriptions are stored in slots
• slotNames(ob1) lists all slots in object, or use str().
• To access slots– ob1@slotname– slotname(ob1), or– slot(ob1, “slotname")
Example: ExpressionSet
library(ALL)
data(ALL)
slotNames(ALL)
ALL@phenoData
phenoData(ALL)
class(ALL)
?ExpressionSet
> ALL
ExpressionSet (storageMode: lockedEnvironment)
assayData: 12625 features, 128 samples
element names: exprs
protocolData: none
phenoData
sampleNames: 01005 01010 ... LAL4 (128 total)
varLabels: cod diagnosis ... date last seen (21 total)
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
pubMedIds: 14684422 16243790
Annotation: hgu95av2
Method which act on a S4 class
showMethods(class= "ExpressionSet")
getMethod("write.exprs", "ExpressionSet")
Or if you wish to see how the package really works, download and look the source code
Getting Data into R & Bioconductor
Aedín Culhane
aedin@jimmy.harvard.edu
http://www.hsph.harvard.edu/research/aedin-culhane/
Simple Excel SpreadSheet data
• Simple table
– read.table()
– read.csv()
– scan()
• However more datatype specialized. See Technologies on BiocViews.
– http://www.bioconductor.org/packages/release/BiocViews.html
• Large data files. Also see http://www.revolutionanalytics.com
13
Some common data types
• Microarray
• SNP
• NGS
May 2011 14
A Microarray OverviewA Microarray Overview
15
Reading Affymetrix Data
library(affy)
require(affy) # Alternative
affybatch <- ReadAffy(celfile.path="[Location of your data]")
eSet<-justRMA()
May 2011 16
Sample R code
17
ExpressionSet Class in R
May 2011 18
Assessing Data Quality
May 2011 19
Public Microarray Data
ArrayExpress • 21997 Studies (622,617 profiles,)
GEO • 22,735 Studies (558,074 profiles)
Statistics May 2011
R Code
May 2011 21
More on GEOquery
May 2011 22
require(GEOquery)
Let's try to load the GDS810 dataset which contains data on Alzheimer's disease at various stages of severity.
GDS810<-getGEO("GDS810")
The getGEO function returns an object of class GEOData. You can get a description of this class like this: help("GEOData-class")
Meta(GDS810) Columns(GDS810) head(Table(GDS810))
Affy SNP Arrays
May 2011 23
Process – Affy SNP Arrays (Oligo package)
May 2011 24
Other Arrays
• Illumina– Lumi package
• 2 color spotted arrays– Limma package
• Other arrays– http://www.bioconductor.org/help/workflows/
oligo-arrays/
May 2011 25
Next Generation Sequencing Data
R Code
May 2011 27
Exercise
• Install the library GEOquery
• Download the dataset GSE1297 using getGEO
• This data will be downloaded as an eSet, so to see the expression data and phenoData, use pData and exprs
• Use ArrayQualityMetrics to Assess the data quality of these data
May 2011 28
R basics: Getting help
• To get help– ?mean– help(mean)
• help.search(“mean”)
• apropos("mean")
• example(mean)
• http://www.bioconductor.org/help/
• With thanks to
• www.bioconductor.org/help/course.../Bioconductor-Introduction-lab.pdf
May 2011 30