Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi...

34
Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad [email protected] u 577-2569
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi...

Page 1: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Programming with R and Bioconductor for microarray

expression analysis and interpretation

Ramsi Haddad

[email protected]

577-2569

Page 2: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Outline

• Lecture 1: Overview and background for R

• Lecture 2: BioC & microarray diagnostics

• Lecture 3: BioC analysis and interpretation

• Lecture 4: Cluster computing on BlueGene

Page 3: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Lecture 1

Overview and background for R

Page 4: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

What is R?

• Open source software that implements the S programming language

• can be used on the command line, or scripted• Available from www.r-project.org• Good documentation• Not so user friendly

– “R is user-friendly, it just chooses its friends carefully”, Kyle Furge

Page 5: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Starting R

• use Putty to connect to genetics.wayne.edu

• then use ssh to connect to bluegene.biosci.wayne.edu

• password is mgb8680

• type R to start– q() # this is how we shut down

Page 6: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Help!

• If you have a workstation, use help.start() and a web browser will pop up

• For us help(command) for manual pages– can also use ?command

• Excellent documentation at web site• “Introductory statistics with R”, P. Dalgaard

– great little book

Page 7: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Basics

• R is command driven– It waits for you to give it instructions

“>prompt”

• R operates on data structures and objects

• Data structures include vectors, lists, arrays (matrices) and dataframes– No scalars per se, only vectors with 1 element

Page 8: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Command or Assignment• Everything is either a functional command (I.e.

statement)or an assignment• Assignment operator is arrow pointing to variable name

(no space allowed within arrow!): “<-”• x <- 10 # I.e. x = (10)• y <- c(1,2,3) # c(), concatenate• z <- c(1:20) # help(rep, seq)• print(X)

– won’t work since R is case sensitive: print(x)• [1] 10 #[1] index of first element

# on line

Page 9: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Functions• 1 + 1 # answer is a vector [1]• z * 2• jack <- z *2• Jack <- 1:40• length(Jack)• all functions include () even if there is nothing: list work space: ls()

• if you type ls without (), R gives you the contents of the function.• some functions behave differently on different objects,

polymorphic functions.• lots of built in functions

– mean(Jack); median(Jack); sd(Jack)

Page 10: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Don’t use single character names

• single letter object names cannot be:

• c, q, t, C, D, F, I, T

• try to use other variable names of at least 3 characters

• I like to use names for quick stuff.

Page 11: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Vectors, intro to data structures• All data structures are objects that can be assigned to

variable names• All objects have a mode and a length• Vector: an ordered collection of items (all of which are

of the same mode)– jack <- c(1:10)– mode(jack) # mode is numeric– typeof(jack) # integer type– length(jack) # 10

• mode can be Character, numeric, complex, logical• typeof can be integer, double, complex etc.

Page 12: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Vector generating functions

• seq(from, to, by=) # test #this out, look at help #page, ?seq

• rep(x,times,length.out,each) # test out.

Page 13: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Vectorized Arithmetic

• two vectors can be multiplied, one element at a time, if they are the same length

• if they are different length, shorter vector is “recycled”

• if longer vector is not a multiple of the recycled vector, there is a warning.

• make two vectors of different length and test this out.

Page 14: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Logical Vectors

• can be written out (rare)• a <- c(TRUE, TRUE, FALSE, TRUE)

• can be generated• a <- x >10 # expect 10X FALSE, 10X TRUE

• print(a); length(a) • logical operators: <, <=, ==, >, >=, !=

Page 15: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Arrays & Matrices

• arrays are multi-dimensional vectors, matrices are two-dimensional vectors

• can assign dimensions to vector z to create matrix– z <- 1:20; dim(z) <- c(5,4) # z matrix has 5 rows

• table.z <- matrix(z, nrow=4, ncol=5)

• table.z # make sure it worked the way we wanted.– lets look at ?matrix to see how else to do that– maybe add “byrow=TRUE”

Page 16: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Matrix Functions• dim(z) <- c(row, column) #

previous slide• matricies have rownames & colnames # attributes

– vectors use names• rownames(z) <- LETTERS[1:20]• t(z) # transposition so that columns become rows

and row become colums– very important when generating a matrix from a matrix

• rownames(z) <- c(paste(“row”, 1:20))– what happens if you forget the terminal ) ?

Page 17: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Matrix Functions con’t• cbind, rbind # column bind and row bind• cart <- matrix(1:100, ncol=20, byrow=T)

• horse <- 10:15• cart.horse <- rbind(cart, horse)

– this wouldn’t work with cbind, why?

• rownames(cart.horse) <- NULL– # remove the evidence

Page 18: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Lists• Lists are generic vectors that can hold any data

type (don’t all need to be char or logical or numeric

• ramsi <- list(name=“ramsi”, age=36, married=TRUE, children=c(“jack”, “tommy”)

• str(ramsi) # gives you structure of ramsi list

• ramsi$children– # $ operator gets to variables within a list

Page 19: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Data Frame

• table containing numbers and text• I.e. the matrix equivalent of a list

– some columns can be numbers, others text, still others logical.

• sounds like microarray data– probe set ID – character– expression values –numerical– treatment – logical…..

Page 20: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Numeric Indexing• vectors, matrices, lists• need to have some form of indexing.• use square brackets []• z[1] # first element of z (there is no z[0])

• z[1:4] # sub-sequence of z• z[1,4] # no go, need to us z[c(1,4)]– I usually write closing bracket with opener, less

confusing.

Page 21: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Logical Indexing of vectorsI.e. Conditional Selecion

• z <- 1:50• a <- z > 13 #make a logical vector

• length(a) # still 50, just some F• z[a] #apply the logical vector• length(z[a]) # this is 37• length(z[z>13])• negative index drops item.

Page 22: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Logical Indexing of matricies

• a must be the same length as z for this to work.• matrices have index [row, column]• table.z <- matrix(z, nrow=4, ncol=5, by.row=TRUE)

• table.z[3,2] # should be a 12, verify it….

• table.z[2,] # all of row 2• table.z[,4] # all of column 4• indexes dropped with minus sign

– Try this out

Page 23: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Looping through matrices• R has explicit loops like other languages

– not very efficient, nobody uses them

• apply(x, margin, fun, …) #fun must work on vectors

• lets look at its help page, make a big table with rnorm() and take the mean of the rows and columns.

Page 24: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Functions• look at the help page for t.test• this is a typical statistical function• it requires an input and has an output• the input usually has default values• input can be defined by order of appearance

(positional matching) or explicitly• always better to go explicit, expecially when

scripting.• positional and named arguments can be mixed in the

same function call.

Page 25: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

More on the t.test() function

• t.test(x, y, alternative=“two.sided”, mu=0, paired=FALSE, var.equal =FALSE, conf.level=0.95, …)

• t.test(x,y,”two.sided”, ,,TRUE,0.99) # this is not readable!

Page 26: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

t.test() results list

• list with multiple values.• since it’s a list, use the $ operator to extract values• also possible to use list[[]], (not worth it)• catherine <- c(rnorm(10, 5, 2), rnorm(10, 3, 3)) #make a vector

• ramsi <- t.test(catherine[1:10], catherine[11:20])

• ramsi # this shows the standard t.test() output

• str(ramsi) # lets see the variables

Page 27: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

library(multtest)

• look at your handout, we’ll fiddle with it.• this is a library of routines or functions that

someone has written.• must first be called with library(multtest)• one of the functions is mt.sample.teststat()

Page 28: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

writing functions

• my.function <- function(a,b,c) {this, that,return(whatever)}

• cubism <- function(a = 3) { b <- (a*a*a)

• return(b) }• also c(b) instead of return• cubism() # default is 27• cubism(4) # better be 64!

Page 29: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

another not-so-silly function• number to times a number (variable name

number) is bigger than a series of numbers (variable name series)

freq <- function(series=1:20, number=11){ instances <- series[series > number]answer <- length(instances)return(answer)}

Page 30: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Permuted t.test()

• take a vector of gene expression data.• calculate a t-statistic for it using a t.test()• jumble the data and take another t.test()

– rinse and repeat for all possibilities

• how many times is the real t-statistic better than or equal to the permutations?– this is your real p-value– this permuted t-test is distribution free since you are

using sampling to generate the actual distribution of the data instead of relying on some theoretical normal dist.

Page 31: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

making your own function

• homework assignment will be to generate at permuted.t.test() for microarray data

• once this works, you’ll have it for ever.• you’ll forget how to use gene spring, but you’ll

always have a record of your function or your entire data analysis.

• I have only seen farther because I stood on the shoulder’s of giants.– use multtest….can it be done without multtest?

Page 32: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

What is Bioconductor?

• set of routines and functions to perform microarray analysis

• all are used within R and can therefore be manipulated and altered.

• tough to use without R• my analysis is about ½ and ½ R and

Bioconductor libraries…• open group with mailing list.

Page 33: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.

Lecture 2

• you all have 24 arrays of muscle tissue from normal, obese and morbidly obese men

• we want to read these all in, perform some QC/QA and see which genes are differentially expressed using a permuted t test.

Page 34: Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569.