Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General...

36
Introduction to R statistical environment R Nano Course Series Aishwarya Gogate Computational Biologist I Green Center for Reproductive Biology Sciences

Transcript of Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General...

Page 1: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

IntroductiontoRstatisticalenvironment

RNano CourseSeries

Aishwarya GogateComputational Biologist I

Green Center forReproductive BiologySciences

Page 2: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

HistoryofR• Risafreesoftwareenvironment forstatisticalcomputingandgraphics

• Rlanguagebuiltasadialect oftheSstatisticallanguage(S-plus)developed atBellLaboratories inmid70’s.

• 2000:Rversion1.0.0wasreleased.• Quicklybecame popularforbioinformatics,microarrayanalysis

• Newversionreleased every6months• Now-versionsforWindows(32and64bit),UNIX/Linux,MacOS,andRStudio (GUIversion)

Page 3: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

WhatisR?• Objectoriented statistical language• Suiteofoperatorsforcalculations onarraysandmatrices• Sophisticated graphicalfacilities fordisplayoroutputfiles• ActiveRcommunity- R-helpandR-devel mailing lists• ~25base,orstandard,packages• Thousands ofcontributed packages inrepositories:

• CRAN:http://CRAN.R-project.org• Bioconductor:www.bioconductor.org• Manymorepackagesavailableonpersonalwebsites

Page 4: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

WhatIplantocoverinthissection• IntroductiontoR• Rlanguage– thebasics• StrengthsofR• WorkingwithRobjects/ datatypes• Simplemathfunctionsusingdifferentdatatypes• UsingRfunctionsonvectorsandmatrices• Subsetting vectors,dataframesandmatrices• Rpackages,IntroductiontoBioconductor• Installing apackage(DESeq2)• Importingafile - Input/outputwithR– textfiles(csv,tab-delim, etc)• Movingbetween RandExcel

Page 5: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

RWebsitewww.r-project.org

Current version is 3.3.2(Sincere Pumpkin Patch), released 2016-10-31

Page 6: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

IDE=integrated development environment

http://www.rstudio.com/

EasytouseInterface:Yourcode,codeexecution,datathatisreadintoRstudio andoutputwindow(showingplot)-Allinonescreen!

Page 7: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

• www.bioconductor.org• AgroupofRpackages aimed athigh-throughput genomic dataanalysis andgenomic annotations

(originally microarrays, butnow SNPdata,RNA-seq, ChIP-seq, other sequencing, mass spec, flowcytometry, etc)

• Open source andOpendevelopment• EachBioconductor packagehasa“vignette” fordocumentation• Easytodownload Bioconductor packageswithin R:

source("http://www.bioconductor.org/biocLite.R")biocLite() #installs adefault setof Bioconductor packages togetstartedbiocLite(“package.name”)

• Packagesaregrouped as– Software (Microarray, visualization, statistics, etc.)– AnnotationData (organizedby organism,microarray)– ExperimentData

– Current version is 3.4which workswithRversion 3.3.1(users with older versions – update installations)

Page 8: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

Rtutorials• Under“Manuals” onRwebsite– several indepthtutorials; somebasic,someadvanced

• Basic introductions toseveralspecific topics inR:http://www.cyclismo.org/tutorial/R/

• Variousforumsavailablewhichdiscuss rangesoferrorsthatusersencounter–Whenindoubt,JustGoogleandgetthesyntax!

• ManyRbooksavailable:• GeneralpurposeR:e.g.,RCookbook(2011),RinaNutshell (2010)• Specifictopics: e.g.,Introductorystatistics inR,

• AppliedStatisticalGeneticswithR,• TheartofRprogramming(softwaredesign),• RGraphicsCookbook• DataMiningwithR:LearningwithCaseStudies

Page 9: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

GettinghelpinR– Important!

• Fromthecommandline:• ?function_name (Example:?mean)• Pressq togetbacktoRcommandline• ??keyword(Example:??mean)• help(function,package=PKG) forafunctioninaspecificpackage• Googleit• Package-specifichelpfiles

Page 10: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

StrengthsandWeaknesses• Strengths• Stronginstatistics• Goodwithmatrices• Goodgraphics• Accesstoawealthofcontributedstatistical/computational/

mathematics/bioinformaticsmethods• Widelyused- relativelyeasytocreateanddistributeyourownR

package• Weaknesses– Fairlymemory-intensive– Notrealgoodforparsingorfilehandling

Page 11: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

WorkinginR• Canworkinteractively(linebyline)• InBatchmode (runawholefilewithcodeatonce)

• LinuxCommandLine:Rscript filename.r [possible additional arguments] nohup Rscript filename.r & (tocontinuerunninginbackgroundevenifyoulogout)

• “filename” referstotheRfilewiththecodetorun.• Inlinux,typeR interminalwindowandhitEntertolaunchR

• Shouldsee someintroductorytext• InWindows,wouldopentheRprogramwithinterface

• Ctrl-C willstopanRcommandwithoutexitingR(Ex:togetoutofaninfiniteloop)

Page 12: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

StartingoffinR• Whenworkingonyourown,it’sgoodpracticetoworkwithanRscript(example,my_R_script.R)• Copycodetotheconsoleasneeded,soallyourcodecanbesaved• InLinux,youcanuseanytexteditor• InWindowsinR,youhaveanRscriptwithintheprogram.Simplyhighlightthecodetorunandclicktherun button.

• InWindowsconnectedtoaserverforR,Notepad++isfreeandsupportscolorcodingforprogrammingR.

• Cancopy/pastemultiple lines ofcodeatatime intotheconsole

Page 13: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

CreatingvariablesinR• Theentities Roperates onaretechnically knownasobjects.• Tocreatevariables(objects) inR,useeither <- or=• Examples: x <- 1 x = 1

• This iscalledmakinganassignment• Giveobjectsmeaningful names

• ObjectnamesCANNOTstartwithanumber• ObjectnamesCANhave“.”andnumberswithinthem

• SimilartoPython,ifyoutrytoaccessanobjectthathasn’tbeenassigned, Rwillcomplain• > cloud• Error: object “cloud" not found

Page 14: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

TryingsomesimpleRcode• Rcanbeused likeacalculator> (1+2+3+4+5)/5 # mean of 5 numbers[1] 3

> (1+3)*5[1] 20

> 7/6[1] 1.166667

> (7/6)*6[1] 7

Referstotheindex ofthefirstentry printed

Notice thatfor display, itrounded toacertainnumber of significant digits.Butthetrueanswer isactually calculated andstored.

Page 15: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

RFunctions• Afunctionperformssomecalculation andprovidesanoutput• Maytake0,1ormoreparameters asinput,separated bycommas.Parametervaluescanbespecified directly inthefunctionalcall,objects inyourworkspace,orevencallstoanotherfunctionfunction_name(param1=x, param2=y)

• Canassigntheoutputtoanobjectresult <- function_name(param1=x, param2=y)

• Often,variable/object names containingdataarethefirstparameter(s)passed tofunctions – additional optionsfollow

Page 16: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

RFunctions• Syntaxforthemean() function:

mean(x, trim=0, na.rm=FALSE)

• Aslongasyouprovidetheparameters inthecorrectorder,youdon’tneedtotypeinparameternames

• Example:result <- mean(x=B) isthesameasresult <- mean(B) whereBisanumericvector

• Manyfunctionshavedefaultparametersthatyoudon’tneedtospecify inthefunctioncall,aslongasyouwanttousethedefaults.

• Otherparametersarerequiredtobespecified

Page 17: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

SomeFunctionsdonotneedanyarguments• Tocallafunctionwithoutinputparameters,youstillneedparentheses.Examples: • ls() #lists alltheobjects inyourworkspace• quit() #quitsR• library() #lists allinstalled Rpackages• getwd() # getyourworkingdirectorypath• colors() # lists allavailable colornames

Page 18: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

Tryingsomemorecode• a <- 2 #assignthenumber2toanobjectcalleda• a #viewthevalueoftheobjecta• b <- 3

• a+b #calculate andprinttheresult ofa+b• a^b #calculate&printtheresultofatothebth power• Usethefunctionc(),whichstandsfor“combine” tocreateavectorobjectcalleda.vec:

• a.vec <- c(2,4,6)

• a.vec #Viewthevalueofa.vec• length(a.vec) #Getthe lengthofthevector• a.vec[1:2] #Getthefirst2entries ofthevector

Page 19: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

Creatingavector> b.vec<- c(3,5,7)Addeachentryina.vec toeachentryinb.vec> a.vec + b.vec[1] 5 9 13Multiple eachentryina.vec toeachentryinb.vec> a.vec*b.vec[1] 6 20 42

Riscasesensitive> A.vecError: object 'A.vec' not found

Page 20: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

CreatingamatrixCreatea3X2matrix:• a.mat <- matrix(data=c(1,2,3,4,5,6), nrow=3, ncol=2)• a.matCalculatethesquareofeachelementina.mat (NotethatthisdoesNOTdotraditionalmatrixmultiplication.)• a.mat^2Calculatethelog-base2ofeachelementina.mat• log2(a.mat)• a.mat[1:2,1:2] #displaythefirst2rowsandfirst2columnsofthematrixa.mat• a.mat[1:2,1] #displaythefirst2rowsandfirstcolumn• a.mat[1:2,] Q:Whatdoesthisdo?

Page 21: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

TryingsomemorecodeObjectscanbeoftypesotherthannumeric, forexamplecharacters(strings)orvectorsofstrings

> message <- “Hello world”

> message

[1] “Hello world”

> c.vec <- c(‘Hello’,’Goodbye’)

> c.vec

[1] “Hello” “Goodbye”

Q:Whatshouldwetypetogetjust“Goodbye”?

Notethateithersingle ordouble quotes canbe used

Page 22: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

UsefulfunctionsforworkinginRObjectsyoucreatearestored inyourRworkspace(workingdirectory)1. ls() #lists theobjects inyourworkspace2. rm(object1,object2) #removesobject1&object2fromyour

workspace3. rm(list=ls()) #Completely clearsyourworkspace4. save(object1,object2,file=“C:/myRobjects.RData”) #saves

object1andobject2toaRdata file.5. save.image(“path/my_workspace.RData”) #savesentire

workspace6. load(“C:/myRobjects.Rdata”) #loadsavedRdata fromafile

Page 23: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

UsefulfunctionsforworkinginR7. getwd() #Findwhereyourcurrentworkingdirectoryis8. setwd(‘C:/workingDirectory’) #Changewhereyourworking

directoryis9. quit() #quitR10.library() #listsallpackagesavailabletoload11.library(package) or require(package) #loadinstalled R

package

Page 24: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

Rmathematical functions• x<- 2.1• Commonmathematical functions

• exp(x) #etoapower• log(x, base=exp(1)) #naturallog,oranybase• log2(x) #logbase2• log10(x) #logbase10• sqrt(x) #squareroot• abs(x) #absolute value• round(x) #roundtothenearest integer• factorial(x) #factorialofxTry:factorial(5)• choose(n, k) #“nchoosek”Try:choose(5,3)• cos(),sin(),tan() #cosine,sine,&tangent

Page 25: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

Typesofobjects• x <- c(0,1,0,1)• Eachobject hasanassigned data-type.• Themostcommonatomicmodesare:

– logical (TRUEorFALSE)(0or1isalsorecognized)• logical(length=3) #Create alogicalvariableoflength3• is.logical(x) #Test whetheranobject isdata-type logical• as.logical(x) #Convert anobjecttoalogical

– character• character(length), is.character(x), as.character(x)

– numeric• numeric(length), is.numeric(x), as.numeric(x)

– integer• integer(length), is.integer(x), as.integer(x)

Page 26: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

Typesofobjects– Vector(anorderedsetofobjects) - Theobjects inavectorcouldbelogical,integer,character,numeric,orevenvectorsormatricesthemselves asinlists.• vector(length=1) #Create alogicalvariableoflength1• is.vector(x) #Test whetheranobjectisavector• as.vector(x) #Convert anobjecttoavector

– Factor- Factorsareusedtodescribe items thatcanhaveafinitenumberofvalues(gender,socialclass,etc.).Eachpossible valueiscalled alevel.• factor(x,…), is.factor(x), as.factor(x)

• Canalsodothese 3functions for:• vector,matrix, list, data.frame,function(function workssomewhatdifferently),environment

Page 27: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

WorkingwithvectorsThreewaystocreateavector:• my.vec<- c(‘h’,’e’,’l’,’l’,’o’)

• my.vec1 <- seq(from=1,to=10,by=2)

• my.vec2 <- 1:4

Canaddtoavectorusingthec()function:• my.vec2<- c(my.vec2, NA)

Subsetavector:• my.vec1[my.vec1<6]• subset(my.vec1,my.vec1<6) #alternativewayJustforfun(graphicspreview):• plot(my.vec2,my.vec1)

Page 28: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

Functionstolearnaboutanobject• length(my.vec) #lengthofavectororlist• nchar(‘abcdefg’) #sizeofacharacterstring• dim(a.mat) #dimensions ofamatrixordataframe[1] 3 2 #1st valueisnumberofrows,2nd valueisnumberofcolumns

• mode(my.vec) #storagemode,i.e.logical,numeric,character,…• typeof(my.vec) #logical, integer,double,complex,character,etc.• attributes(my.vec) #useful forlearningaboutunknowncomplexobjects• Res<-t.test(rnorm(20,1),rnorm(20,2))• Res #Displayformattedresultsoft-test• attributes(Res) #providesnamesoflistvalues• Res$p.value

Similar

Page 29: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

Workingwithmatrices• my.mat1<-matrix(1:20,nrow=5,ncol=4)Didthematrixgetfilledinbyrowsorbycolumns?Wecanusethesamemethodforsubsetting thatweusedforvectorsformatrices. • my.mat1[!is.na(my.vec2),]Whatdoestheabovelineofcodedo?Ausefulwaytocreateamatrix(ordataframe)istocombinemultiple vectorsascolumnsusingthefunctioncbind(),whichstandsfor“columnbind”.

• my.mat2<- cbind(1:5,seq(1,10,2))

cbind()canalsobeusedtocombine amatrixordataframewithadditional vectors.

Page 30: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

DataFrames(Tables)• Adataframeobject inRislikeatableofdata.Differentcolumnscanbedifferenttypesofobjects(e.g.characterandnumeric)

• Whenwereadindatafromatextfile intoR,itwillbereadinasadataframeobject.

• Likematrices,dataframescanhavecolumnandrownames• Likematrices,youcaneasily accessordisplayanysubsetofrowsorcolumns

Page 31: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

WorkingwithdataframesAdataframeisatightlycoupledcollection ofvariableswhichsharemanyoftheproperties ofmatrices andoflists, andusedasthefundamental datastructure inmostRcode.• numbers <- 1:4

• letters<-c(‘a’,’b’,’c’,’d’)

• grp <- as.factor(c(1,0,0,1))

• mydata <- data.frame(cbind(numbers, letters, grp))

• mydata

• mydata[1:2,] #Commonly used toviewtop rowsof atable

numbers letters grp

1 1 a 1

2 2 b 0

• mode(mydata)

[1] “list”

• ?mode #Seeforlist ofpossible values

Page 32: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

Workingwithdataframes• colnames(mydata) #Getthecolumnheadersofthedataframe• rownames(mydata) #Gettherownames (namedbydefault)ofthedataframe

• Thereare3differentwaystoaccess acolumn• mydata$letters• mydata[,”letters”]• mydata[,2]

• Displaytherow(s)wherethe2nd column=‘b’mydata[mydata[,2]==‘b’,]

Page 33: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

Installingapackage>source("http://bioconductor.org/biocLite.R")>biocLite("DESeq2") #PackageforDEanalysisofRNA-seq data>install.packages("ggplot2") #Usedtocreateplots inR

Page 34: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

Importingafile

• Toreadinafilex<- read.table(file=“/path/to/file/file.csv”, sep=“\t”, header=T)

• Tocheckthatthefilehasbeenproperlyimportedyoucan:head(x) #Displaysfirstfewrowsofthe importedfile

Page 35: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

MovingbetweenRandExcel• Oftentimes there isaconfusionabouthowtoopentheoutputfiles created inR• Thefilesmaybe informatssuchas.txt,.csv,etc.• Allyouhavetodo: Right-click->Openwith->MicrosoftExcel• Oryoucanalso:Copy&pastethecontentsofthe.txtfileintoExcel

Page 36: Introduction to R statistical environment › ... › r_nanocourse_6-dec-16_handouts.pdf• General purpose R: e.g., R Cookbook (2011), R in a Nutshell (2010) • Specific topics:

ThankYou!

• Aftertheshortbreak…

1. PleasehaveRstudio openonyourcomputer2. OpenthefileR_Intro_Workshop_code.rwhichhasbeen

providedtoyouonline