Introduction to R

Introduction To RSamuel Bosch ([email protected])

mailto:[email protected]

What is R

R is a language and environment for statistical computing and graphics. Itis a GNU project which is similar to the S language.

Created in 1993, license: GNU GPL, current version 3.2.3

Interpreted

C-like syntax

Functional programming language semantics (Lisp, APL)

Object oriented (3 different OO systems)

Garbage collector

Mostly call-by-value

Lexical scope

Function closure

·

·

·

·

·

·

·

·

·/

Popularity

TIOBE: 18

Github: 12

Stackoverflow: 117341 questions (Java: 978006, Python: 507653)

Most popular tool for statistical data analysis

·

·

·

·

/

Usage

CRAN Task Views: https://cran.r-project.org/web/views/

Statistics (frequentist and bayesian)

Machine learning and data mining

Science (mathematics, chemistry, physics, medical, ecology, genetics,economy, history, …)

Finance

Natural Language Processing

Data visualization

Analyzing spatial, spatio-temporal data and time series

…

·

·

·

·

·

·

·

·

/

https://cran.r-project.org/web/views/

R Markdown

This is an R Markdown presentation. Markdown is a simple formattingsyntax for authoring HTML, PDF, and MS Word documents. For moredetails on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated thatincludes both content as well as the output of any embedded R codechunks within the document.

/

http://rmarkdown.rstudio.com/

Competitors/colleagues

SAS, SPSS, STATA, Mathematica and other statistical software

Python + Numpy + Pandas + matplotlib + …

Matlab/Octave

Julia

K/J and other APL like languages

Java (Weka), Clojure, .NET (F#), …

·

·

·

·

·

·

/

Calling R

command line

SAS, SPSS, Stata, Statistica, JMP

Java, C++, F#

Python, Perl, Ruby, Julia

PostgreSQL: PL/R

·

·

·

·

·

/

Ecosystem

IDE: RStudio or one of the alternatives (plugins for Eclipse, VisualStudio, Atom, Sublime Text, Vim, …) Packages: CRAN (6700+ packages),Bioconductor, RForge, Github Learning more and getting help:

Built-in documentation (?, help(), F1) and package vignettes

Official manuals: https://cran.r-project.org/manuals.html

Short reference card: https://cran.r-project.org/doc/contrib/Short-refcard.pdf

(Free) books: Advanced R and R packages by Hadley Wickham

Courses on Edx and Coursera

Stack Overflow and Cross validated (for statistical questions)

[email protected]

·

·

·

·

·

·

·/

https://cran.r-project.org/manuals.html

https://cran.r-project.org/doc/contrib/Short-refcard.pdf

http://adv-r.had.co.nz/

http://r-pkgs.had.co.nz/

http://stackoverflow.com/

http://stats.stackexchange.com/

mailto:[email protected]

View help

?Filter

/

Operators

+ ‐ * / ^ or ** for exponentiation %% modulus %/% integer division < <= > >= == != !x isTRUE(x) xor(x, y)

# element wise OR and AND c(FALSE, FALSE) | c(TRUE, FALSE) & c(TRUE, FALSE)

## [1] TRUE FALSE

# first element OR and AND c(FALSE, FALSE) || c(TRUE, FALSE) && c(TRUE, TRUE)

## [1] TRUE /

Vectors

List of elements of the same type

a <‐ c(1,2,5.3,6,‐2,4) # numeric vector a[c(2,4)] # 2nd and 4th element

## [1] 2 6

names(a) <‐ c("c","d","e","f","g","h") a

## c d e f g h ## 1.0 2.0 5.3 6.0 ‐2.0 4.0

/

Vectors

a[a > 3] ## a[c(F, F, T, T, F, T)]

## e f h ## 5.3 6.0 4.0

a[3:5]

## e f g ## 5.3 6.0 ‐2.0

a[‐1]

## d e f g h ## 2.0 5.3 6.0 ‐2.0 4.0

/

Vectors

a[c("c","d","e")]

## c d e ## 1.0 2.0 5.3

a[a %in% c(1,2)]

## c d ## 1 2

is.null(c()) & is.null(NULL)

## [1] TRUE

/

Vectors

c(1,2,3)[c(TRUE,FALSE,NA)]

## [1] 1 NA

c(sum=sum(a), sumna=sum(c(a,NA)), sumnona=sum(c(a,NA), na.rm = TRUE), mean=mean(a), sd=sd(a), max=max(1,2,a))

## sum sumna sumnona mean sd max ## 16.300000 NA 16.300000 2.716667 2.993604 6.000000

/

Data Types: numeric vectors

Default type for numbers

class(c(1, 2.3))

## [1] "numeric"

c(is.integer(1), is.numeric(1))

## [1] FALSE TRUE

c(seq(from = 1, to = 5, by = 2), rep(c(6,7), times = c(2,3)))

## [1] 1 3 5 6 6 7 7 7

/

Data Types: integer vectors

as.integer(c(1,2.3,"4.5","bla"))

## Warning: NAs introduced by coercion

## [1] 1 2 4 NA

as.integer(c(TRUE,FALSE))

## [1] 1 0

/

Factors

Used to encode a vector as a factor ('category'/'enumerated type')

f <‐ factor(c(1,1,2,2,3,3,2,1), levels=c(1,2,3), labels=c("a", "b", "c")) f

## [1] a a b b c c b a ## Levels: a b c

table(f)

## f ## a b c ## 3 3 2

/

Factors

as.character(f)

## [1] "a" "a" "b" "b" "c" "c" "b" "a"

as.numeric(f)

## [1] 1 1 2 2 3 3 2 1

/

Other Vectorial Data Types

Complex

Logical: 1 < 2, TRUE, T, FALSE, F

Character:

·

·

·

as.character(1.2)

## [1] "1.2"

fizz <‐ paste0("fi", paste(rep("z", 2), collapse = "")) paste(fizz, "buzz", 1:3, sep="_", collapse = " | ")

## [1] "fizz_buzz_1 | fizz_buzz_2 | fizz_buzz_3"

/

Matrices

Multiple vector columns of the same type and the same length

m <‐ matrix(1:10, nrow=5, ncol=2, byrow = TRUE) m[1,] # 1st row

## [1] 1 2

m[,2] # 2nd column

## [1] 2 4 6 8 10

m[1,2] # 1st row, 2nd column

## [1] 2

/

Matrices

m * 1:5

## [,1] [,2] ## [1,] 1 2 ## [2,] 6 8 ## [3,] 15 18 ## [4,] 28 32 ## [5,] 45 50

t(m) %*% m

## [,1] [,2] ## [1,] 165 190 ## [2,] 190 220

/

Matrices

diag(1, nrow = 2, ncol = 2)

## [,1] [,2] ## [1,] 1 0 ## [2,] 0 1

sum(c(rowSums(m), colSums(m))) == sum(2*m)

## [1] TRUE

apply(m, MARGIN = 1, function(x) { sum(x) }) == rowSums(m)

## [1] TRUE TRUE TRUE TRUE TRUE

/

Matrices

head(m, n = 3)

## [,1] [,2] ## [1,] 1 2 ## [2,] 3 4 ## [3,] 5 6

summary(m)

## V1 V2 ## Min. :1 Min. : 2 ## 1st Qu.:3 1st Qu.: 4 ## Median :5 Median : 6 ## Mean :5 Mean : 6 ## 3rd Qu.:7 3rd Qu.: 8 ## Max. :9 Max. :10 /

Arrays

One, two or more dimensions

a <‐ array(data = t(1:24), dim = c(2,3,4)) a[1,,]

## [,1] [,2] [,3] [,4] ## [1,] 1 7 13 19 ## [2,] 3 9 15 21 ## [3,] 5 11 17 23

a[1,1,1]

## [1] 1

/

Data frames

A data frame combines columns with the same length and differentdata types

d <‐ data.frame(number=1:2, bool=c(TRUE, FALSE), string=c("y", "z")) d$number

## [1] 1 2

d[1,c(2,3)]

## bool string ## 1 TRUE y

/

Data frames

d[,"string"]

## [1] y z ## Levels: y z

data.frame(string=c("y", "z"), stringsAsFactors = FALSE)[,1]

## [1] "y" "z"

/

dplyr

Lots of operators for manipulating local and database data (sqlite,mysql and postgresql). Basic verbs:

Other goodies:

select

filter

arrange (= sort)

mutate

summarise

·

·

·

·

·

piping (chaining)

database access as lazy as possible

Bigquery support (Google)

·

·

· /

dplyr

library(dplyr) cars <‐ mutate(mtcars, hp_mpg = hp/mpg) cars %>% group_by(cyl) %>% summarise(mean(disp), mean(hp), mean(mpg), mean(hp_mpg))

## Source: local data frame [3 x 5] ## ## cyl mean(disp) mean(hp) mean(mpg) mean(hp_mpg) ## (dbl) (dbl) (dbl) (dbl) (dbl) ## 1 4 105.1364 82.63636 26.66364 3.244667 ## 2 6 183.3143 122.28571 19.74286 6.231013 ## 3 8 353.1000 209.21429 15.10000 14.419146

/

dplyr

## options("samuelb@obisdb‐stage.vliz.be" = "<your password here>") pwd <‐ getOption("samuelb@obisdb‐stage.vliz.be") src <‐ src_postgres(dbname = "obis", host = "obisdb‐stage.vliz.be", port = "5432", user="samuelb", password = pwd, options="‐c search_path=obis") tbl(src, "positions") %>% select(id, bottomdepth) %>% filter(longitude == 0 && latitude == 0) %>% collect()

## Source: local data frame [1 x 2] ## ## id bottomdepth ## (int) (int) ## 1 8667455 4935

/

Lists

Ordered collection of objects

l <‐ list(name="Samuel", age=33, workdays=c("Mon","Tues","Wed", "Thurs", "Fri")) l

## $name ## [1] "Samuel" ## ## $age ## [1] 33 ## ## $workdays ## [1] "Mon" "Tues" "Wed" "Thurs" "Fri"

/

Lists

l$name

## [1] "Samuel"

l[["age"]]

## [1] 33

l[[3]]

## [1] "Mon" "Tues" "Wed" "Thurs" "Fri"

/

Functions

sumf <‐ function(x, na.rm = FALSE) { x <‐ ifelse(na.rm, na.omit(x), x) Reduce("+", x) } sumf(1:3)

## [1] 1

sumf

## function(x, na.rm = FALSE) { ## x <‐ ifelse(na.rm, na.omit(x), x) ## Reduce("+", x) ## }

/

Functions

sumf <‐ function(x, na.rm = FALSE) { if(na.rm) { x <‐ na.omit(x) } sum <‐ 0 for (element in x) { sum <‐ sum + element } sum } sumf(c(1, 2, NA))

## [1] NA

sumf(c(1, 2, NA), na.rm = TRUE)

## [1] 3/

Errors

try({ stop("Not supported") }, silent = TRUE) tryCatch(expr = { qwerty + 1 }, error = function (e) str(e), finally = print("Finally"))

## List of 2 ## $ message: chr "object 'qwerty' not found" ## $ call : language doTryCatch(return(expr), name, parentenv, handler) ## ‐ attr(*, "class")= chr [1:3] "simpleError" "error" "condition" ## [1] "Finally"

/

Files

f <‐ file(filename, open = "r") on.exit(close(f)) readLines writeLines cat sink scan parse url gzfile read.table read.csv read.csv2 write.table write.csv write.csv2 /

Short example

aphia_ids <‐ c() for (file in list.files("demo", pattern="*[.]txt", full.names=T)) { print(file) species <‐ read.table(file, header=T, sep="\t", quote="", fill=T) exact_match <‐ species[species$Match.type == "exact",] aphia_ids <‐ c(aphia_ids, exact_match$AphiaID_accepted) }

## [1] "demo/corals_red_sea_matched.txt" ## [1] "demo/red_sea_non_coral_invertebrate_1_matched.txt" ## [1] "demo/red_sea_non_coral_invertebrate_2_matched.txt" ## [1] "demo/red_sea_shore_fish_2_matched.txt" ## [1] "demo/red_sea_shore_fish_matched.txt"

paste(na.omit(aphia_ids[1:6]), collapse = ",")

## [1] "216153,216155,216154,286927,216152,210746"/

Data visualization

plot(runif(n=1000, 0, 0.5), runif(n=1000, 0, 1), pch=3, col="red", xlab="", ylab="", xlim=0:1, ylim=0:1) points(runif(n=50, .5, 1), runif(n=50, 0, 1), pch=20, col="blue")

/

Data visualization

library(ggplot2) mtcars$gear <‐ factor(mtcars$gear,levels=c(3,4,5), labels=c("3gears","4gears","5gears")) mtcars$cyl <‐ factor(mtcars$cyl,levels=c(4,6,8), labels=c("4cyl","6cyl","8cyl")) qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5), main="Distribution of Gas Milage", xlab="Miles Per Gallon", ylab="Density") ## linear regression qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"), method="lm", formula=y~x, color=cyl, main="Regression of MPG on Weight", xlab="Weight", ylab="Miles per Gallon")

/

Data visualization

/

Data visualization

ggplot(movies, aes(x=rating)) + geom_histogram(binwidth = 0.1, aes(fill = ..count..)) + scale_fill_gradient("Count", low = "green", high = "red")

/

Objects

Recommended reading: http://adv-r.had.co.nz/OO-essentials.html

S3: generic function OO, very casual system e.g. drawRect(canvas,"blue")

S4: similar to S3 but more rigid, has multiple dispatch

Reference classes: message-passing OO (like Java, C++, etc), objectsare mutable

Base classes: defined in C

·

·

·

·

/

http://adv-r.had.co.nz/OO-essentials.html

Debugging

RStudio setting: Debug -> On Error -> Break in code

DEMO

recover() traceback()

/

Packages

install.packages("caret") ## installs caret and it's dependencies devtools::install_github("rstudio/packrat") # install from github library(caret) # load the library and import all functions if(!require(raster)) { print("raster package could not be loaded") } dplyr::aggregate ## calling a function without importing the full package plyr::select ## or handle naming conflicts

/

Packrat

Per project private package libraries

install.packages("packrat") packrat::init(project = ".") install.packages("survival") packrat::snapshot() packrat::init() packrat::snapshot() packrat::restore() packrat::clean() packrat::bundle() packrat::unbundle() packrat::on() packrat::off()

/

Package development

devtools + roxygen2 + testthat

Advantages:

Disadvantage:

Get started with the book http://r-pkgs.had.co.nz/ by Hadley Wickham

testing

documentation

versioning

distribution

·

·

·

·

more work·

/

http://r-pkgs.had.co.nz/

Testing

library(testthat) test_that("list_datasets result is same as datasets.csv", { skip_on_cran() original <‐ read.csv2(data_raw_file("datasets.csv"), stringsAsFactors = FALSE) df <‐ list_datasets() expect_equal(nrow(df),nrow(original)) expect_equal(df, original) })

/

Performance

Some resources:

compiler package (byte-code compiler)

parallel package

http://www.noamross.net/blog/2013/4/25/faster-talk.html

http://adv-r.had.co.nz/Performance.html

http://adv-r.had.co.nz/Profiling.html

http://adv-r.had.co.nz/memory.html

http://adv-r.had.co.nz/Rcpp.html

https://cran.r-project.org/web/views/HighPerformanceComputing.html

·

·

·

·

·

·

·

·

/

http://www.noamross.net/blog/2013/4/25/faster-talk.html

http://adv-r.had.co.nz/Performance.html

http://adv-r.had.co.nz/Profiling.html

http://adv-r.had.co.nz/memory.html

http://adv-r.had.co.nz/Rcpp.html

https://cran.r-project.org/web/views/HighPerformanceComputing.html

Parallel

# Calculate the number of cores no_cores <‐ detectCores() ‐ 1 # Initiate cluster cl <‐ makeCluster(no_cores) on.exit(stopCluster(cl)) clusterExport(cl, "species") clusterExport(cl, "background") results <‐ parLapply(cl, seq(0.1, 0.9, 0.1), function(beta) { source("sdmExperiment.R") kresults <‐ lapply(1:10, function(k) { data <‐ species[[paste0("beta",beta)]][[paste0("k",k)]] cbind(beta, k, t(build_sdm_rcew(data, background))) }) dplyr::rbind_all(kresults) }) results <‐ dplyr::rbind_all(results) ## combine list of data.frames

/

Web

Shiny: http://shiny.rstudio.com/

OpenCPU: https://www.opencpu.org/

RServe: https://rforge.net/Rserve/doc.html

·

interactive web pages

no need for javascript (at least not for simple things)

reactive programming

typically ui.R and a server.R

example: http://shiny.rstudio.com/gallery/movie-explorer.html

DEMO

-

-

-

-

-

-

·

HTTP API for data analysis in R-

·

Binary R server-/

http://shiny.rstudio.com/

https://www.opencpu.org/

https://rforge.net/Rserve/doc.html

http://shiny.rstudio.com/gallery/movie-explorer.html

Machine learning

https://cran.r-project.org/web/views/MachineLearning.html

caret, rattle

specific libraries for the different machine learning algorithms

·

·

·

/

https://cran.r-project.org/web/views/MachineLearning.html

Machine Learning example

library(e1071) train_idx <‐ sample(1:nrow(mtcars), nrow(mtcars)/2) train <‐ mtcars[train_idx,] test <‐ mtcars[‐train_idx,] model <‐ svm(hp ~ mpg + cyl + gear, data = train) train_results <‐ predict(model, train) test_results <‐ predict(model, test) rmse <‐ function(error) { sqrt(mean(error^2)) }

/

Machine learning example

print(paste("training rmse",rmse(train_results ‐ train$hp)))

## [1] "training rmse 22.1904078087206"

print(paste("test rmse",rmse(test_results ‐ test$hp)))

## [1] "test rmse 38.6500402049542"

plot_data <‐ data.frame(hp=c(train$hp,test$hp), predicted=c(train_results,test_results), split=c("train","test"))

/

Machine learning example

ggplot(plot_data, aes(hp, predicted)) + geom_point(aes(colour = factor(split), shape = factor(split)))

/

Questions

ggplot(data.frame(a="?", x=0, y=0), aes(x=x, y=y, label=a)) + geom_text(size=100)

/

Introduction to R

Software

Transcript of Introduction to R