Introduction to R

55
Introduction To R Samuel Bosch ( [email protected])

Transcript of Introduction to R

Page 1: Introduction to R

Introduction To RSamuel Bosch ([email protected])

Page 2: Introduction to R

What is R

R is a language and environment for statistical computing and graphics. Itis a GNU project which is similar to the S language.

Created in 1993, license: GNU GPL, current version 3.2.3

Interpreted

C-like syntax

Functional programming language semantics (Lisp, APL)

Object oriented (3 different OO systems)

Garbage collector

Mostly call-by-value

Lexical scope

Function closure

·

·

·

·

·

·

·

·

·/

Page 3: Introduction to R

Popularity

TIOBE: 18

Github: 12

Stackoverflow: 117341 questions (Java: 978006, Python: 507653)

Most popular tool for statistical data analysis

·

·

·

·

/

Page 4: Introduction to R

Usage

CRAN Task Views: https://cran.r-project.org/web/views/

Statistics (frequentist and bayesian)

Machine learning and data mining

Science (mathematics, chemistry, physics, medical, ecology, genetics,economy, history, …)

Finance

Natural Language Processing

Data visualization

Analyzing spatial, spatio-temporal data and time series

·

·

·

·

·

·

·

·

/

Page 5: Introduction to R

R Markdown

This is an R Markdown presentation. Markdown is a simple formattingsyntax for authoring HTML, PDF, and MS Word documents. For moredetails on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated thatincludes both content as well as the output of any embedded R codechunks within the document.

/

Page 6: Introduction to R

Competitors/colleagues

SAS, SPSS, STATA, Mathematica and other statistical software

Python + Numpy + Pandas + matplotlib + …

Matlab/Octave

Julia

K/J and other APL like languages

Java (Weka), Clojure, .NET (F#), …

·

·

·

·

·

·

/

Page 7: Introduction to R

Calling R

command line

SAS, SPSS, Stata, Statistica, JMP

Java, C++, F#

Python, Perl, Ruby, Julia

PostgreSQL: PL/R

·

·

·

·

·

/

Page 8: Introduction to R

Ecosystem

IDE: RStudio or one of the alternatives (plugins for Eclipse, VisualStudio, Atom, Sublime Text, Vim, …) Packages: CRAN (6700+ packages),Bioconductor, RForge, Github Learning more and getting help:

Built-in documentation (?, help(), F1) and package vignettes

Official manuals: https://cran.r-project.org/manuals.html

Short reference card: https://cran.r-project.org/doc/contrib/Short-refcard.pdf

(Free) books: Advanced R and R packages by Hadley Wickham

Courses on Edx and Coursera

Stack Overflow and Cross validated (for statistical questions)

[email protected]

·

·

·

·

·

·

·/

Page 9: Introduction to R

View help

?Filter

/

Page 10: Introduction to R

Operators

+  ‐  *  /   ^ or ** for exponentiation   %% modulus   %/% integer division    <   <=   >   >=   ==   !=   !x   isTRUE(x)   xor(x, y)  

# element wise OR and AND c(FALSE, FALSE) | c(TRUE, FALSE) & c(TRUE, FALSE)

## [1]  TRUE FALSE

# first element OR and AND c(FALSE, FALSE) || c(TRUE, FALSE) && c(TRUE, TRUE)

## [1] TRUE /

Page 11: Introduction to R

Vectors

List of elements of the same type

a <‐ c(1,2,5.3,6,‐2,4) # numeric vector a[c(2,4)] # 2nd and 4th element

## [1] 2 6

names(a) <‐ c("c","d","e","f","g","h") a

##    c    d    e    f    g    h  ##  1.0  2.0  5.3  6.0 ‐2.0  4.0

/

Page 12: Introduction to R

Vectors

a[a > 3] ## a[c(F, F, T, T, F, T)]

##   e   f   h  ## 5.3 6.0 4.0

a[3:5]

##    e    f    g  ##  5.3  6.0 ‐2.0

a[‐1]

##    d    e    f    g    h  ##  2.0  5.3  6.0 ‐2.0  4.0

/

Page 13: Introduction to R

Vectors

a[c("c","d","e")]

##   c   d   e  ## 1.0 2.0 5.3

a[a %in% c(1,2)]

## c d  ## 1 2

is.null(c()) & is.null(NULL)

## [1] TRUE

/

Page 14: Introduction to R

Vectors

c(1,2,3)[c(TRUE,FALSE,NA)]

## [1]  1 NA

c(sum=sum(a), sumna=sum(c(a,NA)), sumnona=sum(c(a,NA), na.rm = TRUE),     mean=mean(a), sd=sd(a), max=max(1,2,a))

##       sum     sumna   sumnona      mean        sd       max  ## 16.300000        NA 16.300000  2.716667  2.993604  6.000000

/

Page 15: Introduction to R

Data Types: numeric vectors

Default type for numbers

class(c(1, 2.3))

## [1] "numeric"

c(is.integer(1), is.numeric(1))

## [1] FALSE  TRUE

c(seq(from = 1, to = 5, by = 2), rep(c(6,7), times = c(2,3)))

## [1] 1 3 5 6 6 7 7 7

/

Page 16: Introduction to R

Data Types: integer vectors

as.integer(c(1,2.3,"4.5","bla"))

## Warning: NAs introduced by coercion

## [1]  1  2  4 NA

as.integer(c(TRUE,FALSE))

## [1] 1 0

/

Page 17: Introduction to R

Factors

Used to encode a vector as a factor ('category'/'enumerated type')

f <‐ factor(c(1,1,2,2,3,3,2,1), levels=c(1,2,3), labels=c("a", "b", "c")) f

## [1] a a b b c c b a ## Levels: a b c

table(f)

## f ## a b c  ## 3 3 2

/

Page 18: Introduction to R

Factors

as.character(f)

## [1] "a" "a" "b" "b" "c" "c" "b" "a"

as.numeric(f)

## [1] 1 1 2 2 3 3 2 1

/

Page 19: Introduction to R

Other Vectorial Data Types

Complex

Logical: 1 < 2, TRUE, T, FALSE, F

Character:

·

·

·

as.character(1.2)

## [1] "1.2"

fizz <‐ paste0("fi", paste(rep("z", 2), collapse = "")) paste(fizz, "buzz", 1:3, sep="_", collapse = " | ")

## [1] "fizz_buzz_1 | fizz_buzz_2 | fizz_buzz_3"

/

Page 20: Introduction to R

Matrices

Multiple vector columns of the same type and the same length

m <‐ matrix(1:10, nrow=5, ncol=2, byrow = TRUE) m[1,] # 1st row

## [1] 1 2

m[,2] # 2nd column

## [1]  2  4  6  8 10

m[1,2] # 1st row, 2nd column

## [1] 2

/

Page 21: Introduction to R

Matrices

m * 1:5

##      [,1] [,2] ## [1,]    1    2 ## [2,]    6    8 ## [3,]   15   18 ## [4,]   28   32 ## [5,]   45   50

t(m) %*% m

##      [,1] [,2] ## [1,]  165  190 ## [2,]  190  220

/

Page 22: Introduction to R

Matrices

diag(1, nrow = 2, ncol = 2)

##      [,1] [,2] ## [1,]    1    0 ## [2,]    0    1

sum(c(rowSums(m), colSums(m))) == sum(2*m)

## [1] TRUE

apply(m, MARGIN = 1, function(x) { sum(x) }) == rowSums(m)

## [1] TRUE TRUE TRUE TRUE TRUE

/

Page 23: Introduction to R

Matrices

head(m, n = 3)

##      [,1] [,2] ## [1,]    1    2 ## [2,]    3    4 ## [3,]    5    6

summary(m)

##        V1          V2     ##  Min.   :1   Min.   : 2   ##  1st Qu.:3   1st Qu.: 4   ##  Median :5   Median : 6   ##  Mean   :5   Mean   : 6   ##  3rd Qu.:7   3rd Qu.: 8   ##  Max.   :9   Max.   :10 /

Page 24: Introduction to R

Arrays

One, two or more dimensions

a <‐ array(data = t(1:24), dim = c(2,3,4)) a[1,,]

##      [,1] [,2] [,3] [,4] ## [1,]    1    7   13   19 ## [2,]    3    9   15   21 ## [3,]    5   11   17   23

a[1,1,1]

## [1] 1

/

Page 25: Introduction to R

Data frames

A data frame combines columns with the same length and differentdata types

d <‐ data.frame(number=1:2, bool=c(TRUE, FALSE), string=c("y", "z")) d$number

## [1] 1 2

d[1,c(2,3)]

##   bool string ## 1 TRUE      y

/

Page 26: Introduction to R

Data frames

d[,"string"]

## [1] y z ## Levels: y z

data.frame(string=c("y", "z"), stringsAsFactors = FALSE)[,1]

## [1] "y" "z"

/

Page 27: Introduction to R

dplyr

Lots of operators for manipulating local and database data (sqlite,mysql and postgresql). Basic verbs:

Other goodies:

select

filter

arrange (= sort)

mutate

summarise

·

·

·

·

·

piping (chaining)

database access as lazy as possible

Bigquery support (Google)

·

·

· /

Page 28: Introduction to R

dplyr

library(dplyr) cars <‐ mutate(mtcars, hp_mpg = hp/mpg) cars %>%    group_by(cyl) %>%   summarise(mean(disp), mean(hp), mean(mpg), mean(hp_mpg))

## Source: local data frame [3 x 5] ##  ##     cyl mean(disp)  mean(hp) mean(mpg) mean(hp_mpg) ##   (dbl)      (dbl)     (dbl)     (dbl)        (dbl) ## 1     4   105.1364  82.63636  26.66364     3.244667 ## 2     6   183.3143 122.28571  19.74286     6.231013 ## 3     8   353.1000 209.21429  15.10000    14.419146

/

Page 29: Introduction to R

dplyr

## options("samuelb@obisdb‐stage.vliz.be" = "<your password here>") pwd <‐ getOption("samuelb@obisdb‐stage.vliz.be")  src <‐ src_postgres(dbname = "obis", host = "obisdb‐stage.vliz.be",                      port = "5432", user="samuelb",                     password = pwd, options="‐c search_path=obis") tbl(src, "positions") %>%    select(id, bottomdepth) %>%    filter(longitude == 0 && latitude == 0) %>%    collect()

## Source: local data frame [1 x 2] ##  ##        id bottomdepth ##     (int)       (int) ## 1 8667455        4935

/

Page 30: Introduction to R

Lists

Ordered collection of objects

l <‐ list(name="Samuel", age=33,           workdays=c("Mon","Tues","Wed", "Thurs", "Fri")) l

## $name ## [1] "Samuel" ##  ## $age ## [1] 33 ##  ## $workdays ## [1] "Mon"   "Tues"  "Wed"   "Thurs" "Fri"

/

Page 31: Introduction to R

Lists

l$name

## [1] "Samuel"

l[["age"]]

## [1] 33

l[[3]]

## [1] "Mon"   "Tues"  "Wed"   "Thurs" "Fri"

/

Page 32: Introduction to R

Functions

sumf <‐ function(x, na.rm = FALSE) {   x <‐ ifelse(na.rm, na.omit(x), x)   Reduce("+", x) } sumf(1:3)

## [1] 1

sumf

## function(x, na.rm = FALSE) { ##   x <‐ ifelse(na.rm, na.omit(x), x) ##   Reduce("+", x) ## }

/

Page 33: Introduction to R

Functions

sumf <‐ function(x, na.rm = FALSE) {   if(na.rm) {     x <‐ na.omit(x)   }   sum <‐ 0   for (element in x) {     sum <‐ sum + element   }   sum } sumf(c(1, 2, NA))

## [1] NA

sumf(c(1, 2, NA), na.rm = TRUE)

## [1] 3/

Page 34: Introduction to R

Errors

try({ stop("Not supported") }, silent = TRUE)  tryCatch(expr = { qwerty + 1 },          error = function (e) str(e),          finally = print("Finally"))

## List of 2 ##  $ message: chr "object 'qwerty' not found" ##  $ call   : language doTryCatch(return(expr), name, parentenv, handler) ##  ‐ attr(*, "class")= chr [1:3] "simpleError" "error" "condition" ## [1] "Finally"

/

Page 35: Introduction to R

Files

f <‐ file(filename, open = "r")  on.exit(close(f)) readLines writeLines cat sink scan parse  url gzfile  read.table read.csv read.csv2 write.table write.csv write.csv2 /

Page 36: Introduction to R

Short example

aphia_ids <‐ c() for (file in list.files("demo", pattern="*[.]txt", full.names=T)) {   print(file)   species <‐ read.table(file, header=T, sep="\t", quote="", fill=T)   exact_match <‐ species[species$Match.type == "exact",]   aphia_ids <‐ c(aphia_ids, exact_match$AphiaID_accepted) }

## [1] "demo/corals_red_sea_matched.txt" ## [1] "demo/red_sea_non_coral_invertebrate_1_matched.txt" ## [1] "demo/red_sea_non_coral_invertebrate_2_matched.txt" ## [1] "demo/red_sea_shore_fish_2_matched.txt" ## [1] "demo/red_sea_shore_fish_matched.txt"

paste(na.omit(aphia_ids[1:6]), collapse = ",")

## [1] "216153,216155,216154,286927,216152,210746"/

Page 37: Introduction to R

Data visualization

plot(runif(n=1000, 0, 0.5), runif(n=1000, 0, 1), pch=3, col="red",       xlab="", ylab="", xlim=0:1, ylim=0:1) points(runif(n=50, .5, 1), runif(n=50, 0, 1), pch=20, col="blue")

/

Page 38: Introduction to R

Data visualization

library(ggplot2)  mtcars$gear <‐ factor(mtcars$gear,levels=c(3,4,5),     labels=c("3gears","4gears","5gears"))  mtcars$cyl <‐ factor(mtcars$cyl,levels=c(4,6,8),    labels=c("4cyl","6cyl","8cyl"))   qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5),     main="Distribution of Gas Milage", xlab="Miles Per Gallon",     ylab="Density")  ## linear regression qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"),     method="lm", formula=y~x, color=cyl,     main="Regression of MPG on Weight",     xlab="Weight", ylab="Miles per Gallon")

/

Page 39: Introduction to R

Data visualization

/

Page 40: Introduction to R

Data visualization

/

Page 41: Introduction to R

Data visualization

ggplot(movies, aes(x=rating)) +    geom_histogram(binwidth = 0.1, aes(fill = ..count..)) +    scale_fill_gradient("Count", low = "green", high = "red")

/

Page 42: Introduction to R

Objects

Recommended reading: http://adv-r.had.co.nz/OO-essentials.html

S3: generic function OO, very casual system e.g. drawRect(canvas,"blue")

S4: similar to S3 but more rigid, has multiple dispatch

Reference classes: message-passing OO (like Java, C++, etc), objectsare mutable

Base classes: defined in C

·

·

·

·

/

Page 43: Introduction to R

Debugging

RStudio setting: Debug -> On Error -> Break in code

DEMO

recover() traceback()

/

Page 44: Introduction to R

Packages

    install.packages("caret") ## installs caret and it's dependencies     devtools::install_github("rstudio/packrat") # install from github          library(caret) # load the library and import all functions     if(!require(raster)) {       print("raster package could not be loaded")     }     dplyr::aggregate ## calling a function without importing the full package     plyr::select ##  or handle naming conflicts

/

Page 45: Introduction to R

Packrat

Per project private package libraries

    install.packages("packrat")     packrat::init(project = ".")     install.packages("survival")     packrat::snapshot()          packrat::init()          packrat::snapshot()     packrat::restore()     packrat::clean()      packrat::bundle()     packrat::unbundle()          packrat::on()     packrat::off()

/

Page 46: Introduction to R

Package development

devtools + roxygen2 + testthat

Advantages:

Disadvantage:

Get started with the book http://r-pkgs.had.co.nz/ by Hadley Wickham

testing

documentation

versioning

distribution

·

·

·

·

more work·

/

Page 47: Introduction to R

Testing

library(testthat)  test_that("list_datasets result is same as datasets.csv", {   skip_on_cran()   original <‐ read.csv2(data_raw_file("datasets.csv"), stringsAsFactors = FALSE)   df <‐ list_datasets()   expect_equal(nrow(df),nrow(original))   expect_equal(df, original) })

/

Page 48: Introduction to R

Performance

Some resources:

compiler package (byte-code compiler)

parallel package

http://www.noamross.net/blog/2013/4/25/faster-talk.html

http://adv-r.had.co.nz/Performance.html

http://adv-r.had.co.nz/Profiling.html

http://adv-r.had.co.nz/memory.html

http://adv-r.had.co.nz/Rcpp.html

https://cran.r-project.org/web/views/HighPerformanceComputing.html

·

·

·

·

·

·

·

·

/

Page 49: Introduction to R

Parallel

  # Calculate the number of cores   no_cores <‐ detectCores() ‐ 1   # Initiate cluster   cl <‐ makeCluster(no_cores)   on.exit(stopCluster(cl))   clusterExport(cl, "species")   clusterExport(cl, "background")   results <‐ parLapply(cl, seq(0.1, 0.9, 0.1), function(beta) {     source("sdmExperiment.R")     kresults <‐ lapply(1:10, function(k) {       data <‐ species[[paste0("beta",beta)]][[paste0("k",k)]]       cbind(beta, k, t(build_sdm_rcew(data, background)))     })     dplyr::rbind_all(kresults)   })   results <‐ dplyr::rbind_all(results) ## combine list of data.frames

/

Page 50: Introduction to R

Web

Shiny: http://shiny.rstudio.com/

OpenCPU: https://www.opencpu.org/

RServe: https://rforge.net/Rserve/doc.html

·

interactive web pages

no need for javascript (at least not for simple things)

reactive programming

typically ui.R and a server.R

example: http://shiny.rstudio.com/gallery/movie-explorer.html

DEMO

-

-

-

-

-

-

·

HTTP API for data analysis in R-

·

Binary R server-/

Page 51: Introduction to R

Machine learning

https://cran.r-project.org/web/views/MachineLearning.html

caret, rattle

specific libraries for the different machine learning algorithms

·

·

·

/

Page 52: Introduction to R

Machine Learning example

library(e1071)  train_idx <‐ sample(1:nrow(mtcars), nrow(mtcars)/2) train <‐ mtcars[train_idx,] test <‐ mtcars[‐train_idx,]  model <‐ svm(hp ~ mpg + cyl + gear, data = train) train_results <‐ predict(model, train) test_results <‐ predict(model, test)  rmse <‐ function(error) {   sqrt(mean(error^2)) }

/

Page 53: Introduction to R

Machine learning example

print(paste("training rmse",rmse(train_results ‐ train$hp)))

## [1] "training rmse 22.1904078087206"

print(paste("test rmse",rmse(test_results ‐ test$hp)))

## [1] "test rmse 38.6500402049542"

plot_data <‐ data.frame(hp=c(train$hp,test$hp),                          predicted=c(train_results,test_results),                         split=c("train","test"))

/

Page 54: Introduction to R

Machine learning example

ggplot(plot_data, aes(hp, predicted)) +    geom_point(aes(colour = factor(split), shape = factor(split)))

/

Page 55: Introduction to R

Questions

ggplot(data.frame(a="?", x=0, y=0), aes(x=x, y=y, label=a)) +   geom_text(size=100)

/