1 P-GRADE Portal and GEMLCA Legacy Code Architecture Peter Kacsuk MTA SZTAKI .
Introduction to R - SZTAKI · Introduction to R Adrienn Szabó DMS Group, MTA SZTAKI Aug 30, 2014....
Transcript of Introduction to R - SZTAKI · Introduction to R Adrienn Szabó DMS Group, MTA SZTAKI Aug 30, 2014....
1/62
Introduction to R
Adrienn Szabó
DMS Group, MTA SZTAKI
Aug 30, 2014
2/62
1 What is R?What is R for?Who is R for?
2 BasicsData StructuresControl Structures
3 ExtRa stuffR packagesUnit testing in R
3/62
What is R?
R is a dialect of the S language. ,
. . . but seriously . . .
R is a . . .� Programming language (free, open
source)� Computing environment (like Matlab)� Community (quite an active one)� Ecosystem (rapid conversion from
data-science knowledge to productivity)
(S was developed at Bell Labs in the 1970s as an internal statistical analysis
environment)
4/62
What is R?
Some say that it’s „Not Really A ProgrammingLanguage”. . .
. . . but rather R „is an interactive environment for doingstatistics”*
∗http://readwrite.com/2013/11/25/
python-displacing-r-as-the-programming-language-for-data-science
5/62
What for?
� statistical computing (and statistical tests)
� dataset exploration� analysis (time series analysis, classification, clustering,
etc.)
� linear and nonlinear modelling� visualization� recently the favourite tool of data scientists
6/62
What for?
7/62
R is not only for programmers!
A couple of titles from the latest useR! conference(more than 150 talks):
� Teaching R to high school students (and teachers)
� Visualizing Lack of Fit in Complex Regression Mode
� A real time, responsive Quantitative trading analysisMobile App using R
� eegR: an R package to analyze electrophysiological(EEG) signals (MTA TTK)
8/62
More titles – BD
Because everyone has to deal with Big Datanowadays. . .
� PivotalR: A Package for Machine Learning on Big Data
� Massive Predictive Modeling (Oracle)
� Domino: A Platform-as-a-Service for Industrialized DataAnalysis
� Plyrmr: a data manipulation DSL for big data
9/62
More titles – ML
Machine Learning is cool. . .
� 10 R packages to win Kaggle competitions
� The Arborist: a Scalable Decision Tree Implementation
� Representing Model Ensembles in PMML
� Distributed Matrix Exponentiation in R
10/62
More titles – Networks & Twitter &maps
Who doesn’t want to study Twitter or social data? ,� Simulating Influenza Transmission with Real Network
Data
� Spatial Tweetstistics with R: Geographical Distribution ofEnglish Loan Words in Spanish Tweets
� Opportunities through the use of Open-Street-Map datain social sciences
11/62
More titles – RR
These folks seem to care about Reproducible Researchas well. . .
� R and Reproducibility: a Proposal
� rctrack: An R Package that Automatically Collects andArchives Details for Reproducible Computing
� Fostering the next generation of open science with R
� Teaching data analysis in R through the lens ofreproducibility (poster)
12/62
Why?
13/62
Features of R
� Free software� Runs on almost any standard computing
platform/OS� Active development, about yearly releases
+ bugfixes� Sophisticated graphics capabilities� Useful for interactive work, but contains a powerful
programming language for developing new tools(user -> programmer)
14/62
R vs. Python
� Python is more general-purpose, easier to writeprograms in it
15/62
R vs. Python
� R has more "stats + data analytics" librariesready-to-use
16/62
1 What is R?What is R for?Who is R for?
2 BasicsData StructuresControl Structures
3 ExtRa stuffR packagesUnit testing in R
17/62
Getting help
Code: built-in man pages
> ?librarylibrary package:base R Documentation
Loading and Listing of Packages
Description:
‘library’ and ‘require’ load add-on packages.
Usage:
library(package, help, pos = 2, lib.loc = ...
18/62
Data structures
� Vector� Matrix� Array� List� Data frame
19/62
Numbers, assignment
To assign a single number: the ’<-’ operator:Note: the ’=’ operator works the same way in almost allcases, but its usage is not advised.
Code: numbers
> a <- 4> b <- 5> a + b[1] 9> a - b[1] -1> a ^ b[1] 1024
20/62
Vector
Vectors (similar to Lists in Java) can be crated with thec() function (short name for „concatenate”).
Vectors can hold any kinds of things, but the items ofone vector have to be of the same type.
Code: vector examples
> v1 <- c(1, 2, 3, 4, 5, 6)> v2 <- c(0.8, 0.1)> v1 + v2[1] 1.8 2.1 3.8 4.1 5.8
Magic!
21/62
Vector
You can concatenate items and vectors as you please.
Code: vector examples 2
> v1 <- c(1, 2, 3)> v2 <- c(0.8, 0.1)> c(22, v1, -3.9, v2)[1] 22.0 1.0 2.0 3.0 -3.9 0.8 0.1> c(v1, "Sponge", "Bob")[1] "1" "2" "3" "Sponge" "Bob"
Warning: the result vector’s elements can be turned into themore general type!
22/62
Vector
You can select ranges of vectors to get a shorter one.
Indexing begins with 1!
Negative indices: leave it out!
Code: vector subsetting
> v1 <- c(1, 2, 3, 4, 5, 6)> v1[2][1] 2> v1[2:4][1] 2 3 4> v1[c(-2,-5)][1] 1 3 4 6
23/62
(Nice random image 1)
24/62
Matrix
A matrix is a vector represented and accessible intwo dimensions. It has a fixed type of elements andfixed number of rows and columns.
Code: matrix examples
> matrix(1:6, byrow=TRUE, nrow=2)[,1] [,2] [,3]
[1,] 1 2 3[2,] 4 5 6> matrix(c(1,2,13,9,8,17,3,4,5), ncol = 3)
[,1] [,2] [,3][1,] 1 9 3[2,] 2 8 4[3,] 13 17 5
25/62
Matrix
You can give names to columns and/or rows.
Code: matrix naming
> matrix(c(1:9),nrow=3,byrow=TRUE,+ dimnames=list(c("r1","r2","r3"),c("a","b","c")))
a b cr1 1 2 3r2 4 5 6r3 7 8 9
26/62
Matrix
Subsetting works here as well. . .
Code: matrix subsetting
> m1 <- matrix(c(1:9),nrow=3,byrow=TRUE,+ dimnames=list(c("r1","r2","r3"),c("a","b","c")))> m1[2,]a b c4 5 6> m1[,-2]
a cr1 1 3r2 4 6r3 7 9
27/62
Matrix
Matrix operations are quite similar to vector operations.For example, inequality will return another logicalmatrix of equal size.
Code: matrix example
> m1 > 5a b c
r1 FALSE FALSE FALSEr2 FALSE FALSE TRUEr3 TRUE TRUE TRUE
> m1[m1>5][1] 7 8 6 9
28/62
(Nice random image 2)
29/62
Array
An array is an extension to matrix in its number ofdimensions.It is a vector that is represented and accessible in agiven number of dimensions.
Let’s arrange 20 integers from 0 to 19 in three dimensions: 2x 5 x 2
Code: array example
> a1 <- array(c(0:20), dim = c(2, 5, 2))
30/62
Array
Code: array example
> array(c(0:20), dim = c(2, 5, 2)), , 1
[,1] [,2] [,3] [,4] [,5][1,] 0 2 4 6 8[2,] 1 3 5 7 9
, , 2
[,1] [,2] [,3] [,4] [,5][1,] 10 12 14 16 18[2,] 11 13 15 17 19
31/62
Array
Subsetting works similarly to matrices, but we canspecify the selected row/col indices for each dimension.
Code: array example
> a1[-1, 1:4, ][,1] [,2]
[1,] 1 11[2,] 3 13[3,] 5 15[4,] 7 17
> a1[-1, 1:4, 2][1] 11 13 15 17
32/62
(Nice random image 3)
33/62
List
A list is a generic vector that is allowed to includedifferent types of objects, even other lists.
Code: list example
> list(1, c(TRUE,FALSE), c("a","b","c"))[[1]][1] 1
[[2]][1] TRUE FALSE
[[3]][1] "a" "b" "c"
34/62
List
We can assign names to each entry by using namedarguments.
Code: list example
> myl <- list(x=1,y=c(TRUE,FALSE),z=c("a","b","c"))> myl$x[1] 1
$y[1] TRUE FALSE
$z[1] "a" "b" "c"
35/62
List
To access the members of a list by name, usedollar-sign:
Code: list example
> myl$x[1] 1
> myl $ z[1] "a" "b" "c"
> myl$almaNULL
36/62
List
To access the N-th member of a list, use double squarebrackets:
Code: list example
> myl[[1]][1] 1
> myl [[ 3 ]][1] "a" "b" "c"
> myl[[4]]Error in myl[[4]] : subscript out of bounds
37/62
List
Even names can be used inside double brackets:
Code: list example
> myl[["x"]][1] 1
> elemname <- "z"> myl[[elemname]][1] "a" "b" "c"
38/62
Subsetting a list
Use single-square-bracket notation to extract multiplemembers from a list and construct a new list:
Code: subsetting a list
> myl["x"]$x[1] 1
> mxl[c("x","y")]$x[1] 1
$y[1] TRUE FALSE
39/62
Subsetting a list
Code: more examples of subsetting a list
> myl[1]$x[1] 1
> myl[c(TRUE, FALSE, TRUE)]$x[1] 1
$z[1] "a" "b" "c"
40/62
Setting values of a list
Code: setting and adding list members
> myl$x <- 0.6 # overwrite element> myl$m <- 4 # add a new named element> myl$y <- NULL # delete by name> myl$x[1] 0.6
$z[1] "a" "b" "c"
$m[1] 4
41/62
Setting values of a list
Code: setting and adding list members
> myl[[2]] <- NULL # delete by index ("z")> myl[[1]] <- 0.8 # overwrite element> myl[[5]] <- 5 # add a new element# what do we get?
42/62
Setting values of a listCode: setting and adding list members
> myl$x[1] 0.8
$m[1] 4
[[3]]NULL
[[4]]NULL
[[5]][1] 5
43/62
Other list functions
Code: List functions
> is.list(myl[1]) # [ ] -> sublist[1] TRUE> is.list(myl[[1]]) # [[ ]] -> element[1] FALSE> l2 <- as.list(c(a=1,b=2,c=3)) # vector to list> unlist(l2) # list to vectora b c1 2 3> unlist(list(a=1, b=2, c="hello"))
a b c"1" "2" "hello"
44/62
(Nice random image 4)
45/62
Factor
The term factor refers to a statistical data type used tostore categorical variables.Use the function factor() to get factors from a vectorof objects.
Code: using factors
> sData <- c("Male", "Female", "Female", "X", "Male")> sFactors <- factor(sData)> sFactors[1] Male Female Female X MaleLevels: Female Male X
A factor is stored internally as a numeric vector with values1, 2, 3, k, where k is the number of levels.
46/62
Data frame
Data frames are similar to tables in a relationaldatabase.Generalisation of a matrix and a list – different columnsmay have different modes, but all elements of a columnmust have the same mode (all numeric or all factor, orall character).
Typically you’ll load a csv file into a data frame object.
47/62
Data frame
A data frame has colnames(), and rownames(); thelength() of a data frame is the same as ncol();nrow() gives the number of rows.
Code: data frame example
> df <- data.frame(x = 1:3, y = c("a", "b", "c"))> str(df)’data.frame’: 3 obs. of 2 variables:$ x: int 1 2 3$ y: Factor w/ 3 levels "a","b","c": 1 2 3
Note: data.frame() by default turns strings into factors.Use stringAsFactors = FALSE.
48/62
Data frame
Code: data frame example
> myy <- c("a", "b", "c", "n")> df <- data.frame(x = 4:7, y = myy,+ stringsAsFactors = FALSE)
> dfx y
1 4 a2 5 b3 6 c4 7 n> str(df)’data.frame’: 4 obs. of 2 variables:$ x: int 4 5 6 7$ y: chr "a" "b" "c" "n"
49/62
(Nice random image 5)
50/62
Control structures
if, else testing a conditionfor execute a loop a fixed number of times
while execute a loop while a condition is truerepeat execute an infinite loopbreak break the execution of a loopnext skip an interation of a loop
return exit a function
51/62
Condition
This is a valid if/else structure.
Code: if and else
if(x > 3) {y <- 10
} else {y <- 0
}
So is this one.
Code: alternative if and else
y <- if(x > 3) { 10 } else { 0 }
52/62
Loop
We can iterate on lists or vectors.
Code: for loops
x <- c("a", "b", "c", "d")for(letter in x) {print(letter)
}
# another examplefor(i in 1:3) print(x[i])
53/62
Functions
Functions are objects in their own right!
Three components of a function:� arguments� body� environment
Functions can return only a single object. (But they canreturn a list with any objects.
Call-by-value: modifying a function argument does notchange the original value (some exceptions exist).
54/62
Functions
Code: defining functions
isSumAboveTen <- function(x, y) {if (x + y > 10) {return(TRUE)
} else {return(FALSE)
}}
# Shorter: the last object will be returnedisSumAbove10 <- function(x, y) {if (x + y > 10) TRUE else FALSE
}
55/62
Scripts
And now we can write nice little scripts to do whateverwe want! ,
How to get the examples
$ git clone [email protected]/tutorial$ ls tutorial/R
example1 example2 example3 example4
56/62
Twitter example
57/62
1 What is R?What is R for?Who is R for?
2 BasicsData StructuresControl Structures
3 ExtRa stuffR packagesUnit testing in R
58/62
Packages in R
When you install R then you get a base R system withbasic functionality to use R.It does include by default some basic packages (utils,stats, datasets, graphics, methods, tools, parallel, etc.)
For more specific purposes you either :� find an existing add-on package on CRAN that
helps, or maybe on Bioconductor,� or you can write your own package (just for yourself
or you can easily publish it as well)
59/62
CRAN
CRAN (The Comprehensive R Archive Network)is a network of ftp and web servers around the worldthat store identical & up-to-date versions of code anddocumentation for R.
(Please use the CRAN mirror nearest to you to minimizenetwork load.)
More about packages later. . .
60/62
Unit testing in R
There are some options:
RUnit is the oldest onesvUnit has a GUI
testthat is actively developed, andsmarter (but isn’tcompatible with either ofthe pervious 2)
61/62
Sources
1 From the Coursera course "R programming" byRoger D. Peng
2 http://renkun.me/learnR/
3 http://adv-r.had.co.nz/
4 http://www.edureka.in/blog/why-learn-r/
5 http://user2014.stat.ucla.edu/files/Abstracts.pdf
6 http://www.slideshare.net/DataRobot/final-10-r-xc-36610234
7 https://www.datacamp.com/courses/introduction-to-r
8 http://www.stat.berkeley.edu/~statcur/Workshop2/Presentations/functions.pdf
62/62
Sources of images1 http://www.opsrules.com/supply-chain-optimization-blog/bid/349734/Combining-Machine-Learning-and-Optimization-in-Supply-Chain-Analytics
2 http://exploredata.wordpress.com/2012/08/20/importing-a-google-spreadsheet-into-r/
3 http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-study-data-structures-1/
4 http://1.bp.blogspot.com/_FsLa1cMTCWU/TPgxWY_QNZI/AAAAAAAAAjc/ORVJjtoDBvg/s1600/program_language_density_plot.png
5 http://www.gettyimages.com/detail/photo/castor-oil-stem-light-micrograph-of-a-high-res-stock-photography/123790451
6 http://illuminarti.weebly.com/patrick-star.html
7 http://vis.cs.ucdavis.edu/papers/social_
networks.pdf