Introduction to the language R

22
Language R By Franck Benault Created 05/10/2015 Last update 22/12/2015

Transcript of Introduction to the language R

Language R

By Franck Benault

Created 05/10/2015Last update 22/12/2015

R introduction

● R is a statistical and graphical programming language

– Lingua Franca of Data Science● Easy to use and powerful

● R is free and exists on very platform (Window, Unix)

– Large community● There will be a lack of data-scientists

● Some elements are coming from Datacamp tutorials

R in public repositories of Github

Year Rank Nb public repository

2014 14th 48.574

2013 24th 7.867

2012 25th 5.626

● Index Tiobe

– http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html

– R is 19th (September 2015)

R links

● Datacamp (R training)

– https://www.datacamp.com/ ● Following datacamp, each year the number of R

users grows by 40 %.● My examples in github

– https://github.com/franck-benault/test-R

R plan

● Environment

● Data types

– From basics to Dataframes● R and statistics

● Diagrams

R environment

● Rstudio

R's fundamental data types

● Logical value (TRUE, FALSE, T, F, NA)

● Numeric (2, 4.5)

● Integer (2L)

● Character

● Complex

● Raw (store raw bytes)

● is.* functions (test : is.numeric(), is.integer() ...)

● as.* functions (conversion as.numeric(), as.integer() ...)

R datatype vector

● Vector

– Sequence of data elements (one dimension)

– Same datatype

– a <- c(1,2,5.3,6,-2,4)

– b <- c("one","two","three")

– d <- c(1,2.1,"three") # vector of character● Methods

– is.vector(v)

– Length(v)

– Names(v) <- v2 #to associate a name to the values● Basic data types are vectors

– a <-2

– is.vector(a) #return TRUE

R datatype vector, methods

● A lot of methods can be used on vector of numeric

– mean(V) #average

– median(V)

– sum(V)● Name on vectors

– a <- c(1,6,5)

– n <- c("Ford","Renault", "Fiat")

– names(a) <- n

– b <- c(Ford=1, Renault=6, Fiat=5)● You need a collection of elements with different datatype use a List

R datatype matrix, names

● Names with rownames, and colnames

– m <- matrix(1:6, byrow=TRUE, nrow=2)

– rownames(m) <- c("row1", "row2")

– colnames(m) <- c("col1", "col2", "col3")● matrix(1:6, byrow=TRUE, nrow=2,

dimnames=list(c("row1", "row2"),c("col1","col2","col3")))

R datatype matrix

● Matrix

– two dimensions

– all elements have same type● Creation, matrix() function with vector as parameter

– y<-matrix(1:20, nrow=5,ncol=4)● Creation from two or more vectors, cbind or rbind

– cbind(1:4, 1:4, 1:4)

– rbind(1:4, 1:4, 1:4)

R datatype Factor

● Categorical variable

– Limited number of different values

– Belong to a category● In R, Factor datastructure

● # example blood type

– blood <- c("A","B", "O", "AB","O", "A")

– blood_factor <- factor(blood)

– blood_factor

– #order of the levels alphabetical

– str(blood_factor)

– table(blood_factor)

R datatype List

● List

– One dimension

– Different R objects (even list, matrix, vector)

– Loss of functionality● Creation of list

– song <- list("Rsome types", 190, 5)● Naming a list

– names(song) <- c("title","duration","track")

– song <- list(title="Rsome types", duration=190, track=5)

R datatype dataframe

● Datasets

– Observations

– Variables● Example people

– Row = observation

– Properties = variables● Store that in R

– List

– Dataframe

R datatype dataframe

● data.frame

– Specifically for a dataset

– Rows = observations

– Columns = variables

– Contains elements of different types● Read a csv file to create a dataframe

– people <-read.csv("./people.csv", sep="", header=TRUE)

R and statistics

● Four types of variables (SS Stevens 1946)

– Nominal (categories)

– Ordinal (rank 1st 2nd etc)

– Interval (interval between each value is equal)

– Ratio (interval + « true » zero)

R and statistics : Data description

● Data description

– centrality● Mean (average), function mean()● Median (50%), function median()● Mode (peak)

– Spread● Standard deviation (variance and sd)● Inter quartile range

– Scale() : transformation to Z-score (mean = 0)

R and statistics : main functions

● Rnorm()

– generation of a sample following the normal distribution● Summary()

– Lot of information ● Min,max,average,median etc

Diagrams for qualitative data

● Qualitative, diagrams

– histogram

– Bar plot

– Pie chart

R Diagrams

● Qualitative, diagrams

– Bar plot

– Pie chart● Quantitative

– Few numerical value● Diagram = dot plot

– Lot of data● Histogram● Box plot

R Libraries

● Maps

– Install.packages(« maps »)

– library(« maps »)

– map(« world »)

– map(« france »)

– title("la France")

Conclusion

● When will you start using R ?

● Maybe it is also a good idea to follow a basis statistics course