R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis...

30
R Introduction, Data Structures

Transcript of R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis...

Page 1: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

R

Introduction, Data Structures

Page 2: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

An Excellent R Book (among many others)

R in ActionData Analysis and Graphics with RRobert I. Kabacoff

http://www.manning.com/affiliate/idevaffiliate.php?id=1102_173

Page 3: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Steps in a typical data analysis (Kabacoff, 2011)

Page 4: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

R features (Kabacoff, 2011)

• R is free! (SPSS, SAS, etc. cost thousands or tens of thousands of dollars • R is a comprehensive statistical platform, offering all manner of data analytic

techniques• R has state-of-the-art graphics capabilities• R is a powerful platform for interactive data analysis and exploration• R can easily import data from a wide variety of sources, including text files,

database management systems, statistical packages, and specialized data repositories. It can write data out to these systems as well

• R provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner. It’s easily extensible and provides a natural language for quickly programming recently published methods

• R contains advanced statistical routines not yet available in other packages. In fact, new methods become available for download on a weekly basis

• A variety of graphic user interfaces (GUIs) are available, offering the power of R through menus and dialogs.

• R runs on a wide array of platforms, including Windows, Unix, and Mac OS X

Page 5: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Data structures in R

Page 6: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Vectors• Vectors are one-dimensional arrays that can hold numeric data, character data, or logical

data• The combine function c() is used to form the vector> x = c(1, 3, 5, 7, 25, -13, 47)> y = c(”unu", ”doi", ”trei”, “opt”)

• The data in a vector must only be one type (numeric, character, or logical)• Elements of a vector can be referred using a numeric vector of positions within brackets:

x[c(4, 6)] refers to the 4th and 6th element of vector x. > x = c(1, 3, 5, 7, 25, -13, 47)> c[3][1] 5> x [c(1, 2, 4)][1] 1 3 7> x[2:6][1] 3 5 7 25 -13

• Last statement generates a sequence of numbers; x <- c(2:6) is equivalent to x <- c(2, 3, 4, 5, 6)

Page 7: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Date type

• Date type handling is more difficult to handle• Dates are represented as the number of days since

1970-01-01, with negative values for earlier dates.• as.Date( ) converts strings to dates > mydates <- as.Date(c ('2013-10-01', '2013-10-03', '2013-11-10'))

• number of days between 10/11/2013 and 3/10/ 2013 > days <- mydates[3] - mydates[2]> days> # notice the way of displaying the result

• # print today's date> today <- Sys.Date()> format(today, format="%d %B %Y")

Page 8: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Symbols used with format( )

Symbol Meaning Example%d day as a number (0-31) 01-31

%a abbreviated weekday Mon

%A unabbreviated weekday Monday

%m month (00-12) 00-12

%b abbreviated month Jan

%B unabbreviated month January

%y 2-digit year 07

%Y 4-digit year 2007

Page 9: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Date conversions

• Character to Date: as.Date(x, "format")

> # convert date info in format ’dd/mm/yyyy'> strDates = c("01/10/2013", ”31/10/2013")> dates = as.Date(strDates,"%d/%m/%Y")

• Date to Character: as.Character( )

> # convert dates to character data> strDates2 = as.character(dates)

Page 10: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Matrices• Two-dimensional arrays where each element has the same

type (numeric,character, or logical)• Created with the matrix function. Format: > Myymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logical_value, dimnames=list( char_vector_rownames,

char_vector_colnames))– vector contains the elements for the matrix– nrow and ncol specify the row and column dimensions– dimnames contains optional row and column labels stored in

character vectors. – byrow indicates whether the matrix should be filled in by row

(byrow=TRUE) or by column (byrow=FALSE); the default is by column.

Page 11: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Creating matrices (1)• First example (a 5 x 4 matrix)> m1 <- matrix(1:20, nrow=5, ncol=4)> m1 [,1] [,2] [,3] [,4][1,] 1 6 11 16[2,] 2 7 12 17[3,] 3 8 13 18[4,] 4 9 14 19[5,] 5 10 15 20

• Second example (a 2 x 2 matrix, filled by rows)> cells <- c(1,26,24,68)> rownames <- c("Row1", "Row2")> colnames <- c("Col1", "Col2")> m2 <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, + dimnames=list(rownames, colnames))> m2 Col1 Col2Row1 1 26Row2 24 68

Page 12: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Creating matrices (2)

• Third example (a 2 x 2 matrix, filled by columns)

> m3 <- matrix(cells, nrow=2, ncol=2,+ byrow=FALSE, dimnames=list(rownames,+ colnames))>> m3 Col1 Col2Row1 1 24Row2 26 68

Page 13: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Accesing matrix elements (1)• (re) create the matrix> m1 <- matrix(1:20, nrow=5)> m1 [,1] [,2] [,3] [,4][1,] 1 6 11 16[2,] 2 7 12 17[3,] 3 8 13 18[4,] 4 9 14 19[5,] 5 10 15 20

• display the 3rd row> m1[3,][1] 3 8 13 18

• display the 3rd column> m1[,3][1] 11 12 13 14 15

Page 14: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Accesing matrix elements (2)• display the element in 2nd row anf 3rd column> m1 [2,3][1] 12• display two elements from the same row: m1 [2,3] and m1[2,4]> m1 [2, c(3,4)][1] 12 17• display three elements from the same column: m1 [1,2], m1 [2,2] and m1[3,2]> m1 [c(1,2, 3), 2][1] 6 7 8• display a "submatrix", from m1 [2,2] to m2[4.4]> m1 [ c(2,3,4), c(2,3,4)] [,1] [,2] [,3][1,] 7 12 17[2,] 8 13 18[3,] 9 14 19

Page 15: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Arrays

• Similar to matrices but can have more than two dimensions

• Elements must be of the same type• Created with array function:> myarray <- array(vector, + dimensions, dimnames)– vector contains the data for the array – dimensions is a numeric vector giving the maximal index

for each dimension– dimnames - optional list of dimension labels.

• Elements in arrays are accessed similar to those in matrices

Page 16: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Create and access arrays (1)> dim1 <- c("A1", "A2")> dim2 <- c("B1", "B2", "B3")> dim3 <- c("C1", "C2", + "C3", "C4")> a1 <- array(1:24, c(2, 3, 4), + dimnames=list(dim1, dim2, + dim3))>> a1

, , C1 B1 B2 B3A1 1 3 5A2 2 4 6

, , C2 B1 B2 B3A1 7 9 11A2 8 10 12

• Cont. of previous column, , C3 B1 B2 B3A1 13 15 17A2 14 16 18

, , C4 B1 B2 B3A1 19 21 23A2 20 22 24

• display element [2,2,3]> a1 [2,2,3][1] 16

Page 17: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Create and access arrays (2)

• display a matrix from elements of A and B for first row/column of C

> a1 [,,1] B1 B2 B3A1 1 3 5A2 2 4 6

• display elements of A for the 3rd "row" of B and 2nd row/columns of C

> a1 [,3,2]A1 A2 11 12

• display a subarray containg all elements from first two rows/columns of A, B and C

> a1 [c(1,2),c(1,2),c(1,2)]

, , C1

B1 B2A1 1 3A2 2 4

, , C2

B1 B2A1 7 9A2 8 10

Page 18: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Data Frames

• Most important data structure in R (at least for us)• A data frame is a structure in R that holds data and

is similar to the datasets found in standard statistical packages (for example, SAS, SPSS, and Stata) and databases

• The columns are variables and the rows are observations

• Variables can have different types (for example, numeric, character) in the same data frame.

• Data frames are the main structures we’ll use to store datasets

Page 19: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

data.frame function• A data frame is created with the data.frame() function :> mydata <- data.frame(col1, col2, col3,…)

– col1, col2, col3, … are column vectors of any type (such as character, numeric,or logical).

– names for each column can be provided with the names function.

> studentID <- c(1, 2, 3, 4, 5)> name <- c("Popescu I. Vasile", "Ianos W. Adriana", + "Kovacz V. Iosef", "Babadag I. Maria", "Pop P. Ion")> age <- c(23, 19, 21, 22, 31)> scholarship <- c("Social","Studiu1","Studiu2","Merit","Studiu1")> lab_assessment <- c("Bine", "Foarte bine", "Excelent", "Bine", "Slab")> final_grade <- c(9, 9.45, 9.75, 7.21, 6)> student_gi <- data.frame(studentID, name, age, scholarship, + lab_assessment, final_grade)> student_gi studentID name age scholarship lab_assessment final_grade1 1 Popescu I. Vasile 23 Social Bine 9.002 2 Ianos W. Adriana 19 Studiu1 Foarte bine 9.453 3 Kovacz V. Iosef 21 Studiu2 Excelent 9.754 4 Babadag I. Maria 22 Merit Bine 7.215 5 Pop P. Ion 31 Studiu1 Slab 6.00

Page 20: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

accessing elements of a data frame (1)• display first two columns (studentID and name )> student_gi [1:2] studentID name1 1 Popescu I. Vasile2 2 Ianos W. Adriana3 3 Kovacz V. Iosef4 4 Babadag I. Maria5 5 Pop P. Ion• the same operation could be done with> student_gi [c("studentID", "name")]

• display final_grade column as a vector> student_gi$final_grade[1] 9.00 9.45 9.75 9.00 6.00

Page 21: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

accessing elements of a data frame (2)• cross tabulate (a sort of pivot table) lab_assessment by final_grade> table (student_gi$lab_assessment, + student_gi$final_grade) 6 9 9.45 9.75 Bine 0 2 0 0 Excelent 0 0 0 1 Foarte bine 0 0 1 0 Slab 1 0 0 0

• summary statistics of final_grade> summary(student_gi$final_grade) Min. 1st Qu. Median Mean 3rd Qu. Max. 6.00 9.00 9.00 8.64 9.45 9.75

• two plots> plot(student_gi$lab_assessment, student_gi$final_grade)> plot(student_gi$age, student_gi$final_grade)

Page 22: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

attach()• attach() function adds the data frame to the R search path• When a variable name is encountered, data frames in the search path are checked

in order to locate the variable.

• But first we'll delete the vectors which formed the data frame (to avoid confusion)> rm(studentID, name, age, scholarship, lab_assessment,+ final_grade)

• Now we'll launch the previous commands but with attach> attach(student_gi)> final_grade> table (lab_assessment, final_grade)> summary(final_grade)> plot(lab_assessment, final_grade)> plot(age, final_grade)

• At the end, detach remove the data frame from the R search path> detach(student_gi)

Page 23: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Case Identifiers • Can be specified with a rowname option in the data frame function• New values for studentID (to avoid confusion with regular row numbers)> studentID <- c(1001, 1002, 1003, 1004, 1005)• Vectors name, age, scholarship, lab_assessment and final_grade are the same• (Slightly) new version of the data frame> student_gi <- data.frame(studentID, name, age, + scholarship, lab_assessment, final_grade,+ row.names = studentID)• studentID is the variable to use in labeling cases

> student_gi studentID name age scholarship lab_assessment final_grade1001 1001 Popescu I. Vasile 23 Social Bine 9.001002 1002 Ianos W. Adriana 19 Studiu1 Foarte bine 9.451003 1003 Kovacz V. Iosef 21 Studiu2 Excelent 9.751004 1004 Babadag I. Maria 22 Merit Bine 9.001005 1005 Pop P. Ion 31 Studiu1 Slab 6.00

Page 24: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Factors (1)• Variables can be described as nominal, ordinal, or continuous• Nominal variables are categorical, without an implied order. Examples:

MaritalStatus, Sex, Job, MasterProgramme• Ordinal variables imply order but not amount. Examples: Status (poor,

improved, excellent ), LabAssessment (slab, bine, foarteBine, excelent)• Continuous variables can take on any value within some range, and both

order and amount are implied. Examples: LitersPer100Km, Height, Weight, FinalGrade (with decimals)

• Categorical (nominal) and ordered categorical (ordinal) variables are called factors.

• Factors determine how data will be analyzed and presented visually• The function factor() stores the categorical values as a vector of integers

in the range [1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers

Page 25: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

factor function• A nominal variable> scholarship <- c("Social","Studiu1","Studiu2","Merit", + "Studiu1")

• factor function> scholarship_f <- factor(scholarship)> scholarship_f[1] Social Studiu1 Studiu2 Merit Studiu1Levels: Merit Social Studiu1 Studiu2

• Ordinal variable> lab_assessment <- c("Bine", "Foarte bine", "Excelent",+ "Bine", "Slab")> lab_assessment[1] "Bine" "Foarte bine" "Excelent" "Bine" "Slab" > lab_assessment <- factor(lab_assessment, order=TRUE,+ levels=c("Slab", "Bine", "Foarte bine", "Excelent"))> lab_assessment[1] Bine Foarte bine Excelent Bine Slab Levels: Slab < Bine < Foarte bine < Excelent

Page 26: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Data Frame with Factors (1)• Vectors studentID, name, age, final_grade are identical as previous

• Scholarship and lab_assessment are factors> scholarship <- c("Social", "Studiu1", "Studiu2", "Merit", "Studiu1")> scholarship <- factor(scholarship)> lab_assessment <- c("Bine", "Foarte bine", "Excelent", "Bine", "Slab")> lab_assessment <- factor(lab_assessment, order=TRUE, levels=c("Slab",+ "Bine", "Foarte bine", "Excelent"))

• Another version of the data frame (column studentID is removed and becomes row identifier)

> student_gi <- data.frame(name, age, scholarship, + lab_assessment, final_grade, row.names = studentID)

• Structure of the data frame> str(student_gi)'data.frame': 5 obs. of 5 variables: $ name : Factor w/ 5 levels "Babadag I. Maria",..: 5 2 3 1 4 $ age : num 23 19 21 22 31 $ scholarship : Factor w/ 4 levels "Merit","Social",..: 2 3 4 1 3 $ lab_assessment: Ord.factor w/ 4 levels "Slab"<"Bine"<..: 2 3 4 2 1 $ final_grade : num 9 9.45 9.75 9 6

Page 27: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Data Frame with Factors (2)

• Basic statistics about variables in data frame> summary(student_gi)

Page 28: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Factors and Value Labels

• The factor() function can be used to create value labels for categorical variables

> patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52)> diabetes <- c("Type1", "Type2", "Type1", "Type1")> status <- c("Poor", "Improved", "Excellent", "Poor")> diabetes <- factor(diabetes)> status <- factor(status, order=TRUE)> gender <- c(1, 2, 2, 1)> patientdata <- data.frame(patientID, age, diabetes,+ status, gender)

• Variable gender is coded 1 for male and 2 for female. Create value labels:> patientdata$gender <- factor(patientdata$gender, + levels = c(1,2),labels = c("male", "female"))

• levels indicate the actual values of the variable• labels refer to a character vector containing the desired labels.

Page 29: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Lists

• Lists are the most complex of the R data types• A list is an ordered collection of objects (components). • A list allows gathering a large variety of (possibly

unrelated) objects under one name.• A list can contain a combination of vectors, matrices, data

frames, and even other list• Created using list() function : mylist <- list(object1, object2, …) where the objects are any of the structures seen so far • Optionally, the objects in a list can be named: mylist <- list(name1=object1,+ name2=object2, …)

Page 30: R Introduction, Data Structures. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff .

Useful functions for Data Objects