Data Management and Statistical Analysis - Data Manipulation
-
Upload
vivay-salazar -
Category
Documents
-
view
291 -
download
8
description
Transcript of Data Management and Statistical Analysis - Data Manipulation
Presentation Title Goes Here…presentation subtitle.
Introduction to R:
Data Manipulation and Statistical Analysis
Data Manipulation
Violeta I. BartolomeSenior Associate Scientist-BiometricsCrop Research Informatics LaboratoryInternational Rice Research Institute
:: color, composition, and layout
Sample data set
mydata[3,4]
:: color, composition, and layout
Selecting Variables
• Select variable Y1
o mydata[“Y1”]
o mydata[,3]
o mydata[3]
o mydata[c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)]
o mydata[as.logical(c(0,0,1,0,0,0))]
o mydata[names(mydata)==“Y1”]
o mydata$Y1
To create a data frame containing Y1
myA<- mydata[“Y1”]
:: color, composition, and layout
• Select variables Y1, Y2, Y3, Y4
o mydata[c(3,4,5,6)]
o mydata[3:6]
o mydata[-c(1,2)]
o mydata[-I(1:2)] # I() is the isolation function
o mydata[c(“Y1”, “Y2”, “Y3”, “Y4”)]
o mydata[c(FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)]
o mydata[as.logical(c(0,0,1,1,1,1))]
Selecting Variables
To create a data frame containing Y1, Y2, Y3, Y4
myB<- mydata[c(3,4,5,6)]Dataset
:: color, composition, and layout
Selecting Variables
• Select variables Y1, Y2, Y3, Y4
o myB<-data.frame(mydata$Y1, mydata$Y2,mydata$Y3, mydata$Y4)
this is equivalent to
attach(mydata)
myB<-data.frame(Y1,Y2,Y3,Y4)
detach(mydata)
o myB<-subset(mydata, select=Y1:Y4)
:: color, composition, and layout
Selecting Observations
• Select observation numbers 3 to 8
o mydata[3:8, ]
o mydata[-c(1,2), ]
• Select observations of Site B
o mydata[mydata$Site==“B”, ]
o subset(mydata,subset=Site==“B”)
o mydata[which(mydata$Site==“B”),]
To create a data frame
myC<- mydata[mydata$Site==“B”, ]Dataset
:: color, composition, and layout
Selecting ObservationsSelect observations of Sites A and B, and
Trt 1 and 2
o attach(mydata)
mydata[(Site==“A” | Site==“B”) & (Trt==1 | Trt==2), ]
detach(mydata)
o subset(mydata,subset=((Site==“A” | Site==“B”) & (Trt==1 | Trt==2)))
o mydata[which((mydata$Site==“A” | mydata$Site==“B”) & (mydata$Trt==1 | mydata$Trt==2)),] Dataset
:: color, composition, and layout
Selecting Both Variables and
Observations
• Data frame containing Site B and Y1-Y4
o myD<-mydata[4:6, 3:6]
myD<-mydata[mydata$Site==“B”, c(“Y1”,”Y2”,”Y3”,”Y4”)]
o myD<-subset(mydata,subset=Site==“B”,select=Y1:Y4)
Dataset
Hands-on
:: color, composition, and layout
Transforming/Creating New Variables• Using Numerical Expressions
o mydata$Y5 <- mydata$Y3
o mydata$Y6 <- 0
• Using Mathematical Operations (+, -, *. / **)
o mydata$sum <-mydata$Y1+mydata$Y2+mydata$Y3+mydata$Y4
o attach(mydata)
mydata$sum<-Y1+Y2+Y3+Y4
detach(mydata)
o mydata<-transform(mydata, sum=Y1+Y2+Y3+Y4)
o If with more than 1 transformation
mydata<-transform(mydata,
sum=Y1+Y2+Y3+Y4,
mean=sum/4)
sample dataset
sample dataset
:: color, composition, and layout
Using Numerical Expressions Using Mathematical Operations
forwardback
:: color, composition, and layout
Transforming/Creating New Variables
• Using functions
o mydata$sqrtY3 <- sqrt(mydata$Y3)
o mydata$Y4 <- log10(mydata$Y4)
:: color, composition, and layout
Missing data: using the na.rm option
• Consider the statement
o mydata$sumy<-mydata$Y1+mydata$Y2+mydata$Y3
Note: if any of the Y’s is missing sum will be missing
• To get sum of non-missing observations
o myYs<-subset(mydata,select=c(Y1,Y2,Y3))
o mydata$sum<-rowSums(myYs,na.rm=TRUE)
sample data set
:: color, composition, and layout
backforward
:: color, composition, and layout
Missing data: using the is.na()
• Selecting observations with at least one missing observation
o missing <- subset(mydata,subset=(is.na(Y1)==T|is.na(Y2)==T|is.na(Y3)==T|is.na(Y4)==T))
:: color, composition, and layout
Keeping and Dropping Variables
• Create a copy of mydata
mysubset <- mydata
• Drop Y3 and Y4 from mysubset
mysubset$Y3 <- mysubset$Y4 <- NULL
:: color, composition, and layout
Renaming Variables
• Rename Y1-Y4 to X1-X4, respectively
o library (reshape)
mydata <- rename(mydata, c(Y1=“X1”))
mydata <- rename(mydata, c(Y2=“X2”))
mydata <- rename(mydata, c(Y3=“X3”))
mydata <- rename(mydata, c(Y4=“X4”))
o names(mydata) <- c(“Site”, “Trt”, “X1”, “X2”, “X3”, X4”)
Hands-on
:: color, composition, and layout
Stacking/Concatenating Data Frames
• Data frame containing Site A only
attach(mydata)
A <- mydata[Site==“A”, ]
• Data frame containing Site B only
B <- mydata[Site==“B”, ]
• Combine the two data frames
both <- rbind(A,B)
detach(mydata)
Hands-on :: color, composition, and layout
Merging Data Frames
• Data frame containing Y1 and Y2
attach(mydata)
left <- mydata[c(“Site”,”Trt”,”Y1”,”Y2”)]
• Data frame containing Y3 and Y4
right <- mydata[c(“Site”,”Trt”,”Y3”,”Y4”)]
• Merge the two data frames
both <- merge(left, right,
by=c(“Site”,”Trt”))
detach(mydata)
Hands-on
:: color, composition, and layout
Sorting Data Frames
• Sort by Trt and Site
mydataSorted <-mydata[order(mydata$Trt, mydata$Site), ]
Note: Default is ascending order. Prefix a variable by a minus sign to get descending order
mydataSorted <-mydata[order(-mydata$Trt, mydata$Site), ]
Hands-on :: color, composition, and layout
Parallel to Serial
data.serial <- reshape(mydata, # object to be reshapedvarying=list(3:6), # if >1 variable -- list(3:4,5:6)v.names=“Y", # v.names=c(“Y”,”X”)
idvar=c(“Site“,”Trt”), # be used as rownames
timevar=“Rep", # new variable to be createdtimes=c(1:4), # values of new variabledirection="long“)
data.serial
:: color, composition, and layout
Parallel to Serial
row.names(data.serial) <- 1:NROW(data.serial) data.serial
Change row names
idvar used as row names
:: color, composition, and layout
Parallel to Serial
Hands-on
:: color, composition, and layout
Serial to Parallel
data.parallel <- reshape(serialdata, # object to be reshapedv.names=c("yld","dm"), # variables to be convertedidvar=c("plot","date"), # variables to be retainedtimevar="rep", # values of which will be
affixed to column names
drop=c(“var1”,”var2”) # variables to be removed
from the reshaped data
direction="wide“)data.parallel :: color, composition, and layout
Serial to Parallel
colnames(data.parallel) <- gsub("[.]", "", colnames(data.parallel))data.parallel
Remove “.” from column names
:: color, composition, and layout
Serial to Parallel
row.names(data.parallel) <-1:NROW(data.parallel)data.parallel
Change row names
:: color, composition, and layout
Serial to Parallel
Hands-on
:: color, composition, and layout
Aggregating data
• With only one response variable
meanY <- aggregate(data.serial$Y,by = list(data.serial$Site,data.serial$Trt),FUN=mean,na.rm=TRUE) # gets statistics from nonmissing values
meanYna.rm=TRUE na.rm=FALSE
:: color, composition, and layout
Aggregating data
• With more than one response variables
Ys <- subset(mydata,select=Y1:Y4) # data frame of numerical variables
meanYs <- aggregate(Ys, by=list(mydata$Site), # subsetting variables
FUN=mean, # function to be performed
na.rm=TRUE)meanYs
Hands-on
:: color, composition, and layout
Please do the exercise.
Thank You.