Lecture3-R · 2016-07-18 · 7/18/16 1 Lecture 3: Programming Statistics in R Christopher S....
Transcript of Lecture3-R · 2016-07-18 · 7/18/16 1 Lecture 3: Programming Statistics in R Christopher S....
7/18/16
1
Lecture 3: Programming Statistics in R Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox
Review
• Questions from last lecture? – Problems with Stata? – Problems with Excel?
2
Review of Homework
• Create age and HLA mismatch categories • Perform t tests for age, cold ischemia time, surgery
duration, and HLA mismatches • Perform chi-square tests for sex and race • Use summarize command with “if” statement to
get summaries • Move summaries to Excel and format • Make graph
3
7/18/16
2
4
cd /Volumes/Hollenbeak/Teaching/Residents/OutcomesResearch/ clear insheet using "ltd_data.csv" generate age039=0
replace age039=1 if age < 40 generate age4049=0
replace age4049=1 if age >= 40 & age < 50 generate age5059=0
replace age5059=1 if age >= 50 & age < 60 generate age60=0
replace age60=1 if age >= 60
generate ab0=0 replace ab0=1 if abmm==0
generate ab1=0 replace ab1=1 if abmm==1
generate ab2=0 replace ab2=1 if abmm==2
generate ab3=0 replace ab3=1 if abmm==3
generate ab4=0 replace ab4=1 if abmm==4
summarize age age039 age4049 age5059 age60 female male black nonblack coldtime dur_surg abmm ab0 ab1 ab2 ab3 ab4 if ssi==0, sep(0) summarize age age039 age4049 age5059 age60 female male black nonblack coldtime dur_surg abmm ab0 ab1 ab2 ab3 ab4 if ssi==1, sep(0) ttest age, by(ssi) ttest coldtime, by(ssi) ttest dur_surg, by(ssi) ttest abmm, by(ssi) tab female ssi, chi2 tab black ssi, chi2 twoway scatter los age, title("Relationship between Age and LOS") xtitle("Age (Years)") ytitle("Hospital Stay (Days)”)
Stata Tip
• To find a variable quickly, use the command – lookfor
• This will search all variable names and labels for your text
• For example: lookfor bmi will identify two variables in the data set that contain the text “bmi” – bmi (which is patient’s bmi at listing) – bmi_d, which is donor BMI at hepatectomy
5
Overview
• R is an statistical computing environment • Based on the S language (like S-Plus) • Free and open source
– www.r-project.org
• Extensive add-on packages available – Also free and open source
7/18/16
3
Overview
• R has become the most widely used statistical software
• More flexible than Stata • More extensible than Stata • More up-to-date than Stata • R language is object-oriented
– Usually requires fewer steps – Can be unintuitive
• Did I mention it is free??
Common Tasks in R
• Import data • Create variables • Subset data • Univariate statistics
– Chi-square tests – T-tests – ANOVA
• Graphics – Histograms – Scatterplots – Boxplots
R Interface
• We use RStudio to run R • Process is (almost) the same as Stata
– Write a script file of commands • Please make sure to save your script files
– Run blocks of commands and retrieve output – Move to Word for cleaning – Move output to Excel for formatting
7/18/16
4
R Interface
Script Files
Console / Text output Browser
Variables
Differences Between R and Stata
• Stata holds one data set only, and all commands apply to that data set – R can hold multiple data sets
• Commands and variables must specific which data set
• Stata code is commands, separated by spaces, options after a single comma
• R code is functions, with parentheses, and options separated by multiple commas
• R is object oriented, so process is 1) create an object, 2) summarize the object
Setting the Working Directory
• It is good practice to point R to a directory where it will thereafter retrieve files and write files
• The function is setwd() • For example, if I have my .csv file with liver
transplant data sitting in a projects folder • Mac/Unix: setwd(“~/projects/ltd/”) • Windows: setwd(“c:/users/chollenbeak/projects/ltd/”) – Note that directories are separated by forward slashes
“/” not backslashes “\” even in Windows
7/18/16
5
Running a Command
• To run a command in R: – Highlight the line of code in your script file – Mac: Cmd + Enter – Windows: Ctrl + Enter
Import Data
• Create a comma-separate-value text file – Can be done in Excel
• Save As…CSV – Variable names in first row
• R stores data in objects called “data frames” • You can import, create, and subset as many data
frames as you want during an R session
R Command: read.csv
• Import the data using the command read.csv() – e.g. data1 <- read.csv(“c:/projects/ltd_data.csv”) – Or, if you have your working directory set to c:/projects/
already, you can simply use: – data1 <- read.csv(“ltd_data.csv”)
• This creates a data object called “data1” • When you want to use these data, refer to this
object • The “<-” is called the “assignment operator” and
assigns a value to the object
7/18/16
6
Viewing Data
• To view your data, click on the name of the data frame in the Environment window
• The data set will then appear in the script window
Referring to Variables
• All variables are associated with a data frame • To call a variable, use object$variable • For example, to compute the mean cost contained
in data frame data1, use: – mean(data1$cost1)
Create New Variables
• Variables should be created inside a data frame • To create a new variable, use the assignment
operator • Example: Assume you have a “male” dummy
variable but no female dummy variable • Create a female dummy variable using:
– data1$female <- 1 – data$male
7/18/16
7
Create New Variables
• Assume you have a sex variable coded as “M” and “F”
• You need to create male and female dummy variables
• Use the ifelse() function – data$newvar <- ifelse(condition, value_if_true,
value_if_false) – data1$male <- ifelse(data1$gender==“M”,1,0)
– data1$female <- 1- data1$male
Create New Variables
• Assume you have a race variable coded as “White”, “Black”, “Hispanic”, “Asian”, and “Other”
• You need to create race/ethnicity dummy variables
– data1$white <- ifelse(data1$race==“White”,1,0)
– data1$black <- ifelse(data1$race==“Black”,1,0)
– data1$hispanic <- ifelse(data1$race==“Hispanic”,1,0)
– data1$asian <- ifelse(data1$race==“Asian”,1,0)
– data1$other <- ifelse(data1$race==“Other”,1,0)
Create New Variables
• Assume you have a dummy variables for race and need a categorical version
• You can embed the ifelse() functions
– data1$race <- ifelse(data1$white==1,”white”, ifelse(data1$black==1,”black”, ifelse(data1$hispanic==1,”hispanic”, ifelse(data1$asian==1,”asian”, ”other”))))
7/18/16
8
Subsetting Data
• To pull out a subset of observations use the subset command
• Create a new data frame based on the original – data_new <- subset(dataset_old, condition) – e.g. data2 <- subset(data1, data1$male==1)
• Like Stata, R distinguishes between the logical equals (==) and the assignment equals (=)
Summary Statistics
• R has functions for all the summary statistics you care to present – N: length() – Mean: mean() – Standard Deviation: sd() – Sum: sum() – Minimum: min() – Maximum: max()
• However, it lacks a single, easy function to create a table of descriptive statistics – No equivalent of “summarize” in Stata)
Summary Statistics
• The solution is to write your own • You can create your own functions in R using the function() command
• Here is mine: summarize <- function(var){
n <- length(var) m <- mean(var)
s <- sd(var) v <- sum(var)
return(cbind(n, m, s, v)) }
7/18/16
9
Summary Statistics
• To create a table of summary statistics – 1. Run the summarize command in R – 2. Copy the output to Word – 3. Replace spaces with tabs
• The special character for tab is “^t” • May need to replace two adjacent tabs (^t^t) with a single tab • The special character for paragraph is “^p”
– 4. Copy to Excel – 5. Format in Excel
Univariate Statistics
• Categorical variables – Contingency tables – Chi-square tests
• Continuous variables – T-tests – ANOVA
Contingency Tables
• To obtain a tabulation of data, use the table() command
• For example, to get a tabulation of HLA –A and –B mismatches – table(data1$abmm)
• Note that the output is not formatted neatly – Do your formatting in Excel
0 1 2 3 4 12 22 109 264 370
7/18/16
10
Contingency Tables
• To get the percents, you can find the total observations using the length command and the divide the table
table(data1$abmm)/length(data1$abmm) 0 1 2 3 4 0.01544402 0.02831403 0.14028314 0.33976834 0.47619048
Contingency Tables
• To get a cross-tabulation, include both variables • e.g. To study whether mortality differs between
men and women:
> table(data1$male, data1$died) 0 1 0 297 61 1 337 82
Chi-square Test
• To test whether the difference in the distribution is significant, use the chisq.test() command
• Note that this command is applied to the table, not the variables
> chisq.test(table(data1$male, data1$died)) Pearson's Chi-squared test with Yates' continuity correction data: table(data1$male, data1$died) X-squared = 5.4022, df = 1, p-value = 0.02011
7/18/16
11
T-Test
• To compare the means between two groups, use the t.test() command
• Syntax is – t.test(vdepvar ~ indvar)
• For example, to test whether the LOS differs between men and women: – t.test(data1$los ~ data1$male)
Welch Two Sample t-test data: data1$los by data1$male t = 1.0511, df = 617.186, p-value = 0.2936 alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: -1.988466 6.568512 sample estimates: mean in group 0 mean in group 1 29.13966 26.84964
ANOVA
• To compare a continuous variable across more than two groups use the command aov()
• For example, to test whether LOS varies over HLA –A and –B mismatches:
anova1 <- aov(data1$los ~ data1$abmm)
summary(anova1)
7/18/16
12
ANOVA
• aov() creates an ANOVA object called “anova1”
• summary() gives us the results
> anova1 <- aov(data1$los ~ data1$abmm) > summary(anova1) Df Sum Sq Mean Sq F value Pr(>F) data1$abmm 1 10700 10700 12.54 0.000423 *** Residuals 775 661504 854 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Missing Data
• R stores missing values as “NA” • R cannot run analyses on missing data
– It will not produce output and will not tell you why
• You must identify variables with missing data and drop those observations
• Use the is.na() function to identify an observation with missing values
Missing Data
• Example: Assume you can tell that the male dummy variable has some missing values. You need to drop those observations
• Strategy 1: Create a subset with no missing values – data2 <- subset(data1, is.na(data1$male)==FALSE) – t.test(data2$los ~ data2$male)
• Strategy 2: Use original data set but limit the analysis to non-missing observations – t.test(data1$los[is.na(data1$male)==FALSE] ~
data1$male[is.na(data1$male)==FALSE])
7/18/16
13
Graphics in R
• R has fantastic graphing capabilities • Almost any aspect of the graph can be controlled
by the user – The opposite of the Stata approach
• We will discuss three types of graphs – Histograms – Scatterplots – Boxplots
Histogram
• To summarize the distribution of a continuous variable, plot a histogram
• The R command is hist() • To summarize the age of transplant recipients:
– hist(data1$age)
Histogram of data1$age
data1$age
Frequency
0 20 40 60 80
050
100
150
200
7/18/16
14
Histograms
• Can improve this by changing options: – main=“” Main Title – ylab=“” Y axis label – xlab=“” X axis label – col=“” Gives the boxes color – box() Adds a box around the inner plot
• R graphics are extremely flexible and customizable!
hist(data1$cci_index, ylab="Frequency", xlab="Comorbidities", main="Distribution of Charlson Comorbidity Index", col="red”, breaks=25) box()
Distribution of Age
age
Frequency
0 20 40 60 80
020
4060
80100
7/18/16
15
Scatterplot
• To create a scatterplot use the plot() function
• For example, to look at the correlation between one-year cost and five-year cost
plot(data1$age, data1$los)
0 20 40 60 80
0100
200
300
data1$age
data1$los
Scatterplot
• Again we can customize with additional options – pch=x Change the plot symbol – xlim=c(lower,upper) Set x axis limits – ylim=c(lower,upper) Set y axis limits – cex=x Symbol scaling factor
• Default is 1, Half size is .5, Double size is 2, etc.
7/18/16
16
plot(data1$age, data1$los, ylab=”Total Length of Stay", xlab=”Age at Transplant", main="Correlation Between Age and LOS", pch=16, cex=1.25, xlim=c(0,80), ylim=c(0,400))
0 20 40 60 80
010
020
030
040
0
Correlation Between Age and LOS
Age at Transplant
Tota
l Len
gth
of S
tay
Boxplot
• To summarize distributions across strata • Use boxplot(variable ~ strata) command • For example, to plot the distribution of costs
across stage:
boxplot(data1$age ~ data1$abmm)
7/18/16
17
0 1 2 3 4
0100
200
300
Boxplot
• We can customize – outline=FALSE Turn off outliers – notch=TRUE Adds notches to the boxes – col=“color” Fills boxes with color
boxplot(data1$los ~ data1$abmm, outline=FALSE, notch=TRUE, ylim=c(0,60), xlab="HLA-A and -B Mismatches", ylab="Inpatient Length of Stay", col=”turquoise")
0 1 2 3 4
010
2030
4050
60
HLA−A and −B Mismatches
Inpa
tient
Len
gth
of S
tay
7/18/16
18
Annotating Graphs
• R has a few functions that make annotating graphs particularly easy
• 1. Add a vertical or horizontal line – abline(v=x) puts a vertical line at x – abline(h=y) puts a horizontal line at y – We usually add these lines to the zeros or a null value
• 2. Add text to a graph – text(x, y, “text”) puts “text” at (x,y) – We frequently add p-values to graphs
Exporting Graphics
• Graphics can be exported directly from Rstudio • PDF is the preferred format
– Export | Save as PDF
• Other formats are available (eps, tiff, etc.) – Export | Save as image
Homework
• Revisit the homework from Lecture 2 • Do it all in R