Lecture3-R · 2016-07-18 · 7/18/16 1 Lecture 3: Programming Statistics in R Christopher S....

7/18/16

1

Lecture 3: Programming Statistics in R Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox

Review

•  Questions from last lecture? –  Problems with Stata? –  Problems with Excel?

2

Review of Homework

•  Create age and HLA mismatch categories •  Perform t tests for age, cold ischemia time, surgery

duration, and HLA mismatches •  Perform chi-square tests for sex and race •  Use summarize command with “if” statement to

get summaries •  Move summaries to Excel and format •  Make graph

3

7/18/16

2

4

cd /Volumes/Hollenbeak/Teaching/Residents/OutcomesResearch/ clear insheet using "ltd_data.csv" generate age039=0

replace age039=1 if age < 40 generate age4049=0

replace age4049=1 if age >= 40 & age < 50 generate age5059=0

replace age5059=1 if age >= 50 & age < 60 generate age60=0

replace age60=1 if age >= 60

generate ab0=0 replace ab0=1 if abmm==0





summarize age age039 age4049 age5059 age60 female male black nonblack coldtime dur_surg abmm ab0 ab1 ab2 ab3 ab4 if ssi==0, sep(0) summarize age age039 age4049 age5059 age60 female male black nonblack coldtime dur_surg abmm ab0 ab1 ab2 ab3 ab4 if ssi==1, sep(0) ttest age, by(ssi) ttest coldtime, by(ssi) ttest dur_surg, by(ssi) ttest abmm, by(ssi) tab female ssi, chi2 tab black ssi, chi2 twoway scatter los age, title("Relationship between Age and LOS") xtitle("Age (Years)") ytitle("Hospital Stay (Days)”)

Stata Tip

•  To find a variable quickly, use the command –  lookfor

•  This will search all variable names and labels for your text

•  For example: lookfor bmi will identify two variables in the data set that contain the text “bmi” –  bmi (which is patient’s bmi at listing) –  bmi_d, which is donor BMI at hepatectomy

5

Overview

•  R is an statistical computing environment •  Based on the S language (like S-Plus) •  Free and open source

–  www.r-project.org

•  Extensive add-on packages available –  Also free and open source

7/18/16

3

Overview

•  R has become the most widely used statistical software

•  More flexible than Stata •  More extensible than Stata •  More up-to-date than Stata •  R language is object-oriented

–  Usually requires fewer steps –  Can be unintuitive

•  Did I mention it is free??

Common Tasks in R

•  Import data •  Create variables •  Subset data •  Univariate statistics

–  Chi-square tests –  T-tests –  ANOVA

•  Graphics –  Histograms –  Scatterplots –  Boxplots

R Interface

•  We use RStudio to run R •  Process is (almost) the same as Stata

–  Write a script file of commands •  Please make sure to save your script files

–  Run blocks of commands and retrieve output –  Move to Word for cleaning –  Move output to Excel for formatting

7/18/16

4

R Interface

Script Files

Console / Text output Browser

Variables

Differences Between R and Stata

•  Stata holds one data set only, and all commands apply to that data set –  R can hold multiple data sets

•  Commands and variables must specific which data set

•  Stata code is commands, separated by spaces, options after a single comma

•  R code is functions, with parentheses, and options separated by multiple commas

•  R is object oriented, so process is 1) create an object, 2) summarize the object

Setting the Working Directory

•  It is good practice to point R to a directory where it will thereafter retrieve files and write files

•  The function is setwd() •  For example, if I have my .csv file with liver

transplant data sitting in a projects folder •  Mac/Unix: setwd(“~/projects/ltd/”) •  Windows: setwd(“c:/users/chollenbeak/projects/ltd/”) –  Note that directories are separated by forward slashes

“/” not backslashes “\” even in Windows

7/18/16

5

Running a Command

•  To run a command in R: –  Highlight the line of code in your script file –  Mac: Cmd + Enter –  Windows: Ctrl + Enter

Import Data

•  Create a comma-separate-value text file –  Can be done in Excel

•  Save As…CSV –  Variable names in first row

•  R stores data in objects called “data frames” •  You can import, create, and subset as many data

frames as you want during an R session

R Command: read.csv

•  Import the data using the command read.csv() –  e.g. data1 <- read.csv(“c:/projects/ltd_data.csv”) –  Or, if you have your working directory set to c:/projects/

already, you can simply use: –  data1 <- read.csv(“ltd_data.csv”)

•  This creates a data object called “data1” •  When you want to use these data, refer to this

object •  The “<-” is called the “assignment operator” and

assigns a value to the object

7/18/16

6

Viewing Data

•  To view your data, click on the name of the data frame in the Environment window

•  The data set will then appear in the script window

Referring to Variables

•  All variables are associated with a data frame •  To call a variable, use object$variable •  For example, to compute the mean cost contained

in data frame data1, use: –  mean(data1$cost1)

Create New Variables

•  Variables should be created inside a data frame •  To create a new variable, use the assignment

operator •  Example: Assume you have a “male” dummy

variable but no female dummy variable •  Create a female dummy variable using:

–  data1$female <- 1 – data$male

7/18/16

7


•  Assume you have a sex variable coded as “M” and “F”

•  You need to create male and female dummy variables

•  Use the ifelse() function –  data$newvar <- ifelse(condition, value_if_true,

value_if_false) –  data1$male <- ifelse(data1$gender==“M”,1,0)

–  data1$female <- 1- data1$male


•  Assume you have a race variable coded as “White”, “Black”, “Hispanic”, “Asian”, and “Other”

•  You need to create race/ethnicity dummy variables

–  data1$white <- ifelse(data1$race==“White”,1,0)

–  data1$black <- ifelse(data1$race==“Black”,1,0)

–  data1$hispanic <- ifelse(data1$race==“Hispanic”,1,0)

–  data1$asian <- ifelse(data1$race==“Asian”,1,0)

–  data1$other <- ifelse(data1$race==“Other”,1,0)


•  Assume you have a dummy variables for race and need a categorical version

•  You can embed the ifelse() functions

–  data1$race <- ifelse(data1$white==1,”white”, ifelse(data1$black==1,”black”, ifelse(data1$hispanic==1,”hispanic”, ifelse(data1$asian==1,”asian”, ”other”))))

7/18/16

8

Subsetting Data

•  To pull out a subset of observations use the subset command

•  Create a new data frame based on the original –  data_new <- subset(dataset_old, condition) –  e.g. data2 <- subset(data1, data1$male==1)

•  Like Stata, R distinguishes between the logical equals (==) and the assignment equals (=)

Summary Statistics

•  R has functions for all the summary statistics you care to present –  N: length() –  Mean: mean() –  Standard Deviation: sd() –  Sum: sum() –  Minimum: min() –  Maximum: max()

•  However, it lacks a single, easy function to create a table of descriptive statistics –  No equivalent of “summarize” in Stata)

Summary Statistics

•  The solution is to write your own •  You can create your own functions in R using the function() command

•  Here is mine: summarize <- function(var){

n <- length(var) m <- mean(var)

s <- sd(var) v <- sum(var)

return(cbind(n, m, s, v)) }

7/18/16

9

Summary Statistics

•  To create a table of summary statistics –  1. Run the summarize command in R –  2. Copy the output to Word –  3. Replace spaces with tabs

•  The special character for tab is “^t” •  May need to replace two adjacent tabs (^t^t) with a single tab •  The special character for paragraph is “^p”

–  4. Copy to Excel –  5. Format in Excel

Univariate Statistics

•  Categorical variables –  Contingency tables –  Chi-square tests

•  Continuous variables –  T-tests –  ANOVA

Contingency Tables

•  To obtain a tabulation of data, use the table() command

•  For example, to get a tabulation of HLA –A and –B mismatches –  table(data1$abmm)

•  Note that the output is not formatted neatly –  Do your formatting in Excel

0 1 2 3 4 12 22 109 264 370

7/18/16

10

Contingency Tables

•  To get the percents, you can find the total observations using the length command and the divide the table

table(data1$abmm)/length(data1$abmm) 0 1 2 3 4 0.01544402 0.02831403 0.14028314 0.33976834 0.47619048

Contingency Tables

•  To get a cross-tabulation, include both variables •  e.g. To study whether mortality differs between

men and women:

> table(data1$male, data1$died) 0 1 0 297 61 1 337 82

Chi-square Test

•  To test whether the difference in the distribution is significant, use the chisq.test() command

•  Note that this command is applied to the table, not the variables

> chisq.test(table(data1$male, data1$died)) Pearson's Chi-squared test with Yates' continuity correction data: table(data1$male, data1$died) X-squared = 5.4022, df = 1, p-value = 0.02011

7/18/16

11

T-Test

•  To compare the means between two groups, use the t.test() command

•  Syntax is –  t.test(vdepvar ~ indvar)

•  For example, to test whether the LOS differs between men and women: –  t.test(data1$los ~ data1$male)

Welch Two Sample t-test data: data1$los by data1$male t = 1.0511, df = 617.186, p-value = 0.2936 alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval: -1.988466 6.568512 sample estimates: mean in group 0 mean in group 1 29.13966 26.84964

ANOVA

•  To compare a continuous variable across more than two groups use the command aov()

•  For example, to test whether LOS varies over HLA –A and –B mismatches:

anova1 <- aov(data1$los ~ data1$abmm)

summary(anova1)

7/18/16

12

ANOVA

•  aov() creates an ANOVA object called “anova1”

•  summary() gives us the results

> anova1 <- aov(data1$los ~ data1$abmm) > summary(anova1) Df Sum Sq Mean Sq F value Pr(>F) data1$abmm 1 10700 10700 12.54 0.000423 *** Residuals 775 661504 854 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Missing Data

•  R stores missing values as “NA” •  R cannot run analyses on missing data

–  It will not produce output and will not tell you why

•  You must identify variables with missing data and drop those observations

•  Use the is.na() function to identify an observation with missing values

Missing Data

•  Example: Assume you can tell that the male dummy variable has some missing values. You need to drop those observations

•  Strategy 1: Create a subset with no missing values –  data2 <- subset(data1, is.na(data1$male)==FALSE) –  t.test(data2$los ~ data2$male)

•  Strategy 2: Use original data set but limit the analysis to non-missing observations –  t.test(data1$los[is.na(data1$male)==FALSE] ~

data1$male[is.na(data1$male)==FALSE])

7/18/16

13

Graphics in R

•  R has fantastic graphing capabilities •  Almost any aspect of the graph can be controlled

by the user –  The opposite of the Stata approach

•  We will discuss three types of graphs –  Histograms –  Scatterplots –  Boxplots

Histogram

•  To summarize the distribution of a continuous variable, plot a histogram

•  The R command is hist() •  To summarize the age of transplant recipients:

–  hist(data1$age)

Histogram of data1$age

data1$age

Frequency

0 20 40 60 80

050

100

150

200

7/18/16

14

Histograms

•  Can improve this by changing options: –  main=“” Main Title –  ylab=“” Y axis label –  xlab=“” X axis label –  col=“” Gives the boxes color –  box() Adds a box around the inner plot

•  R graphics are extremely flexible and customizable!

hist(data1$cci_index, ylab="Frequency", xlab="Comorbidities", main="Distribution of Charlson Comorbidity Index", col="red”, breaks=25) box()

Distribution of Age

age

Frequency

0 20 40 60 80

020

4060

80100

7/18/16

15

Scatterplot

•  To create a scatterplot use the plot() function

•  For example, to look at the correlation between one-year cost and five-year cost

plot(data1$age, data1$los)

0 20 40 60 80

0100

200

300

data1$age

data1$los

Scatterplot

•  Again we can customize with additional options –  pch=x Change the plot symbol –  xlim=c(lower,upper) Set x axis limits –  ylim=c(lower,upper) Set y axis limits –  cex=x Symbol scaling factor

•  Default is 1, Half size is .5, Double size is 2, etc.

7/18/16

16

plot(data1$age, data1$los, ylab=”Total Length of Stay", xlab=”Age at Transplant", main="Correlation Between Age and LOS", pch=16, cex=1.25, xlim=c(0,80), ylim=c(0,400))

0 20 40 60 80

010

020

030

040

0

Correlation Between Age and LOS

Age at Transplant

Tota

l Len

gth

of S

tay

Boxplot

•  To summarize distributions across strata •  Use boxplot(variable ~ strata) command •  For example, to plot the distribution of costs

across stage:

boxplot(data1$age ~ data1$abmm)

7/18/16

17

0 1 2 3 4

0100

200

300

Boxplot

•  We can customize –  outline=FALSE Turn off outliers –  notch=TRUE Adds notches to the boxes –  col=“color” Fills boxes with color

boxplot(data1$los ~ data1$abmm, outline=FALSE, notch=TRUE, ylim=c(0,60), xlab="HLA-A and -B Mismatches", ylab="Inpatient Length of Stay", col=”turquoise")

0 1 2 3 4

010

2030

4050

60

HLA−A and −B Mismatches

Inpa

tient

Len

gth

of S

tay

7/18/16

18

Annotating Graphs

•  R has a few functions that make annotating graphs particularly easy

•  1. Add a vertical or horizontal line –  abline(v=x) puts a vertical line at x –  abline(h=y) puts a horizontal line at y –  We usually add these lines to the zeros or a null value

•  2. Add text to a graph –  text(x, y, “text”) puts “text” at (x,y) –  We frequently add p-values to graphs

Exporting Graphics

•  Graphics can be exported directly from Rstudio •  PDF is the preferred format

–  Export | Save as PDF

•  Other formats are available (eps, tiff, etc.) –  Export | Save as image

Homework

•  Revisit the homework from Lecture 2 •  Do it all in R

Lecture3-R · 2016-07-18 · 7/18/16 1 Lecture 3: Programming Statistics in R Christopher S....

Documents

Transcript of Lecture3-R · 2016-07-18 · 7/18/16 1 Lecture 3: Programming Statistics in R Christopher S....