Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204...
Transcript of Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204...
Stat 204, Part 1 Data
Chapter 3: Association: Contingency, Correlation, and Regression
These notes reflect material from our text, Statistics: The Art and Science of Learning from Data,Third Edition, by Alan Agresti and Catherine Franklin, published by Pearson, 2013.
Hubble and the recession of galaxies, 1929
During the 1920’s, Edwin Hubble used the 100 inch telescope of the Mount Wilson Observatory nearLos Angeles, which at that time was the largest telescope in the world, to explore faint objects in thesky known as “spiral nebulae.” Hubble’s observations and measurements of their distances and spectraestablished that they were galaxies, each with its own astronomical number of stars, and that they wereall receding from the Earth and from each other. His results documenting the recession of galaxies werepublished in 1929 and constituted a major advance in our understanding of cosmology.
0.0 0.5 1.0 1.5 2.0
-200
0200
400
600
800
1000
Hubble's data, 1929
distance (Mpc)
rece
ssio
n ve
loci
ty (k
m/s
ec)
The relationship that Hubble discovered is summarized in the equation
V = H0D,
where V is the velocity of the galaxy, D is its distance to earth, and H0 is a constant now known as Hubble’sconstant. The units for Hubble’s constant are strange, km per second per megaparsec, but the meaning ofthe equation is clear: velocity is proportional to distance. Hubble’s constant is the slope of the regressionline in this image.
Spring 2015 Page 1 of 11
Stat 204, Part 1 Data
Categorical data
Contingency tables are compact representations of the values of categorical variables. We need to beable to represent them on paper and in our computers. Wikipedia illustrates the concept of a contingencytable with this example displaying the numbers of right-handed and left-handed people in a small sample ofmen and women. By convention, levels of explanatory variables are in rows and levels of response variablesare in columns.
right leftm 43 9f 44 4
# contingency table : handedness (Wikipedia)
# http://en.wikipedia.org/wiki/Contingency_table
handed <- c(43, 44, 9, 4)
dim(handed) <- c(2, 2)
dimnames(handed) <- list(Gender=c("m", "f"),
Handed=c("right", "left"))
# Handed
# Gender right left
# m 43 9
# f 44 4
mosaicplot(handed,
col=c("orange", "yellow"),
main="Handedness")
Handedness
Gender
Handed
m f
right
left
Spring 2015 Page 2 of 11
Stat 204, Part 1 Data
Summarizing categorical data:
• counts
• percentages, conditional proportions
Use addmargins to add row and col sums to a contingency table.
# margins
handed.margins <- addmargins(handed)
# Handed
# Gender right left Sum
# m 43 9 52
# f 44 4 48
# Sum 87 13 100
Use prop.table to calculate conditional proportions.
# conditional proportions
# divide each row by its row sum
prop.table(handed, 1)
# Handed
# Gender right left
# m 0.8269231 0.17307692
# f 0.9166667 0.08333333
# divide each col by its col sum
prop.table(handed, 2)
# Handed
# Gender right left
# m 0.4942529 0.6923077
# f 0.5057471 0.3076923
Displaying categorical data:
• frequency table (one set of categories)
• bar plot
• contingency table, r × c (two sets of categories)
• mosaic plot
• side-by-side bar plots
• segmented bar plots
Spring 2015 Page 3 of 11
Stat 204, Part 1 Data
A mosaic plot is an effective representation of an r × c contingency table. In effect, the mosaic plottransposes the matrix. The first variable splits a big square into W-E blocks. The second variable splitsthose blocks into N-S sub-blocks. The dimensions of the blocks carry the information (not the areas of therectangles). Designated colors will denote the levels of the last (response) variable.
Social science majors and non-social science majors were asked if they are in a relationship. Are thesocial science majors ... ahem ... more social?
Relationship
Major
Relationship
SS not SS
yes
no
Is happiness associated with family income?
Happiness
above.average average below.average
Not.too.happy
Pretty.happy
Very.happy
Spring 2015 Page 4 of 11
Stat 204, Part 1 Data
Modeling (numerical response)
explanatory variable, response variable
regression line
Linear model - one explanatory variable
Explanatory (predictor) and response (predicted) variables.
Fitting y ∼ x results in a linear model, y = β0 + β1x
Point estimates for β0 and β1 are determined from the sample and are denoted b0 and b1
An association is characterized by its direction (positive or negative), form (linear or non-linear) andstrength (which for linear relationships is measured by the correlation)
Correlation coefficient, ρ
Residual, ei = yi − yi
Least squares line
Conditions for least squares : (1) nearly linear relationship, (2) nearly normal residuals, (3) with nearlyconstant variability.
Calculate b0 and b1
The slope of the line is proportional to the correlation : b1 = ρ sy/sx
The center of mass of the sample lies on the least squares line : (x, y) lies on the regression line.
Use a least squares line to predict y from x : y = b0 + b1x
R2 is the proportion of the response variation explained by the linear model.
Spring 2015 Page 5 of 11
Stat 204, Part 1 Data
Using R for Linear Models
Use Plot to plot the points, lm to create the linear model, and abline to plot the regression line.coefficients will extract the coefficients of the regression line from the linear model.
# population of SC
# plot the population of SC for the years 1900-1960
years <- seq(from=1900, to=1960, by=10)
SC.population <- c(1340, 1515, 1684, 1739, 1900, 2117, 2383)
plot(years, SC.population,
pch=19, col="darkred",
xlab="Year", ylab="Population (in thousands of people)",
main="Population of SC")
SC.population.lm <- lm(SC.population ~ years)
abline(SC.population.lm,
col="orange")
coefficients(SC.population.lm)
# (Intercept) years
# -29544.46429 16.24643
1900 1910 1920 1930 1940 1950 1960
1400
1600
1800
2000
2200
2400
Population of SC
Year
Pop
ulat
ion
(in th
ousa
nds
of p
eopl
e)
Spring 2015 Page 6 of 11
Stat 204, Part 1 Data
Prediction.
# population of SC : problem with extrapolation
# use the regression line to predict the population in 2010
# plot the actual population in 2010
a <- coefficients(SC.population.lm)[1]
b <- coefficients(SC.population.lm)[2]
predict <- function(year){
y.hat <- a + b * year
return(y.hat)
}
predict(2010) # predicted population in 2010 (in green on the graph)
# 3110.857
years.new <- append(years, c(2010))
SC.population.new <- append(SC.population, c(4625))
plot(years.new, SC.population.new,
pch=19, col="darkred",
xlab="Year", ylab="Population (in thousands of people)",
main="Population of SC")
abline(SC.population.lm,
col="orange")
points(x=2010, y=predict(2010),
col="green")
1900 1920 1940 1960 1980 2000
1500
2000
2500
3000
3500
4000
4500
Population of SC
Year
Pop
ulat
ion
(in th
ousa
nds
of p
eopl
e)
Spring 2015 Page 7 of 11
Stat 204, Part 1 Data
Correlation
The correlation measures the strength of an association. Just how tightly do the data points clusterabout the least-squares regression line? Or do the data points form a diffuse cloud with little suggestionof a linear trend? Correlation provides a quantitative answer. It is a number between -1 and 1, and it iscalculated in R with a command like cor(x, y).
Note that the correlation, r, can vary over the range 0 < r ≤ 1 for any fixed regression line with positiveslope, just as it can vary over the range −1 ≤ r < 0 for any fixed regression line with negative slope. If thecorrelation is 0, the regression line is horizontal. Code for the image shown below appeared recently in theR-bloggers forum (Tony Hirst, Sketching Scatterplots to Demonstrate Different Correlations, R-bloggers,December 17, 2014).
-1.0 -0.8 -0.5
-0.2 0.0 0.2
0.5 0.8 1.0
-2
0
2
-2
0
2
-2
0
2
-2 0 2 -2 0 2 -2 0 2x
y
Spring 2015 Page 8 of 11
Stat 204, Part 1 Data
Importing Data into R
The Agresti-Franklin textbook comes with a set of data files, and its exercises sometimes make use ofthem. This note explains how to import the information contained in such data files into R.
First of all, copy the Agresti-Franklin data files into a convenient place on your computer where theywill stay for the rest of the semester. Notice that there are three sets of files contained in three folders: CSVfiles, Text files, and TI 83-84 Files. A csv file is a plain-text file which has its records (observations)on separate rows and the values associated with each observation separated by commas (csv = comma-separated-values). The text files are similar, except that the values of each observation are separated bytabs. You can choose to use either type of file, but the command that you use to import the data into Ris slightly different for each file type.
Launch R. On a Mac, go to the Misc menu and select Get Working Directory. Notice that R answersby displaying the current working directory. R always expects to read files from and write files to the cur-rent working directory. Go to Misc : Change Working Directory ... and navigate to the directorythat contains the file you want to import into R, for instance the folder CSV files which contains the filemountain bike.csv which we will use in this chapter. The R command dir() will now display all the filesthat R can see in that folder.
Since we wish to import a csv file into R which contains data about mountain bikes, an appropriatecommand might be
mountain.bike <- read.csv("mountain_bike.csv", header=TRUE)
The parameter header was set to TRUE because the first row of the data inside the file contains the namesof the variables for each of the columns in that file. Verify that the value of the symbol mountain bike isnow the data frame that was created from that file.
mountain.bike
# price weight suspension price_FE weight_FE price_FU weight_FU
# 1 1000 32 FU 700 29 1000 32
# 2 940 34 FU 600 28 940 34
# 3 1100 30 FU 440 29 1100 30
# ...
On a Mac, an easy way to reset the current working directory to a new folder is to simply “grab” thatfolder with your mouse, drag it across the computer screen, and “drop” it on the icon in the dock that youordinarily click on to launch R. Notice that R responds by executing the setwd() command (set workingdirectory) in the console, with an argument which is the path to that new directory.
To import the analogous text file mountain bike.txt in the folder Text files, begin by making theText files folder the active directory, either by using the commands in the Misc menu or by draggingthe Text files folder onto the R icon in the dock. Now execute a command such as the following:
mountain.bike <- read.table("mountain_bike.txt", header=TRUE, sep="\t")
and verify that the value of the symbol mountain.bike is now the desired data frame.
mountain.bike
# price weight suspension price_FE weight_FE price_FU weight_FU
# 1 1000 32 FU 700 29 1000 32
# 2 940 34 FU 600 28 940 34
# 3 1100 30 FU 440 29 1100 30
# ...
Spring 2015 Page 9 of 11
Stat 204, Part 1 Data
The parameter
sep="\t"
tells read.table that values are separated by tabs.
Loading Packages into R
R consists of a core programming language, which is maintained and improved by a small group of in-ternational collaborators, and an impressive array of some 5000 packages, which are collections of specialtycode written by a much more diverse collection of contributors. The packages that we use most often areautomatically loaded into R, and we don’t even have to think about them. Other packages are distributedwith R and are already on our machines, but we have to request that they be loaded into R’s workspacebefore they can be used. Finally, many more packages are not on our machines but are stored in CRANand can be downloaded over the Internet in just a few moments.
For one of the exercises in chapter 3, it would be nice to use a procedure called scatterplot fromthe package car. This package is distributed with R, and is already on our machines but is not normallyloaded. To load the car package and use the procedure scatterplot, evaluate the following code.
# 3.57 (education)
education <- c(70, 75, 80, 85, 55, 58, 60, 65)
crime.rate <- c(140, 120, 110, 105, 50, 40, 30, 25)
setting <- c(rep("urban", 4), rep("rural", 4))
crime <- data.frame(education, crime.rate, setting)
crime
library(car)
scatterplot(crime.rate ~ education | setting, data=crime,
legend.coords="topleft")
55 60 65 70 75 80 85
4060
80100
120
140
education
crime.rate
settingruralurban
Spring 2015 Page 10 of 11
Stat 204, Part 1 Data
The command
library(car)
loads the car package into R’s workspace so that the command scatterplot becomes available. It hasto be evaluated once per R session in order to be able to use its contents.
Another useful package, named openintro, contains code for a procedure called treeDiag, written byDavid Diez of Google, that we can use in chapter 5 to draw tree diagrams, which are helpful for calculatingcertain probabilities. The openintro package is not distributed with R, but it is available on CRAN. Ona Mac, to download that package to your machine, select the menu item Packages & Data : Package
Installer and type openintro into the Package Search box. Click on Get List. A request is sent to CRANfor the number of the latest version available. When that number comes back, select the line containingit in the package box, select the check box for Install Dependencies, and click on Install Selected. After amoment, you will see some action over on the R console. The package is now inside R on your machine andwill remain there. From now on, in order to use the openintro software, you just execute the command,
library(openintro)
once per session. Download the openintro package now, and test it by executing the following code.
library(openintro)
treeDiag(c(’Job status’,’Self-classification’),
c(0.76,0.24),
list(c(0.48,0.52), c(0.42,0.58)),
textwd=0.2, solwd=0.35, cex.main=1.4,
c(’unemployed’,’actively disengaged’), c(’thriving’,’struggling’),
digits=5, col.main="orangered", showWork=TRUE)
Job status Self-classification
unemployed, 0.76
thriving, 0.48 0.76*0.48 = 0.3648
struggling, 0.52 0.76*0.52 = 0.3952
actively disengaged, 0.24
thriving, 0.42 0.24*0.42 = 0.1008
struggling, 0.58 0.24*0.58 = 0.1392
Spring 2015 Page 11 of 11
Stat 204, Part 1 Data
On Writing
“Work on a good piece of writing proceeds on three levels : a musical one, where it is composed; anarchitectural one, where it is constructed; and finally, a textile one, where it is woven.” Walter Benjamin,quoted by Roger Cohen in the International New York Times, January 20, 2014.
Exercises
We will attempt to solve some of the following exercises as a community project in class today. Finish thesesolutions as homework exercises, write them up carefully and clearly, and hand them in at the beginningof class next Friday.
Exercises for Chapter 3: 3.3 (happy), 3.5 (alcohol), 3.9 (gender), 3.17 (r), 3.21 (bikes), 3.22 (sandwiches),3.39 (study time), 3.41 (bikes), 3.42 (bikes), 3.57 (education), 3.60 (confounder?)
Class work 3a – association
Exercises from Chapter 3: 3.3 (happy), 3.5 (alcohol), 3.9 (gender), 3.17 (r), 3.21 (bikes), 3.22 (sandwiches)
Class work 3b – association
Exercises from Chapter 3: 3.39 (study time), 3.41 (bikes), 3.42 (bikes), 3.57 (education), 3.60 (confounder?)
Spring 2015 Page 12 of 11