Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204...

12
Stat 204, Part 1 Data Chapter 3: Association: Contingency, Correlation, and Regression These notes reflect material from our text, Statistics: The Art and Science of Learning from Data, Third Edition, by Alan Agresti and Catherine Franklin, published by Pearson, 2013. Hubble and the recession of galaxies, 1929 During the 1920’s, Edwin Hubble used the 100 inch telescope of the Mount Wilson Observatory near Los Angeles, which at that time was the largest telescope in the world, to explore faint objects in the sky known as “spiral nebulae.” Hubble’s observations and measurements of their distances and spectra established that they were galaxies, each with its own astronomical number of stars, and that they were all receding from the Earth and from each other. His results documenting the recession of galaxies were published in 1929 and constituted a major advance in our understanding of cosmology. 0.0 0.5 1.0 1.5 2.0 -200 0 200 400 600 800 1000 Hubble's data, 1929 distance (Mpc) recession velocity (km/sec) The relationship that Hubble discovered is summarized in the equation V = H 0 D, where V is the velocity of the galaxy, D is its distance to earth, and H 0 is a constant now known as Hubble’s constant. The units for Hubble’s constant are strange, km per second per megaparsec, but the meaning of the equation is clear: velocity is proportional to distance. Hubble’s constant is the slope of the regression line in this image. Spring 2015 Page 1 of 11

Transcript of Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204...

Page 1: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

Chapter 3: Association: Contingency, Correlation, and Regression

These notes reflect material from our text, Statistics: The Art and Science of Learning from Data,Third Edition, by Alan Agresti and Catherine Franklin, published by Pearson, 2013.

Hubble and the recession of galaxies, 1929

During the 1920’s, Edwin Hubble used the 100 inch telescope of the Mount Wilson Observatory nearLos Angeles, which at that time was the largest telescope in the world, to explore faint objects in thesky known as “spiral nebulae.” Hubble’s observations and measurements of their distances and spectraestablished that they were galaxies, each with its own astronomical number of stars, and that they wereall receding from the Earth and from each other. His results documenting the recession of galaxies werepublished in 1929 and constituted a major advance in our understanding of cosmology.

0.0 0.5 1.0 1.5 2.0

-200

0200

400

600

800

1000

Hubble's data, 1929

distance (Mpc)

rece

ssio

n ve

loci

ty (k

m/s

ec)

The relationship that Hubble discovered is summarized in the equation

V = H0D,

where V is the velocity of the galaxy, D is its distance to earth, and H0 is a constant now known as Hubble’sconstant. The units for Hubble’s constant are strange, km per second per megaparsec, but the meaning ofthe equation is clear: velocity is proportional to distance. Hubble’s constant is the slope of the regressionline in this image.

Spring 2015 Page 1 of 11

Page 2: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

Categorical data

Contingency tables are compact representations of the values of categorical variables. We need to beable to represent them on paper and in our computers. Wikipedia illustrates the concept of a contingencytable with this example displaying the numbers of right-handed and left-handed people in a small sample ofmen and women. By convention, levels of explanatory variables are in rows and levels of response variablesare in columns.

right leftm 43 9f 44 4

# contingency table : handedness (Wikipedia)

# http://en.wikipedia.org/wiki/Contingency_table

handed <- c(43, 44, 9, 4)

dim(handed) <- c(2, 2)

dimnames(handed) <- list(Gender=c("m", "f"),

Handed=c("right", "left"))

# Handed

# Gender right left

# m 43 9

# f 44 4

mosaicplot(handed,

col=c("orange", "yellow"),

main="Handedness")

Handedness

Gender

Handed

m f

right

left

Spring 2015 Page 2 of 11

Page 3: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

Summarizing categorical data:

• counts

• percentages, conditional proportions

Use addmargins to add row and col sums to a contingency table.

# margins

handed.margins <- addmargins(handed)

# Handed

# Gender right left Sum

# m 43 9 52

# f 44 4 48

# Sum 87 13 100

Use prop.table to calculate conditional proportions.

# conditional proportions

# divide each row by its row sum

prop.table(handed, 1)

# Handed

# Gender right left

# m 0.8269231 0.17307692

# f 0.9166667 0.08333333

# divide each col by its col sum

prop.table(handed, 2)

# Handed

# Gender right left

# m 0.4942529 0.6923077

# f 0.5057471 0.3076923

Displaying categorical data:

• frequency table (one set of categories)

• bar plot

• contingency table, r × c (two sets of categories)

• mosaic plot

• side-by-side bar plots

• segmented bar plots

Spring 2015 Page 3 of 11

Page 4: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

A mosaic plot is an effective representation of an r × c contingency table. In effect, the mosaic plottransposes the matrix. The first variable splits a big square into W-E blocks. The second variable splitsthose blocks into N-S sub-blocks. The dimensions of the blocks carry the information (not the areas of therectangles). Designated colors will denote the levels of the last (response) variable.

Social science majors and non-social science majors were asked if they are in a relationship. Are thesocial science majors ... ahem ... more social?

Relationship

Major

Relationship

SS not SS

yes

no

Is happiness associated with family income?

Happiness

above.average average below.average

Not.too.happy

Pretty.happy

Very.happy

Spring 2015 Page 4 of 11

Page 5: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

Modeling (numerical response)

explanatory variable, response variable

regression line

Linear model - one explanatory variable

Explanatory (predictor) and response (predicted) variables.

Fitting y ∼ x results in a linear model, y = β0 + β1x

Point estimates for β0 and β1 are determined from the sample and are denoted b0 and b1

An association is characterized by its direction (positive or negative), form (linear or non-linear) andstrength (which for linear relationships is measured by the correlation)

Correlation coefficient, ρ

Residual, ei = yi − yi

Least squares line

Conditions for least squares : (1) nearly linear relationship, (2) nearly normal residuals, (3) with nearlyconstant variability.

Calculate b0 and b1

The slope of the line is proportional to the correlation : b1 = ρ sy/sx

The center of mass of the sample lies on the least squares line : (x, y) lies on the regression line.

Use a least squares line to predict y from x : y = b0 + b1x

R2 is the proportion of the response variation explained by the linear model.

Spring 2015 Page 5 of 11

Page 6: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

Using R for Linear Models

Use Plot to plot the points, lm to create the linear model, and abline to plot the regression line.coefficients will extract the coefficients of the regression line from the linear model.

# population of SC

# plot the population of SC for the years 1900-1960

years <- seq(from=1900, to=1960, by=10)

SC.population <- c(1340, 1515, 1684, 1739, 1900, 2117, 2383)

plot(years, SC.population,

pch=19, col="darkred",

xlab="Year", ylab="Population (in thousands of people)",

main="Population of SC")

SC.population.lm <- lm(SC.population ~ years)

abline(SC.population.lm,

col="orange")

coefficients(SC.population.lm)

# (Intercept) years

# -29544.46429 16.24643

1900 1910 1920 1930 1940 1950 1960

1400

1600

1800

2000

2200

2400

Population of SC

Year

Pop

ulat

ion

(in th

ousa

nds

of p

eopl

e)

Spring 2015 Page 6 of 11

Page 7: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

Prediction.

# population of SC : problem with extrapolation

# use the regression line to predict the population in 2010

# plot the actual population in 2010

a <- coefficients(SC.population.lm)[1]

b <- coefficients(SC.population.lm)[2]

predict <- function(year){

y.hat <- a + b * year

return(y.hat)

}

predict(2010) # predicted population in 2010 (in green on the graph)

# 3110.857

years.new <- append(years, c(2010))

SC.population.new <- append(SC.population, c(4625))

plot(years.new, SC.population.new,

pch=19, col="darkred",

xlab="Year", ylab="Population (in thousands of people)",

main="Population of SC")

abline(SC.population.lm,

col="orange")

points(x=2010, y=predict(2010),

col="green")

1900 1920 1940 1960 1980 2000

1500

2000

2500

3000

3500

4000

4500

Population of SC

Year

Pop

ulat

ion

(in th

ousa

nds

of p

eopl

e)

Spring 2015 Page 7 of 11

Page 8: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

Correlation

The correlation measures the strength of an association. Just how tightly do the data points clusterabout the least-squares regression line? Or do the data points form a diffuse cloud with little suggestionof a linear trend? Correlation provides a quantitative answer. It is a number between -1 and 1, and it iscalculated in R with a command like cor(x, y).

Note that the correlation, r, can vary over the range 0 < r ≤ 1 for any fixed regression line with positiveslope, just as it can vary over the range −1 ≤ r < 0 for any fixed regression line with negative slope. If thecorrelation is 0, the regression line is horizontal. Code for the image shown below appeared recently in theR-bloggers forum (Tony Hirst, Sketching Scatterplots to Demonstrate Different Correlations, R-bloggers,December 17, 2014).

-1.0 -0.8 -0.5

-0.2 0.0 0.2

0.5 0.8 1.0

-2

0

2

-2

0

2

-2

0

2

-2 0 2 -2 0 2 -2 0 2x

y

Spring 2015 Page 8 of 11

Page 9: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

Importing Data into R

The Agresti-Franklin textbook comes with a set of data files, and its exercises sometimes make use ofthem. This note explains how to import the information contained in such data files into R.

First of all, copy the Agresti-Franklin data files into a convenient place on your computer where theywill stay for the rest of the semester. Notice that there are three sets of files contained in three folders: CSVfiles, Text files, and TI 83-84 Files. A csv file is a plain-text file which has its records (observations)on separate rows and the values associated with each observation separated by commas (csv = comma-separated-values). The text files are similar, except that the values of each observation are separated bytabs. You can choose to use either type of file, but the command that you use to import the data into Ris slightly different for each file type.

Launch R. On a Mac, go to the Misc menu and select Get Working Directory. Notice that R answersby displaying the current working directory. R always expects to read files from and write files to the cur-rent working directory. Go to Misc : Change Working Directory ... and navigate to the directorythat contains the file you want to import into R, for instance the folder CSV files which contains the filemountain bike.csv which we will use in this chapter. The R command dir() will now display all the filesthat R can see in that folder.

Since we wish to import a csv file into R which contains data about mountain bikes, an appropriatecommand might be

mountain.bike <- read.csv("mountain_bike.csv", header=TRUE)

The parameter header was set to TRUE because the first row of the data inside the file contains the namesof the variables for each of the columns in that file. Verify that the value of the symbol mountain bike isnow the data frame that was created from that file.

mountain.bike

# price weight suspension price_FE weight_FE price_FU weight_FU

# 1 1000 32 FU 700 29 1000 32

# 2 940 34 FU 600 28 940 34

# 3 1100 30 FU 440 29 1100 30

# ...

On a Mac, an easy way to reset the current working directory to a new folder is to simply “grab” thatfolder with your mouse, drag it across the computer screen, and “drop” it on the icon in the dock that youordinarily click on to launch R. Notice that R responds by executing the setwd() command (set workingdirectory) in the console, with an argument which is the path to that new directory.

To import the analogous text file mountain bike.txt in the folder Text files, begin by making theText files folder the active directory, either by using the commands in the Misc menu or by draggingthe Text files folder onto the R icon in the dock. Now execute a command such as the following:

mountain.bike <- read.table("mountain_bike.txt", header=TRUE, sep="\t")

and verify that the value of the symbol mountain.bike is now the desired data frame.

mountain.bike

# price weight suspension price_FE weight_FE price_FU weight_FU

# 1 1000 32 FU 700 29 1000 32

# 2 940 34 FU 600 28 940 34

# 3 1100 30 FU 440 29 1100 30

# ...

Spring 2015 Page 9 of 11

Page 10: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

The parameter

sep="\t"

tells read.table that values are separated by tabs.

Loading Packages into R

R consists of a core programming language, which is maintained and improved by a small group of in-ternational collaborators, and an impressive array of some 5000 packages, which are collections of specialtycode written by a much more diverse collection of contributors. The packages that we use most often areautomatically loaded into R, and we don’t even have to think about them. Other packages are distributedwith R and are already on our machines, but we have to request that they be loaded into R’s workspacebefore they can be used. Finally, many more packages are not on our machines but are stored in CRANand can be downloaded over the Internet in just a few moments.

For one of the exercises in chapter 3, it would be nice to use a procedure called scatterplot fromthe package car. This package is distributed with R, and is already on our machines but is not normallyloaded. To load the car package and use the procedure scatterplot, evaluate the following code.

# 3.57 (education)

education <- c(70, 75, 80, 85, 55, 58, 60, 65)

crime.rate <- c(140, 120, 110, 105, 50, 40, 30, 25)

setting <- c(rep("urban", 4), rep("rural", 4))

crime <- data.frame(education, crime.rate, setting)

crime

library(car)

scatterplot(crime.rate ~ education | setting, data=crime,

legend.coords="topleft")

55 60 65 70 75 80 85

4060

80100

120

140

education

crime.rate

settingruralurban

Spring 2015 Page 10 of 11

Page 11: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

The command

library(car)

loads the car package into R’s workspace so that the command scatterplot becomes available. It hasto be evaluated once per R session in order to be able to use its contents.

Another useful package, named openintro, contains code for a procedure called treeDiag, written byDavid Diez of Google, that we can use in chapter 5 to draw tree diagrams, which are helpful for calculatingcertain probabilities. The openintro package is not distributed with R, but it is available on CRAN. Ona Mac, to download that package to your machine, select the menu item Packages & Data : Package

Installer and type openintro into the Package Search box. Click on Get List. A request is sent to CRANfor the number of the latest version available. When that number comes back, select the line containingit in the package box, select the check box for Install Dependencies, and click on Install Selected. After amoment, you will see some action over on the R console. The package is now inside R on your machine andwill remain there. From now on, in order to use the openintro software, you just execute the command,

library(openintro)

once per session. Download the openintro package now, and test it by executing the following code.

library(openintro)

treeDiag(c(’Job status’,’Self-classification’),

c(0.76,0.24),

list(c(0.48,0.52), c(0.42,0.58)),

textwd=0.2, solwd=0.35, cex.main=1.4,

c(’unemployed’,’actively disengaged’), c(’thriving’,’struggling’),

digits=5, col.main="orangered", showWork=TRUE)

Job status Self-classification

unemployed, 0.76

thriving, 0.48 0.76*0.48 = 0.3648

struggling, 0.52 0.76*0.52 = 0.3952

actively disengaged, 0.24

thriving, 0.42 0.24*0.42 = 0.1008

struggling, 0.58 0.24*0.58 = 0.1392

Spring 2015 Page 11 of 11

Page 12: Chapter 3: Association: Contingency, Correlation, and Regressioncparrish.sewanee.edu/stat204 S2015/notes/part 01 data/… ·  · 2015-01-19Chapter 3: Association: Contingency, Correlation,

Stat 204, Part 1 Data

On Writing

“Work on a good piece of writing proceeds on three levels : a musical one, where it is composed; anarchitectural one, where it is constructed; and finally, a textile one, where it is woven.” Walter Benjamin,quoted by Roger Cohen in the International New York Times, January 20, 2014.

Exercises

We will attempt to solve some of the following exercises as a community project in class today. Finish thesesolutions as homework exercises, write them up carefully and clearly, and hand them in at the beginningof class next Friday.

Exercises for Chapter 3: 3.3 (happy), 3.5 (alcohol), 3.9 (gender), 3.17 (r), 3.21 (bikes), 3.22 (sandwiches),3.39 (study time), 3.41 (bikes), 3.42 (bikes), 3.57 (education), 3.60 (confounder?)

Class work 3a – association

Exercises from Chapter 3: 3.3 (happy), 3.5 (alcohol), 3.9 (gender), 3.17 (r), 3.21 (bikes), 3.22 (sandwiches)

Class work 3b – association

Exercises from Chapter 3: 3.39 (study time), 3.41 (bikes), 3.42 (bikes), 3.57 (education), 3.60 (confounder?)

Spring 2015 Page 12 of 11