Chapter 41 Describing Relationships: Scatterplots and Correlation.

Post on 19-Dec-2015

231 views 4 download

Tags:

Transcript of Chapter 41 Describing Relationships: Scatterplots and Correlation.

Chapter 4 1

Chapter 4

Describing Relationships: Scatterplots and Correlation

Objectives (BPS chapter 4)

Relationships: Scatterplots and correlation

Explanatory and response variables

Displaying relationships: scatterplots

Interpreting scatterplots

Adding categorical variables to scatterplots

Measuring linear association (correlation)

Facts about correlation

Chapter 4 2

Chapter 4 3

ScatterplotA scatterplot is a graph in which paired (x, y) data (usually collected on the same individuals) are plotted with one variable represented on a horizontal (x -) axis and the other variable represented on a vertical (y-) axis. Each individual pair (x, y) is plotted as a single point.

Example:

Student Number of Beers

Blood Alcohol Level

1 5 0.1

2 2 0.03

3 9 0.19

6 7 0.095

7 3 0.07

9 3 0.02

11 4 0.07

13 5 0.085

4 8 0.12

5 3 0.04

8 5 0.06

10 5 0.05

12 6 0.1

14 7 0.09

15 1 0.01

16 4 0.05

Here we have two quantitative variables

for each of 16 students.

1. How many beers they drank,

and

2. Their blood alcohol level (BAC)

We are interested in the relationship

between the two variables: How is one

affected by changes in the other one?

Student Beers BAC

1 5 0.1

2 2 0.03

3 9 0.19

6 7 0.095

7 3 0.07

9 3 0.02

11 4 0.07

13 5 0.085

4 8 0.12

5 3 0.04

8 5 0.06

10 5 0.05

12 6 0.1

14 7 0.09

15 1 0.01

16 4 0.05

ScatterplotsIn a scatterplot one axis is used to represent each of the variables,

and the data are plotted as points on the graph.

Explanatory (independent) variable: number of beers

Response

(dependent)

variable:

blood alcohol

contentx

y

Explanatory and response variables

A response variable measures or records an outcome of a study. An

explanatory variable explains changes in the response variable.

Typically, the explanatory or independent variable is plotted on the x

axis and the response or dependent variable is plotted on the y axis.

Some plots don’t have clear explanatory and response variables.

Do calories explain

sodium amounts?

Does percent return on Treasury bills

explain percent return on common stocks?

Chapter 4 8

Examining a ScatterplotYou can describe the overall pattern of a scatterplot by the

Form – linear or non-linear ( quadratic, exponential, no correlation etc.)

Direction – negative, positive.

Strength – strong, very strong, moderately strong, weak etc.

Look for outliers and how they affect the correlation.

Chapter 4 9

Scatterplot

x 1 2 3 4 5

y -4 -2 1 0 2

x

2 4

–2

– 4

y

2

6

Example: Draw a scatter plot for the data below. What is the nature of the relationship between X and Y.

Strong, positive and linear.

Chapter 4 10

Examining a Scatterplot

Two variables are positively correlated when high values of the variables tend to occur together and low values of the variables tend to occur together. The scatterplot slopes upwards from left to right.

Two variables are negatively correlated when high values of one of the variables tend to occur with low values of the other and vice versa. The scatterplot slopes downwards from left to right.

Chapter 4 11

Types of Correlation

x

y

Negative Linear Correlation

x

y

No Correlation

x

y

Positive Linear Correlation

x

y

Non-linear Correlation

As x increases, y tends to decrease.

As x increases, y tends to increase.

Chapter 13 12

Examples of Relationships

0

10

20

30

40

50

60

$0 $10 $20 $30 $40 $50 $60 $70

Income

Hea

lth

Sta

tus

Mea

sure

0

10

20

30

40

50

60

70

0 20 40 60 80 100

Age

Hea

lth

Stat

us M

easu

re0

2

4

6

8

10

12

14

16

18

0 20 40 60 80 100

Age

Ed

uca

tion

Lev

el

30

35

40

45

50

55

60

65

0 20 40 60 80

Physical Health Score

Men

tal H

ealt

h S

core

Caution: Relationships require that both variables be quantitative (thus the order of the data points is

defined entirely by their value).

Correspondingly, relationships between categorical data are meaningless.

Example: Beetles trapped on boards of different colors

What association? What relationship?

Blue White Green Yellow Board color

Blue Green White Yellow Board color

Describe one category at a time.

?

Chapter 4 14

Thought Question 1What type of association would the following pairs of variables have – positive, negative, or none?

1. Temperature during the summer and electricity bills

2. Temperature during the winter and heating costs3. Number of years of education and height (Elementary School)

4. Frequency of brushing and number of cavities

5. Number of churches and number of bars in cities

6. Height of husband and height of wife

Chapter 4 15

Thought Question 2

Consider the two scatterplots below. How does the outlier impact the correlation for each plot?

– does the outlier increase the correlation, decrease the correlation, or have no impact?

Strength of the associationThe strength of the relationship between the two variables can be seen

by how much variation, or scatter, there is around the main form.

With a strong relationship, you can get a pretty good estimate

of y if you know x.

With a weak relationship, for any x you might get a wide

range of y values.

How to scale a scatterplot

Using an inappropriate scale for a scatterplot can give an incorrect impression.

Both variables should be given a similar amount of space:

• Plot roughly square• Points should occupy all the plot space (no blank space)

Same data in all four plots

Adding categorical variables to scatterplots

Often, things are not simple and one-dimensional. We need to group

the data into categories to reveal trends.

What may look like a positive

linear relationship is in fact a

series of negative linear

associations.

Plotting different habitats in

different colors allowed us to

make that important distinction.

Comparison of men’s and

women’s racing records over

time.

Each group shows a very

strong negative linear

relationship that would not be

apparent without the gender

categorization.

Relationship between lean body mass

and metabolic rate in men and women.

While both men and women follow the

same positive linear trend, women show

a stronger association. As a group, males

typically have larger values for both

variables.

Chapter 4 20

Measuring Strength & Directionof a Linear Relationship

How closely does a non-horizontal straight line fit the points of a scatterplot?

The correlation coefficient (often referred to as just correlation): r

– measure of the strength of the relationship: the stronger the relationship, the larger the magnitude of r.

– measure of the direction of the relationship: positive r indicates a positive relationship, negative r indicates a negative relationship.

Chapter 4 21

Correlation Coefficient

Greek Capital Letter Sigma – denotes summation or addition.

1

1

1

1

x y

x y

x x y yrn s s

x x y yn s s

Example: Find the correlation between X and Y

Chapter 4 22

x 1 2 3 4 5

y -4 -2 1 0 2

x y

1 -2 -4 -3.4 6.8

2 -1 -2 -1.4 1.4

3 0 1 1.6 0

4 1 0 0.6 0.6

5 2 2 2.6 5.2

3, 0.6x y

1.58, 2.41x ys s

140.9192

4 1.58 2.41r

x x y y x x y y

Chapter 4 23

Correlation Coefficient

The range of the correlation coefficient is -1 to 1.

-1 0 1

If r = -1 there is a perfect negative

correlation

If r = 1 there is a perfect positive

correlation

If r is close to 0 there is no linear

correlation

Chapter 4 24

Linear Correlation

Strong negative correlation

Weak positive correlation

Strong positive correlation

Non-linear Correlation

x

y

x

y

x

y

x

y

r = 0.91 r = 0.88

r = 0.42 r = 0.07

Try

Chapter 4 25

Correlation Coefficient

special values for r : a perfect positive linear relationship would have r = +1 a perfect negative linear relationship would have r = -1 if there is no linear relationship, or if the scatterplot

points are best fit by a horizontal line, then r = 0 Note: r must be between -1 and +1, inclusive

r > 0: as one variable changes, the other variable tends to change in the same direction

r < 0: as one variable changes, the other variable tends to change in the opposite direction

Chapter 4 26

Correlation Coefficient Because r uses the z-scores for the observations, it does not change

when we change the units of measurements of x , y or both.

Correlation ignores the distinction between explanatory and response variables.

r measures the strength of only linear association between variables.

A large value of r does not necessarily mean that there is a strong linear relationship between the variables – the relationship might not be linear; always look at the scatterplot.

When r is close to 0, it does not mean that there is no relationship between the variables, it means there is no linear relationship.

Outliers can inflate or deflate correlations. Try

Chapter 4 27

Not all Relationships are LinearMiles per Gallon versus Speed

Curved relationship(r is misleading)

Speed chosen for each subject varies from 20 mph to 60 mph

MPG varies from trial to trial, even at the same speed

Statistical relationship

0

5

10

15

20

25

30

35

0 50 100

speed

mil

es p

er g

allo

n

r=-0.06

Chapter 4 28

Common Errors Involving Correlation

1. Causation: It is wrong to conclude that correlation implies causality.

2. Averages: Averages suppress individual variation and may inflate the correlation coefficient.

3. Linearity: There may be some relationship between x and y even when there is no linear correlation.

Chapter 4 29

ExampleA survey of the world’s nations in 2004 shows a strongpositive correlation between percentage of countriesusing cell phones and life expectancy in years at birth.

a) Does this mean that cell phones are good for your health?

No. It simply means that in countries where cell phone use is high, the life expectancy tends to be high as well.

b) What might explain the strong correlation?The economy could be a lurking variable. Richer countries generally have more cell phone use and better health care.

Chapter 4 30

ExampleThe correlation between Age and Income as measured on 100

people is r = 0.75. Explain whether or not each of these

conclusions is justified.

a) When Age increases, Income increases as well.

b) The form of the relationship between Age and Income is linear.

c) There are no outliers in the scatterplot of Income vs. Age.

d) Whether we measure Age in years or months, the correlation will still be 0.75.

Chapter 4 31

ExampleExplain the mistakes in the statements below:

a) “My correlation of -0.772 between GDP and Infant Mortality Rate shows that there is almost no association between GDP and Infant Mortality Rate”.

b) “There was a correlation of 0.44 between GDP and Continent”

c) “There was a very strong correlation of 1.22 between Life Expectancy and GDP”.

Chapter 4 32

Key Concepts Strength of Linear Relationship

Direction of Linear Relationship

Correlation Coefficient

Common Problems with Correlations

r can only be calculated for quantitative data.