9.1

17
9.1 SCATTER DIAGRAMS AND LINEAR CORRELATION

description

 

Transcript of 9.1

Page 1: 9.1

9.1 SCATTER DIAGRAMS AND LINEAR CORRELATION

Page 2: 9.1

Scatter Diagrams• A scatter diagram is a graph in which data pairs () are

plotted as individual points on a grid with horizontal axis and vertical axis .• Scatter diagrams are used in studies of correlation and regression

of two variables.• is called the explanatory variable• is called the response variable• By looking at a scatter diagram of data pairs, you can observe

whether there seems to be a linear relationship between and

Page 3: 9.1

Example 1: Scatter DiagramPhosphorous is a chemical used in many household and industrial cleaning compounds. Unfortunately, phosphorous tends to find its way into surface water, where it can kill fish, plants, and other wetland creatures. Phosphorous-reduction programs are required by law and are monitored by the Environmental Protection Agency (EPA) (Reference: EPA Case Study 832-R-93-005).

A random sample of eight sites in a California wetlands study gave the following information about phosphorous reduction in drainage water. In this study, x is a random variable that represents phosphorous concentration (in 100 mg/l) at the inlet of a passive biotreatment facility, and y is a random variable that represents total phosphorous concentration (in 100 mg/l) at the outlet of the passive biotreatment facility

a) Make a scatter diagram for these data

b) Comment on the relationship between and shown in the scatter diagram

Page 4: 9.1

Example 1: Scatter DiagramSolutiona) The values in the table make ordered pairs. To make

the scatter diagram, scan the data and choose an appropriate scale for each axis.

Figure 9.1

Phosphorous Reduction (100 mg/l)

The line segment shows the basic trend.

Page 5: 9.1

b) By inspecting the figure, we see that smaller values of are associated with smaller values of and larger values of tend to be associated with larger values of . Roughly speaking, the general trend seems to be reasonably well represented by an upward-sloping line segment, as shown in Figure 9-1.

Example 1: Scatter DiagramSolution

Page 6: 9.1

Scatter Diagrams and Linear Correlation• When using scatter plots, we are interested in the

relationship between (the explanatory variable) and (the response variable).• Specifically we are interested in strength of this relationship and

any deviation from the basic patter of this relationship• Most often, we examine whether the relationship can be reasonably

explained in terms of a linear function (a function whose graph is a straight line), this is called linear correlation.• The “best-fitting line” should be the one that comes closest to each of the

points of the scatter diagram (there are mathematical criteria for this line and a formula that we will learn in 9.2)

Page 7: 9.1

Scatter Diagrams and Linear Correlation• If the points of a scatter diagram are located so that NO line is

realistically a “good” fit, then we say that the points possess no linear correlation.

Figure 9.2

Scatter Diagrams with No Linear Correlation

(a) (b)

Page 8: 9.1

Scatter Diagrams and Linear Correlation• If the points of a scatter diagram seem close to a straight line, we say the

linear correlation is moderate to high, depending on how close the points lie to the line.• If all the points lie on a line, then we have perfect linear correlation.

• In statistics, perfect linear correlation almost never occurs.

• The variables and are said to have positive correlation if low values of are associated with low values of and high values of are associated with high values of .

• If low values of are associated with high values of and high values of are associated with low values of , the variables are said to be negatively correlated.

Page 9: 9.1

Scatter Diagrams and Linear Correlation

Figure 9.4

Scatter Diagrams with Moderate and Perfect Linear Correlation

Page 10: 9.1

Sample Correlation Coefficient r• Whenever you are looking for a relationship between two

variables, making a scatter diagram is a good first step. • The sample correlation coefficient is a numerical

measurement that assess the strength of a linear relationship between two variables and .

Page 11: 9.1

Sample Correlation Coefficient r1. is a unit less measurement between and (). If ,

there is perfect positive linear correlation. If , there is perfect negative linear correlation. If , there is no linear correlation. The closer is to or , the better a line describes the relationship between the two variables and .

2. Positive values of imply that as increases, tends to increase. Negative values of imply that as increases, tends to decrease.

3. The value of is the same regardless of which variable is the explanatory variable and which is the response variable. In other words, the value of is the same for the pairs and the corresponding pairs

4. The value of does not change when either variable is converted to different units.

Page 12: 9.1

Some facts about the Correlation Coefficient• Table 9-2 gives a quick summary of some basic facts about .

Page 13: 9.1

Formulas for r

• For a discussion on the Development of Formula for see page 507 in your text

• The defining formula for is: • It shows how the mean and standard deviation of each variable in

the data pair enter into the formulation of . However, it is difficult to work with because of all the subtraction and products.

• The computation formula for uses the raw data values of and directly. The computation formula for is: .• It requires that you obtain a random sample of data pairs and that

the data pairs should have a bivariate normal distribution. This means that for a fixed value of , the values should have a normal distribution, and for a fixed , the values should have their own (approximately) normal distribution.

Page 14: 9.1

Formulas for rWe will be using the TI-83 and TI-84 to compute r directly.

First use CATALOG, find DiagnosticsON, and press ENTER twice.

1. Enter the values into L1

2. Enter the values into L2

3. Press STAT, and then select CALC. The choice 8:LinReg(a + bx) performs linear regression on the variables in L1 and L2. If the data are in other lists, then specify those lists after the LinReg(a + bx) command.

4. Your solution screen with display the correlation coefficient (round to 3 places)

Page 15: 9.1

Example: Computing r• Sand driven by wind creates large, beautiful dunes at the Great Sand

Dunes National Monument, Colorado. Of course, the same natural forces also create large dunes in the Great Sahara and Arabia. Is there a linear correlation between wind velocity and sand drift rate? Let x be a random variable representing wind velocity (in 10 cm/sec) and let y be a random variable representing drift rate of sand (in 100 gm/cm/sec). A test site at the Great Sand Dunes National Monument gave the following information about x and y (Reference: Hydrologic, Geologic, and Biologic Research at Great Sand Dunes National Monument, Proceedings of the National Park Service Research Symposium).

Page 16: 9.1

Cautions about Correlation• The correlation coefficient can be thought of as a measure

of how well a linear model fits the data points on a scatter diagram. The closer r is to or , the better a line “fits” the data. Values of r close to indicate a poor fit to any line.

• Usually a scatter plot does not contain all possible data points that could be gathered. Most scatter plots represent only a random sample of data pairs taken from a population. Thus we expect the values of to vary from one sample to the next (just like we expect to vary from sample to sample). This brings up the question of the significance of , which will be discussed later in the chapter.

Page 17: 9.1

Causation• The correlation coefficient is a mathematical tool for

measuring the strength of a linear relationship between two variables. • It makes no implication about cause or effect.

• The fact that two variables tend to increase or decrease together does not mean that a change in one is causing a change in the other.

• A strong correlation between x and y is sometimes due to other (either known or unknown) variables. Such variables are called lurking variables.