Finding Areas with Calc 1. Shade Norm Consider WISC data from before: N (100, 15). Suppose we want...

Finding Areas with Calc 1. Shade NormFinding Areas with

Calc 1. Shade Norm• Consider WISC data from before: N (100, 15).

Suppose we want to find % of children whose scores are above 125

• Specify window: X[55,145]12 and Y[-.008, .028].01

• Press 2nd VARS (DISTR), then choose Draw and 1: ShadeNorm(.

• Compete command ShadeNorm(125, 1E99, 100, 15) or (0,85, 100,15).

• If using Z scores instead of raw scores, the mean 0 and SD 1 will be understood so ShadeNorm (1,2) will give you area from z-score of 1.0 to 2.0. How does this compare to the 68-95-99.7 rule?

2. NormalCDF2. NormalCDF

• Advantage- quicker, disadvantage- no picture.

• 2nd Vars (DISTR) choose 2: normalcdf(.

• Complete command (125, 1E99, 100, 15) and press enter. You get .0477 or approx. 5%.

• If you have Z scores: normalcdf (-1, 1) = .6827 aka 68%...like our rule!

InvNormInvNorm• InvNorm calculates the raw or Z score value

corresponding to a known area under the curve.

• 2nd Vars (DISTR), choose 3: invNorm(.

• Complete the command invNorm (.9, 100, 15) and press Enter. You get 119.223, so this is the score corresponding to 90th percentile.

• Compare this with command invNorm (.9) you get 1.28. This is the Z score.

Describing Bivariate Relationships

Describing Bivariate Relationships

Chapter 3 SummaryYMS3e

AP Stats at RHSMs. Nichols

Chapter 3 SummaryYMS3e

AP Stats at RHSMs. Nichols

Bivariate RelationshipsBivariate RelationshipsWhat is Bivariate data?When exploring/describing a bivariate (x,y) relationship:

Determine the Explanatory and Response variablesPlot the data in a scatterplotNote the Strength, Direction, and FormNote the mean and standard deviation of x and the mean and standard deviation of yCalculate and Interpret the Correlation, rCalculate and Interpret the Least Squares Regression Line in context.Assess the appropriateness of the LSRL by constructing a Residual Plot.

3.1 Response Vs. Explanatory Variables

3.1 Response Vs. Explanatory Variables

• Response variable measures an outcome of a study, explanatory variable helps explain or influences changes in a response variable (like independent vs. dependent).

• Calling one variable explanatory and the other response doesn’t necessarily mean that changes in one CAUSE changes in the other.

• Ex: Alcohol and Body temp: One effect of Alcohol is a drop in body temp. To test this, researches give several amounts of alcohol to mice and measure each mouse’s body temp change. What are the explanatory and response variables?

ScatterplotsScatterplots

• Scatterplot shows the relationship between two quantitative variables measured on the same individuals.

• Explanatory vari

• ables along X axis, Response variables along Y.

• Each individual in data appears as the point in the plot fixed by the values of both variables for that individual.

• Example:

Interpreting ScatterplotsInterpreting Scatterplots

• Direction: in previous example, the overall pattern moves from upper left to lower right. We call this a negative association.

• Form: The form is slightly curved and there are two distinct clusters. What explains the clusters? (ACT States)

• Strength: The strength is determined by how closely the points follow a clear form. The example is only moderately strong.

• Outliers: Do we see any deviations from the pattern? (Yes, West Virginia, where 20% of HS seniors take the SAT but the mean math score is only 511).

AssociationAssociation

Introducing Categorical Variables

Introducing Categorical Variables

Calculator ScatterplotCalculator Scatterplot

• Enter the Beer consumption in L1 and the BAC values in L2

• Next specify scatterplot in Statplot menu (first graph). X list L1 Y List L2 (explanatory and response)

• Use ZoomStat.

• Notice that their are no scales on the axes and they aren’t labeled. If you are copying your graph to your paper, make sure you scale and label the Axis (use Trace)

Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Beers 5 2 9 8 3 7 3 5 3 5 4 6 5 7 1 4BAC 0.1 0.0

30.19

0.12

0.04

0.0950

0.07

0.06

0.02

0.05

0.07

0.1 0.085

0.09

0.01

0.05

CorrelationCorrelation

• Caution- our eyes can be fooled! Our eyes are not good judges of how strong a linear relationship is. The 2 scatterplots depict the same data but drawn with a different scale. Because of this we need a numerical measure to supplement the graph.

rr

• The Correlation measures the direction and strength of the linear relationship between 2 variables.

• Formula- (don’t need to memorize or use): r =

• In Calc: Go to Catalog (2nd, zero button), go to DiagnosticOn, enter, enter. You only have to do this ONCE! Once this is done:

• Enter data in L1 and L2 (you can do calc-2 var stats if you want the mean and sd of each)

• Calc, LinReg (A + Bx) enter

ZxZyn 1

Interpreting r Interpreting r • The absolute value of r tells you the strength of the

association (0 means no association, 1 is a strong association)

• The sign tells you whether it’s a positive or a negative association. So r ranges from -1 to +1

• Note- it makes no difference which variable you call x and which you call y when calculating correlation, but stay consistent!

• Because r uses standardized values of the observations, r does not change when we change the units of measurement of x, y, or both. (Ex: Measuring height in inches vs. ft. won’t change correlation with weight)

• values of -1 and +1 occur ONLY in the case of a perfect linear relationship , when the variables lie exactly along a straight line.

ExamplesExamples1. Correlation requires that both variables be quantitative

2. Correlation measures the strength of only LINEAR relationships, not curved...no matter how strong they are!

3. Like the mean and standard deviation, the correlation is not resistant: r is strongly affected by a few outlying observations. Use r with caution when outliers appear in the scatterplot

4. Correlation is not a complete summary of two-variable data, even when the relationship is linear- always give the means and standard deviations of both x and y along with the correlation.

3.2- least squares regression

3.2- least squares regression

Text

The slope here B = .00344 tells us that fat gained goes down by .00344 kg for each added calorie of NEA according to this linear model. Our regression equation is the predicted RATE OF CHANGE in the response y as the explanatory variable x changes.

The Y intercept a = 3.505kg is the fat gain estimated by this model if NEA does not change when a person overeats.

PredictionPrediction• We can use a regression line to predict the

response y for a specific value of the explanatory variable x.

LSRL LSRL • In most cases, no line will pass exactly

through all the points in a scatter plot and different people will draw different regression lines by eye.

• Because we use the line to predict y from x, the prediction errors we make are errors in y, the vertical direction in the scatter plot

• A good regression line makes the vertical distances of the points from the line as small as possible

• Error: Observed response - predicted response

LSRL Cont. LSRL Cont.

Equation of LSRLEquation of LSRL

• Example 3.36: The Sanchez household is about to install solar panels to reduce the cost of heating their house. In order to know how much the panels help, they record their consumption of natural gas before the panels are installed. Gas consumption is higher in cold weather, so the relationship between outside temp and gas consumption is important.

• Describe the direction, form, and strength of the relationship

• Positive, linear, and very strong

• About how much gas does the regression line predict that the family will use in a month that averages 20 degree-days per day?

• 500 cubic feet per day

• How well does the least-squares line fit the data?

•

• The error of our predictions, or vertical distance from predicted Y to observed Y, are called residuals because they are “left-over” variation in the response.

ResidualsResiduals

One subject’s NEA rose by 135 calories. That subject gained 2.7 KG of fat. The predicted gain for 135 calories is

Y hat = 3.505- .00344(135) = 3.04 kg

The residual for this subject is

y - yhat= 2.7 - 3.04 = -.34 kg

Residual PlotResidual Plot

• The sum of the least-squares residuals is always zero.

• The mean of the residuals is always zero, the horizontal line at zero in the figure helps orient us. This “residual = 0” line corresponds to the regression line

Residuals List on Calc

Residuals List on Calc• If you want to get all your residuals listed in L3

highlight L3 (the name of the list, on the top) and go to 2nd- stat- RESID then hit enter and enter and the list that pops out is your resid for each individual in the corresponding L1 and L2. (if you were to create a normal scatter plot using this list as your y list, so x list: L1 and Y list L3 you would get the exact same thing as if you did a residual plot defining x list as L1 and Y list as RESID as we had been doing).

This is a helpful list to have to check your work when asked to calculate an individuals residual.

Examining Residual PlotExamining Residual Plot• Residual plot should show no obvious pattern. A

curved pattern shows that the relationship is not linear and a straight line may not be the best model.

• Residuals should be relatively small in size. A regression line in a model that fits the data well should come close” to most of the points.

• A commonly used measure of this is the standard deviation of the residuals, given by:

s residuals

2n 2

For the NEA and fat gain data, S = 7.663

14.740

Residual Plot on CalcResidual Plot on Calc

• Produce Scatterplot and Regression line from data (lets use BAC if still in there)

• Turn all plots off

• Create new scatterplot with X list as your explanatory variable and Y list as residuals (2nd stat, resid)

• Zoom Sta

R squared- Coefficient of determination

R squared- Coefficient of determination

If all the points fall directly on the least-squares line, r squared = 1. Then all the variation in y is explained by the linear relationship with x.

So, if r squared = .606, that means that 61% of the variation in y among individual subjects is due to the influence of the other variable. The other 39% is “not explained”.

r squared is a measure of how successful the regression was in explaining the response

Facts about Least-Squares regressionFacts about Least-Squares regression

• The distinction between explanatory and response variables is essential in regression. If we reverse the roles, we get a different least-squares regression line.

• There is a close connection between corelation and the slope of the LSRL. Slope is r times Sy/Sx. This says that a change of one standard deviation in x corresponds to a change of 4 standard deviations in y. When the variables are perfectly correlated (4 = +/- 1), the change in the predicted response y hat is the same (in standard deviation units) as the change in x.

• The LSRL will always pass through the point (X bar, Y Bar)

• r squared is the fraction of variation in values of y explained by the x variable

3.3 Influences3.3 Influences• Correlation r is not resistant. Extrapolation

is not very reliable. One unusual point in the scatterplot greatly affects the value of r. LSRL also not resistant.

• A point extreme in the x direction with no other points near it pulls the line toward itself. This point is influential.

Lurking Variables- Beware!

Lurking Variables- Beware!

• Example: A college board study of HS grads found a strong correlation between math minority students took in high school and their later success in college. News articles quoted the College Board saying that “math is the gatekeeper for success in college”.

• But, Minority students from middle-class homes with educated parents no doubt take more high school math courses. They are also more likely to have a stable family, parents who emphasize education, and can pay for college etc. These students would likely succeed in college even if they took fewer math courses. The family background of students is a lurking variable that probably explains much of the relationship between math courses and college success.

Beware correlations based on averagesBeware correlations based on averages

• Correlations based on averages are usually too high when applied to individuals.

• Example: if we plot the average height of young children against their age in months, we will see a very strong positive association with correlation near 1. But individual children of the same age vary a great deal in height. A plot of height against age for individual children will show much more scatter and lower correlation than the plot of average height against age.

Chapter Example:Corrosion and Strength

Chapter Example:Corrosion and Strength

Consider the following data from the article, “The Carbonation of Concrete Structures in the Tropical Environment of Singapore” (Magazine of Concrete Research (1996):293-300 which discusses how the corrosion of steel(caused by carbonation) is the biggest problem affecting concrete strength:

x= carbonation depth in concrete (mm)y= strength of concrete (Mpa)

xx 8 20 20 30 35 40 50 55 65

yy 22.8 17.1 21.5 16.1 13.4 12.4 11.4 9.7 6.8

Define the Explanatory and Response Variables.Plot the data and describe the relationship.

Corrosion and StrengthCorrosion and Strength

Depth (mm)

Str

en

gth

(M

pa) There is a strong,

negative, linear relationship between

depth of corrosion and concrete strength. As the

depth increases, the strength decreases at a

constant rate.


Depth (mm)

Str

en

gth

(M

pa)

The mean depth of corrosion is 35.89mm with

a standard deviation of 18.53mm.

The mean strength is 14.58 Mpa with a standard

deviation of 5.29 Mpa.

Corrosion and StrengthCorrosion and StrengthFind the equation of the

Least Squares Regression Line (LSRL)

that models the relationship between

corrosion and strength.

Depth (mm)

Str

en

gth

(M

pa)

y=24.52+(-0.28)x

strength=24.52+(-0.28)depth

r=-0.96


Depth (mm)

Str

en

gth

(M

pa) y=24.52+(-0.28)x

strength=24.52+(-0.28)depth

r=-0.96

What does “r” tell us?There is a Strong, Negative, LINEAR relationship

between depth of corrosion and strength of concrete.

What does “b=-0.28” tell us?For every increase of 1mm in depth of corrosion, we predict a 0.28 Mpa decrease in strength of the

concrete.

Corrosion and StrengthCorrosion and StrengthUse the prediction model (LSRL) to determine the following:

What is the predicted strength of concrete with a corrosion depth of 25mm?strength=24.52+(-0.28)depthstrength=24.52+(-0.28)(25)strength=17.59 Mpa

What is the predicted strength of concrete with a corrosion depth of 40mm?strength=24.52+(-0.28)(40)strength=13.44 MpaHow does this prediction compare with the observed strength at a corrosion depth of 40mm?

ResidualsResidualsNote, the predicted strength when corrosion=40mm is:predicted strength=13.44 MpaThe observed strength when corrosion=40mm is:observed strength=12.4mm

• The prediction did not match the observation.• That is, there was an “error” or “residual” between

our prediction and the actual observation.

• RESIDUAL = Observed y - Predicted y

• The residual when corrosion=40mm is:• residual = 12.4 - 13.44• residual = -1.04

Assessing the ModelAssessing the ModelIs the LSRL the most appropriate prediction model for strength? r suggests it will provide strong predictions...can we do better?To determine this, we need to study the residuals generated by the LSRL.Make a residual plot.Look for a pattern.If no pattern exists, the LSRL may be our best bet for predictions.If a pattern exists, a better prediction model may exist...

Residual PlotResidual PlotConstruct a Residual Plot for the (depth,strength) LSRL.

depth(mm)

resi

du

als

There appears to be no pattern to the residual plot...therefore, the LSRL may be our best prediction model.

Coefficient of DeterminationCoefficient of Determination

We know what “r” tells us about the relationship

between depth and strength....what about r2?

Depth (mm)

Str

en

gth

(M

pa)

93.75% of the variability in predicted strength can be explained by the LSRL on

depth.

SummarySummaryWhen exploring a bivariate relationship:

Make and interpret a scatterplot:

Strength, Direction, Form

Describe x and y:

Mean and Standard Deviation in Context

Find the Least Squares Regression Line.

Write in context.

Construct and Interpret a Residual Plot.

Interpret r and r2 in context.

Use the LSRL to make predictions...

Examining Examining RelationshipsRelationshipsRegression Regression

ReviewReview

Regression BasicsRegression BasicsWhen describing a Bivariate Relationship:

Make a Scatterplot

Strength, Direction, Form

Model: y-hat=a+bx

Interpret slope in context

Make Predictions

Residual = Observed-Predicted

Assess the Model

Interpret “r”

Residual Plot

0

10

20

30

40

50

60

Registrations (thousand)400 450 500 550 600 650 700 750

The Endangered Manatee Scatter Plot

0

10

20

30

40

50

60



Killed = (0.125 thousand^-1)Registrations - 41.4; r2 = 0.89

0

10

20

30

40

50

60




0

10

20

30

40

50

60



Killed = (0.125 thousand^-1)Registrations - 41; r2 = 0.89

010

203040

5060


-606

400 450 500 550 600 650 700 750Registrations (thousand)



010

203040

5060


-606



Reading Minitab Reading Minitab OutputOutputRegression Analysis: Fat gain versus NEA

The regression equation isFatGain = ****** + ******(NEA)

Predictor Coef SE Coef T PConstant 3.5051 0.3036 11.54 0.000NEA -0.0034415 0.00074141 -4.04 0.000

S=0.739853 R-Sq = 60.6%R-Sq(adj)=57.8%

Regression equations aren’t always as easy to spot as they are on your TI-84. Can you find the slope and intercept above?

Regression Analysis: Fat gain versus NEA



S=0.739853 R-Sq = 60.6%R-Sq(adj)=57.8%




S=0.739853 R-Sq = 60.6%R-Sq(adj)=57.8%




S=0.739853 R-Sq = 60.6%R-Sq(adj)=57.8%

Outliers/Influential Outliers/Influential PointsPointsDoes the age of a child’s first word

predict his/her mental ability? Consider the following data on (age of first word, Gesell Adaptive Score) for 21 children.

Age at First Word and Gesell Score

Child Age Score <new>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

1 months15 95

2 months26 71

3 months10 83

4 months9 91

5 months15 102

6 months20 87

7 months18 93

8 months11 100

9 months8 104

10 months20 94

11 months7 113

12 months9 96

13 months10 83

14 months11 84

15 months11 102

16 months10 100

17 months12 105

18 months42 57

19 months17 121

20 months11 86

21 months10 100

Age at First Word and Gesell Score

Child Age Score <new>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

1 months15 95

2 months26 71

3 months10 83

4 months9 91

5 months15 102

6 months20 87

7 months18 93

8 months11 100

9 months8 104

10 months20 94

11 months7 113

12 months9 96

13 months10 83

14 months11 84

15 months11 102

16 months10 100

17 months12 105

18 months42 57

19 months17 121

20 months11 86

21 months10 100

50

60

70

80

90

100

110

120

130

Age (months)0 5 10 15 20 25 30 35 40 45

Age at First Word and Gesell Score Scatter Plot

50

60

70

80

90

100

110

120

130

Age (months)0 5 10 15 20 25 30 35 40 45


Score = (-1.13 months^-1)Age + 110; r2 = 0.41

50

60

70

80

90

100

110

120

130

Age (months)0 5 10 15 20 25 30 35 40 45



50

60

70

80

90

100

110

120

130

Age (months)0 5 10 15 20 25 30 35 40 45


Does the highlighted point markedly affect the equation of the LSRL? If so, it is “influential”.

Test by removing the point and finding the new LSRL.

Influential?


50

60

70

80

90

100

110

120

130

Age (months)0 5 10 15 20 25 30 35 40 45



50

60

70

80

90

100

110

120

130

Age (months)0 5 10 15 20 25 30 35 40 45


Explanatory vs. Explanatory vs. ResponseResponseThe Distinction Between Explanatory and Response

variables is essential in regression.

Switching the distinction results in a different least-squares regression line.

v = 454r - 38; r2 = 0.63

-400

-200

0

200

400

600

800

1000

1200

r0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2

Hubble 1929 data Scatter Plot

v = 454r - 38; r2 = 0.63

-400

-200

0

200

400

600

800

1000

1200

r0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2


r = 0.00139v + 0.39; r2 = 0.63

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

v-400 -200 0 200 400 600 800 1000 1200


r = 0.00139v + 0.39; r2 = 0.63

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

v-400 -200 0 200 400 600 800 1000 1200


Note: The correlation value, r, does NOT depend on the distinction between Explanatory and Response.

CorrelationCorrelationThe correlation, r, describes the strength of the straight-line relationship between x and y.

Ex: There is a strong, positive, LINEAR relationship between # of beers and BAC.

There is a weak, positive, linear relationship between x and y. However, there is a strong nonlinear relationship.

r measures the strength of linearity...

BAC = 0.0180Beers - 0.013; r2 = 0.80

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

Beers0 1 2 3 4 5 6 7 8 9 10

Beer and Blood Alcohol Scatter Plot

BAC = 0.0180Beers - 0.013; r2 = 0.80

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

Beers0 1 2 3 4 5 6 7 8 9 10

Beer and Blood Alcohol Scatter Plot

y = x - 10; r2 = 0.14

-20

-15

-10

-5

0

5

x0 2 4 6 8 10 12

Collection 1 Scatter Plot

y = x - 10; r2 = 0.14

-20

-15

-10

-5

0

5

x0 2 4 6 8 10 12


Coefficient of Coefficient of DeterminationDeterminationThe coefficient of determination, r2, describes the

percent of variability in y that is explained by the linear regression on x.

71% of the variability in death rates due to heart disease can be explained by the LSRL on alcohol consumption.

That is, alcohol consumption provides us with a fairly good prediction of death rate due to heart disease, but other factors contribute to this rate, so our prediction will be off somewhat.

DeathRate = (-23.0 yr/L)Alcohol + 260; r2 = 0.71

0

50

100

150

200

250

300

350

Alcohol (L/yr)0 1 2 3 4 5 6 7 8 9 10

Wine Consumption and Heart Disease Scatter Plot

DeathRate = (-23.0 yr/L)Alcohol + 260; r2 = 0.71

0

50

100

150

200

250

300

350

Alcohol (L/yr)0 1 2 3 4 5 6 7 8 9 10

Wine Consumption and Heart Disease Scatter Plot

CautionsCautions Correlation and Regression are NOT RESISTANT to outliers and Influential Points!

Correlations based on “averaged data” tend to be higher than correlations based on all raw data.

Extrapolating beyond the observed data can result in predictions that are unreliable.

Correlation vs. Correlation vs. CausationCausationConsider the following historical data:

Collection 1

Year Ministers Rum <new>

1

2

3

4

5

6

7

8

9

10

11

12

1860 63 8376

1865 48 6406

1870 53 7005

1875 64 8486

1880 72 9595

1885 80 10643

1890 85 11265

1895 76 10071

1900 80 10547

1905 83 11008

1910 105 13885

1915 140 18559

Collection 1

Year Ministers Rum <new>

1

2

3

4

5

6

7

8

9

10

11

12

1860 63 8376

1865 48 6406

1870 53 7005

1875 64 8486

1880 72 9595

1885 80 10643

1890 85 11265

1895 76 10071

1900 80 10547

1905 83 11008

1910 105 13885

1915 140 18559

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

x0 20 40 60 80 100 120 140 160


0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

x0 20 40 60 80 100 120 140 160


y = 132x + 33; r2 = 1.00

02000400060008000

100001200014000160001800020000

x0 20 40 60 80 100 120 140 160


y = 132x + 33; r2 = 1.00

02000400060008000

100001200014000160001800020000

x0 20 40 60 80 100 120 140 160


There is an almost perfect linear relationship between x and y. (r=0.999997)

x = # Methodist Ministers in New England

y = # of Barrels of Rum Imported to Boston

CORRELATION DOES NOT IMPLY CAUSATION!

SummarySummary

0

10

20

30

40

50

60



0

10

20

30

40

50

60




0

10

20

30

40

50

60




0

10

20

30

40

50

60




010

203040

5060


-606




010

203040

5060


-606



Finding Areas with Calc 1. Shade Norm Consider WISC data from before: N (100, 15). Suppose we want...

Documents

Transcript of Finding Areas with Calc 1. Shade Norm Consider WISC data from before: N (100, 15). Suppose we want...