Chapter 3: Descriptive Analysis and Presentation of...

Chapter 3: Descriptive Analysisand Presentation of Bivariate

Data

5040302010

60

50

40

30

20

10

Height

Wei

ght

R -Sq = 0.559Y = -2.31464 + 1.28722X

Regression Plot

Chapter Goals

• To be able to present bivariate data intabular and graphic form.

• To gain an understanding of the distinctionbetween the basic purposes of correlationanalysis and regression analysis.

• To become familiar with the ideas ofdescriptive presentation.

3.1: Bivariate Data

Bivariate Data: Consists of the values of two differentresponse variables that are obtained from the same populationof interest.

Three combinations of variable types:1. Both variables are qualitative (attribute).2. One variable is qualitative (attribute) and the other is

quantitative (numerical).3. Both variables are quantitative (both numerical).

Two Qualitative Variables:When bivariate data results from two qualitative (attribute orcategorical) variables, the data is often arranged on a cross-tabulation or contingency table.

Example: A survey was conducted to investigate therelationship between preferences for television, radio, ornewspaper for national news, and gender. The results aregiven in the table below.

TV Radio NPMale 280 175 305Female 115 275 170

This table may be extended to display the marginal totals (ormarginals). The total of the marginal totals is the grand total.

Contingency tables often show percentages (relativefrequencies). These percentages are based on the entiresample or on the subsample (row or column) classifications.

TV Radio NP Row TotalsMale 280 175 305 760Female 115 275 170 560Col. Totals 395 450 475 1320

Percentages based on the grand total (entire sample):The previous contingency table may be converted topercentages of the grand total by dividing each frequency bythe grand total and multiplying by 100.

For example, 175 becomes 13.3%

TV Radio NP Row TotalsMale 21.2 13.3 23.1 57.6Female 8.7 20.8 12.9 42.4Col. Totals 29.9 34.1 36.0 100.0

1751320

100 13 3× =

.

PercentagesBasedonGrandTotal

0.0

5.0

10.0

15.0

20.0

25.0

TV Radio NP

Media

Perc

enta

ge

Male

Female

These same statistics (numerical values describing sampleresults) can be shown in a (side-by-side) bar graph.

Percentages based on row (column) totals:The entries in a contingency table may also be expressed aspercentages of the row (column) totals by dividing each row(column) entry by that row’s (column’s) total and multiplyingby 100. The entries in the contingency table below areexpressed as percentages of the column totals.

These statistics may also be displayed in a side-by-side bargraph.

TV Radio NP Row TotalsMale 70.9 38.9 64.2 57.6Female 29.1 61.1 35.8 42.4Col. Totals 100.0 100.0 100.0 100.0

One Qualitative and One Quantitative Variable:

1. When bivariate data results from one qualitative and onequantitative variable, the quantitative values are viewed asseparate samples.

2. Each set is identified by levels of the qualitative variable.3. Each sample is described using summary statistics, and the

results are displayed for side-by-side comparison.4. Statistics for comparison: measures of central tendency,

measures of variation, 5-number summary.5. Graphs for comparison: dotplot, boxplot.

Example: A random sample of households from threedifferent parts of the country was obtained and their electricbill for June was recorded. The data is given in the tablebelow.

The part of the country is a qualitative variable with threelevels of response. The electric bill is a quantitative variable.The electric bills may be compared with numerical andgraphical techniques.

Northeast Midwest West23.75 40.50 34.38 34.35 54.54 65.6033.65 31.25 39.15 37.12 59.78 45.1242.55 50.60 36.71 34.39 60.35 61.5337.70 31.55 35.12 35.80 52.79 47.3738.85 21.25 37.24 40.01 59.64 37.40

Comparison using dotplots:

. . : . . . . . .---+---------+---------+---------+---------+---------+---Northeast

.:..:. ..

---+---------+---------+---------+---------+---------+---Midwest.

. . . . . : . .---+---------+---------+---------+---------+---------+---West24.0 32.0 40.0 48.0 56.0 64.0

The electric bills in the Northeast tend to be more spread outthan those in the Midwest. The bills in the West tend to behigher than both those in the Northeast and Midwest.

Comparison using Box-and-Whisker plots:

Northeast Midwest West

20

30

40

50

60

70

Ele

ctric

Bill

Two Quantitative Variables:1. Expressed as ordered pairs: (x, y)2. x: input variable, independent variable.

y: output variable, dependent variable.

Scatter Diagram: A plot of all the ordered pairs of bivariatedata on a coordinate axis system. The input variable x isplotted on the horizontal axis, and the output variable y isplotted on the vertical axis.

Note: Use scales so that the range of the y-values is equal toor slightly less than the range of the x-values. This creates awindow that is approximately square.

Example: In a study involving children’s fear related to beinghospitalized, the age and the score each child made on theChild Medical Fear Scale (CMFS) are given in the tablebelow.

Construct a scatter diagram for this data.

Age (x ) 8 9 9 10 11 9 8 9 8 11CMFS (y ) 31 25 40 27 35 29 25 34 44 19

Age (x ) 7 6 6 8 9 12 15 13 10 10CMFS (y ) 28 47 42 37 35 16 12 23 26 36

1514131211109876

50

40

30

20

10

Age

CM

FSChild Medical Fear Scale

Scatter diagram:age = input variable, CMFS = output variable

3.2: Linear Correlation

• Measure the strength of a linear relationshipbetween two variables.

• As x increases, no definite shift in y: no correlation.• As x increase, a definite shift in y: correlation.• Positive correlation: x increases, y increases.• Negative correlation: x increases, y decreases.• If the ordered pairs follow a straight-line path:

linear correlation.

302010

55

45

35

Input

Out

put

Example: no correlation.As x increases, there is no definite shift in y.

55504540353025201510

60

50

40

30

20

Input

Out

put

Example: positive correlation.As x increases, y also increases.

55504540353025201510

95

85

75

65

55

Input

Out

put

Example: negative correlation.As x increases, y decreases.

Note:1. Perfect positive correlation: all the points lie along a line

with positive slope.2. Perfect negative correlation: all the points lie along a line

with negative slope.3. If the points lie along a horizontal or vertical line: no

correlation.4. If the points exhibit some other nonlinear pattern: no linear

relationship, no correlation.5. Need some way to measure correlation.

Coefficient of linear correlation: r, measures the strength ofthe linear relationship between two variables.

Pearson’s product moment formula:

Note:1.2. r = +1: perfect positive correlation3. r = -1 : perfect negative correlation

− ≤ ≤ +1 1r

rx x y yn s sx y

=− −−

∑ ( )( )( )1

Alternate formula for r:

r xyx y

= SSSS SS

( )( ) ( )

( )SS sum of squares for( )x x

xx

n

=

= −∑ ∑22

( )SS sum of squares for( )y y

yy

n

=

= −∑ ∑22

SS sum of squares for( )xy xy

xyx yn

=

= −∑ ∑ ∑

Example: The table below presents the weight (in thousandsof pounds) x and the gasoline mileage (miles per gallon) y forten different automobiles. Find the linear correlationcoefficient.

2.5 40 6.25 1600 100.03.0 43 9.00 1849 129.04.0 30 16.00 900 120.03.5 35 12.25 1225 122.52.7 42 7.29 1764 113.44.5 19 20.25 361 85.53.8 32 14.44 1024 121.62.9 39 8.41 1521 113.15.0 15 25.00 225 75.02.2 14 4.84 196 30.8

Sum 34.1 309 123.73 10665 1010.9

x y x2 y2 xy

x∑ y∑ x2∑ y2∑ xy∑

To complete the calculation for r:

( )SS( ) . ( . ) .x x

xn

= − = − =∑ ∑22 2

123 73 34 110

7 449

( )SS( ) ( ) .y y

yn

= − = − =∑ ∑22 2

10665 30910

1116 9

SS( ) . ( . )( ) .xy xyx yn

= − = − = −∑ ∑ ∑ 1010 9 34 1 30910

42 79

r xyx y

= = − = −SS

SS SS( )

( ) ( ).

( . )( . ).42 79

7 449 1116 947

Note:1. r is usually rounded to the nearest hundredth.2. r close to 0: little or no linear correlation.3. As the magnitude of r increases, towards -1 or +1, there is

an increasingly stronger linear correlation between the twovariables.

4. Method of estimating r based on the scatter diagram.Window should be approximately square.Useful for checking calculations.

3.3: Linear Regression

• Regression analysis finds the equation ofthe line that best describes the relationshipbetween two variables.

• One use of this equation: to makepredictions.

Models or prediction equations:Some examples of various possible relationships.

Linear:

Quadratic:

Exponential:

Logarithmic:

Note: What would a scatter diagram look like to suggest eachrelationship?

$y b b x= +0 1

$y a bx cx= + + 2

$ ( )y a bx=

$ logy a xb=

Method of least squares:

Equation of the best-fitting line:

Predicted value:

Least squares criterion:Find the constants b0 and b1 such that the sum

is as small as possible.

$y b b x= +0 1

$y

( $) ( ( ))y y y b b x− = − +∑ ∑20 1

2

Observed and predicted values of y:

$y b b x= +0 1

•

• ( , )x y

( , $)x y

x

y

y y− $

y$y

The equation of the line of best fit:Determined by

b0: slopeb1: y-intercept

Values that satisfy the least squares criterion:

bx x y y

x xxyx1 2=

− −

−=∑

∑( )( )

( )( )( )

SSSS

( ))( 1

10 xby

nxby

b ⋅−=⋅−

= ∑∑

Example: A recent article measured the job satisfaction ofsubjects with a 14-question survey. The data belowrepresents the job satisfaction scores, y, and the salaries, x, fora sample of similar individuals.

1. Draw a scatter diagram for this data.2. Find the equation of the line of best fit.

x 31 33 22 24 35 29 23 37y 17 20 13 15 18 17 12 21

23 12 529 27631 17 961 52733 20 1089 66022 13 484 28624 15 576 36035 18 1225 63029 17 841 49337 21 1369 777

234 133 7074 4009

Preliminary calculations needed to find b1 and b0:

x y x2 xy

x∑ y∑ x2∑ xy∑

Finding b1 and b0:

( )SS( ) .x x

xn

= − = −

=∑ ∑2

2 27074 234

8229 5

SS( ) ( )( ) .xy xyx yn

= − = −

=∑ ∑ ∑ 4009 234 1338

118 75

b xyx1

118 75229 5

5174= = =SSSS

( )( )

..

.

( )by b x

n01 133 5174 234

814902=

− ⋅= − =∑ ∑ (. )( ) .

Equation of the line of best fit:$ . .y x= +149 517

21 23 25 27 29 31 33 35 37

Salary

12

13

14

15

16

17

18

19

20

21

22Jo

b S

atis

fact

ion

Scatter diagram:

Note:1. Keep at least three extra decimal places while doing the

calculations to ensure an accurate answer.2. When rounding off the calculated values of b0 and b1,

always keep at least two significant digits in the finalanswer.

3. The slope b1 represents the predicted change in y per unitincrease in x.

4. The y-intercept is the value of y where the line of best fitintersects the y-axis.

5. The line of best fit will always pass through the point( , )x y

Making predictions:1. One of the main purposes for obtaining a regression

equation is for making predictions.2. For a given value of x, we can predict a value of y,3. The regression equation should be used to make

predictions only about the population from which thesample was drawn.

4.The regression equation should be used only to cover thesample domain on the input variable. You can estimatevalues outside the domain interval, but use caution and usevalues close to the domain interval.

5. Use current data. A sample taken in 1987 should not beused to make predictions in 1999.

( $)y

Chapter 3: Descriptive Analysis and Presentation of...

Documents

Transcript of Chapter 3: Descriptive Analysis and Presentation of...