Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary...

19
Further Maths Bivariate Data Summary Page 1 of 19 Further Mathematics Bivariate Summary Representing Bivariate Data Back to Back Stem Plot A back to back stem plot is used to display bivariate data, involving a numerical variable and a categorical variable with two categories. Example The data can be displayed as a back to back stem plot. From the distribution we see that the Labor distribution is symmetric and therefore the mean and the median are very close, whereas the Liberal distribution is negatively skewed. Since the Liberal distribution is skewed the median is a better indicator of the centre of the distribution than the mean. It can be seen that the Liberal party volunteers handed out many more “how to vote” cards than the Labor party volunteers. Parallel Boxplots When we want to display a relationship between a numerical variable and a categorical variable with two or more categories, parallel boxplots can be used. Example The 5-number summary of each class is determined.

Transcript of Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary...

Page 1: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 1 of 19

Further Mathematics Bivariate Summary

Representing Bivariate Data Back to Back Stem Plot

A back to back stem plot is used to display bivariate data, involving a numerical variable and a

categorical variable with two categories.

Example

The data can be displayed as a back to back stem plot.

From the distribution we see that the Labor distribution is symmetric and therefore the mean and the median are very close, whereas the Liberal distribution is negatively skewed. Since the Liberal distribution is skewed the median is a better indicator of the centre of the distribution than the mean. It can be seen that the Liberal party volunteers handed out many more “how to vote” cards than the Labor party volunteers.

Parallel Boxplots

When we want to display a relationship between a numerical variable and a categorical variable with

two or more categories, parallel boxplots can be used.

Example

The 5-number summary of each class is determined.

Page 2: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 2 of 19

Four boxplots are drawn. Notice that one scale is used.

Based on the medians, 7B did best (median 77.5), followed by 7C (median 69.5), then 7D (median 65) and finally 7A (median 61.5).

Two-way Frequency Tables and Segmented Bar Charts

When we are examining the relationship between two categorical variables, the two-way frequency

table can be used.

Example

67 primary and 47 secondary school students were asked about their attitude to the number of

school holidays which should be given. They were asked whether there should be fewer, the same

number or more school holidays. The results of the survey can be recorded in a two-way frequency

table and a percentage two-way table as shown below.

The data can also be represented in a segmented bar chart based on the data in the second table.

Clearly, secondary students were much keener on having more holidays than were primary students.

Page 3: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 3 of 19

Dependent and Independent Variables In a relationship involving two variables, if the values of one variable “depend” on the values of

another variable, then the former variable is referred to as the dependent variable and the latter

variable is referred to as the independent variable. When a relationship between two sets of

variables is being examined, it is important to know which one of the two variables depends on the

other. Most often we can make a judgement about this, although sometimes it may not be possible.

For example, in the case where the ages of company employees are compared with their annual

salaries, you might reasonably expect that the annual salary of an employee would depend on the

person’s age. In this case, the age of the employee is the independent variable and the salary of the

employee is the dependent variable.

We always place the independent variable on the x-axis and the dependent variable on the y-axis

in a scatterplot

Scatterplots

Page 4: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 4 of 19

Example

There is a moderate, negative linear relationship between the two variables

Pearson’s Correlation Coefficient (r)

Note that outliers can affect the accuracy of r.

Page 5: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 5 of 19

The coefficient of determination is useful when we have two variables which have a linear

relationship. It tells us the percentage of variation in one variable which can be explained by the

variation in the other variable.

Example.

Page 6: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 6 of 19

Linear Regression If 2 variables have a moderate or strong association (positive or negative), we can find the equation

of the line of best fit of the data and make predictions. The general process of fitting curves to data

is called regression and the fitted line is called a regression line.

Lines of Best Fit

By Eye The simplest method is to plot the data on a scatter graph, and place a line over the plot by eye

which seems to best represent the pattern in the data values. This method will often give the

approximate location of the regression line.

The 3-Median Method

Page 7: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 7 of 19

Step 5. Find the equation of the line (general form 𝑦 = 𝑚𝑥 + 𝑐, where 𝑚 is the gradient and 𝑐 is the

y-intercept). The gradient of the 3-median line can be found by determining the gradient of the line

that passes through the upper median and lower median, (𝑥𝑢, 𝑦𝑢) and (𝑥𝐿 , 𝑦𝐿). Use the formula:

Gradient = 𝑟𝑖𝑠𝑒

𝑟𝑢𝑛=

𝑦𝑢−𝑦𝐿

𝑥𝑢−𝑥𝐿

The y-intercept 𝑐 can be found from the graph if the scale on the axes begins at zero. Otherwise,

take a point on the final line and determine c by substitution.

Using the Calculator to Determine the Equation of the 3-Median Line The following table gives the winning high jump heights for consecutive Olympic Games from 1956

to 2000.

1. Enter the data into Lists and Spreadsheets View

2. Press the Home button and enter Data and Statistics View.

Page 8: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 8 of 19

3. Press the TAB key and make year the independent variable. (x value)

4. Press the TAB key and select height to make it the dependent variable. (y value)

You should see the scatter plot shown below.

5. Press Menu, Analysis, Regression and Show Median-Median. You will see the plot with the 3-

Median regression line as below.

6. The equation of the 3-Median regression line is given as 𝑦 = 0.006094𝑥 − 9.7751

This equation can be written as:

Winning Height = 𝟎. 𝟎𝟎𝟔𝟎𝟗𝟒 × 𝒀𝒆𝒂𝒓 − 𝟗. 𝟕𝟕𝟓𝟏

7. Using this rule we can predict the winning jump in 2004 by substituting Year = 2004 in the

equation.

Winning Height = 0.006094 × 2004 − 9.7751 = 2.43728

The regression model predicts the winning jump to be 2.44 meters in 2004.

It is interesting to note that the actual winning jump in 2004 was 2.36m which shows that

extrapolating data can be unreliable.

Interpolation occurs when a value is substituted into the regression equation that is within the

bounds of the data given. Interpolation is reliable. Predicting the winning jump for the year 1978 in

the above example is interpolation.

Extrapolation occurs when a value is substituted into the regression equation that is outside the

bounds of the given data. Extrapolation is unreliable. Predicting the winning jump for the year 2004

in the above example is extrapolation.

Page 9: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 9 of 19

The Least Squares Regression Method The Least Squares Regression method finds the line that minimizes the total of the squares formed

by the points and the line. We normally use CAS to generate the equation of the least squares

regression line. It is given in the form 𝑦 = 𝑚𝑥 + 𝑏 on the calculator.

Example:

You would expect the number of skiers to depend on the depth of snow. The independent variable

is the depth of snow and dependent variable is the number of skiers.

Create a scatterplot of the data. This can be done on the calculator.

1. In Lists and Spreadsheet view, enter the data in the table.

Page 10: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 10 of 19

2. Hit the Home button and go to Data and Statistics view.

3. Tab to the horizontal axis and select the independent variable depth and tab to the vertical axis

and select the dependent variable skiers. The scatterplot will form.

4. It can be seen that there is a linear, positive strong correlation between the depth of snow and

the number of skiers. There is evidence to suggest that as the depth of the snow increases the

number of skiers increases.

5. Next find 𝑟, the coefficient of correlation and the coefficient of determination 𝑟2.

Hit Ctrl and Left Arrow to return to Lists and Spreadsheet View. Hit Menu, Statistics, Stat

Calculations, Linear Regression (mx + b). Hit the Click button and select depth from the drop down

list for X List. Hit tab and select skiers from the drop down list for the Y List.

Page 11: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 11 of 19

There is no need to enter data into the other boxes. Tab to OK and hit Enter.

The coefficient of correlation 𝑟 = 0.88402

This indicates that there is a strong, positive correlation between the depth of snow and the number

of skiers.

The coefficient of determination, 𝑟2 is 0.781492

We can say that 78% of the variation in the number of skiers can be explained by the variation in

the depth of snow.

The data also gives us the line of best fit, the least squares regression equation.

𝑦 = 186.418𝑥 + 28.3373

We can write this more clearly as

𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐬𝐤𝐢𝐞𝐫𝐬 = 𝟏𝟖𝟔. 𝟒𝟏𝟖 × 𝐝𝐞𝐩𝐭𝐡 𝐨𝐟 𝐬𝐧𝐨𝐰 𝐢𝐧 𝐦 + 𝟐𝟖. 𝟑𝟑𝟕𝟑

6. The equation of the least squares regression line can also be determined in Data and Statistics

view. Hit Ctrl + right arrow to return to your scatterplot. Hit Menu, Analyse, Regression, Show Linear

(mx + b)

The least squares regression equation is 𝑦 = 186.418𝑥 + 28.3373

Page 12: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 12 of 19

Interpreting the Slope (Gradient) and y-intercept

The slope or gradient 186.418 indicates that the number of skiers increased by 186 for every 1

metre increase in depth of snow.

The y-intercept 28.3373 indicates that if the depth of snow is 0, there would be 28 skiers attending

the resort.

Using the Least Squares Regression Equation to make Predictions

Suppose we want to estimate the number of skiers when the depth of snow is 3m. Using

𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐬𝐤𝐢𝐞𝐫𝐬 = 𝟏𝟖𝟔. 𝟒𝟏𝟖 × 𝐝𝐞𝐩𝐭𝐡 𝐨𝐟 𝐬𝐧𝐨𝐰 + 𝟐𝟖. 𝟑𝟑𝟕𝟑

Number of skiers = 186.418 × 3 + 28.3373 = 587.5913

That is 588 skiers

This result is reliable because we have interpolated. 3m lies within the bounds of the depth of snow

given in the table. That is it is between 0.5 and 3.6m.

Suppose we want to estimate the number of skiers when the depth of snow is 4m.

Number of skiers = 186.418 × 4 + 28.3373 = 774

That is 774 skiers.

The result is unreliable because we have extrapolated. 4m lies outside the bounds of the depth of

snow given in the table. It is outside the range of 0.5 to 3.6m.

Page 13: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 13 of 19

Calculating the Least Squares Regression Line Manually If you are given Pearson’s correlation coefficient, r, the mean values of the independent and

dependent variables, and the standard deviations of the independent and dependent variables for a

set of data, the least squares regression equation can be calculated.

If the least squares regression equation is 𝑦 = 𝑚𝑥 + 𝑐

Then the gradient

𝒎 = 𝒓𝒔𝒚

𝒔𝒙 , where

r is the coefficient of correlation.

𝑠𝑥 is the standard deviation of the independent variable

𝑠𝑦 is the standard deviation of the dependent variable

Also the least squares regression line always passes through the point ( �̅�, �̅�), so

�̅� = 𝒎�̅� + 𝒄, where

𝑥 ̅is the mean of the independent variable

𝑦 ̅is the mean of the dependent variable.

These equations can be rearranged to give:

𝑐 = �̅� − 𝑚�̅�

𝑐 can now be found by substitution.

Example:

Page 14: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 14 of 19

Residual Analysis There is no guarantee that a linear regression model will be appropriate for a bivariate data set. One

way to check that a regression line is viable is to look at the differences between the actual y values

and the predicted y values from the regression equation. These differences are often referred to as

the residual values, because they are the bits left over. The residual values can be determined using

the formula:

residual value = 𝒚 (𝒂𝒄𝒕𝒖𝒂𝒍 𝒗𝒂𝒍𝒖𝒆) − 𝒚(𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒗𝒂𝒍𝒖𝒆)

A residual is the vertical distance between each data point and the regression line.

Page 15: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 15 of 19

We can create a plot of the residuals very easily using CAS.

For example if the values given are those in the table below:

x 2 3 4 5 6

y 4.8 5.4 6.0 6.5 6.9

1. Enter data into Lists and Spreadsheets View

2. Press the Home button and enter the Data and Statistics View

3. Form the scatter plot ensuring the x values are along the x axis and the y values are along

the y axis.

4. Click Menu, Analyse, Regression, Show Linear (mx + b)….. This will insert the regression line.

On first inspection it appears a good fit.

5. Tab to the y-axis and click stat.resid.

The residual plot appears

Notice that the points do not appear to be distributed randomly about zero. Some pattern is evident. Looks like a parabola. This means that the regression model has not accounted for this pattern in the relationship. So although the coefficient of determination is high (r2 = .9933) other models may need to be tried. Also if you move the mouse pointer over the point the coordinates of the point is given.

Page 16: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 16 of 19

Transforming to Linearity We have seen that there is no guarantee that a linear regression model (straight line model) will be

appropriate for a bivariate data set, even when the coefficient of determination is high. It may be

that a non-linear model such as a quadratic, logarithmic or reciprocal may be better suited for the

model. Another approach is to “straighten out” the curve so that an appropriate non-linear model

can be found.

Page 17: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 17 of 19

Example

Air is trapped in a syringe, and then weights are added to the plunger, which increases the pressure

exerted on the trapped air. For each weight, the pressure and resulting volume are recorded.

Volume (mL) 29.5 25.1 22.5 20 18.2 16.7 15.5 14.2 13.4

Pressure (kPa) 98.6 111.3 124 136.7 149.4 162.1 174.8 187.5 200.2

1. Using CAS in Lists and Spreadsheets View enter the data in the table:

2. Press the Home button and enter the Data and Statistics View, create the scatter plot, making the

volume the independent x value and the pressure the dependent y value. On inspection it appears

that the plot is curved and not linear.

3. Press Menu, Analyse, Regression, Show Linear (mx + b).. to insert the least squares regression line.

Page 18: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 18 of 19

4. Tab to y-axis and select stat.resid to create a residual plot.

Notice that the residual points form a curved pattern about the X axis. This suggests a linear model is

not the best model for the data, even though the coefficient of determination is high (r2 = 0.9425)

Also if the mouse is moved over the points the coordinates of the point is given.

5. In order to "straighten" the plot we will use the reciprocal transformation 1 𝑥

. We will reciprocate

the volume values.

Press Ctrl + Left Arrow to return to Lists and Spreadsheets View. Give column C the heading recvol

(short for reciprocal volume). In the grey area just below the new heading in column C enter the

formula:

1/ vol

Press Enter. Column C should automatically fill with the reciprocal values.

Page 19: Further Maths Bivariate Data Summary - St Leonard's … · Further Maths Bivariate Data Summary Page 2 of 19 Four boxplots are drawn. Notice that one scale is used. Based on the medians,

Further Maths Bivariate Data Summary

Page 19 of 19

6. Press the Home button and enter Data and Statistics View. Plot the pressure (y value) against the

recvol (x values) to form a new scatter plot of reciprocal data. Also insert the least squares

regression line. Notice the line is a precise fit.

7. Remember that we have plotted y (pressure) against 𝟏

𝒙 (

𝟏

𝒗𝒐𝒍𝒖𝒎𝒆) . The graphics

calculator reports the model in the form 𝑦 = 𝑚𝑥 + 𝑏, the symbol x refers to the values used for

the independent variable, which we recognise as 𝟏

𝒙 due to the transformation.

So the transformed equation can be written as:

y = 2499

𝑥+ 12.6

or Pressure = 2499

𝑉𝑜𝑙𝑢𝑚𝑒+ 12.6

Using this formula we can find the pressure if a volume is given.

For example if the volume is 19 mL,

Pressure = 2499

19+ 12.6 = 144.13

Pressure is 144.13 kPa

This result is reliable because we have interpolated. The value 19 mL lies within the range of

volumes given in the table.

If the volume is 5 mL then

Pressure = 2499

5+ 12.6 = 512.4

Pressure is 512.4 kPa

This result is unreliable, because we have extrapolated. The value 5 mL lies outside the range of

volumes given in the table.