Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two...
-
Upload
dwayne-shepherd -
Category
Documents
-
view
235 -
download
4
Transcript of Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two...
Bivariate Data analysis
Bivariate Data
In this PowerPoint we look at sets of data which contain two variables.
Scatter plots Correlation
Outliers Causation
Variables
DiscreteContinuous
Quantitative(Numerical)
(measurements and counts)
Qualitative(categorical)
(define groups)
Ordinal(fall in natural order)
Categorical(no idea of order)
We are only going to consider quantitative variables in this AS
Quantitative
Discrete• Many repeated values• Age groups• Marks
Continuous• Few repeated values• Height• Length• Weight
Qualitative
Categorical• Gender• Religious
denomination• Blood types• Sport’s numbers (e.g.
He wears the number ‘8’ jersey)
Ordinal• Grades• Places in a race (e.g.
1st, 2nd, 3rd)
We often want to know if there is a relationship between two
numerical variables.
A scatter plot, which gives a visual display of the relationship between two variables, provides a good starting point.
In a relationship involving two variables, if the values of one variable ‘depend’ on the values of another variable, then the former variable is referred to as the dependent (or response) variable and the latter variable is
referred to as the independent (or explanatory) variable.
y - axis dependent (response) variable
x - axis independent (explanatory) variable
Consider data on ‘hours of study’ vs ‘ test score’
Hours Score Hours Score Hours Score
18 59 14 54 17 59
16 67 17 72 16 76
22 74 14 63 14 59
27 90 19 72 29 89
15 62 20 58 30 93
28 89 10 47 30 96
18 71 28 85 23 82
19 60 25 75 26 35
22 84 18 63 22 78
30 98 19 61
We may want to see if we could predict
the test score (response variable) based on the
hours of study (explanatory variable).
y - axis: Test score
x - axis: Hours of study
We look for a pattern in the way
the points lie
Certain patterns tell us about
the relationship
This is called
correlation
This pointis an outlier
We could describe the rest of the data as having a linear form.
Scatter plots• Use hollow circles for points• Label axes correctly with units• What you want to predict goes on the y-axis
(response variable)• Title of graph• No background; No gridlines• Unless you need to show categories- no legend• Show different categories on a single graph in
different colours rather than on separate graphs.• Adjust scale and size of font (14pt for pasting)
What to look for in your plot?
• Direction of the relationship - positive or negative• Form of the graph - linear or curved• The strength - whether it is strong, moderate or
weak• Scatter - constant scatter, a fan effect…• Outliers• Groupings
Page 22
What do you see in this scatter plot?
• There appears to be a linear trend.
• There appears to be moderate constant scatter.
• Negative Association.
• No outliers or groupings visible.
454035
20
19
18
17
16
15
14
Latitude (°S)
Mean January Air Temperatures for 30 New Zealand Locations
Tem
pera
ture
(°C
)
What do you see in this scatter plot?
• There appears to be a non-linear trend.
• There appears to be non-constant scatter about the trend line.
• Positive Association.
• One possible outlier (Large GDP, low % Internet Users).
0 10 20 30 40
GDP per capita (thousands of dollars)
0
10
20
30
40
50
60
70
80
Inte
rnet
Use
rs (
%)
% of population who are Internet Users vs GDP per capita for 202 Countries
What do you see in this scatter plot?
• Two non-linear trends (Male and Female).
• Very little scatter about the trend lines
• Negative association until about 1970, then a positive association.
• Gap in the data collection (Second World War).
Year
1990198019701960195019401930
30
28
26
24
22
20
Ag
e
Average Age New Zealanders are First Married
Rank these relationships from weakest (1) to strongest (4):
1
2
3
4
Describe these relationshipsPerfect, negative, linear
relationship
Perfect, positive, linear
relationship
Norelationship
Moderate,negative
linearrelationship
Weak,positivelinear
relationship
Describe this relationship.
As the hours of study increase, the test score . . . .? . . .
Pearson’s product-moment correlation coefficient, r
Correlation measures the strength of the linear association between two quantitative variables.
r = -1 r = -0.7 r = -0.4 r = 0 r = 0.3 r = 0.8 r = 1
Points fall exactly on a straight line
No linear relationship
(uncorrelated)
Points fall exactly on a straight line
The correlation coefficient may take any value between -1.0 and +1.0
How close the points in the scatter plot come to lying on the line.
r - what does it tell you?
r = 0.99
x
y
**** ** ** ** * **** **** *
r = 0.57
x
y
*
**
*
** *
*
**
*
****
*
*
*
* *r = 0.99 r = 0.57
Interpreting r
• 0.75-1 Strong positive linear association• 0.5-0.75 Moderate positive linear association• 0.25-0.5 Weak positive linear association• -0.25-0.25 No association or weak linear
association• -0.5--0.25 Weak negative linear association• -0.75--0.5 Moderate negative linear association• -1 - -0.75 Strong negative linear association
Useful websites
• http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html Regression by eye
• http://istics.net/stat/Correlations/ Guessing
• http://illuminations.nctm.org/LessonDetail.aspx?ID=L455#whatif effect of outliers
Assumptions
• linear relationship between x and y
• continuous random variables• The residuals must be normally distributed• x and y must be independent of each other• all individuals must be selected at random
from the population
• all individuals must have equal chance of being selected
What is correlation?
A measure of the strength of a LINEAR association between two
quantitative variables.
Sure you can calculate a correlation coefficient for any
pair of variables but correlation measures the strength only of the linear association and will be
misleading if the relationship is not linear.
Do you know that:
• Correlation applies only to quantitative variables. Check you know the units and what they measure.
• Outliers can distort the correlation dramatically.
Some facts about the correlation coefficient
• The sign gives the direction of the association.• Correlation is always between -1 and 1.• Correlation treats x and y symmetrically. The correlation
of x and y is the same as the correlation of y with x.• Correlation has no units and is generally given as a
decimal.• r is a multiple of the slope• Note: variables can have a strong association but still have
a small correlation if the association isn’t linear.• Correlation is sensitive to outliers. A single outlying value
can make a small correlation large or make a large one small.
The sign gives the direction of the association.
Positive Negative
Correlation treats x and y symmetrically. The correlation of x and y is the same as
the correlation of y with x.
r is a multiple of the slope
Variables can have a strong association but still have a small correlation if the association isn’t
linear.
Always plot the data before looking at the correlation!
Would it be OK to use a correlation coefficient to
describe the strength of the relationship?
9876543210
4000
3000
2000
1000
0
Position Number
Dis
tan
ce (
mill
ion
mile
s)
Distances of Planets from the Sun
√
Reaction Times (seconds) for 30 Year 10 Students
0
0.2
0.4
0.6
0.8
0 0.2 0.4 0.6 0.8 1
Non-dominant Hand
Dom
inan
t H
an
d
454035
20
19
18
17
16
15
14
Latitude (°S)
Mean January Air Temperatures for 30 New Zealand Locations
Tem
pera
ture
(°C
)
√Female ($)
Average Weekly Income for Employed New Zealanders in 2001
Male
($
)
0
200
400
600
800
1000
1200
0 200 400 600 800
XX
Correlation is sensitive to outliers. A single outlying value can make a small correlation large or make a
large one small.
You should be cautious in interpreting the correlation - these
graphs all have the same correlation coefficient (0.817)
Data set 1
Data set 2
Data set 3
Data set 4
Outliers can distort the correlation dramatically. An
outlier can make an otherwise small correlation look big or hide
a large correlation. It can even give an otherwise positive
association a negative correlation coefficient (and vice versa).
What do you see in this scatterplot?
22 23 24 25 26 27 28 29
150
160
170
180
190
200
Foot size (cm)
Heig
ht
(cm
)
Height and Foot Size for 30 Year 10 Students •Appears to be
a linear trend, with a possible outlier (tall person with a small foot size.)
•Appears to be constant scatter.
•Positive association.
What will happen to the correlation coefficient if the tallest Year 10
student is removed?
22 23 24 25 26 27 28 29
150
160
170
180
190
200
Foot size (cm)
Heig
ht
(cm
)
Height and Foot Size for 30 Year 10 Students
•It will get smaller
•It will get bigger
•It will stay the same
What do you see in this scatter plot?
•Appears to be a strong linear trend.
•Outlier in X (the elephant).
•Appears to be constant scatter.
•Positive association.
6005004003002001000
40
30
20
10
Gestation (Days)
Life
Exp
ect
an
cy (
Years
)
Life Expectancies and Gestation Period for a sample of non-human Mammals
Elephant
6005004003002001000
40
30
20
10
Gestation (Days)
Life
Exp
ect
an
cy (
Years
)
Life Expectancies and Gestation Period for a sample of non-human Mammals
Elephant
What will happen to the correlation coefficient if the
elephant is removed?
•It will get smaller
•It will get bigger
•It will stay the same
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
When you see an outlier, it’s often a good idea to report the
correlations with and without the point.
Don’t confuse Correlation with causation. Scatterplots and
correlation never prove causation.
Using the information in the plot, can you
suggest what needs to be done in a country to increase the life expectancy?
Explain.
400003000020000100000
80
70
60
50
People per Doctor
Life
Exp
ect
an
cy
Life Expectancy and Availability of Doctors for a Sample of 40 Countries
Perhaps if you have less people per Doctor (i.e. more Doctors per person), then the life expectancy will increase.
Using the information in this plot, can you make another suggestion as
to what needs to be done in a country to increase life expectancy?
6005004003002001000
80
70
60
50
People per Television
Life
Exp
ect
an
cy
Life Expectancy and Availability of Televisions for a Sample of 40 Countries It looks like if
you decrease the number of people per television (i.e. have more TVs per person), then the life expectancy will increase!
Can you suggest another variable that is linked to life expectancy and
the availability of doctors (and televisions) which explains the
association between the life expectancy and the availability of
doctors
(and televisions)?
Some measure of wealth of a country.
Eg Average income per person or GDP.
Damaged for life by too much TV
• Watching too much television as a child causes serious health problems years later, and raises the risk of heart disease, a New Zealand study of 1000 children has found….
• It links the amount of time spent in front of the box as a child with obesity, high cholesterol, poor fitness and smoking….
Damaged for life by too much TV
Damaged for life by too much TV
Hea
lth
Sco
re
TV watching
r = - 0.93
Causal relationships
• Two general types of studies: experiments and observational studies
• In an experiment, the experimenter determines which experimental units receive which treatments.
• In an observational study, we simply compare units that happen to have received each of the treatments.
• Only properly designed and carefully executed experiments can reliably demonstrate causation.
• An observational study is often useful for identifying possible causes of effects, but it cannot reliably establish causation
Causal relationships
• In observational studies, strong relationships are not necessarily causal relationships.
• Correlation does not imply causation.
• Be aware of the possibility of lurking variables.
Causal relationships
Watch out for lurking variables. Damage ($) vs number of firemen would show a strong correlation, but damage doesn’t cause firemen
and firemen do seem to cause damage (spraying water and
chopping holes). The underlying variable is the size of the blaze.
Although there was plenty of evidence that increased smoking was associated with increased levels of lung cancer, it took
years to provide evidence that smoking actually causes lung
cancer.
It would be a good idea to read the two pages of notes you have
that discusses correlation and causation!
So now you want to know how to calculate the correlation
coefficient, r.Here is one version of the
formula!
Luckily the computer will calculate R2 and you can square
root this to get r. Remember only when the
association is linear.
r measures the strength of the relationship NOT R2!!!!
r measures the strength of the relationship NOT R2!!!!
r measures the strength of the relationship NOT R2!!!!
The words you use
• There is a strong, positive, linear relationship between ‘x’ and ‘y’ and when the x- values increase, the y-values increase also. This is indicated by the value of the correlation coefficient i.e. r = 0.85 which is close to 1.
• (Note: Do not use ‘x’ and ‘y’ use what they represent.)