Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I,...

44
Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Transcript of Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I,...

Page 1: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Statistics in Applied Science and Technology

Chapter 13, Correlation and Regression

Part I, Correlation (Measure of Association)

Page 2: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Key Concepts in This Chapter• Associations between two variables measured at Interval-

ratio level

• Scatter diagram, Pearson’s correlation coefficient

• Positive correlation, negative correlation, curvilinear relationships

• Coefficient of Determination

• Associations between two variables measured at Ordinal level

• Associations between two variables measured at Nominal level

• Associations between one quantitative variable and one nominal or ordinal variable

Page 3: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Association Between Two Variables Measured at Interval-

ration level

• Positive Correlation

• Negative Correlation

• No Association

• Curvilinear Relationship

Page 4: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Insofar as relatively low values are paired with relatively low values, and relatively high values are paired with relatively high values, the relationship is positive.

Example: time studied for test and test scores

Positive Correlation

Page 5: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Negative Correlation

Insofar as relatively low values are paired with relatively high values, and relatively high values are paired with relatively low values, the relationship is negative.

Example: life span in years and history of heavy smoking in years

Page 6: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

No Association

No regularity is apparent among the pairs of observations - no association.

Example: IQ score and shoe size

Page 7: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Curvilinear Relationship

• A relationship that can be described best with a curved line.

Example: Physical strength and years of age.

Page 8: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Preview a relationship - Scatterplots

• Definition of scatterplots - a graph containing a cluster of dots that represent all pairs of observations.

• Construction of scatterplots- scale each of the two variables along the horizontal (X) and vertical (Y) axes and use each pair of observations to locate a dot within the scatterplot.

Page 9: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Is there a Association?

Student SAT-V GPAJohn 333 1.0Janet 756 3.8Thomas 444 1.9Scotty 629 3.2Diana 501 2.3Hilary 245 0.4

Page 10: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Plot out the data The Scattergram

SAT_V

800700600500400300200

GP

A

4.0

3.5

3.0

2.5

2.0

1.5

1.0

.5

0.0

Page 11: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

The Scattergram

SAT_V (Mean = 484.67)

800700600500400300200

GP

A (

Me

an

= 2

.1)

4.0

3.5

3.0

2.5

2.0

1.5

1.0

.5

0.0

Page 12: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Anxiety & Skill Performance

Curvilinear Relationship

Anxiety Level

1412108642

Le

ve

l o

f S

kill

Pe

rfo

rma

nce

90

80

70

60

50

40

30

Page 13: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Measurement of strength of correlation - correlation coefficient r

• Intuitively: the more closely the dot cluster approximate a straight line, the stronger the relationship will be.

• A correlation coefficient is a number between -1 and 1 that describes the relationship between variables.

• The sign of r indicated the types of linear relationship, whether positive or negative.

Page 14: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

• The value of r without regard to sign, indicates the strength of the linear relationship.

• Types of correlation coefficient:– Pearson’s correlation coefficient r (appropriate when

both variables are normal in distribution)

– Spearman’s correlation coefficient rs (appropriate when one or both variables are not normal in distribution)

Measurement of strength of correlation - correlation coefficient r

Page 15: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Calculation of Pearson’s r

22 )()(

))((

yyxx

yyxxr

In which: x, y - pair of observations x, y - mean of x and y

Page 16: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Coefficient of Determination r2

• r2 : simply square the r value

• What percentage of the variance in each variable is explained by knowledge of the variance of the other variable– what percentage of the variance within Y is

predicted by the variance within X?

Page 17: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Notes about r2

Coefficient of determination explains explained variance•tTherefore, 1-r2 is unexplained variance•r for example, r= 0.70 gives about 50% explained variance (why???)•aAlways calculate r2 to evaluate extent of the correlation

Page 18: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

NotCausation

Only AssociationThomas Sowell opinion piece

Page 19: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Smoking & LBP

Smoking LowBackPain

r = 0.45

Page 20: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Smoking & LBP

Smoking LowBackPain

r = 0.45

?LowBackPain

Smoking

Page 21: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Smoking & LBP

Smoking LowBackPain

r = 0.45

Lifestyle factors( ie strength)

?

Page 22: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Smoking & LBP

Smoking LowBackPain

r = .45

Lifestyle choices

Page 23: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

How Would I Measure Association Between Variables Measured at

the Ordinal Level?

• Like associations between two variables measured at Interval-ratio level, associations between two variables measured at the ordinal level have three types: – positive association

– Negative association

– No association

Page 24: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Positive association:

• Cases tend to be ranked in the same order on both variables. For example, if case A is ranked above case B on one variable, it would also be ranked above case B on the second variable.

• If “burnout” among elementary school teachers are positively associated with the number of years of service, teacher A who had more years of service than teacher B would be ranked higher than teacher B on “burnout”.

Page 25: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Negative association

• The order of cases would be ranked reversed between two variables. If case A ranked above case B on one variable, it would tend to rank below case B on the second variable.

• If “burnout” among elementary school teachers are negatively associated with the number of years of service, teacher A who had more years of service than teacher B would be ranked lower than teacher B on “burnout”.

Page 26: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

No Association

No regularity is apparent among two variables - no association.

Page 27: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Preview a Association - Conditional Distribution of Y

Length of Service (X)Burnout(Y)

Low Moderate High

Totals

Low 20(52.6%)

6(18.8%)

4(13.3%)

30

Moderate 10(26.3%)

15(46.9%)

5(16.7%)

30

High 8(21.1%)

11(35.3%)

21(70%)

40

Totals 38(100%)

32(100%)

30(100%)

100

Table 1. Burnout by Length of Service

Page 28: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Preview a Association - Conditional Distribution of Y

• Table 1 displays the association between burnout and length of service. Each column shows the pattern of scores on the dependent variable for each score on the independent variable. For example, the first column indicates that, of the 38 teacher whose length of service is low, 20 (52.6%) were low on burnout, 10 (26.3%) were moderate on burnout, and 8 (21.1%) were high on burnout. Second column shows that, of the 32 teacher whose length of service is moderate, 6 (18.8%) were low on burnout, 15 (46.9%) were moderate on burnout, and 11 (34.3%) were high on burnout.

Page 29: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Preview a Association - Conditional Distribution of Y

• Like Table 1, these “within-column” frequency distributions are called the conditional distribution of Y, since they display the distribution of scores on the dependent variable for each condition (score) of the independent variable.

• By observing the conditional distribution of Y in Table 1, a positive association exists between burn out and length of service (Why?)

Page 30: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Measurement of strength of association between two variables measured at

Ordinal Level- Gamma (G)

• Gamma (G) can be calculated by (SPSS will be used)

• Interpreting G: similarly to r2, G indicates that the percentage errors can be reduced if we predict the order of pairs of cases on one variable from the order of pairs of cases on the other - as opposed to predicting order while ignoring the other variable

ds

ds

NN

NNG

Where: Ns - the number of pairs of cases ranked the same as both variable. Nd - the number of pairs of cases ranked differently on the two variables.

Page 31: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

How Would I Measure Association Between Variables Measured at

the Nominal Level?

• Associations between two variables measured at the nominal level have no direction, therefore only two types: – Existence of association

– No association

Page 32: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Preview a Association - Conditional Distribution of Y

Gender (X) TotalsHeight(Y)

Male Female

Tall 44(88%)

8(16%)

52

Short 6(12%)

42(84%)

48

Totals 50(100%)

50(100%)

100

Table 2. Height by Gender

Page 33: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Preview a Association - Conditional Distribution of Y

• Table 2 displays the association between height and gender. Each column shows the pattern of scores on the dependent variable for each score on the independent variable. For example, the first column indicates that, of the 50 people who are male, 44 (88%) were tall, and 6 (12%) were short. Second column shows that, of the 50 people who are female, 8 (16%) were tall, and 42 (84%) were short. Calculating percentages as in Table 2 will help identifying the pattern of association. Males tend to be tall and females tend to be short.

Page 34: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Measurement of strength of association between two variables measured at

Nominal Level- Phi () and Cramer’s V • Phi () is appropriate for 2 2 table. It is defined as:

• For tables large than 2 2, a more general form of Cramer’s V should be used. It is defined as:

N

2

Where: 2 - a test statistics based on expected frequency and observed frequency. N - total number of observation

)1,1)(min(

2

cofrNV

Where: 2, N are defined as above. Min of r-1, c-1 = minimum value of r-1 (# of rows minus 1) or c-1(# of column minus 1).

Page 35: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Phi () and Cramer’s V

• Calculations of these two parameter will be conducted using SPSS (You don’t need to know about the calculation).

• Values of and Cramer’s V are both between 0.00 and 1.00. A value close to 1 indicate a strong association, and a value close to 0.00 indicate a very weak association.

Page 36: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

A limitation of Phi () and Cramer’s V

• A limitation of Phi and Cramer’s V is the absence of a direct and meaningful interpretation for values between the extremes of 0.00 and 1.00

• In another word, the values between 0.00 and 1.00 can not be interpreted as anything other than an index of the relative strength of association.

Page 37: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Another measurement of strength of association between two variables

measured at Nominal Level- Lambda ()

• Lambda is calculated based on E1 (prediction errors made while ignoring and independent variable) and E2 (prediction errors made while taking independent variables into account).

1

21

E

EE

Page 38: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Interpretation of Lambda ()

• The value of lambda ranges from 0.00 to 1.00

• When multiplied by 100, the value of lambda indicates the percentage reduction in error if we take independent variable into account - as opposed to ignoring the independent variable , therefore, the strength of the association.

Page 39: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

• Lambda () is asymmetric. This means that the value of will vary depending on which variable is taken as independent

• When one of the row totals is much larger than the others, lambda can take on a value of 0.00 even when other measure of association would not be 0.00, and calculating percentages for the table indicates some association between variables.

Notes about Lambda ()

Page 40: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Another Popular Way Measuring the Strength of Association between two

nominal variables (Chap 12.9)

• Relative risk is defined as the ratio of the incidence rate for persons exposed to a risk factor to the incidence rate for those not exposed to the risk factor:

incidence rate among exposed

Relative Risk (RR) =

incidence rate among unexposed

Page 41: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

A generalized 2 2 table (Table 3)

RiskFactor

DiseasePresent

DiseaseAbsent

Total

Present a b a+b

Absent c d c+d

Page 42: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

Relative Risk and Odds Ratio

• Relative Risk is also called risk ratio and it can be computed using frequencies listed in Table 3.

• Another commonly used measure of strength of association is the odds ratio (OR), and is defined as

)/(

)/(

dcc

baaRR

dc

baOR

/

/

Page 43: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

How to measure strength of association between one quantitative variable and one

nominal (or ordinal) variable?

• Measure of strength of association begin with summary statistics such as mean, median, percentage , etc.

• The measure of strength of association is created by comparing these summary statistics, such as subtract the summary statistics. The bigger the difference, the stronger the association.

Page 44: Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)

An Example

• When exploring the relationship between benzene and white blood cell count, we might use the mean white blood cell counts for the exposed and non-exposed workers and then subtract these two means. The bigger the difference, the stronger the association.