Powerpoint - Regression and Correlation Analysis

Post on 22-Nov-2014

1.481 views 0 download

Transcript of Powerpoint - Regression and Correlation Analysis

Regression and Correlation Analysis

Violeta BartolomeSenior Associate Scientiest

PBGB-CRILv.bartolome@cgiar.org

Correlation Analysis

• A measure of association between two numerical variables.

• Example (positive correlation)o As soil fertility increases, rice grain yield

also increases

IRRI-PBGB-CRIL 2

also increases

Typically, in the summer as the temperature increases people are thirstier. Consider the two numerical variables, temperature and water consumption. We would expect the higher the temperature, the more water a given person would consume. Thus we would say that in the summer time, temperature and water consumption are positively correlated.

Example

For seven randomly selected plots,

Nitrogen Content (%)

Grain Yield (kg/ha)

0.12 16520.14 2056

IRRI-PBGB-CRIL 3

selected plots, nitrogen content in the soil and the grain yield were recorded.

0.15 25980.16 27340.19 32380.22 48240.23 4858

(The data is shown in the table with the temperature placed in increasing order.)

How would you describe the graph?

Grain Yield of Rice at differnt levels of Soil Nitrogen Content

4000

5000

6000

Grain Yield (kg/ha)

IRRI-PBGB-CRIL 4How “strong” is the linear relationship?

1000

2000

3000

4000

0.1 0.15 0.2 0.25

Nitrogen Content (%)

Grain Yield (kg/ha)

This graph helps us visualize what appears to be a somewhat linear relationship between temperature and the amount of water one drinks.

Measuring the Relationship

Pearson’s Sample Correlation Coefficient, r

measures the direction and the

IRRI-PBGB-CRIL 5

measures the direction and the strength of the linear association between two numerical paired variables.

Direction of Association

Positive Correlation Negative Correlation

IRRI-PBGB-CRIL 6

Direction of the Association: The association can be either positive or negative. Positive Correlation: as the x variable increases so does the y variable. Example: In the summer, as the temperature increases, so does thirst. Negative Correlation: as the x variable increases, the y variable decreases. Example: As the price of an item increases, the number of items sold decreases.

Strength of Linear Association

r value Interpretation

IRRI-PBGB-CRIL 7

1 perfect positive linear relationship

0 no linear relationship

-1 perfect negative linear relationship

Strength of the Association:  The strength of the linear association is measured by the sample Correlation Coefficient, r.  r can be any value from –1 to +1.    The closer r is to one (in magnitude) the stronger the linear association.   If r equals zero, then there is no linear association between the two variables. 

Strength of Linear Association

No Linear CorrelationNo Linear CorrelationNo Linear CorrelationNo Linear Correlation

Perfect Linear Positive Perfect Linear Positive Perfect Linear Positive Perfect Linear Positive CorrelationCorrelationCorrelationCorrelation

IRRI-PBGB-CRIL 8

No Linear CorrelationNo Linear CorrelationNo Linear CorrelationNo Linear Correlation

Other Strengths of Association

r value Interpretation

0.9 strong association

IRRI-PBGB-CRIL 9

0.9 strong association

0.5 moderate association

0.25 weak association

*  No other values of r have precise definitions of strength. See the chart below. Note:  All of the values in the second table are positive. Thus the associations are positive. The same strength interpretations hold for negative values of r, only the direction interpretations of the association would change.

Other Strengths of Association

Strong Positive Linear Strong Positive Linear Strong Positive Linear Strong Positive Linear CorrelationCorrelationCorrelationCorrelation

Moderate Negative Moderate Negative Moderate Negative Moderate Negative Linear CorrelationLinear CorrelationLinear CorrelationLinear Correlation

IRRI-PBGB-CRIL 10

Linear CorrelationLinear CorrelationLinear CorrelationLinear Correlation

Formula

= the sum

IRRI-PBGB-CRIL 11

x

= the sumn = number of paired

itemsxi = input variable yi = output variable

= x-bar = mean ofx’s

= y-bar = mean ofy’s

sx= standard deviation of x’s

sy= standard deviation of y’s

y

Correlation Coefficient (r)

r=0 does not necessarily mean no relationship. Relationship may be

IRRI-PBGB-CRIL 12

relationship. Relationship may be nonlinear.

Correlation Coefficient

IRRI-PBGB-CRIL 13

Correlation Coefficient (r)

A significant r does not necessarily mean a strong linear relationship

IRRI-PBGB-CRIL 14

Correlation Coefficient

350

400

450

500

r = .25**n = 234

When no. of observations is

IRRI-PBGB-CRIL 15

100

150

200

250

300

0 5 10 15 20

Tiller/plant

Yield/plot observations is

large, a low r-value may still be significant.

Correlation Coefficient (r)

To be able to conclude that 2 variables have a strong linear relationship, r should be both high and significant

IRRI-PBGB-CRIL 16

and significant

Correlation Coefficient

4

5

6Yield (t/ha)

r = .90**n = 60

IRRI-PBGB-CRIL 17

0

1

2

3

20 30 40 50 60 70 80 90 100 110

No. of spikelet/panicle

Yield (t/ha)

Test of significance for rDegrees of Freedom Probability, p

0.05 0.01 0.001

1 0.997 1.000 1.000

2 0.950 0.990 0.999

3 0.878 0.959 0.991

4 0.811 0.917 0.974

5 0.755 0.875 0.951

6 0.707 0.834 0.925

7 0.666 0.798 0.898

r is significant if the absolute value is greater that the tabular

IRRI-PBGB-CRIL 18

7 0.666 0.798 0.898

8 0.632 0.765 0.872

9 0.602 0.735 0.847

10 0.576 0.708 0.823

11 0.553 0.684 0.801

12 0.532 0.661 0.780

13 0.514 0.641 0.760

14 0.497 0.623 0.742

15 0.482 0.606 0.725

16 0.468 0.590 0.708

17 0.456 0.575 0.693

18 0.444 0.561 0.679

19 0.433 0.549 0.665

20 0.423 0.457 0.652

value is greater that the tabular value.

CORRELATION ANALYSIS

PEARSON CORRELATION ANALYSIS Nitrogen.Content Grain.Yield

Nitrogen.Content Coef 1 0.99 P-value 1 1e-04

Grain.Yield Coef 0.99 1

IRRI-PBGB-CRIL 19

Grain.Yield Coef 0.99 1 P-value 1e-04 1

Regression Analysis

IRRI-PBGB-CRIL 20

Regression Analysis

What is the growth rate of a rice plant?

Growth rate can be defined as the change in heightper unit of time.

Scientific Question

IRRI-PBGB-CRIL 21

per unit of time.

Data Collection

DAS Height (cm)

0 0

10 12

30 55

IRRI-PBGB-CRIL 22

60 80

90 110

Statistical Questions• What is the relationship

between age and height?Linear

• How do I describe or quantify the relationship?

60

80

100

120

Plant Height (cm)

IRRI-PBGB-CRIL 23

quantify the relationship?Regression

• Is the association significant?Statistical Test

0

20

40

60

0 20 40 60 80 100

Days after Seeding

Plant Height (cm)

Linear Regression

• A general method for estimating or describing association between a continuous outcome variable

IRRI-PBGB-CRIL 24

continuous outcome variable (dependent) and one or multiple predictors in one equation.

o One predictor: Simple linear regressiono Multiple predictors: Multiple linear regression

Statistical Model

52

54

56

Y

Data = Model Fit + Residual

YY ε+= ˆ

IRRI-PBGB-CRIL 25

46

48

50

52

X

Y iii YY ε+= ˆ

ii XY 10ˆ ββ +=

Intercept Slope

Yi = µ + α i + εi

Least Squares Estimates

iii YY ε+= ˆ ii XY 10ˆ ββ +=

To estimate the intercept and slope, minimize residual sum of squares (RSS)

IRRI-PBGB-CRIL 26

RSS = εi2 =∑ (Yi − ˆ Y i)

2 =∑ (Yi − β0 − β1X i)2∑

∂RSS∂β0

=(Yi − β0 − β1X i)

2∑∂β0

= −2 (Yi − β0 − β1X i)∑ = 0

==> ˆ β 0 = Y − ˆ β 1X

∂RSS∂β1

=(Yi −Y + β1X − β1X i)

2∑∂β1

= −2 (X i − X )(Yi −Y + β1X − β1X i)∑ = 0

==> ˆ β 1 =(X i − X )(Yi −Y )∑(X i − X )

2∑

We don’t have to do the estimation by hand. R/CropStat or other statistical packages can do the work for us.

LINEAR REGRESSION ANALYSISDependent Variable: Height

Analysis of Variance SV Df Sum Square Mean Square F value Pr (>F) DAS 1 8201.389781 8201.389781 95.435198 0.002279Residuals 3 257.810219 85.93674

Model Summary R Squared 0.969523

IRRI-PBGB-CRIL 27

R Squared 0.969523 Adj. R Squared 0.959364

Parameter Estimates Parameter Estimate Std. Error t value Pr (> |t|)

(Intercept) 4.912409 6.311259 0.778356 0.493109DAS 1.223358 0.125227 9.769094 0.002279

Example: Growth Rate Data

Parameter Estimates Parameter Estimate Std. Error t value Pr (> |t|) (Intercept) 4.912409 6.311259 0.778356 0.493109DAS 1.223358 0.125227 9.769094 0.002279

IRRI-PBGB-CRIL 28

Intercept: The height at age 0 is 4.9 cm.Slope: The height increase per day after seeding is 1.223 cm.

Height =4.9+ 1.223DAS r = 0.98

0

20

40

60

80

100

120

140

0 20 40 60 80 100

Days after Seeding

Plant H

eight (cm)

Prediction

Given the regression line, it can be predicted that the height at 40 days after

Height =4.9+ 1.223DAS r = 0.98

80

100

120

140

Plant Height (cm)

IRRI-PBGB-CRIL 29

height at 40 days after seeding will be 53.8 cm.

0

20

40

60

80

0 20 40 60 80 100

Days after Seeding

Plant Height (cm)

Example: Growth Rate Data Analysis of Variance SV Df Sum Square Mean Square F value Pr (>F) DAS 1 8201.389781 8201.389781 95.435198 0.002279Residuals 3 257.810219 85.93674

Model Summary R Squared 0.969523

IRRI-PBGB-CRIL 30

R Squared 0.969523 Adj. R Squared 0.959364

∑ ∑∑∑ −+−=−+−=− 2222 )ˆ()ˆ()ˆˆ()( iiiiiii YYYYYYYYYY

SST SSM SSE

Sums of Squares

Degrees of freedomn-1 1 n-2

∑∑

−== 2

22

)(

)ˆ(

YY

YY

SSTSSM

Ri

i R2 is the fraction of variation in Y explained by X.

Linear Regression vs. ANOVA

ANOVADependent: ContinuousIndependent: Categorical

Linear regressionDependent: ContinuousIndependent: Continuous

IRRI-PBGB-CRIL 31

Linear models

ANOVA and regression are the same thing!!!

Misuse of Regression and Correlation Analysis

• Performing regression and correlation on spurious data could give significant results. But this is not a valid indication of a linear relationship.

IRRI-PBGB-CRIL 32

Misuse of Regression and Correlation Analysis

• Extrapolation of resultso scope of data is extended. Example

§ If the relationship of yield IR8 and stemborer incidence is extended to cover all rice varieties

IRRI-PBGB-CRIL 33

incidence is extended to cover all rice varieties§ If the relationship between grain yield and protein

content from varietal trials is assumed to be applicable to other types of experiments such as fertilizer trials

o functional relationship is assumed to hold beyond the range of X values tested

Misuse of Regression and Correlation Analysis

y = 23.751x + 4307.2r = 0.987**9000

10000

11000

There is no evidence if a linear relationship still holds

IRRI-PBGB-CRIL 34

4000

5000

6000

7000

8000

0 30 60 90 120 150 180 210 240

N-rate (kg/ha)

Grain Yield (kg

/ha) linear relationship still holds

above N = 180 kg/ha

Coefficient of Determination (R2)

• Percentage of the total variation that is explained by the linear function.

IRRI-PBGB-CRIL 35

For example, with an R2 value of 0.64, the implication is 64% [(0.64)(100) = 64] of the variation in the variable Y can be explained by the linear function of the variable X.

Problems with R2

• R2 tends to increase as additional variables are included to a regression equation, regardless of their true importance in determining the values of the dependent variable

The adjusted R2 (Ra2) compensates for this effect

IRRI-PBGB-CRIL 36

• Gives no information on the appropriateness of the model

iablestindependenofnop

nsobservatioofnonwhere

Rpn

nRa

var.

.

)1()1(

11 22

=

=

−+−

−−=

The adjusted R2 (Ra2) compensates for this effect

Problems with R2

IRRI-PBGB-CRIL 37

Curvilinear data fitted by a straight line with high R2

Segregated data fitted by a straight line with high R2

For detecting these kinds of departures from the regression model there is no substitute to plotting the data

Thank you!

IRRI-PBGB-CRIL 38