Blended Lean Six Sigma Black Belt Training – ABInBev Correlation and Regression ©2010 ASQ. All...

Blended Lean Six Sigma Black Belt Training –

ABInBev

Correlation and Regression

©2010 ASQ. All Rights Reserved.

© 2010 ASQ. All Rights Reserved.2

Module Objectives

Learn and apply some key Black Belt tools used to analyze your data• How to develop and interpret the correlation between

variables • Develop a mathematical model expressing the

relationship—egressiono Regressiono Simple Linear Regressiono Multiple Linear Regressiono Logistic Regression

This review module is aligned with your Moresteam Web Training, Session 7: Identifying Root Cause

© 2010 ASQ. All Rights Reserved.3 © 2010 ASQ. All Rights Reserved.

So Where Are We Now?

• We have understood our process using process maps and FMEA.

• Created graphs and charts to visualize what is happening in our process—seven basic tools.

• Validated our measurement system to ensure our data is both precise and accurate—Gage R&R.

• Collected data to establish our process performance using process capability analysis.

• Now we are going to use the statistical tools to infer cause and effect/uncover underlying relationships.


TermsCorrelation

• Used when both Y and X are continuous• Measures the strength of linear relationship between Y and X• Metric: Pearson Correlation Coefficient, r (r varies between -1 and +1)

o Perfect positive relationship: r = 1o No relationship: r = 0o Perfect negative relationship: r = -1

Regression

• Simple linear regression used when both Y and X are continuous• Quantifies the relationship between Y and X (Y = b0 + b1X)

• Metric: Coefficient of Determination, R-Sq (varies from 0.0 to 1.0 or zero to 100%)o None of the variation in Y is explained by X, R-Sq = 0.0o All of the variation in Y is explained by X, R-Sq = 1.0


Correlation Coefficients: Illustration

1031021011009998

-98

-99

-100

-101

-102

-103

X

-Y

SCATTERPLOT OF Y VERSUS X

220210200

210

200

190

180

X

Y


r = 0.0

r = -1.0

1031021011009998

103

102

101

100

99

98

X

Y


r = +1.0


Correlation: Minitab Example

• Voltage for the same power supply is measured at Station 1 and Station 2.

• Determine the correlation for voltage between the two stations.

Approach:• Open Datafile:CORRELAT.mtw

(the data are displayed in the Data Window)

• Go to Stat > Basic Statistics > Correlation…


Correlation: Minitab Example (Continued)

1. Select C1 Station 1 and C2 Station 2

2. Select Display p-values

2

1 Graph > Scatterplot…Simple


Correlation: Minitab Example (Continued)

From Minitab Session Window

Null Hypothesis:no correlation between Station 1 and Station 2 (H0 is false because p is less than 0.05)

9.69.49.29.08.88.6

9.4

9.3

9.2

9.1

9.0

8.9

8.8

8.7

8.6

8.5

Station 2

Sta

tion 1

Scatterplot of Station 1 vs Station 2


ABI Example 1 – Correlation

This project related to measuring client satisfaction in the BSC.

• Client satisfaction was measured by a monthly survey of five general questions

• Four answers could be given for each question: ”very dissatisfied”, “dissatisfied”, “satisfied”, and “very satisfied”.

• The questions were about response time, language knowledge, helpfulness, quality of solution, and knowledge.

• A correlation test was run to determine if there is a relationship between the questions—meaning that if a low score in one area might mean a low score in another, and so on…

Isabelle Verdoodt and Matthias Pindur Belt Project, Zone WE

© Anheuser-Busch InBev. All Rights Reserved.


ABI Example 1 – Correlation (Continued)

Is there a correlation betweeen customer satisfaction questions?

Isabelle Verdoodt and Matthias Pindur Belt Project, Zone WE


There is only correlationbetween the questionsHelpfulness and Knowledge.


ABI Example 2 – Correlation

• POC buyout is a type of trade investment to POC with the agreement about volume commitment, loyalty request, or other conditionality.

• POC buyout is a key driver of core+ and premium business in the restaurant channel and the nightlife channel.

• It is the single biggest investment in China, accounting for 45% of total China commercial investments (2.4 billion RMB in 2011).

• A correlation test was run to determine if there was any correlation between the volume sold and investment made for four different brands.

Luke Zhou Belt Project, Zone APAC



ABI Example 2 – Correlation (Continued)

Pearson Correlation P-Value

Volume vs. Investment / case – Bud SBT

0.819 0.000

Volume vs. Investment / case – HICE 500

0.594 0.000

Volume vs. Investment / case - Bud BBT

.890 0.000

Volume vs. Investment / case – HICE 600

.139 0.312

Is there a correlation?What is the strength of the relationship?

Luke Zhou Belt Project, Zone APAC



Testing Method Selection Matrix

Variable Type Attribute Y Count Y Continuous Y

Discrete X

1 or 2 TreatmentsProportions

3+ TreatmentsChi Square

1 or 2 Treatments Poisson

3+ Treatments Chi Square

1 or 2 TreatmentsT tests

3+ TreatmentsANOVA

Continuous XLogistic Regression

Logistic Regression

Least Squares Regression


Simple Linear Regression Analysis

Y valuepopulaton the of

) value'fitted(" estimate an is Y where

XbbY

• Used to fit lines and curves to data when the parameters (bs) are linear

• The fitted lines:o Quantify the relationship between the predictor (input) variable (X)

and response (output) variable (Y)o Help to identify the vital few Xs o Enable predictions of the response Y to be made from a knowledge of

the predictor Xo Identify the impact of controlling a process input variable (X) on a

process output variable (Y)

• Produces an equation of the form:


Regression: Minitab Example 1

• A Black Belt in the Supply department is tracking the output of voltage at two different stations. Voltage is measured at Station 1 and Station 2.

• A Black Belt is given the task of predicting the voltage at Station 2 from the voltage measured at Station 1.

• Stat>Regression>Fitted Line Plot

Approach:• Open Datafile: CORRELAT.mtw

(the data are displayed in the Data Window)• Go to Stat > Regression > Fitted Line Plot…


Regression: Minitab Example 1 (Continued)



Prediction equation

Coefficient of Determination: use R-Sq for simple linear regression (one X)

Fitted line: obeys the prediction equation



• From the Session Window, the regression equation is:

Station 2 = -0.3402 + 1.054 Station 1

o The intercept (b0) is where the fitted line (regression line) crosses the Y-axis when X = 0.

o The slope, b1, is “rise over run”, or DY/DX.

• The coefficients b0 and b1 are estimates of the population parameters b0 and b1: they are linear coefficients.

Intercept, b0 Slope, b1

Practically, what does this mean? • You can measure the voltage only at Station 1 and plug it into the equation.• You can then predict the voltage at Station 2..

As a result of the regression equation, you no longer need to measure the voltage at Station 2.


Statistical Significance – Minitab Example 2

• An analysis of variance (ANOVA) table informs us about the statistical significance of the regression analysis.

• Hypothesis for Regression:

– H0: The regression results from common cause variation—when H0 is true, there is no statistically significant regression, and the best prediction of Y is the mean of Y.

– Ha: The regression is statistically significant.– Look at the p-value used to evaluate the null

hypothesis; in this case, alpha = 0.05. So if p is less than alpha, then reject the null

hypothesis. You can conclude that the regression is statistically

significantApproach:• Use Datafile:REGRESSANOVA.mtw• Go to Stat > Regression… >Regression


ANOVA for Simple Linear Regression – Minitab Example 2 (Continued)

REGRESSANOVA.mtwStat > Regression… >Regression


ANOVA for Simple Linear Regression – Minitab Example 2 (Continued)

Regression is significant: p < 0.05

What is R-sq value telling us?


Analysis of Residuals – Minitab Example 2 (Continued)

• Residuals are used to test the adequacy of the prediction equation (model)

• In residual plots, three types of plots indicate model inadequacy• The plots will be dramatic—not subtle!

1. Fans 2. Bands sloping up or down

3. Curved bands


Analysis of Residuals – Minitab Example 2 (Continued)

Do you see any patterns in the residuals that might indicate model inadequacy?


Regression: Minitab Example 3

Illustrating the analysis of residuals

Use Datafile: RESIDUALS.mtwGo to Stat > Regression…

>Fitted Line Plot Linear



• R-Sq is 89.7%.• The regression is significant.• Can we do better?• How do the residuals look?



Not quite random!

What do the Residuals look like? Is the straight line a best fit? What do you suggest?



Continuing with the same example …..Use Datafile: RESIDUALS.mtwGo to Stat > Regression… >Fitted Line Plot > Quadratic

Illustrating the analysis of residuals



Improving the model adequacy increased R-Sq from 89.7% to 95.0%

How do the residuals look?


ABI Example 1: Correlation and Regression

Trying to determine if there is a relationship between Customer Delivery Performance and Forecast Accuracy?

What is the Regression Equation?

UKI Forecast Accuracy (FA)

© Anheuser Busch InBev. All Rights Reserved.

Gustavo Burger Belt Project – Zone WE


50000400003000020000100000

25

20

15

10

5

Vol

Inv/case

S 1. 58346

R- Sq 82. 1%

R- Sq（调整） 81. 4%

I nv/ case = 8. 123 + 0. 000809 Vol- 0. 000000 Vol **2

Bud SBT

Pearson correlation: 0.819

P value:0.00

Legend:Volume is units sold

Investment per case is how much money is paid to the POC

Luke Zhou Belt Project - Zone APAC

ABI Example 2: Correlation and Regression



The regression equation is: Gross Inv Val = 3,531,232 + 854,979 Vol pack (MM bbl)

Predictor Coef SE Coef T PConstant 3531232 1751989 2.02 0.072Vol pack (MM bbl) 854979 191217 4.47 0.001

S = 2079206 R-Sq = 66.7% R-Sq(adj) = 63.3%

Analysis of VarianceSource DF SS MS F PRegression 1 8.64274E+13 8.64274E+13 19.99 0.001Residual Error 10 4.32310E+13 4.32310E+12Total 11 1.29658E+14

Regression Analysis: Gross Inv Val vs. Volume packaged

Katie Shiro Belt Project, Zone NA

What is the regression equation?What is the Rsq (adj) figure telling you?

ABI Example 3: Correlation and Regression – Spare Parts Inventory

Determine whether these is a correlation between the inventory value of spare parts and the volume packaged at each brewery.



Multiple Linear Regression – Exercise 1 (Continued)

55443322110 XbXbXbXbXbbY

Our goal is to fit a multiple regression of the following form:

This example will illustrate the following additional aspects of multiple regression:

1. Elimination of X-variables that have no explanatory power2. Residual analysis


Multiple Factor Correlation and Regression

Data on water usage has been collected along with data on factors that may be used to predict water usage. The factors were average temperature, production volume, number of associates, number of days of plant operation, and number of visitors.

Data is in Water Usage.mtw


Multiple Factor Regression

Stat>Regression>General RegressionRecommend you always turn this

option on.


Session WindowRegression EquationWater Usage = 6805.38 + 17.2286 Average Temp + 0.221781 Production - 138.578 Operating Days - 26.4302 Associates - 1.59134 Visitors

Coefficients

Term Coef SE Coef T P VIFConstant 6805.38 1461.69 4.65583 0.001Average Temp 17.23 6.64 2.59330 0.025 1.26281Production 0.22 0.05 4.41450 0.001 6.74070Operating Days -138.58 55.09 -2.51543 0.029 1.27287Associates -26.43 9.27 -2.85192 0.016 6.77552Visitors -1.59 3.19 -0.49900 0.628 1.03867

Visitors are not significant and should be removed from the model.


Reduced Model

Coefficients

Term Coef SE Coef T P VIFConstant 6687.10 1396.48 4.78854 0.000Average Temp 16.96 6.41 2.64580 0.021 1.25479Production 0.22 0.05 4.53564 0.001 6.71249Operating Days -138.35 53.34 -2.59393 0.023 1.27278Associates -25.89 8.91 -2.90510 0.013 6.68148

Variance Inflation Factor (VIF) checks for factors that are co-linear. Co-linear factors may cause invalid models and should be avoided. Rule of thumb: VIFs < 8 are not a problem. If factors are highly correlated, try removing one from the model or using Partial Least Squares Regression.

Regression EquationWater Usage = 6687.1 + 16.9643 Average Temp + 0.220159 Production - 138.354 Operating Days - 25.8854 Associates


The Rest of the Session Window

Summary of Model

S = 276.626 R-Sq = 76.10% R-Sq(adj) = 68.14%PRESS = 1588865 R-Sq(pred) = 58.65%

Analysis of Variance

Source DF Seq SS Adj SS Adj MS F PRegression 4 2924259 2924259 731065 9.5537 0.0010367 Average Temp 1 315092 535674 535674 7.0003 0.0213440 Production 1 1562688 1574213 1574213 20.5720 0.0006830 Operating Days 1 400666 514877 514877 6.7285 0.0234871 Associates 1 645813 645813 645813 8.4396 0.0132008Error 12 918264 918264 76522Total 16 3842523

Standard deviation of the error term

2 1 Error

Total

SSSS

r

12 ErrorAdjusted

Total

MS n-1r =

MS n-number of factors-1

How well the model is

expected to predict new observations.


Residual Analysis

5002500-250-500

99

90

50

10

1

Residual

Per

cent

N 17AD 0.159P-Value 0.938

50004500400035003000

500

0

-500

Fitted Value

Resi

dual

4002000-200-400

4.5

3.0

1.5

0.0

Residual

Frequen

cy

161412108642

500

0

-500

Observation Order

Res

idual

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for Water Usage

Worksheet: Water Usage.MTW

The residuals are normally distributed with a mean of zero and a constant variance. There is no reason to reject the model.


Let’s Use the Model to Predict UsageYou have been asked to predict the amount of usage for a month with an average temperature of 68, production of 1400, 20 days of operation, and 175 associates.

Do Control + E to bring back previous dialog box


The Prediction

Predicted Values for New Observations

New Obs Fit SE Fit 95% CI 95% PI 1 851.869 666.567 (-600.455, 2304.19) (-720.553, 2424.29)

The predicted value

However, because of the low r2

Predicted the prediction intervals are very wide.

However, because of the low r2 Predicted , the prediction intervals are very wide.


Multiple Regression: ABI Example 1 – Brand Health

Pedro Lozada Belt Project – Zone GHQ

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.82R Square 0.68Adjusted R Square 0.65Standard Error 0.17Observations 48.00

ANOVAdf SS MS F Significance F

Regression 4 2.46 0.61 22.50 4.45E-10Residual 43 1.17 0.03Total 47 3.63

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -3.27 2.08 -1.58 0.12 -7.46 0.92ln(Becks Price) -1.76 0.44 -3.98 0.00 -2.65 -0.87ln(Competitor Price) 1.74 0.20 8.57 0.00 1.33 2.15ln(Becks Media) 0.02 0.01 2.48 0.02 0.00 0.04ln(Competitor Media) -0.03 0.04 -0.63 0.53 -0.11 0.06

This output is from Excel.What is the significance telling us?How do you interpret the Rsq?

- 1.76 = Increase of 10% in price will decrease the share by – 17.6%

Is there a relationship between price and market share?



ABI Example 2 – UK CDP Performance (Multiple Regression Analysis)

What is the prediction model between Customer Delivery Performance in the UK and line efficiency (LEF)?


Gustavo Burger Belt Project – Zone WE


ABI Example 3 – Multiple Regression

Price Change vs. Ad Feature

Use multi-variable regression toseparate the impact of a pricedecrease vs. placing the productin the ad feature.

Source: NC Food Lion Natural Light 24pks


Mike Zacharias Belt Project – Zone NA


Practically what does this mean?

ABI Example 3 (Continued)

What is the regression equation?

From the regression equation: A $1 price decrease is worth 1.8 share points, and an ad feature is worth 6.0 share points.


Mike Zacharias Belt Project – Zone NA


Logistic Regression

• Logistic regression is a variation of ordinary regression which is used when:o The dependent (response) variable is a dichotomous

variable (i.e., it takes only two values, which usually represent the occurrence or non-occurrence of some outcome event, usually coded as 0 or 1).

o The independent (input) variables are continuous, categorical, or both.


Testing Method Selection Matrix

Variable Type Attribute Y Count Y Continuous Y

Discrete X

1 or 2 TreatmentsProportions3+ TreatmentsChi Square

1 or 2 Treatments Poisson

3 + Treatments Chi Square

1 or 2 TreatmentsT tests

3 + TreatmentsANOVA

Continuous XLogistic Regression

Logistic Regression

Least Squares Regression


Logistic Regression

• Logistic Regression evaluates the occurrence of the event in terms of its probability.o If an event happens (success), the probability is “p”o The probability of the event not happening is given by (1-p)

• Odds of success relative to failure is the ratio of p/(1-p)• The logistic regression model is fitted to the natural logarithm of

the odds Ln {p/(1-p)}• The statistical model for logistic regression is:

Log (p/1 − p) = β0 + β1xo where p is a binomial proportion and x is the input factor.o The parameters of the logistic model are β0 and β1.


The Logistic Function

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The logistic function starts very close to 0, then rises rapidly as the event probability threshold is approached, then asymptotically approaches 1.

Datafile/EXHREG.XLS

Pro

babi

lity

of e

vent


An Example

A cereal company want to determine the factors that increase the probability a consumer will purchase their product. Data was collected on 71 consumers to determine the effect of whether they had seen an advertisement, whether they have children, their income, and if they purchased the cereal. Data is in Logistic Regression Cereal Ad.mtw.


Set Up the Analysis

Discrete factors that are included in the model

are entered in the Factors box.

Stat>Regression>Binary Logistics Regression


Option and Graphs


Logistic Regression Output

Variable Value CountBought Yes 34 (Event) No 37 Total 71

Logistic Regression Table

Odds 95% CIPredictor Coef SE Coef Z P Ratio Lower UpperConstant -5.21059 1.31033 -3.98 0.000Income 0.0563140 0.0230953 2.44 0.015 1.06 1.01 1.11Children Yes 2.69208 1.13832 2.36 0.018 14.76 1.59 137.43ViewAd Yes 1.76941 0.658335 2.69 0.007 5.87 1.61 21.32

Log-Likelihood = -30.480Test that all slopes are zero: G = 37.341, DF = 3, P-Value = 0.000

The null hypothesis is that the factor has no effect on the event probability.

All three factors are statistically significant


Model Integrity

Goodness-of-Fit Tests

Method Chi-Square DF PPearson 45.1757 49 0.629Deviance 44.8648 49 0.641Hosmer-Lemeshow 6.4373 8 0.598

Measures of Association:(Between the Response Variable and Predicted Probabilities)

Pairs Number Percent Summary MeasuresConcordant 1105 87.8 Somers' D 0.76Discordant 145 11.5 Goodman-Kruskal Gamma 0.77Ties 8 0.6 Kendall's Tau-a 0.39Total 1258 100.0

The null hypothesis for goodness of fit is that the model fits. Do not reject the null hypothesis and conclude the model fits.


The Chi-Square vs. Probability Graph

0.90.80.70.60.50.40.30.20.10.0

14

12

10

8

6

4

2

0

Probability

Delt

a C

hi-

Square

Delta Chi-Square versus Probability

Worksheet: Logistic Regression Cereal Ad.MTW

Right-click on the graph and brush the outliers.

Note them in the data sheet.


Prepare a Graph of the Results

Do Control + e to bring back previous dialog box


Storing the Data


Preparing the Graph

Graph>Scatterplot


Presenting the Results

605040302010

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Income

Pro

babilit

y of Purc

hase

No NoNo YesYes NoYes Yes

Children ViewAd

Worksheet: Logistic Regression Cereal Ad.MTW

Probability of Purchase vs Income


Exercise – Your Turn

Data was collected for the outcome of emergency room admissions. A hospital administrator would like help determining if any of the factors collected could be used to predict the probability of dying in the hospital.

The data is in Datafile/Emergency.MTW.

A definition of the terms is given inDatafile/EmergencyFileTerms.DOC.


What Have We Covered?

Learned and applied key tools to analyze your data• How to develop and interpret the correlation between

variables • Develop a mathematical model expressing the

relationship—regressiono Regressiono Simple Linear Regressiono Multiple Linear Regressiono Logistic Regression


In the Next Module . . .

• We will learn how to determine the proper sample size and the power of the test

• We will use Minitab to determine:– Sample size– Delta– Power


Supplemental Material


Exercise Solution – Emergency Room


Exercise Solution – Emergency Room (Continued)

Odds 95% CI

Predictor Coef SE Coef Z P Ratio Lower Upper

Constant -5.74590 1.27590 -4.50 0.000

Age 0.0342199 0.0117207 2.92 0.004 1.03 1.01 1.06

Sex

1 -0.374718 0.411645 -0.91 0.363 0.69 0.31 1.54

Race

2 -1.16640 1.09116 -1.07 0.285 0.31 0.04 2.64

3 0.269519 0.907951 0.30 0.767 1.31 0.22 7.76

Ser

1 -0.394346 0.432915 -0.91 0.362 0.67 0.29 1.57

Can

1 1.83110 0.849745 2.15 0.031 6.24 1.18 33.00

PRE

1 0.571998 0.546810 1.05 0.296 1.77 0.61 5.17

TYP

1 2.87674 0.918809 3.13 0.002 17.76 2.93 107.51

Age, TYP, and Can are significant



Logistic Regression Table

Odds 95% CI

Predictor Coef SE Coef Z P Ratio Lower Upper

Constant -6.20134 1.17173 -5.29 0.000

Age 0.0352979 0.0109595 3.22 0.001 1.04 1.01 1.06

Can

1 1.57914 0.808289 1.95 0.051 4.85 0.99 23.65

TYP

1 3.02273 0.873298 3.46 0.001 20.55 3.71 113.79

Even though Can is slightly over .05, let’s keep it in the model.



100908070605040302010

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Age

Pro

babili

ty o

f D

ying in

Hosp

ital 0 0

0 11 01 1

Can TYP

Worksheet: Emergency.MTW

Probability of Dying by Type of Admission vs Age

Blended Lean Six Sigma Black Belt Training – ABInBev Correlation and Regression ©2010 ASQ. All...

Documents

Transcript of Blended Lean Six Sigma Black Belt Training – ABInBev Correlation and Regression ©2010 ASQ. All...