QTM Regression Analysis Ch4 RSH

download QTM Regression Analysis Ch4 RSH

of 40

Transcript of QTM Regression Analysis Ch4 RSH

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    1/40

    How to performRegression analysis

    Nadia Z Khan

    NUST Business SchoolFriday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    2/40

    regression analysis

    A very valuable tool for todays manager.Regression Analysis is used to:

    Understand the relationship between variables.

    Predict the value of one variable based on

    another variable.

    A regression model has:

    dependent, or response, variable - Y axis

    an independent, or predictor, variable - X axisNadia Z KhanNUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    3/40

    regression analysis

    Triple A Construction Company renovates oldhomes in Albany. They have found that its dollar

    volume of renovation work is dependent on the

    Albany area payroll.

    Local Payroll($100,000,000's)

    Triple A Sales($100,000's)

    3 6

    4 86 9

    4 5

    2 4.5

    5 9.5 Nadia Z KhanNUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    4/40

    Scatter plot

    0

    2

    4

    6

    8

    10

    0 1 2 3 4 5 6

    Local Payroll($100,000,000's)

    Sales

    100,0

    0

    0

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    5/40

    regression analysis model

    Create a Scatter Plot

    Perform Regression Analysis

    some random error

    that cannot be

    predicted.

    Slope

    Intercept

    (Value of Y when

    X=0)

    Independent

    Variable, Predictor

    Dependent

    Variable,Response

    Regression: Understand & Predict

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    6/40

    regression analysis model

    Sample data are used to estimatethe true values for the intercept and

    slope.Y= b + bX

    Where,

    Y = predicted value of Y

    Error = (actual value) (predicted value)

    e = Y - Y

    The difference between the actual

    value of Y and the predicted value(using sample data) is known as

    the error.

    0 1

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    7/40

    regression analysis model

    Sales (Y) Payroll (X) (X - X) (X-X)(Y-Y)

    6 3 1 1

    8 4 0 0

    9 6 4 4

    5 4 0 0

    4.5 2 4 5

    9.5 5 1 2.5

    Summations for each column:

    42 24 10 12.5

    Y = 42/6 = 7 X = 24/6 = 4

    _ _ _

    __

    Calculating the required

    parameters:

    b = (X-X)(Y-Y) 12.5

    (X-X) 10

    b = Y b X = 7 (1.25)(4) = 2

    So,

    Y = 2 + 1.25 X

    2

    o 1

    1 = = 1.25

    2

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    8/40

    Measuring the Fit of

    the linear RegressionModel

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    9/40

    Measuring the Fit of the linear

    Regression Model

    To understand how well the X predicts the Y, weevaluate

    Variability in the Y

    variableSSR > Regression Variability

    that is explained by therelationship b/w X & Y

    +

    SSE > UnexplainedVariability, due to factors thenthe regression

    ------------------------------------SST > Total variability about

    the mean

    Coefficient of

    DeterminationR Sq - Proportion of

    explained variation

    Correlation

    Coefficientr Strength of the

    relationshipbetween Y and X

    variables

    Standard

    ErrorSt Deviation

    of erroraround theRegression

    Line

    Residual

    AnalysisValidation of

    Model

    Test for LinearitySignificance of the

    Regression Model i.e.

    Linear Regression ModelNadia Z KhanNUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    10/40

    Variability

    0

    2

    4

    6

    8

    10

    0 1 2 3 4 5 6

    y = 1.25x + 2R = 0.6944

    Local Payroll($100,000,000's)

    Regression Line

    Y_

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    11/40

    Variability

    Sum of Squares Total (SST) measures thetotal variable in Y.

    Sum of the Squared Error (SSE) is lessthan the SST because the regression line

    reduced the variability.

    Sum of Squares due to Regression (SSR)indicated how much of the total variabilityis explained by the regression model.

    Errors (deviations) may be positive ornegative. Summing the errors would be

    misleading, thus we square the terms

    prior to summing.

    SST = (Y-Y)2

    SSE = e = (Y-Y)2 2

    SSR = (Y-Y) 2

    For Triple A Construction:

    SST = (Y-Y)2

    SSE = e = (Y-Y)2 2

    SSR = (Y-Y) 2

    = 22.5

    = 6.875

    = 15.625

    Note:

    SST = SSR + SSE

    Explained

    Variability

    Unexplained

    Variability

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    12/40

    Coefficient of Determination

    The coefficient of determination (r

    2

    )is the proportion of the variability in Y

    that is explained by the regression

    equation.

    r2 = SSR = 1 SSE

    SST SST

    For Triple A Construction:

    r

    2

    = 15.625 = 0.694422.5

    69% of the variability in sales is explained

    by the regression based on payroll.

    Note: 0 < r2 < 1

    SST, SSR and SSEjust themselves

    provide little direct

    interpretation. This

    measures the

    usefulness of

    regression

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    13/40

    Correlation Coefficient

    YY(YnXXn

    YXXYnr

    For Triple A Construction, r = 0.8333

    The correlation coefficient (r)measures the strength of the linear

    relationship.

    Note: -1 < r < 1

    Possible

    Scatter Diagrams

    for values of r.

    Shown as Multiple R in

    the output of Excel

    file

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    14/40

    Correlation Coefficient

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    15/40

    Standard error

    s = MSE = SSEnk-1

    The mean squared error (MSE) isthe estimate of the error variance of

    the regression equation.

    2

    Where,n = number of observations in the sample

    k = number of independent variables

    For Triple A Construction, s = 1.312 Nadia Z KhanNUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    16/40

    Test for linearity

    An F-test is used to statisticallytest the null hypothesis that there

    is no linear relationship between

    the X and Y variables (i.e. = 0).

    If the significance level for the F

    test is low, we reject Ho and conclude

    there is a linear relationship.

    F = MSR

    MSE

    where, MSR = SSR

    k

    1

    For Triple A Construction:

    MSR = 15.625 = 15.625

    1

    F = 15.625 = 9.0909

    1.7188

    The significance level for F = 9.0909 is

    0.0394, indicating we reject Ho and

    conclude a linear relationship exists

    between sales and payroll.

    p value is significance level

    alpha = level of significance or

    = 1-confidence interval

    If p

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    17/40

    Computer Software for

    RegressionIn Excel, use Tools/

    Data Analysis. Thisis an add-in option.

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

    C mpu er S f ware f r

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    18/40

    Computer Software for

    Regression

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

    Computer Software for

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    19/40

    Computer Software for

    Regression

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    20/40

    Anova table

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    21/40

    Residual Analysis:to verify regression assumptionsare correct

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

    Assumptions of the

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    22/40

    Assumptions of the

    Regression Model

    Errors are independent. Errors are normally distributed. Errors have a mean of zero. Errors have a constant variance.

    We make certain assumptions aboutthe errors in a regression model

    which allow for statistical testing.

    Assumptions:

    A plot of

    the errors (Real

    Value minus predicted

    value of Y), also calledresiduals in excel may

    highlight

    problems with the

    model.

    PITFALLS:

    Prediction beyond the range of X values in the sample can be misleading, includinginterpretation of the intercept (X=0).

    A linear regression model may not be the best model, even in the presence of a significant Ftest. Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    23/40

    Constant variance

    Triple A Construction

    Errors have constant

    Variance Assumption

    Plot Residues w.r.t X values

    Pattern should be random!

    Non-constant Variation in Error

    Residual Plot violation0 X

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    24/40

    Normal distribution

    Histogram of Residuals - Should look like a bell curve

    Triple A Construction

    Not possible to see

    the bell curve with just

    6 observations. Need

    more samples

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    25/40

    zero mean

    Triple A Construction

    Errors have zero Mean

    0 X

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    26/40

    independent errors

    If samples collected over aperiod of time and not at the

    same time, then plot the

    residues w.r.t time to see if

    any pattern (Autocorrelation)

    exists.

    If substantial autocorrelation,Regression Model Validity

    becomes doubtful

    Autocorrelation can also be checkedusing DurbinWatson statistic.

    Example: Manager of a packagedelivery store wants to predict

    weekly sales based on the

    number of customers making

    purchases for a period of 100

    days. Data is collected over a

    period of time so check for

    autocorrelation (pattern) effect.

    time

    Res

    idues Cyclical Pattern!A Violation

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

    Residual analysis for

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    27/40

    Residual analysis for

    validating assumptions

    Nonlinear Residual Plot violation

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    28/40

    multiple regression

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    29/40

    multiple regression

    Multiple regression models aresimilar to simple linear regression

    models except they include more

    than one X variable.

    Y= b + bX + b X ++ b X0 1 1 2 2 n n

    Independent variables

    slope

    Price Sq. Feet Age Condition

    35000 1926 30 Good

    47000 2069 40 Excellent

    49900 1720 30 Excellent

    55000 1396 15 Good

    58900 1706 32 Mint

    60000 1847 38 Mint

    67000 1950 27 Mint

    70000 2323 30 Excellent

    78500 2285 26 Mint

    79000 3752 35 Good

    87500 2300 18 Good

    93000 2525 17 Good

    95000 3800 40 Excellent

    97000 1740 12 Mint

    Wilson Realty wants to develop a model to

    determine the suggested listing price for a house

    based on size and age.

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    30/40

    multiple regression

    67% of the variation in

    sales price is explained by

    size and age.

    Ho: No linearrelationship

    is rejected

    Ho: 1 = 0 is rejected

    Ho: 2 = 0 is rejected

    Y = 60815.45 + 21.91(size) 1449.34 (age)

    Y = 60815.45 + 21.91(size) 1449.34 (age)

    Wilson Realty has found a linear

    relationship between price and size

    and age. The coefficient for size

    indicates each additional square foot

    increases the value by $21.91, whileeach additional year in age decreases

    the value by $1449.34.

    For a 1900 square foot house that is 10years old, the following prediction can be

    made:

    $87,951 = 21.91(1900) + 1449.34(10)

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    31/40

    binary or dummyvariables

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    32/40

    dummy variables

    A dummy variable is assigned avalue of 1 if a particular condition ismet and a value of 0 otherwise.

    The number of dummy variablesmust equal one less than the numberof categories of the qualitative

    variable.

    Binary (or dummy) variablesare special variables that are

    created for qualitative data.

    Return to Wilson Realty, and letsevaluate how to use property

    condition in the regression model.

    There are three categories: Mint,

    Excellent, and Good.

    X = 1 if the house is in excellent condition

    = 0 otherwise

    X = 1 if the house is in mint condition

    = 0 otherwise

    Note: If both X and X = 0 then thehouse is in good condition

    3

    4

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    33/40

    dummy variables

    Y = 48329.23 + 28.21 (size) 1981.41(age) +

    23684.62 (if mint) + 16581.32 (if excellent)

    As more variables areadded to the model, the r2

    usually increases.

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    34/40

    model building

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    35/40

    adjusted r-Square

    As more variables are added to themodel, the r2 usually increases.

    The adjusted r2 takes into accountthe number of independent variablesin the model.

    The best model is a statisticallysignificant model with a high r2

    and a few variables.

    Note: When variables are added to the model, the

    value of r2 can never decrease; however, the

    adjusted r2 may decrease. Nadia Z KhanNUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    36/40

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    37/40

    non-linear regression

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    38/40

    non-linear regression

    Engineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They are

    studying the impact of weight on miles per gallon (MPG).

    Linear regression model:

    MPG = 47.8 8.2 (weight)

    F significance = .0003

    r2 = .7446

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    39/40

    non-linear regression

    Nonlinear (transformedvariable)regression model

    MPG = 79.8 30.2(weight) + 3.4(weight)

    F significance = .0002

    R2 = .8478

    2

    Nadia Z Khan

    NUST Business School

    Friday, May 25, 12

  • 7/31/2019 QTM Regression Analysis Ch4 RSH

    40/40

    non-linear regression

    We should not try to interpret the coefficients of the variables

    due to the correlation between (weight) and (weight squared).

    Normally we would interpret the coefficient for as the change

    in Ythat results from a 1-unit change in X1, while holding allother variables constant.

    Obviously holding one variable constant while changing the

    other is impossible in this example since If changes, then mustchange also.

    This is an example of a problem that exists when

    multicollinearity is present Nadia Z Khan