Regression Model in Matri

download Regression Model in Matri

of 7

Transcript of Regression Model in Matri

  • 8/8/2019 Regression Model in Matri

    1/7

    Page 22General Linear Regression Model in Matrix Terms

    Suppose we have one response variable Y and (p-1) predictor (explanatory)

    variables X1, X2, . . . , Xp-1, and n observations, so that the dataset looks like

    the following:

    X1 X2 . . . Xp-1 Y (random error)

    X11 X12 . . . X1(p-1) Y1 1X21 X22 . . . X2(p-1) Y2 1.. . .. .

    .. . .. .

    .. . .. .

    Xn1 Xn2 . . . Xn(p-1) Yn nb

    The general linear regression model is given by

    Yi = 0 + 1Xi1 + 2X12 + . p-1X1(p-1) + . . . i, i = 1, 2, ,n.

    In matrix terms this becomes

    Y = X + where

    Yi = 0 + 1Xi1 + 2X12 + . p-1X1(p-1)+ . . . i,

    Y is the vector of n responses Y1, Y2, . . . , Yn

    X is then x p matrix with first column all 1s and the values of X1,

    X2 , . . .Xp-1 (assumed to be of rank p)

    is the p x 1 vector of parameters. 0, 1. , . . . , p-1. is an n x 1 vector of uncorrelated errors. 1, 2, . . . , p.

  • 8/8/2019 Regression Model in Matri

    2/7

    The random errors 1, 2, . . . , p are assumed to be independent with

    mean 0 and common variance 2. For the purpose of making statisticalinferences, it is further assumed that the errors are normally distributed.

    Page 23

    Estimation of Parameters.

    The most commonly used criterion to estimate the parameters in the model

    is the principle of least squares, which involves minimizing

    Q = n

    i

    1

    2 = [Yi - 0 - 1Xi1 - 2Xi2 - . p-1Xi(p-1)]2 = (Y = X ) ( Y = X )

    It is easily shown that the value b of which minimizes Q is the solutionof the least squares normal equations

    XXb = XY

    which has the solution

    b = (XX)-1XY

    Note: If we assume that the errors are normally distributed then the least

    squares estimatorb is also the maximum likelihood estimator of (to bediscussed later).

    Residuals eiare the difference between observed and fitted values and are

    given by

    ei = Yi iY ,

    or in vector form by

    e = Y Y = Y Xb = Y - X(XX) -1XY = [I X(XX)-1X]Y = (I

    H)Y

    where H = X(XX)-1X. H is called the hat matrix and plays an important

    role in regression diagnostics (to be discussed below).

    The resulting minimumvalue of Q, called the sum of squared errors SSE, is

    given by

  • 8/8/2019 Regression Model in Matri

    3/7

    SSE =2

    ie = (Y- Xb)(Y Xb) = (Yi iY )2

    Page 24

    The fitted values are given by

    Y = Xb = X(XX)-1XY = HY

    This representation of the vector Y of predicted (fitted) values displays

    directly the relationship between them and the observations. Letting hijdenote the i,jth element ofH, the fitted value, we have

    iY =

    =

    n

    j

    jijYh1

    High values of the diagonal elements hij indicate that the observation Yi has

    a high influence on the fitted value.

    Example. Two predictors X1 and X2 , n = 4 observations:

    X =

    101

    41

    31

    21

    , Y =

    40

    3

    3

    3

    , H =

    961.0.148.0013.0123.

    148.0265.0284.0303.0

    013.0284.0329.0374.0

    123.303.0374.0445.0

    ,

    Diag=

    961.0

    265.0

    329.0

    445.0.

    The diagonal elements hii measure the influence (leverage) of the individual

    observations. Both h22 and h44 have very high leverage. For example,

    4Y = -0.123 Y1 + 0.013 Y2 + 0.148 Y3 + 0.961 Y4 .

    Here is a plot of the fitted and

    observed values:

    x

    y

    10987654321

    40

    30

    20

    10

    0

    S 5.14750

    R- Sq 94.8%

    R-Sq(adj) 92.3%

    Fitted Line Plot

    y = - 11.56 +5.013 x

  • 8/8/2019 Regression Model in Matri

    4/7

    Page 25

    From the graph it is easy to see why the observation Y4 has a large influence

    (leverage) on its fitted value (and on the fitted regression line as well).

    Coefficient of Multiple Determination R2.

    We ask: How much improvement is obtained by using a predictor to obtain

    fitted (average) values of the response, versus just using the mean y ?, One

    answer is the following:

    Compare the two ways of getting fitted values:

    1. Use the average value (sample mean ) of the observationsY

    , soYY = and compute SSTO =

    2)( YYi = sum of squared errors ofpredictions.

    2. Use the fitted regression line, getting fitted values Y = x10 + in the

    simple linear regression model.. Then compute SSE =2)( ii YY .

    Compare the sum of squared errors of fitted and observed values for the two

    methods. Then

    R2 = (SSTO SSE)/ SSTO

    equals the proportionate reduction in the sum of squared errors using the

    fitted regression line vs. using the sample mean Y . R2.is usually expressed

    as a percentage reduction. It is also interpreted as the amount of variability

    in the observations that can be explained (or accounted for) by the

    predictors. Note that the sample variance of the observations is

    2

    yS = SSTO/ (n-1)

    and the variance of the residuals using the regression equation is given by

    MSE = SSE/(n-p). s = MSE is the estimated standard deviation of

    the random errors i.

  • 8/8/2019 Regression Model in Matri

    5/7

    It is easily shown that the Total sum of Squares SSTO can be decomposed

    as

    Page 26

    SST0 =2)( YYi = 2)( YYi + 2)( ii YY = SSR + SSE.

    This breakdown of sum of squares can be summarized in an Analysis of

    Variance Table:

    Source of Sum of df MS F-Test

    Variation Squares

    ------------- ---------------------- ------ -------------------- -------------

    Regression SSR =2)( YYi p - 1 MSR=SSR/(p-1) MSR/MSE

    Error SSE =2)( ii YY n p MSE = SSE/(n-p)

    ------------- ---------------------- ------ --------------------

    Total SSTO = 2)( YYi n 1

    The F-test is used to test the hypothesis that all of the parameters 1, 2.,

    , p-1 are simultaneously zero. Use the p-value of the test to make adecision on this (this is probably practically not an issue!).

    Confidence Intervals.

    Recall (from page 23) that b = (XX)-1XY is the estimated vector of (vector)

    of parameters . It can be shown that the variance covariance matrix of bis given by

    Var-Cov (b) = (XX)-12

    which is estimated by

    Est. Var-Cov (b) = (XX)-1MSE

    The square root of the diagonal elements s(bi) of this matrix are the standarderrors of the estimated regression parameters b1, b2., , bp-1. Confidence

    intervals for the bi s are then given by

    bi t* s(bi), where t

    * is a critical value of the t-distribution with n-p df.

  • 8/8/2019 Regression Model in Matri

    6/7

    Tests of hypotheses about individual parameters are conducted using the t-

    distribution alsorefer to the p-values of these tests in regression output.

    Page 27

    Similarly, one can construct confidence intervals for the mean response

    new=E(Ynew) corresponding to a population mean indexed by for values of

    x1, x2., , xp-1. The mean response is estimated by newY = bXh'

    , where'

    newX is the (row) vector of values of x1, x2., , xp-1. It can be shown that

    the standard error of the estimated response is given by

    s.e.( newY ) = MSEXXXX newhnew1' )'(

    Model Selection Criteria.

    If there are (P-1) predictors x1, x2., , xP-1. one can conceivably fit 2P-1

    different models to the data. For example, there are P-1 models with one

    predictor x1, P(P-1) models with 2 predictors, etc. Some criteria used for

    comparing models include the following (p as a subscript below refers to the

    number of predictors in a model):

    SSEp,2

    pR ,2

    , paR , Cp , AICp , BICp, and Pressp .

    These can be described as follows:

    SSEp or2

    pR . Note first that SSEp and2

    pR are equivalent measures, in that

    2

    pR = 1 =SSTO

    SSEp

    The goal in using either of these statistics is to choose a model where their

    values are small. One can plot, e.g.,2

    pR against p and choose a model, or

    models, where it is asmptoting (not changing).

    2

    , pa

    R

    , is the same measure as

    2

    p

    R

    but with an adjustment for sample size. Itis given by

    2

    , paR =

    SSTO

    SSE

    pn

    n p

    )(

    )1(1

    = 21y

    p

    S

    MSE

    where2

    yS = SSTO/(n-1) is the sample variance of the observations. Thus,2

    , paR looks at how the ratio of sample variances for the model with p

  • 8/8/2019 Regression Model in Matri

    7/7

    Page 28

    predictors changes in comparison with the model with no predictors (a

    baseline model).

    Cp . This criterion is concerned with the total mean squared error of the n

    fitted values for each subset selection model. It is a bit complicated to

    describe here. Suffice to say, most statisticians now prefer to use the BIC

    criterion.

    BICp. Schwarzs Bayesian Information Criterion is given by

    BICp = n ln SSEp n ln + [ln n] p

    We will look at an example using the : SDSS Quasar Sloan Digital Sky

    Survey team (CASt dataset SDSS_quasar.dat). Here are 8 of the first 10

    observations in the dataset (which contains 46420 observations in all). The

    variables are as follows:

    Dec. z u_mag g_mag r_mag i_mag z_ mag Radio X-ray J_mag

    H_mag K_mag M_i

    15.30 1.20 19.92 19.81 19.39 19.16 19.32 -1.00 -9.00 0.00 0.00 0.00

    -25.08

    13.94 2.24 19.22 18.89 18.45 18.33 18.11 -1.00 -9.00 0.00 0.00 0.00

    -27.42

    14.93 0.46 19.64 19.47 19.36 19.19 19.00 -1.00 -9.00 0.00 0.00 0.00

    -22.73

    0.04 0.48 18.24 17.97 18.03 17.96 17.91 0.00 -1.66 16.65 15.82 14.82

    -24.05

    14.18 0.95 19.52 19.28 19.11 19.16 19.07 -1.00 -9.00 0.00 0.00 0.00

    -24.57

    -8.86 1.25 19.15 18.72 18.26 18.28 18.26 13.97 -9.00 0.00 0.00 0.00

    -26.06

    15.33 0.99 19.41 19.18 18.99 19.08 19.13 -1.00 -1.88 0.00 0.00 0.00

    -24.71

    13.77 0.77 19.35 19.00 18.92 19.01 18.84 -1.00 -9.00 0.00 0.00 0.00

    -24.19