Class Notes Sessions5-6

download Class Notes Sessions5-6

of 57

Transcript of Class Notes Sessions5-6

  • 8/14/2019 Class Notes Sessions5-6

    1/57

    One Way ANOVA

    ANOVA stands for Analysis of Variance

    ANOVA allows us to compare the means from more than twosets of scores.

    A significant ANOVA indicates that changes in the independentvariable affect the dependent variable.

    ANOVA does not indicate which pairs of conditions aresignificantly different.

    Use planned contrasts or unplanned (post hoc) contrasts toassess whether pairs of conditions are significantly different.

    ANOVA Assumptions

    1. Normally distributed populations2. Equal population variances3. Random sampling used4. Dependent variable uses an interval or ratio scale

  • 8/14/2019 Class Notes Sessions5-6

    2/57

    Digression on Scales: Levels of Measurement

    Interval: 0 doesnt mean none (e.g., IQ score)

    - distances between points on scale are equal but ratiosarent meaningful (e.g., temperature)

    Ratio: Same as interval scale, but 0 means none andratios are meaningful (e.g., weight or age: a personwho is 50 is twice as old as one who is 25).

    Nominal: Numbers are just labels for attributes (e.g., color)

    Ordinal: categories have a logical order (e.g., ranks)

    Digression on Scales: Types of Data

    Continuous: Numerical data that can be fractional (e.g.,

    temperature)

    Discrete: Numerical data that cannot be fractional (e.g.,number of World Cup trophies)

    2

  • 8/14/2019 Class Notes Sessions5-6

    3/57

    One-way ANOVANational Airlines

    National Airlines recently introduce a daily early-morning nonstop flightbetween Houston and Chicago. The vice president of marketing forNational Airlines decided to perform a statistical test to see whetherNationals average passenger load on this new flight is different fromthat of each of its two major competitors (which we will call competitor1 and competitor 2). Ten early-morning flights were selected at randomfrom each of the three airlines and the percentage ofunfilledseats oneach flight was recorded. These data are stored in an EXCEL file on thewebsite at National Airlines (Excel).

    Is there evidence that Nationals average passenger load on the newflight is different from that of its two competitors? Report a p value andinterpret the results of the statistical test.

    3

  • 8/14/2019 Class Notes Sessions5-6

    4/57

    Raw Data (in Excel)

    4

  • 8/14/2019 Class Notes Sessions5-6

    5/57

    Raw Data (in SPSS)

    5

  • 8/14/2019 Class Notes Sessions5-6

    6/57

    Transform Data into Analysis-Ready Form

    6

  • 8/14/2019 Class Notes Sessions5-6

    7/57

    Analyze Compare Means One-Way ANOVA

    7

  • 8/14/2019 Class Notes Sessions5-6

    8/57

    Post Hoc Contrasts

    8

  • 8/14/2019 Class Notes Sessions5-6

    9/57

    Results

    Descriptives

    Unfilled

    10 9.80 2.044 .646 8.34 11.26 7 13

    10 11.30 2.003 .633 9.87 12.73 7 13

    10 12.60 2.011 .636 11.16 14.04 9 15

    30 11.23 2.269 .414 10.39 12.08 7 15

    National

    Competitor 1

    Competitor 2

    Total

    N Mean Std. Deviation Std. Error Lower Bound Upper Bound

    95% Confidence Interval for

    MeanMinimum Maximum

    ANOVA

    Unfilled

    39.267 2 19.633 4.815 .016

    110.100 27 4.078

    149.367 29

    Between Groups

    Within Groups

    Total

    Sum of

    Squares df Mean Square F Sig.

    Post Hoc Tests

    Multiple Comparisons

    Dependent Variable: Unfilled

    Bonferroni

    -1.500 .903 .325 -3.81 .81

    -2.800* .903 .013 -5.11 -.49

    1.500 .903 .325 -.81 3.81

    -1.300 .903 .484 -3.61 1.01

    2.800* .903 .013 .49 5.11

    1.300 .903 .484 -1.01 3.61

    (J) Airline

    Competitor 1

    Competitor 2

    National

    Competitor 2

    National

    Competitor 1

    (I) Airline

    National

    Competitor 1

    Competitor 2

    Mean

    Difference

    (I-J) Std. Error Sig. Lower Bound Upper Bound

    95% Confidence Interval

    The mean difference is significant at the .05 level.*.

    9

  • 8/14/2019 Class Notes Sessions5-6

    10/57

    The 2 Test

    For data in which each outcome is assigned to a single (and only

    one) mutually exclusive and exhaustive category

    Derived from square of Z statistic

    Assumes: Independence of observations (can't take severalobservations from 1 person and analyze with 2 )

    Compares the observed and expected values

    Needs at least 5 expectedobservations per cell

    Values range from 0 on up (no negative values)

    10

  • 8/14/2019 Class Notes Sessions5-6

    11/57

    One-way 2

    Are There Differences Among theLevels of 1 Variable?

    H0: P1 = P2 = P3 = P4 =4

    1

    (where P1 + P2 + P3 + P4 = 1)

    Ha: At least one Population Proportion 41

    E

    )E-(O=

    22

    E)E-O(+. . .+

    E)E-O(=

    m

    mm

    2

    1

    11

    2

    2

    Where Oi and Ei are the observed and expected # ofoccurrences for m (exhaustive & mutually exclusive)outcomes

    Squared deviations in which large disparities count formore than small disparities

    df = (# levels in the independent variable 1)

    11

  • 8/14/2019 Class Notes Sessions5-6

    12/57

    100 Analysts Rate an IPO

    Strong Buy Buy Hold Sell Strong Sell

    24 33 22 16 5H0 : PSB = PB = PH = PS = PSSHa : PSB PB PH PS PSS

    Actual & Expected ( )

    Strong Buy Buy Hold Sell Strong Sell

    24(20) 33(20) 22(20) 16(20) 5(20)

    Critical Value (df = 4, )05.= : 49.9)4(2 =

    =E

    EO2

    2 )( = 50.21

    20

    15

    20

    4

    20

    2

    20

    13

    20

    422222

    =++++

    Reject H0 because the test statistic (21.50) is greaterthan the critical value (9.49).

    12

  • 8/14/2019 Class Notes Sessions5-6

    13/57

    100 Analysts Rate an IPO:Testing Unequal Categories

    Test the null hypothesis that twice as many willoffer some form of buy recommendation (eitherStrong Buy or Buy) than will offer either a hold orsome form of sell recommendation (Sell or StrongSell).

    Collapse analysts recommendations into 3categories:

    Buy Hold Sell

    57 22 21

    H0 : PB = 2PH = 2PSHa : PB 2PH 2Ps

    Critical Value ( )05.= : 99.5)2(2 =

    Actual & Expected ( )Buy Hold Sell

    57(50)

    22(25)

    21(25)

    =EEO

    2

    2 )( 98.125

    16

    25

    9

    50

    492=++=

    Do not reject H0 because the test statistic (1.98) isless than the critical value (5.99).

    13

  • 8/14/2019 Class Notes Sessions5-6

    14/57

    100 Analysts Rate an IPO:Testing Unequal Categories in SPSS

    Analyze Nonparametric Chi-Square

    14

  • 8/14/2019 Class Notes Sessions5-6

    15/57

    Set up the expected values for each category

    Results

    Analysts' Recommendations

    57 50.0 7.0

    22 25.0 -3.0

    21 25.0 -4.0

    100

    Buy

    Hold

    Sell

    Total

    Observed N Expected N Residual

    Test Statistics

    1.980

    2

    .372

    Chi-Squarea

    df

    Asymp. Sig.

    Analysts' Recommendations

    0 cells (.0%) have expected frequencies less than

    5. The minimum expected cell frequency is 25.0.

    a.

    15

  • 8/14/2019 Class Notes Sessions5-6

    16/57

    Two-way 2

    Are There Differences Between 2 Variables?

    H0: Variables A and B are independentHa: Variables A and B are dependent

    df = (# rows - 1)(# columns - 1)

    Tests nondirectional hypotheses only, using a single tail

    16

  • 8/14/2019 Class Notes Sessions5-6

    17/57

    SPSS: 2-Way 2Tests

    Vioxx data file:

    Industry [Industry ties? 1=no, 2=yes]Vioxx [Bring Vioxx back? 1=no, 2=yes]

    17

  • 8/14/2019 Class Notes Sessions5-6

    18/57

    Analyze Descriptives Crosstabs

    Click on Statistics; select chi-square

    18

  • 8/14/2019 Class Notes Sessions5-6

    19/57

    Click OK. This is the 2output:Crosstabs

    Case Processing Summary

    32 100.0% 0 .0% 32 100.0%Industry Ties? *

    Bring Vioxx Back?

    N Percent N Percent N Percent

    Valid Missing Total

    Cases

    Industry Ties? * Bring Vioxx Back? Crosstabulation

    Count

    14 8 22

    1 9 10

    15 17 32

    no

    yes

    Industry

    Ties?

    Total

    no yesBring Vioxx Back?

    Total

    19

  • 8/14/2019 Class Notes Sessions5-6

    20/57

    Notice that the 2 test below is significant (p = .005), but notentirely reliable because the expected cell count in one of thecells is less than 5. (You should be able to verify that the lower

    left cell in the Crosstabulation above is the one with theundercount.)

    Chi-Square Tests

    7.942b 1 .005

    5.935 1 .015

    8.893 1 .003

    .007 .006

    7.694 1 .006

    32

    Pearson Chi-Square

    Continuity Correctiona

    Likelihood Ratio

    Fisher's Exact Test

    Linear-by-Linear

    Association

    N of Valid Cases

    Value df

    Asymp. Sig.

    (2-sided)

    Exact Sig.

    (2-sided)

    Exact Sig.

    (1-sided)

    Computed only for a 2x2 tablea.

    1 cells (25.0%) have expected count less than 5. The minimum expected count is 4.

    69.

    b.

    When one or more of your cells has an expected count less than5, report Fisher's Exact Test (in the SPSS output). FishersExact Test has no test statistic, no critical value, and no

    confidence interval. Report it as follows: p = .007, FishersExact Test, 2-tailed.

    20

  • 8/14/2019 Class Notes Sessions5-6

    21/57

    Correlation

    How do the scores on one variable change with the scores on

    another variable?

    Correlations are concerned with measuring the direction andmagnitude of a linear relationship between two variables.

    The stronger the correlation, the more accurately we can predictY from knowing X.

    Scatterplot: A graph containing clusters of dots that represent allX-Y pairs of observations.

    Involves an examination of pairs of X-Y scores (one-sampleprocedure).

    21

  • 8/14/2019 Class Notes Sessions5-6

    22/57

    Correlation Coefficients

    Measures extent to which individual Xi-Yi scores that make up a

    pair occupy the same or opposite positions within theirdistributions.

    - Pos relation: Pairs tend to occupysimilarrelativepositions in their distributions

    - Neg relation: Pairs tend to occupy opposite relativepositions in their distributions

    Two types (there are others as well):- Pearson r (continuous data): rxy

    - Phi Coefficient (binary variables: 2 X 2 Tables):

    Range from -1 to 11 = perfect pos relation

    -1 = perfect neg relation0 = No relation

    Failure to find strong r may mean:(a) chance(b) variables are unrelated, or(c) the variables are related nonlinearly.

    22

  • 8/14/2019 Class Notes Sessions5-6

    23/57

    R Computation (by hand)

    1. Transform each Y score into a Z score (Zy)

    2. Transform each X score into a Z score (Zx)3. Determine correspondence between each of the paired Zs

    - r indicates the average correspondence between the pairedZs.

    r = Mean of the crossproduct of Z scores.

    Population Sample

    N

    ZZ=r

    yx

    1

    N

    ZZ=r

    yx

    (note: Zs will differ for population & samples because thedenominator for computing population Zs is and the

    denominator for computing sample Zs is s.)

    When large pos correspondence: Z crossproduct is pos. & large

    When small neg correspondence: Z crossproduct is neg & small- (lots of + and - canceling each other out)

    Strength of Relationship

    r2 = Proportion of variability of Y accounted for by X

    23

  • 8/14/2019 Class Notes Sessions5-6

    24/57

    Strong Correlation

    (population computation)

    Student # High School # College Zx Zy

    As (X) As (Y)

    Alejandro 13 14 1.500.50Bernardo 9 18 0.50 1.50Carlos 7 12 0.00 0.00

    Dominique 5 10 -0.50 -0.50Enrique 1 6 -1.50 -1.50

    80.05

    )5.1)(5.1()5.)(5.()0)(0()5.1)(5(.)5)(.5.1( =

    ++++=

    =

    N

    ZZr

    yx

    24

  • 8/14/2019 Class Notes Sessions5-6

    25/57

    Strong Correlation: Using SPSS

    Analyze Correlate Bivariate

    Correlation Output

    Correlations

    1 .800

    . .104

    5 5.800 1

    .104 .

    5 5

    Pearson Correlation

    Sig. (2-tailed)

    NPearson Correlation

    Sig. (2-tailed)

    N

    High School

    College

    High School College

    25

  • 8/14/2019 Class Notes Sessions5-6

    26/57

    Two Points of Caution with Correlations1. Restriction of range (i.e., truncated range) problem

    When the relevant range of X or Y scores is a truncated part ofwhole, then the truncated X-Y correlation will be smaller thanthe whole X-Y correlation.

    2. Correlation does not mean causation

    - may be a correlated 3rd variable

    - Even if no 3rd variable is involved, its not always clearwhich variable is the cause and which is the effect.

    26

  • 8/14/2019 Class Notes Sessions5-6

    27/57

    Phi Coefficient

    Correlation for Categorical Data (2 X 2 Tables):

    a b

    c d

    d)+c)(b+d)(a+b)(c+(a

    bc-ad=

    Yes No50 20

    10 40=

    27

  • 8/14/2019 Class Notes Sessions5-6

    28/57

    Phi Coefficient (using SPSS)

    10 5

    5 8

    Analyze Descriptive Statistics Crosstabs Statistics

    28

  • 8/14/2019 Class Notes Sessions5-6

    29/57

    Click Statistics and check Phi and Cramers V

    Symmetric Measures

    .282 .136

    .282 .136

    28

    Phi

    Cramer's V

    Nominal by

    Nominal

    N of Valid Cases

    Value Approx. Sig.

    Not assuming the null hypothesis.a.

    Using the asymptotic standard error assuming the null

    hypothesis.

    b.

    (ignore Cramers V)

    29

  • 8/14/2019 Class Notes Sessions5-6

    30/57

    Regression

    Regression: The primary purpose of regression is prediction

    Predictions about the linear relationship between independent anddependent variables.

    Independent = predictor = explanatoryDependent = response = criterion

    Types of Regression

    1. Linear (least squares regression line)Simple regression: one predictor variableMultiple regression: multiple predictor variables

    2. Nonlinear (can linearize many of these via transformation)- Positive curvilinear (e.g., diminishing marginal utility)- Polynomial (quadratic parabola-shaped; cubic)- Exponential or negative curvilinear (L-shaped)

    3. Logistic (when dependent variable is categorical)- Example: graduate or not; sales are weak/moderate/strong

    30

  • 8/14/2019 Class Notes Sessions5-6

    31/57

    Lines: Y = bo + b1X

    bo and b1 are regression coefficients- can be positive or negative

    - b1 is more important than b0

    bo = Y intercept (value of Y when X=0)b1 = Slope (how much Y changes when X changes by 1 unit)

    Example #1: Suppose Aeromexico wants to examine the relationbetween number of flight delays and number of passenger complaints.

    X = Number of flight delaysY = Number of passenger complaints.

    Suppose that the data are as follows (X, Y):(0, 1), (1, 3), (2, 5)

    31

  • 8/14/2019 Class Notes Sessions5-6

    32/57

    Scatterplot: Delays vs. Complaints

    2.001.501.000.500.00

    delays

    5.00

    4.00

    3.00

    2.00

    1.00

    complaints

    The line that fits these dataperfectly is: Y = 1 + 2X# Complaints = 1 + 2 (# Flight Delays)

    X Y0 1 + 2(0) = 1 (0,1)1 1 + 2(1) = 3 (1,3)

    2 1 + 2(2) = 5 (2,5)

    32

  • 8/14/2019 Class Notes Sessions5-6

    33/57

  • 8/14/2019 Class Notes Sessions5-6

    34/57

    The regression line (also called Least squares regression line)minimizes the squared difference between the observed and predictedvalues of the response variable (as give by the regression line).

    - The difference between the actual and predicted values is calledthe residual.

    - Minimizing these squared residuals gives slope that is as close aspossible to true slope.

    10.008.006.004.002.000.00

    delays

    20.00

    15.00

    10.00

    5.00

    0.00

    complaints

    R Sq Linear = 0.475

    (Well talk about what R Sq Linear means later on)

    34

  • 8/14/2019 Class Notes Sessions5-6

    35/57

    Example #2: Suppose UT wants to examine relation between alumnidonations to the school and number of football victories.

    X = Number of football victoriesY = Amount of alumni donations the following year

    Alumni Donations = $10,000,000 + $200,000 (# Football Victories)

    Y = 10,000,000 + 200,000X

    Caution #1: X-Y relation may not be causal

    Caution #2: Regression line estimates most trustworthy near bulk ofdata (usually center).

    35

  • 8/14/2019 Class Notes Sessions5-6

    36/57

    Linear Regression Assumptions

    1. Linearity- Linear relationship between X and Y

    - Same expected change in Y moving from X1 to X2 vs.moving from X2 to X3

    Test: X-Y Scatterplot (look for nonlinearities)

    Correction: Insert curvilinear term (Usually quadratic: X2)Y = bo + b1X + b2X

    2

    : Log transformation of X variable (if data are positive)- brings large values down, pushes small values furtherapart

    2. Independence of Observations- Residuals across Xs are not correlated

    Test: Durbin-Watson (0-4) (tests residual correlation among Xs)0.0 - 1.5 = pos correlation1.5 - 2.5 = no correlation2.5 - 4.0 = negative correlation

    Correction: Transformation of Y variable (percentages or logs)

    36

  • 8/14/2019 Class Notes Sessions5-6

    37/57

    3. Normality- The distrib. at each Xi is normal- The errors have normal distribution

    Test #1: Plot histogram of residuals (should be normal)

    Test #2: Normal Prob. Plot- Plot of cumulative probabilities- Should follow diagonal (if residuals follow normal distrib)

    Correction: Log transformation of Y

    4. Constant Variance (homoskedasticity)- Each Yi distrib. has same variance- Means that effects of other factors does not depend on level of X.- Common problem: Variance up as X increases (funnel shape)

    Test: Scatterplot of X vs. Residuals- Should not show funnel shape pattern

    Correction: Log transformation of Y

    37

  • 8/14/2019 Class Notes Sessions5-6

    38/57

    Example: Simple Linear Regression

    Houston Astros Payroll

    Identify a regression equation that predicts the median salary for aHouston Astros baseball player based on knowledge of the total team

    payroll

    Independent variable: Total PayrollDependent variable: Median Salary

    Here are your data (figures are in thousands)

    You can access this data file on the website as well (Houston Astrossalary data)

    38

  • 8/14/2019 Class Notes Sessions5-6

    39/57

    1. Create XY Scatterplot

    Graphs Scatter Simple Define OK

    39

  • 8/14/2019 Class Notes Sessions5-6

    40/57

    Median Salary Total Payroll Scatterplot

    10000.00 20000.00 30000.00 40000.00 50000.00 60000.00 70000.00 80000.00

    Total Payroll

    0.00

    300.00

    600.00

    900.00

    1200.00

    1500.00

    MedianSalary

    This scatterplot shows that the linearity assumption is OK- well check the other 3 assumptions shortly

    40

  • 8/14/2019 Class Notes Sessions5-6

    41/57

    2. Visual check for outliers (remove if necessary)

    3. Add regression line:Double click on graph

    Single click on a data point (it will enlarge and change color)

    Elements Fit Line At Total

    41

  • 8/14/2019 Class Notes Sessions5-6

    42/57

    Fit Line at Total Linear

    42

  • 8/14/2019 Class Notes Sessions5-6

    43/57

    4. Conduct Regression Analysis

    43

  • 8/14/2019 Class Notes Sessions5-6

    44/57

    Put Independent and Dependent variables in the right boxes

    Click Statistics

    44

  • 8/14/2019 Class Notes Sessions5-6

    45/57

    Click Plots

    45

  • 8/14/2019 Class Notes Sessions5-6

    46/57

    Click Save

    By checking these boxes, you will create extra columns on your datafile. You will get a Predicted Values (PRE_1) column and a ResidualValues (RES_1) column.

    46

  • 8/14/2019 Class Notes Sessions5-6

    47/57

    5. Examine Regression Output

    M o d e l S u m m a ryb

    .754a .56 9 .5 40 2 20 .5 39 7 8 .56 9 1 9 .79 0 1 1 5 .0 00 2

    M o d e l

    1

    R R Sq ua re

    Ad jus ted

    R Squa re

    Std . Er ro r o f

    the Es t imate

    R Squa re

    C han ge F C h a ng e d f1 d f2 S ig . F C h a ng e

    Chan ge Sta t is t ics

    Durb i

    Wa t s

    Pred ic to rs : (Cons tan t ) , To ta l Payro l la .

    Dependent Var iab le : Med ian Sa la ryb .

    ANOVAb

    962530.2 1 962530.159 19.790 .000a

    729566.9 15 48637.793

    1692097 16

    Regression

    Residual

    Total

    Model1

    Sum ofSquares df Mean Square F Sig.

    Predictors: (Constant), Total Payrolla.

    Dependent Variable: Median Salaryb.

    Coefficientsa

    110.736 111.951 .989 .338

    .012 .003 .754 4.449 .000

    (Constant)

    Total Payroll

    Model1

    B Std. Error

    Unstandardized

    Coefficients

    Beta

    Standardized

    Coefficients

    t Sig.

    Dependent Variable: Median Salarya.

    47

  • 8/14/2019 Class Notes Sessions5-6

    48/57

    6. Is model statistically significant?

    Yes, because F = 19.79, p = .000 (i.e., p < .001).

    7. Identify equation for the simple linear model (i.e., the regression line)

    Coefficientsa

    110.736 111.951 .989 .338

    .012 .003 .754 4.449 .000

    (Constant)

    Total Payroll

    Model

    1

    B Std. Error

    Unstandardized

    Coefficients

    Beta

    Standardized

    Coefficients

    t Sig.

    Dependent Variable: Median Salarya.

    Y = Y Intercept + Beta * (X)

    Median Salary = 110.736 + .012 (Total Payroll)

    Or, in actual dollars:Median Salary = $110,736 + .012 (Total Payroll)

    48

  • 8/14/2019 Class Notes Sessions5-6

    49/57

    8. Check the other 3 linear regression assumptions

    8a.Independence: D-W = 2.346 (OK, because its between 1.5 and 2.5)

    8b.Normality: Histogram of Residual (is it normal?): Normal Prob. Plot (are points near the diagonal?)

    -2 -1 0 1 2 3

    Regression Standardized Residual

    0

    1

    2

    3

    4

    Frequency

    Mean = -6.94E-17Std. Dev. = 0.968N = 17

    Dependent Variable: Median Salary

    Histogram

    OK, because residuals have a roughly normal shape

    49

  • 8/14/2019 Class Notes Sessions5-6

    50/57

    0.0 0.2 0.4 0.6 0.8 1.0

    Observed Cum Prob

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    ExpectedCumProb

    Dependent Variable: Median Salary

    Normal P-P Plot of Regression Standardized Residual

    OK, because points are near the diagonal

    50

  • 8/14/2019 Class Notes Sessions5-6

    51/57

    8c. Constant variance?: Is there an absence of a funnel shape in scatterplot of X vs.Residuals?

    Go to your Modified Data File:

    51

  • 8/14/2019 Class Notes Sessions5-6

    52/57

    Heres a look at your data file ordered from lowest to highest payroll,where some of the columns are rearranged to make it more readable:

    52

  • 8/14/2019 Class Notes Sessions5-6

    53/57

    Test the Constant Variance assumption by looking at theX vs. Residuals scatterplot. Check for funnel pattern.

    10000.00 20000.00 30000.00 40000.00 50000.00 60000.00 70000.00 80000.00

    Total Payroll

    -400.00000

    -200.00000

    0.00000

    200.00000

    400.00000

    600.00000

    UnstandardizedResidual

    Theres a hintof a bit of a funnel pattern here.(Consider a log transformation of Y variable Median)

    53

  • 8/14/2019 Class Notes Sessions5-6

    54/57

    9. Search output for Case Diagnostics that describe outliers

    None were found here, so nothing shows up in the SPSS output

    But if you changed the Casewise Diagnostics (in Statistics) toshow outliers beyond 1 sd

    Heres what youd get:

    Casewise Diagnostics(a)

    Case Number Std. Residual Median SalaryPredicted

    Value Residual

    3 1.060 500.00 266.1707 233.82928

    8 -1.320 185.00 476.0631 -291.06310

    14 2.229 1300.00 808.3513 491.64868

    15 -1.558 500.00 843.7011 -343.70113

    16 1.218 1200.00 931.4056 268.59437

    17 -1.051 750.00 981.7387 -231.73868

    a Dependent Variable: Median Salary

    54

  • 8/14/2019 Class Notes Sessions5-6

    55/57

    Dont Trust Your Model TOO Much

    Question:

    The Houston Astros payroll in 2005 = $76,779.000. What does theregression line predict the median salary will be?

    Answer:Predicted Median Salary =$110,736 + (.012)(76,779,000) = $1,032,084

    Actual: $500,000

    Question:Why was the model so far off?

    55

  • 8/14/2019 Class Notes Sessions5-6

    56/57

    1988 Houston Astros (Total payroll = $13,455,000; Median = $500,000)T

    56

  • 8/14/2019 Class Notes Sessions5-6

    57/57

    2005 Houston Astros (Total payroll = $76,779,000; Median = $500,000)