Review of observational study design and basic statistics for contingency tables
Basic Review of Statistics
description
Transcript of Basic Review of Statistics
Basic Review of StatisticsBasic Review of StatisticsBy this point in your college career, the BB students should have taken STAT 171 and perhaps DS 303/ ECON 387 (core requirements for the BB degree).
For the BA students, deficiency MA students, and those of you that haven’t completed your statistics requirements we will overview the key topics necessary for applications directly related to Econ 330.
Population Parameters vs. Sample Population Parameters vs. Sample StatisticsStatisticsPopulation Parameters: descriptive measures of the entire
population that you’re interested in examining Ex: All US households Ex: All Illinois households with m > $25,000
In the absence of complete and detailed information on every household you are interested in you must estimate the population parameters. Most common way is using sample statistics.
Sample Statistics: descriptive measures of a representative sample, or subset, of the population.
Ex: instead of surveying every US household we send out surveys to a subset of the population and use that basic information to estimate what the values would be for the overall population.
Measures of Central Tendency
1. Mean (or “arithmetic mean” or “average”): the sum of numbers included in the sample divided by the number of observations, n.
◦ Ex: calculate the average cost per unit (AC) across different firms given cost data: $20.6, $40.3, $15.8, $23.7
Typically written as:
◦ Limitation of the Mean: because it is only an average, you can expect that actual data will rarely coincide exactly with your estimate. If there is high variation in your data the average may not be very useful in estimation.
10.25
x
Measures of Central Tendency Measures of Central Tendency continuedcontinued2. Median: is the middle observation in
your data.Indicates that half of your observations are
above this value and half of your observations are below this value
to find the value of the median, rank in ascending or descending order your observations by value. The observation in the middle is the median.
Ex: 40, 80, 18, 32, 50
Measures of Central Tendency Measures of Central Tendency continuedcontinued3. Mode: the most frequent value in the
sample. useful when there is little variation in the
data (values tend to be continuous and close to one another e.g. sales) ex: sales data of ice cream in gallons over 8
weeks: 100, 99, 100, 102, 97, 110, 100, 103
Aids in identifying the most common value for marketing purposes such as color or size of an item
Measures of Dispersion1. Range: difference between the largest
and the smallest sample observation value
◦ Our firm’s highest profit this year was $20 million, and the lowest profit this year was $12 million ___________________________________________
◦ The larger the range, the more variation or dispersion.
◦ Often used for “best case” and “worst case”
scenario projections.
◦ Limitation: only focuses on the extreme values and may not be really representative of the entire sample.
2. Variance and Standard Deviation:
Variance (σ2 or s2): arithmetic mean of the squared deviation of each observation from the overall mean
How far observation values are from the average or how far they deviate from the average value; whether they are above or below doesn’t matter; squaring the deviations makes sure positive and negative deviations don’t cancel out each other.
◦ Where x is the value in your sample; μ is the population average or mean so (x- μ) is how far your value deviates from the average; n is the number of observations.
Standard Deviation (σ or s): is the square root of the variance
Often used as a measure of potential risk when there is uncertainty.
σ2 =
s2=x
3. Coefficient of Variation (V): compares the standard deviation to the mean.
Used often by managers because the value is unaffected by the size or the unit of measure (such as thousands of dollars vs. millions of dollars).
◦ For example: a manager is comparing two projects: one that costs thousands of dollars and one that costs millions of dollars and projecting profits for each. Looking at standard deviations and comparing them doesn’t allow you to compare apples to apples. Need a measure that isn’t affected by the measurement unit. Coefficient of Variation is such a measure.
◦ V= σ/ μ or
◦ Numerator is a measure of risk; denominator is a central
tendency measure—average outcome.
◦ Hence, in capital budgeting it is used to compare “risk-reward” ratios for different projects that differ widely in profitability or investment requirements.
x
sv
Measure of Goodness of Fit R2 or “coefficient of
determination”: measures how much variation in the dependent variable is explained by our independent variables.
Higher numbers mean greater explanation and that deviations from the equation will be smaller
Coefficient of determination numbers are bounded between 0 and 1
Variable SignificanceVariable Significance t-statistics and p-values are commonly used to
measure significance (the influence of an independent variable on the dependent variable)
Excel which provides both. However, “p-values” are more commonly used so this is the measure we will use.
You define your research question: Is there a difference in blood pressures between those in group A (receiving a drug) and those in group B (receiving a sugar pill—no drug).
The null hypothesis is usually an hypothesis of "no difference" ◦ For example: no difference between blood pressures in
group A and group B.
◦ You then test this hypothesis with data including blood pressures of member of group A and group B.
The “p- value” or sometimes called the “calculated probability” is the estimated probability of rejecting the null hypothesis (H0) of a study question when that hypothesis is true.
◦The probability of saying there is a difference in blood pressures (rejecting the null) when in fact there is not (there are no differences in blood pressure)
◦Standard practice in the field defines “statistically significant” if _______________ (smaller number such as 0.01 means greater significance)
Regression Analysis (OLS)Regression Analysis (OLS)Regression Analysis: uses data to describe how variables are related to
one another. In markets, many variables change simultaneously and regression
analysis accounts for multiple changes
Example: Q=f( P, Psub, ADV, m, POP, time)
Where Q=sales of Brand Name icecream (dependent variable)P=price of brand name ice cream
Psub= price of a substitute, competing, brandADV=adverstsing dollarsm=IncomePOP=populationt=time (sales quarter, to show trends or seasonality)
The right-hand side variables are called “independent variables” Using data gathered on all variables, regression analysis allows us to
see the relative importance of each independent variable (Price, income, etc) on the dependent variable, sales or quantity.
Sample Data
Year-Quarter
Unit Sales (Q)
Price ($)
Advertising Expenditures
($) Competitors'
Price ($) Income
($) Population Time
Variable
2003-1 193,334
6.39
15,827
6.92 33,337 4,116,250 1
2003-2 170,041
7.21
20,819
4.84 33,390 4,140,338 2
2003-3 247,709
5.75
14,062
5.28 33,599 4,218,965 3
2003-4 183,259
6.75
16,973
6.17 33,797 4,226,070 4
2004-1 282,118
6.36
18,815
6.36 33,879 4,278,912 5
2004-2 203,396
5.98
14,176
4.88 34,186 4,359,442 6
2004-3 167,447
6.64
17,030
5.22 35,691 4,363,494 7
2004-4 361,677
5.30
14,456
5.80 35,950 4,380,084 8
2003-1 401,805
6.08
27,183
4.99 34,983 9,184,926 1
2003-2 412,312
6.13
27,572
6.13 35,804 9,237,683 2
2003-3 321,972
7.24
34,367
5.82 35,898 9,254,182 3
2003-4 445,236
6.08
26,895
6.05 36,113 9,272,758 4
2004-1 479,713
6.40
30,539
5.37 36,252 9,300,401 5
2004-2 459,379
6.00
26,679
4.86 36,449 9,322,168 6
2004-3 444,040
5.96
26,607
5.29 37,327 9,323,331 7
2004-4 376,046
7.21
32,760
4.89 37,841 9,348,725 8
2003-1 255,203
6.55
19,880
6.97 34,870 5,294,645 1
2003-2 270,881
6.11
19,151
6.25 35,464 5,335,816 2
2003-3 330,271
5.62
15,743
6.03 35,972 5,386,134 3
2003-4 313,485
6.06
17,512
5.08 36,843 5,409,350 4
37,573
Excel: Summary Stats and Excel: Summary Stats and Regression AnalysisRegression AnalysisShow in excel how to create summary
statistics (mean, median, mode, range, etc)
Show in excel how to run the regression◦ Copy data into excel◦ Under Data Tab use “Data Analysis”◦ select regression from drop down list◦ select y range of data (dependent variable Q
—select only data not title)◦ select x range of data (all independent
variable data)◦ click OK◦ results pop into another window showing
coefficients for our variables
SUMMARY STATS (1SUMMARY STATS (1STST 3 VARIABLES) 3 VARIABLES)
Column1 Column2 Column3
Mean 391917.3125 Mean 6.237292 Mean 29203.64583Standard Error 25371.20712 Standard Error 0.091812 Standard Error1869.317184Median 356929 Median 6.12 Median 26643Mode #N/A Mode 7.02 Mode #N/AStandard Deviation175776.8791 Standard Deviation0.636091 Standard Deviation12951.00935Sample Variance 30897511243 Sample Variance0.404612 Sample Variance167728643.2Kurtosis -0.424039741 Kurtosis -0.8518 Kurtosis -0.38947906Skewness 0.564081425 Skewness 0.082874 Skewness 0.83605057Range 689728 Range 2.38 Range 46388Minimum 75396 Minimum 5.03 Minimum 13896Maximum 765124 Maximum 7.41 Maximum 60284Sum 18812031 Sum 299.39 Sum 1401775Count 48 Count 48 Count 48
REGRESSION OUTPUTREGRESSION OUTPUT
Regression equation (using coefficients above)
Q=647071 -127436P +5.35ADV +29339Pcomp + 0.3403m +0.02POP + 4407t
Regression StatisticsMultiple R 0.946559307R Square 0.895974522Adjusted R Square 0.880751281Standard Error 60699.98879Observations 48
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%Intercept 647041.9264 154303.7628 4.193299726 0.000143148 335419.159 958664.6939 335419.159 958664.6939X Variable 1 -127436.167 15106.87319 -8.435641532 1.68439E-10 -157945.1159 -96927.21787 -157945.1159 -96927.21787X Variable 2 5.352343471 1.114830567 4.801037601 2.12221E-05 3.100897491 7.603789452 3.100897491 7.603789452X Variable 3 29339.75679 12388.80657 2.368247224 0.022668872 4320.054607 54359.45896 4320.054607 54359.45896X Variable 4 0.340280347 3.184070945 0.106869587 0.915413667 -6.090081309 6.770642002 -6.090081309 6.770642002X Variable 5 0.023965899 0.002349065 10.20231336 8.12444E-13 0.019221865 0.028709932 0.019221865 0.028709932X Variable 6 4407.716892 4401.822046 1.001339183 0.322536268 -4481.942977 13297.37676 -4481.942977 13297.37676
Statistically significant Statistically significant variablesvariables
This means changes in price have a statistically significant impact on sales (same with competitors price and advertising)◦ Note each coefficient is ∆Q/∆variable
◦ Example: if the firm increased price by $1.00 then estimated impact on sales is: ____________________________________
◦ If asked for a $0.50 change it would be: _______________________
Income has no discernible effect in this model so predictions about changes in income would result in zero impact on quantity.