Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern...

43
Part 21: Statistical Inference 1-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

Transcript of Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern...

Page 1: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-1/43

Statistics and Data Analysis

Professor William Greene

Stern School of Business

IOMS Department

Department of Economics

Page 2: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-2/43

Statistics and Data Analysis

Part 21 – Statistical Inference: Confidence Intervals

Page 3: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-3/43

Statistical Inference:Point Estimates andConfidence Intervals Statistical Inference Estimation Concept Sampling Distribution Point Estimates and the Law of

Large Numbers Uncertainty in Estimation Interval Estimation

Page 4: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-4/43

Application: Credit Modeling

1992 American Express analysis of Application process: Acceptance or rejection Cardholder behavior

Loan default Average monthly expenditure General credit usage/behavior

13,444 applications in November, 1992

Page 5: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-5/43

Modeling Fair Isaacs’s Acceptance Rate

13,444 Applicants for a Credit Card (November, 1992)

Experiment = A randomly picked application.

Let X = 0 if Rejected

Let X = 1 if Accepted

Rejected Approved

Page 6: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-6/43

The Question They Are Really Interested In: Default

Of 10,499 people whose application was accepted, 996 (9.49%) defaulted on their credit account (loan). We let X denote the behavior of a credit card recipient.

X = 0 if no default (Bernoulli)X = 1 if default

This is a crucial variable for a lender. They spend endless resources trying to learn more about it. Mortgage providers in 2000-2007 could have, but deliberately chose not to.

Page 7: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-7/43

The data contained many covariates. Do these help explain the interesting variables?

Page 8: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-8/43

Variables Typically Used By Credit Scorers

Page 9: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-9/43

Sample Statistics

The population has characteristics Mean, variance Median Percentiles

A random sample is a “slice” of the population

Page 10: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-10/43

Populations and Samples Population features of a

random variable. Mean = μ

= expected value of a random variable

Standard deviation = σ= (square root) of expected squared deviation of the random variable from the mean

Percentiles such as the median = value that divides the population in half – a value such that 50% of the population is below this value

Sample statistics that describe the data Sample mean =

= the average value in the sample

Sample standard deviation = s tells us where the sample values will be (using our empirical rule, for example)

Sample median helps to locate the sample data on a figure that displays the data, such as a histogram.

x

Page 11: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-11/43

The Overriding Principle in Statistical Inference

The characteristics of a random sample will mimic (resemble) those of the population Mean, median, standard deviation, etc. Histogram

The resemblance becomes closer as the number of observations in the (random) sample becomes larger. (The law of large numbers)

Page 12: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-12/43

Point Estimation We use sample features to estimate

population characteristics. Mean of a sample from the population is

an estimate of the mean of the population: is an estimator of μ

The standard deviation of a sample from the population is an estimator of the standard deviation of the population: s is an estimator of σ

x

Page 13: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-13/43

Point Estimator A formula Used with the sample data to estimate a

characteristic of the population (a parameter)

Provides a single value: Ni 1 i

N 2i 1 i

xx a point estimator of

N

(x x)s a point estimator of

N 1

Page 14: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-14/43

Sampling Distribution

The random sample is itself random, since each member is random.

Statistics computed from random samples will vary as well.

Page 15: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-15/43

Estimating Fair Isaacs’s Acceptance Rate

13,444 Applicants for a Credit Card (November, 1992)

Experiment = A randomly picked application.

Let X = 0 if Rejected

Let X = 1 if Accepted

The 13,444 observations are the population. The true proportion is μ = 0.780943. We draw samples of N from the 13,444 and use the observations to estimate μ.

Rejected Approved

Page 16: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-16/43

The Estimator

N

ii 1

The sample proportion we are examining here

is a sample mean.

X = 0 if the individual's application is rejected

X = 1 if the individual's application is accepted

1The "acceptance rate" is x = x .

N

The

population proportion is = 0.780943.

x is an estimator of , the population mean.

Page 17: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-17/43

0.780943 is the true proportion in the population we are sampling from.

x in 100 samples with N = 144 in each sample

Page 18: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-18/43

The Mean is A Good Estimator

Sometimes is too high, sometimes too low. On average, it seems to be right.

The sample mean of the 100 sample estimates is 0.7844The population mean (true proportion) is 0.7809.

x

Page 19: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-19/43

What Makes it a Good Estimator?

The average of the averages will hit the true mean (on average)

The mean is UNBIASED(No moral connotations)

Page 20: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-20/43

What Does the Law of Large Numbers Say?

The sampling variability in the estimator gets smaller as N gets larger.

If N gets large enough, we should hit the target exactly; The mean isCONSISTENT

Page 21: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-21/43

N=144

N=1024

N=4900

.7 t

o .8

8.7

to

.88

.7 t

o .8

8

Page 22: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-22/43

Uncertainty in Estimation

How to quantify the variability in the proportion estimator

--------+---------------------------------------------------------------------Variable| Mean Std.Dev. Minimum Maximum Cases Missing--------+---------------------------------------------------------------------Average of the means of the 100 samples of 144 observationsRATES144| .78444 .03278 .715278 .868056 100 0Average of the means of the 100 samples of 1024 observationsRATE1024| .78366 .01293 .754883 .812500 100 0Average of the means of the 100 samples of 4900 observationsRATE4900| .78079 .00461 .770000 .792449 100 0--------+---------------------------------------------------------------------

The population mean (true proportion) is 0.7809.

Page 23: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-23/43

Range of Uncertainty

The point estimate will be off (high or low) Quantify uncertainty in ± sampling error. Look ahead: If I draw a sample of 100, what value(s) should I

expect? Based on unbiasedness, I should expect the mean to hit the

true value. Based on my empirical rule, the value should be within plus or

minus 2 standard deviations 95% of the time. What should I use for the standard deviation?

Page 24: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-24/43

Estimating the Variance of the Distribution of Means

We will have only one sample! Use what we know about the variance of

the mean: Var[mean] = σ2/N

Estimate σ2 using the data: Then, divide s2 by N.

N 22 i 1 i(x x)

sN 1

Page 25: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-25/43

The Sampling Distribution For sampling from the population and

using the sample mean to estimate the population mean: Expected value of will equal μ Standard deviation of

will equal σ/ √ N CLT suggests a normal distribution

xx

Page 26: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-26/43

The sample mean for a given sample may be very close to the true mean

The sample mean for a given sample may be quite far from the true mean

This is the sampling variability of the mean as an estimator of μ

Page 27: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-27/43

Recognizing Sampling Variability

To describe the distribution of sample means, use the sample to estimate the population expected value

To describe the variability, use the sample standard deviation, s, divided by the square root of N

To accommodate the distribution, use the empirical rule, 95%, 2 standard deviations.

x

Page 28: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-28/43

Estimating the Sampling Variability

For one of the samples, the mean was 0.849, s was 0.358. s/√N = .0298. If this were my estimate, I would use0.849 ± 2 x 0.0298

For a different sample, the mean was 0.750, s was 0.433, s/√N = .0361. If this were my estimate I would use0.750 ± 2 x 0.0361

Page 29: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-29/43

Estimates plus and minus two standard errors

Page 30: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-30/43

Will the Interval Contain the True Value?

Uncertain: The midpoint is random; it may be very high or low, in which case, no. Sometimes it will contain the true value.

The degree of certainty depends on the width of the interval. Very narrow interval: very uncertain.

(1 standard errors) Wide interval: much more certain

(2 standard errors) Extremely wide interval: nearly perfectly

certain (2.5 standard errors) Infinitely wide interval: Absolutely certain.

Page 31: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-31/43

The Degree of Certainty

The interval is a “Confidence Interval” The degree of certainty is the degree of

confidence. The standard in statistics is 95% certainty

(about two standard errors).

Page 32: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-32/43

67 % and 95% Confidence Intervals

Page 33: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-33/43

Monthly Spending Over First 12 Months

Population = 10,239 individuals who (1) Received the Card(2) Used the card at least once(3) Monthly spending no more than 2500.

What is the true mean of the population that produced these data?

Page 34: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-34/43

Estimating the Mean Given a sample

N = 225 observations = 241.242 S = 276.894

Estimate the population mean Point estimate 241.242 66⅔% confidence interval: 241.242 ± 1 x 276.894/√225

= 227.78 to 259.70 95% confidence interval: 241.242 ± 2 x 276.894/√225

= 204.32 to 278.162 99% confidence interval: 241.242 ± 2.5 x 276.894/√225

= 195.09 to 287.39

x

Page 35: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-35/43

Where Did the Interval Widths Come From?

Empirical rule of thumb: 2/3 = 66 2/3% is contained in an interval that is the mean plus

and minus 1 standard deviation 95% is contained in a 2 standard deviation interval 99% is contained in a 2.5 standard deviation interval.

Based exactly on the normal distribution, the exact values would be 0.9675 standard deviations for 2/3 (rather than 1.00) 1.9600 standard deviations for 95% (rather than 2.00) 2.5760 standard deviations for 99% (rather than 2.50)

Page 36: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-36/43

Large Samples

If the sample is moderately large (over 30), one can use the normal distribution values instead of the empirical rule.

The empirical rule is easier to remember. The values will be very close to each other.

Page 37: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-37/43

Refinements (Important)

When you have a fairly small sample (under 30) and you have to estimate σ using s, then both the empirical rule and the normal distribution can be a bit misleading. The interval you are using is a bit too narrow.

You will find the appropriate widths for your interval in the “t table” The values depend on the sample size. (More specifically, on N-1 = the degrees of freedom.)

Page 38: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-38/43

Critical Values

For 95% and 99% using a sample of 15: Normal: 1.960 and 2.576 Empirical rule: 2.000 and 2.500 T[14] table: 2.145 and 2.977

Note that the interval based on t is noticeably wider. The values from “t” converge to the normal values

(from above) as N increases. What should you do in practice? Unless the sample is

quite small, you can usually rely safely on the empirical rule. If the sample is very small, use the t distribution.

Page 39: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-39/43

n = N-1

Small sample

Large sample

Page 40: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-40/43

Application A sports training center is examining the endurance of athletes. A

sample of 17 observations on the number of hours for a specific task produces the following sample: 4.86, 6.21, 5.29, 4.11, 6.19, 3.58, 4.38, 4.70, 4.66, 5.64, 3.77, 2.11, 4.81, 3.31, 6.27, 5.02, 6.12This being a biological measurement, we are confident that the underlying population is normal.

Form a 95% confidence interval for the mean of the distribution. The sample mean is 4.766. The sample standard deviation, s, is 1.160. The standard error of the mean is 1.16/√17 = 0.281. Since this is a small sample from the normal distribution, we use the

critical value from the t distribution with N-1 = 16 degrees of freedom. From the t table (previous page), the value of t[.025,16] is 2.120

The confidence interval is 4.766 ± 2.120(0.281) = [4.170,5.362]

Page 41: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-41/43

Page 42: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-42/43

Confidence Interval for Regression Coefficient

Coefficient on OwnRent Estimate = +0.040923 Standard error = 0.007141 Confidence interval

0.040923 ± 1.96 X 0.007141 (large sample)= 0.040923 ± 0.013996= 0.02693 to 0.05492

Form a confidence interval for the coefficient on SelfEmpl. (Left for the reader)

Page 43: Part 21: Statistical Inference 21-1/43 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.

Part 21: Statistical Inference21-43/43

Summary Methodology: Statistical Inference Application to credit scoring Sample statistics as estimators Point estimation

Sampling variability The law of large numbers Unbiasedness and consistency

Sampling distributions Confidence intervals

Proportion Mean Regression coefficient

Using the normal and t distributions instead of the empirical rule for the width of the interval.