1 BA 555 Practical Business Analysis Housekeeping Review of Statistics Exploring Data Sampling...

1

BA 555 Practical Business Analysis

Housekeeping Review of Statistics

Exploring Data Sampling Distribution of a Statistic Confidence Interval Estimation Hypothesis Testing

Agenda

2

Definition

“Statistics” is the science of data.

It involves collecting, classifying, summarizing, organizing, analyzing,

and interpreting numerical information.

We will learn how to make

based on data

3

Fundamental Elements of Statistics

A population is a set of units (usually people, objects, transactions, or events) that we are interested in studying. It is the totality of items or things under consideration.

A sample is a subset of the units of a population. It is the portion of the population that is selected for analysis.

A parameter is a numerical descriptive measure of a population. It is a summary measure that is computed to describe a characteristic of an entire population.

A statistic is a numerical descriptive measure of a sample. It is a summary measure calculated from the observations in the sample.

4

Example

A manufacturer of computer chips claims that less than 10% of his products are defective. When 1,000 chips were drawn from a large production, 7.5% were found to be defective.

What is the population of interest?

What is the sample?

What is the parameter?

What is the statistic?

Does the value 10% refer to the parameter or to the statistic?

Is the value 7.5% a parameter or a statistic?

5

Statistical Analysis (p.3)

POPULATION

para

met

ers:

,

2 ,

, p, e

tc.

Sele

ctin

g a

rand

om s

ampl

e:X

1, X

2, …

, Xn

Des

crib

ing

unce

rtai

nty:

Ran

dom

var

iabl

es,

Prob

abili

ty,

Dis

trib

utio

ns

Dis

cret

e: b

inom

ial d

istr

ibut

ion

C

ontin

uous

: nor

mal

dis

trib

utio

n,Sa

mpl

ing

dist

ribu

tion

of th

e sa

mpl

e m

ean

Sample

of s

ize

nx 1

, x2,

…, x

n

stat

isti

cs:

, s

2 , s

, ,

etc

.

Org

aniz

ing

data

:Q

ualit

ativ

eQ

uant

itativ

e

Dra

win

g co

nclu

sion

s fr

om d

ata:

Est

imat

ion

Hyp

othe

sis

Tes

ting

Reg

ress

ion

Ana

lysi

sC

ontin

genc

y T

able

s

xp̂

6

Types of Data (p.2)

Numerical (Quantitative) Data Regular numerical observations. Arithmetic

calculations are meaningful. Age Household income Starting salary

Categorical (Qualitative) Data Values are the (arbitrary) names of possible

categories. Gender: Female = 1 vs Male = 0. College major

7

Employee Database(class website, EmployeeDB.sf3)

Quantitative Qualitative

8

Describing Qualitative Data (p.4)

Graphical Methods Numerical Methods Qualitative Data (Categorical Data) e.g. gender, college major, etc.

Pie chart Bar chart Line graph

Frequency tables

Display one variable: Histogram Stem-and-Leaf Display Dot plot Display two variables: Scatter plot Display one variable over time: Time series plot

Quantitative Data (Numerical Data) e.g. age, income, GMAT scores, etc.

Measures of Location: Mean:

sample mean

n

iiX

nX

1

1

population mean

N

iiX

N 1

1

Median: Arrange the observations in ascending order. If n is odd, median = the middle number If n is even, median = the simple average of the

middle two observations. Mode: The measurement that occurs most frequently in

the data set. It might not be unique, or not even exist.

Mid-range 2

minimum - maximum

It is sensitive to extreme observations. Truncated mean: A fixed percentage of the largest and smallest

observations are deleted and the mean of the remaining data is calculated.

Measures of Relative Standing: Percentiles: The pth percentile is a number

such that p% of n observations fall below it and (100-p)% fall above it.

Quartiles Q1 = QL = the lower quartile

= 25th percentile Q2 = Median = 50th percentile Q3 = QU = the upper quartile

= 75th percentile

Z-score = std

mean - obs

Z-score tells you how far the observation is above or below the mean (the center of a data set.) Measure of Association: Correlation: 11 r

sample corr. r population corr.

Correlation describes the strength of linear relationship between two variables.

Measures of Spread/Variability: Variance:

sample variance

N

ii XX

N 1

22 )(1

population variance

n

ii XX

ns

1

22 )(1

1

Standard deviation:

sample std 2ss

population std 2 Range = the largest – the smallest Interquartile range = IQR = Q3 – Q1

9

Summarizing Qualitative Data

Barchart for Gender

0 10 20 30 40 50

frequency

F

M

Piechart for Gender

GenderFM

39.44%

60.56%

Frequency Table for Gender-------------------------------------------------------------- Relative Cumulative Cum. Rel.Class Value Frequency Frequency Frequency Frequency-------------------------------------------------------------- 1 F 28 0.3944 28 0.3944 2 M 43 0.6056 71 1.0000--------------------------------------------------------------

10

Describing Quantitative Data (p.4)Graphical Methods

11

Quantitative Data: Histogram

Histogram for SALARY

Salary (in $000)

freq

uenc

y

23 28 33 38 43 48 53 58 63(X 1000)

0

4

8

12

16

20

24

Histogram for AGE

Age

fre

quen

cy

30 35 40 45 50 55 60 650

4

8

12

16

20

Histogram applet

12

Describing Quantitative Data (p.4)Descriptive/Summary Statistics

13

Guessing Correlations

-0.99 -0.29 0.54 0.95

14

Correlation: Be Careful

Correlation value Scatter plot

?

Correlation

15

Example

Given the data below, complete the following summary statistics table. (Data are in ascending order): 10.0, 10.5, 12.2, 13.9, 13.9, 14.1, 14.7, 14.7, 15.1, 15.3, 15.9, 17.7, 18.5

Variable X Count 13 Average 14.3462 Median 14.7 Variance 5.94936 Standard deviation 2.43913 Minimum 10.0 Maximum 18.5 Range 8.5 Lower quartile 13.9 Upper quartile 15.3 Interquartile range 1.4 Sum 186.5

Box-and-Whisker Plot

Variable X10 12 14 16 18 20

Upper invisible line: 17.4Lower invisible line: 11.8

16

Statgraphics Plus (SG+) Demo (p.1)

Questions to ask when describing and summarizing data:

Where is the approximate center of the distribution? Are the observations close to one another, or are

they widely dispersed? Is the distribution unimodal, bimodal, or multimodal?

If there is more than one mode, where are the peaks, and where are the valleys?

Is the distribution symmetric? If not, is it skewed? If symmetric, is it bell-shaped?

17

The Empirical Rule (p.5)

1. Approximately 68% of the observations will fall within 1 standard deviation of the mean. 2. Approximately 95% of the observations will fall within 2 standard deviations of the mean. 3. Approximately 99.7% of the observations will fall within 3 standard deviations of the mean.

3

3

3

sx

2

2

2

sx

1sx

3

3

3

sx

2

2

2

sx

1

sx

0

x

0.15% 2.35% 13.5% 34% 34% 13.5% 2.35% 0.15%

68%

95%

99.7%

3

3

3

sx

2

2

2

sx

1sx

3

3

3

sx

2

2

2

sx

1

sx

0

x

0.15% 2.35% 13.5% 34% 34% 13.5% 2.35% 0.15%

68%

95%

99.7%

18

Example

The average salary for employees with similar background/skills/etc. is about $120,000.

Your salary is $122,000. Is it a big deal? Why or why not? What

additional information is required to answer this question?

19

What to do next?

Generalize the results from the empirical rule. Justify the use of the mound-shaped

distribution.

3

3

3

sx

2

2

2

sx

1sx

3

3

3

sx

2

2

2

sx

1

sx

0

x

0.15% 2.35% 13.5% 34% 34% 13.5% 2.35% 0.15%

68%

95%

99.7%

20

0 10 20 30 40 50 600

0.01

0.02

0.03

0.04

Example: Warranty Level

Mean = 30,000 miles STD = 5,000 miles

Q1: If the level of warranty is set at 15,000 miles, about what % of tires will be returned under warranty?Q2: If we can accept that up to 2.5% of tires can be returned under warranty, what should be the new warranty level?

21

0 10 20 30 40 50 600

0.01

0.02

0.03

0.04

Example: Warranty Level

Mean = 30,000 miles STD = 5,000 miles

Q1: If the level of warranty is set at 12,000 miles, about what % of tires will be returned under the warranty?Q2: If we can accept that up to 3.0% of tires can be returned under warranty, what should be the warranty level?

22

Normal Probabilitiesz .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

0.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .03590.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .07530.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .11410.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .15170.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .18790.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .22240.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2517 .25490.7 .2580 .2611 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .28520.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .31330.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .33891.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .36211.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .38301.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .40151.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .41771.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .43191.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .44411.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .45451.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .46331.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .47061.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .47672.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .48172.1 .4821 .4826 .4830 .4834 .4838 .4842 .4846 .4850 .4854 .48572.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .48902.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .49162.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .49362.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .49522.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .49642.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .49742.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .49812.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .49863.0 .4987 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990

23

Sampling Distribution (p.6)

The sampling distribution of a statistic is the probability distribution for all possible values of the statistic that results when random samples of size n are repeatedly drawn from the population.

When the sample size is large, what is the sampling distribution of the sample mean / sample proportion / the difference of two samples means / the difference of two sample proportions? NORMAL !!!

24

Central Limit Theorem (CLT) (p.6)

If X ~ N(, 2), then X ~ N( X

,nX

22 )

Sample: X1, X2, …, Xn

X

?)( bXaP

25

Central Limit Theorem (CLT) (p.6)

Sample: X1, X2, …, Xn

X

?)( bXaP

If X ~ Any distribution with the mean , and variance 2, then X ~ N(

X,

nX

22 ) for large n.

26

Standard Deviations

Population standard deviation X or simply .

Sample standard deviation Xs or simply s .

Standard deviation of sample means (aka. standard error) X

Standard deviation of sample proportions (aka. standard error) p̂

Relationships:

o XXXX

Xs

n

s

nor ˆ:

o ppp sn

pp

n

ppˆˆˆ or ˆ:

)ˆ1(ˆ)1(

27

Statistical Inference: Estimation

Research Question: What is the parameter value?

Sample of size n

Population

Tools (i.e., formulas):Point EstimatorInterval Estimator

28

Confidence Interval Estimation (p.7)

29

Example

A random sampling of a company’s monthly operating expenses for a sample of 12 months produced a sample mean of $5474 and a standard deviation of $764. Construct a 95% confidence interval for the company’s mean monthly expenses.

30

Statistical Inference: Hypothesis Testing

Research Question: Is the claim supported?

Sample of size n

Population

Tools (i.e., formulas):z or t statistic

31

Hypothesis Testing (p.9)

32

Example

A bank has set up a customer service goal that the mean waiting time for its customers will be less than 2 minutes. The bank randomly samples 30 customers and finds that the sample mean is 100 seconds. Assuming that the sample is from a normal distribution and the standard deviation is 28 seconds, can the bank safely conclude that the population mean waiting time is less than 2 minutes?

33

Margin of Error (B)

error ofmargin :B

estimator)point of (stdr)(multiplieestimator)point (

• What does B tell us about the point estimator?

• How do we reduce the value of B?

n

ppzp

n

stX

nzX

)ˆ1(ˆˆ 2/

2/

2/

34

Relations among B, n, and

B (margin of error) N (sample size)

Confidence Level (e.g., 90%, 95%)

How to reduce B?

35

Estimation in Practice

Determine a confidence level (say, 95%). How good do you want the estimate to be? (define

margin of error) Use formulas (p.8) to find out a sample size that

satisfies pre-determined confidence level and margin of error.

Parameter Sample Size Needed

2

2/

B

zn

or 2

2/

B

sz 1. Replace or s with the one from

a previous study, or 2. Estimate it by 4range or

range/6. p

2

2/ )1(

B

ppzn

1. use the p from a similar study or

previous experiment. 2. be conservative. Use p = 0.5.

36

Accuracy Gained by Increasing the Sample Size (p.8)

A 95% Confidence Interval for p:

2

2

1

2

196.1

B

n

Margin of Error (B) Sample Size (n) 7% 196 6% 266 5% 384 4% 600 3% 1067 2% 2401 1% 9604

1 BA 555 Practical Business Analysis Housekeeping Review of Statistics Exploring Data Sampling...

Documents

Transcript of 1 BA 555 Practical Business Analysis Housekeeping Review of Statistics Exploring Data Sampling...