1 BA 555 Practical Business Analysis Housekeeping Review of Statistics Exploring Data Sampling...
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
1
Transcript of 1 BA 555 Practical Business Analysis Housekeeping Review of Statistics Exploring Data Sampling...
1
BA 555 Practical Business Analysis
Housekeeping Review of Statistics
Exploring Data Sampling Distribution of a Statistic Confidence Interval Estimation Hypothesis Testing
Agenda
2
Definition
“Statistics” is the science of data.
It involves collecting, classifying, summarizing, organizing, analyzing,
and interpreting numerical information.
We will learn how to make
based on data
3
Fundamental Elements of Statistics
A population is a set of units (usually people, objects, transactions, or events) that we are interested in studying. It is the totality of items or things under consideration.
A sample is a subset of the units of a population. It is the portion of the population that is selected for analysis.
A parameter is a numerical descriptive measure of a population. It is a summary measure that is computed to describe a characteristic of an entire population.
A statistic is a numerical descriptive measure of a sample. It is a summary measure calculated from the observations in the sample.
4
Example
A manufacturer of computer chips claims that less than 10% of his products are defective. When 1,000 chips were drawn from a large production, 7.5% were found to be defective.
What is the population of interest?
What is the sample?
What is the parameter?
What is the statistic?
Does the value 10% refer to the parameter or to the statistic?
Is the value 7.5% a parameter or a statistic?
5
Statistical Analysis (p.3)
POPULATION
para
met
ers:
,
2 ,
, p, e
tc.
Sele
ctin
g a
rand
om s
ampl
e:X
1, X
2, …
, Xn
Des
crib
ing
unce
rtai
nty:
Ran
dom
var
iabl
es,
Prob
abili
ty,
Dis
trib
utio
ns
Dis
cret
e: b
inom
ial d
istr
ibut
ion
C
ontin
uous
: nor
mal
dis
trib
utio
n,Sa
mpl
ing
dist
ribu
tion
of th
e sa
mpl
e m
ean
Sample
of s
ize
nx 1
, x2,
…, x
n
stat
isti
cs:
, s
2 , s
, ,
etc
.
Org
aniz
ing
data
:Q
ualit
ativ
eQ
uant
itativ
e
Dra
win
g co
nclu
sion
s fr
om d
ata:
Est
imat
ion
Hyp
othe
sis
Tes
ting
Reg
ress
ion
Ana
lysi
sC
ontin
genc
y T
able
s
xp̂
6
Types of Data (p.2)
Numerical (Quantitative) Data Regular numerical observations. Arithmetic
calculations are meaningful. Age Household income Starting salary
Categorical (Qualitative) Data Values are the (arbitrary) names of possible
categories. Gender: Female = 1 vs Male = 0. College major
8
Describing Qualitative Data (p.4)
Graphical Methods Numerical Methods Qualitative Data (Categorical Data) e.g. gender, college major, etc.
Pie chart Bar chart Line graph
Frequency tables
Display one variable: Histogram Stem-and-Leaf Display Dot plot Display two variables: Scatter plot Display one variable over time: Time series plot
Quantitative Data (Numerical Data) e.g. age, income, GMAT scores, etc.
Measures of Location: Mean:
sample mean
n
iiX
nX
1
1
population mean
N
iiX
N 1
1
Median: Arrange the observations in ascending order. If n is odd, median = the middle number If n is even, median = the simple average of the
middle two observations. Mode: The measurement that occurs most frequently in
the data set. It might not be unique, or not even exist.
Mid-range 2
minimum - maximum
It is sensitive to extreme observations. Truncated mean: A fixed percentage of the largest and smallest
observations are deleted and the mean of the remaining data is calculated.
Measures of Relative Standing: Percentiles: The pth percentile is a number
such that p% of n observations fall below it and (100-p)% fall above it.
Quartiles Q1 = QL = the lower quartile
= 25th percentile Q2 = Median = 50th percentile Q3 = QU = the upper quartile
= 75th percentile
Z-score = std
mean - obs
Z-score tells you how far the observation is above or below the mean (the center of a data set.) Measure of Association: Correlation: 11 r
sample corr. r population corr.
Correlation describes the strength of linear relationship between two variables.
Measures of Spread/Variability: Variance:
sample variance
N
ii XX
N 1
22 )(1
population variance
n
ii XX
ns
1
22 )(1
1
Standard deviation:
sample std 2ss
population std 2 Range = the largest – the smallest Interquartile range = IQR = Q3 – Q1
9
Summarizing Qualitative Data
Barchart for Gender
0 10 20 30 40 50
frequency
F
M
Piechart for Gender
GenderFM
39.44%
60.56%
Frequency Table for Gender-------------------------------------------------------------- Relative Cumulative Cum. Rel.Class Value Frequency Frequency Frequency Frequency-------------------------------------------------------------- 1 F 28 0.3944 28 0.3944 2 M 43 0.6056 71 1.0000--------------------------------------------------------------
11
Quantitative Data: Histogram
Histogram for SALARY
Salary (in $000)
freq
uenc
y
23 28 33 38 43 48 53 58 63(X 1000)
0
4
8
12
16
20
24
Histogram for AGE
Age
fre
quen
cy
30 35 40 45 50 55 60 650
4
8
12
16
20
Histogram applet
15
Example
Given the data below, complete the following summary statistics table. (Data are in ascending order): 10.0, 10.5, 12.2, 13.9, 13.9, 14.1, 14.7, 14.7, 15.1, 15.3, 15.9, 17.7, 18.5
Variable X Count 13 Average 14.3462 Median 14.7 Variance 5.94936 Standard deviation 2.43913 Minimum 10.0 Maximum 18.5 Range 8.5 Lower quartile 13.9 Upper quartile 15.3 Interquartile range 1.4 Sum 186.5
Box-and-Whisker Plot
Variable X10 12 14 16 18 20
Upper invisible line: 17.4Lower invisible line: 11.8
16
Statgraphics Plus (SG+) Demo (p.1)
Questions to ask when describing and summarizing data:
Where is the approximate center of the distribution? Are the observations close to one another, or are
they widely dispersed? Is the distribution unimodal, bimodal, or multimodal?
If there is more than one mode, where are the peaks, and where are the valleys?
Is the distribution symmetric? If not, is it skewed? If symmetric, is it bell-shaped?
17
The Empirical Rule (p.5)
1. Approximately 68% of the observations will fall within 1 standard deviation of the mean. 2. Approximately 95% of the observations will fall within 2 standard deviations of the mean. 3. Approximately 99.7% of the observations will fall within 3 standard deviations of the mean.
3
3
3
sx
2
2
2
sx
1sx
3
3
3
sx
2
2
2
sx
1
sx
0
x
0.15% 2.35% 13.5% 34% 34% 13.5% 2.35% 0.15%
68%
95%
99.7%
3
3
3
sx
2
2
2
sx
1sx
3
3
3
sx
2
2
2
sx
1
sx
0
x
0.15% 2.35% 13.5% 34% 34% 13.5% 2.35% 0.15%
68%
95%
99.7%
18
Example
The average salary for employees with similar background/skills/etc. is about $120,000.
Your salary is $122,000. Is it a big deal? Why or why not? What
additional information is required to answer this question?
19
What to do next?
Generalize the results from the empirical rule. Justify the use of the mound-shaped
distribution.
3
3
3
sx
2
2
2
sx
1sx
3
3
3
sx
2
2
2
sx
1
sx
0
x
0.15% 2.35% 13.5% 34% 34% 13.5% 2.35% 0.15%
68%
95%
99.7%
20
0 10 20 30 40 50 600
0.01
0.02
0.03
0.04
Example: Warranty Level
Mean = 30,000 miles STD = 5,000 miles
Q1: If the level of warranty is set at 15,000 miles, about what % of tires will be returned under warranty?Q2: If we can accept that up to 2.5% of tires can be returned under warranty, what should be the new warranty level?
21
0 10 20 30 40 50 600
0.01
0.02
0.03
0.04
Example: Warranty Level
Mean = 30,000 miles STD = 5,000 miles
Q1: If the level of warranty is set at 12,000 miles, about what % of tires will be returned under the warranty?Q2: If we can accept that up to 3.0% of tires can be returned under warranty, what should be the warranty level?
22
Normal Probabilitiesz .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .03590.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .07530.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .11410.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .15170.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .18790.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .22240.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2517 .25490.7 .2580 .2611 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .28520.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .31330.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .33891.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .36211.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .38301.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .40151.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .41771.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .43191.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .44411.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .45451.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .46331.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .47061.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .47672.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .48172.1 .4821 .4826 .4830 .4834 .4838 .4842 .4846 .4850 .4854 .48572.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .48902.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .49162.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .49362.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .49522.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .49642.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .49742.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .49812.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .49863.0 .4987 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990
23
Sampling Distribution (p.6)
The sampling distribution of a statistic is the probability distribution for all possible values of the statistic that results when random samples of size n are repeatedly drawn from the population.
When the sample size is large, what is the sampling distribution of the sample mean / sample proportion / the difference of two samples means / the difference of two sample proportions? NORMAL !!!
24
Central Limit Theorem (CLT) (p.6)
If X ~ N(, 2), then X ~ N( X
,nX
22 )
Sample: X1, X2, …, Xn
X
?)( bXaP
25
Central Limit Theorem (CLT) (p.6)
Sample: X1, X2, …, Xn
X
?)( bXaP
If X ~ Any distribution with the mean , and variance 2, then X ~ N(
X,
nX
22 ) for large n.
26
Standard Deviations
Population standard deviation X or simply .
Sample standard deviation Xs or simply s .
Standard deviation of sample means (aka. standard error) X
Standard deviation of sample proportions (aka. standard error) p̂
Relationships:
o XXXX
Xs
n
s
nor ˆ:
o ppp sn
pp
n
ppˆˆˆ or ˆ:
)ˆ1(ˆ)1(
27
Statistical Inference: Estimation
Research Question: What is the parameter value?
Sample of size n
Population
Tools (i.e., formulas):Point EstimatorInterval Estimator
29
Example
A random sampling of a company’s monthly operating expenses for a sample of 12 months produced a sample mean of $5474 and a standard deviation of $764. Construct a 95% confidence interval for the company’s mean monthly expenses.
30
Statistical Inference: Hypothesis Testing
Research Question: Is the claim supported?
Sample of size n
Population
Tools (i.e., formulas):z or t statistic
32
Example
A bank has set up a customer service goal that the mean waiting time for its customers will be less than 2 minutes. The bank randomly samples 30 customers and finds that the sample mean is 100 seconds. Assuming that the sample is from a normal distribution and the standard deviation is 28 seconds, can the bank safely conclude that the population mean waiting time is less than 2 minutes?
33
Margin of Error (B)
error ofmargin :B
estimator)point of (stdr)(multiplieestimator)point (
• What does B tell us about the point estimator?
• How do we reduce the value of B?
n
ppzp
n
stX
nzX
)ˆ1(ˆˆ 2/
2/
2/
34
Relations among B, n, and
B (margin of error) N (sample size)
Confidence Level (e.g., 90%, 95%)
How to reduce B?
35
Estimation in Practice
Determine a confidence level (say, 95%). How good do you want the estimate to be? (define
margin of error) Use formulas (p.8) to find out a sample size that
satisfies pre-determined confidence level and margin of error.
Parameter Sample Size Needed
2
2/
B
zn
or 2
2/
B
sz 1. Replace or s with the one from
a previous study, or 2. Estimate it by 4range or
range/6. p
2
2/ )1(
B
ppzn
1. use the p from a similar study or
previous experiment. 2. be conservative. Use p = 0.5.