Chapter 6 - Random Sampling and Data Description
More joy of dealing with large quantities of data
Chapter 6B
You can never have too much
data.
Today in Prob & Stat
6-2 Stem-and-Leaf Diagrams
Steps for Constructing a Stem-and-Leaf Diagram
6-2 Stem-and-Leaf Diagrams
Example 6-4
Figure 6-4
Stem-and-leaf diagram for the compressive strength data in Table 6-2.
Figure 6-5
25 observations on batch yields
Stem-and-leaf displays for Example 6-5. Stem: Tens digits. Leaf: Ones digits.
too few
too many
just right
Figure 6-6
Stem-and-leaf diagram from Minitab.
Number of observationsIn the middle stem
6-4 Box Plots
• The box plot is a graphical display that simultaneously describes several important features of a data set, such as center, spread, departure from symmetry, and identification of observations that lie unusually far from the bulk of the data.
• Whisker• Outlier• Extreme outlier
Figure 6-13
Description of a box plot.
Figure 6-14
Box plot for compressive strength data in Table 6-2.
Figure 6-15
Comparative box plots of a quality index at three plants.
6-5 Time Sequence Plots
• A time series or time sequence is a data set in which the observations are recorded in the order in which they occur. • A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say x) and the horizontal axis denotes the time (which could be minutes, days, years, etc.). • When measurements are plotted as a time series, weoften see
•trends, •cycles, or •other broad features of the data
Figure 6-16
Company sales by year (a) and by quarter (b).
Figure 6-17 gosh! – a stem and leaf diagram combined with a time series plot
A digidot plot of the compressive strength data in Table 6-2.
Figure 6-18
A digidot plot of chemical process concentration readings, observed hourly.
6-6 Probability Plots
• Probability plotting is a graphical method for determining whether sample data conform to a hypothesized distribution based on a subjective visual examination of the data.
• Probability plotting typically uses special graph paper, known as probability paper, that has been designed for the hypothesized distribution. Probability paper is widely available for the normal, lognormal, Weibull, and various chi-square and gamma distributions.
Probability (Q-Q)* Plots
•Forget ‘normal probability paper’
•Plot the z score versus the ranked observations, x(j)
•Subjective, visual technique usually applied to test normality. Can also be adapted to other distributions.
•Method (for normal distribution):
•Rank the observations x(1), x(2), …, x(n) from smallest to largest
•Compute the (j-1/2)/n value for each x(j)
•Plot zj=F-1((j-1/2)/n) versus x(j)
Parentheses usually indicate
ordering of data.
Computing zj, where zj = -1(j – ½)/n
25 5.65 1.01 0.84526 5.75 1.17 0.87927 5.79 1.36 0.91428 5.85 1.63 0.94829 5.86 2.11 0.983
29 values xj values zj (j-1/2)/n1 4.07 -2.11 0.0172 4.88 -1.63 0.0523 5.10 -1.36 0.0864 5.26 -1.17 0.1215 5.27 -1.01 0.155
xj values are ordered least to greatest
Example in EXCEL – Table 6-6, pp. 214
Cavendish Earth Density Data
-2.50
-2.00
-1.50
-1.00
-0.50
0.00
0.50
1.00
1.50
2.00
2.50
4.0 4.5 5.0 5.5 6.0
xj values
zj v
alu
es
zj is the function NORMSINV
Example in EXCEL – Table 6-6, cont’d
Cavendish Earth Density Data(censored)
-2.50
-2.00
-1.50
-1.00
-0.50
0.00
0.50
1.00
1.50
2.00
4.0 4.5 5.0 5.5 6.0
xj values
zj v
alu
es
zj is the function NORMSINV
Example 6-7
Example 6-7 (continued)
Figure 6-19
Normal probability
plot for battery life.
Figure 6-20
Normal probability plot obtained from standardized normal scores.
Figure 6-21
Normal probability plots indicating a nonnormal distribution. (a) Light-tailed distribution. (b) Heavy-tailed distribution. (c ) A distribution with positive (or right) skew.
The Beginning of a Comprehensive Example
Descriptive Statistics in Action see real numbers, real data watch as they are manipulated in perverse ways be thrilled as they are sorted and be amazed as they are compressed into a single numbers
The Raw Data
As part of a life span study of a particular type of lithium
polymer rechargable battery, 120 batteries were operated and their life span in operating hours determined.
1676.5 2386.6 1347.6 1840.2 916.0 1592.0895.6 1663.8 2420.6 943.5 1351.9 1395.4
1682.0 2045.9 2450.0 2210.5 1823.0 2401.81913.6 1985.2 2319.7 2856.3 1944.9 2968.72881.9 1387.6 2560.1 745.0 1641.3 1952.32007.8 718.3 884.1 2125.3 1694.0 2430.53313.4 1088.7 596.2 1759.9 1378.0 999.12156.4 1879.4 1779.7 1297.0 849.4 1608.41954.7 2056.6 908.3 2210.1 1882.6 983.82210.4 1740.2 955.4 543.4 2323.8 1831.11630.3 2791.8 2383.4 891.5 807.0 1307.41818.8 2476.0 1577.6 1818.8 2088.8 2139.01779.5 845.1 2365.4 1803.7 2940.7 1552.6984.2 1581.8 1527.9 1460.3 2004.6 1808.1
1512.6 2713.7 2749.2 1753.3 1714.3 2398.02046.1 2238.5 2439.7 2633.1 2039.1 2398.81613.3 1314.2 2016.2 4300.8 1760.5 2824.32066.1 729.3 1757.8 1250.8 577.8 715.22926.9 1898.7 1022.7 1005.2 1945.6 2277.31995.7 1377.2 2063.8 667.1 1299.9 1941.2
Data generated from a Weibull distribution with = 2.8 and = 2000
Descriptive Statistics - Minitab
Variable N Mean Median TrMean StDev SE MeanBattery Life 120 1789.4 1813.4 1773.9 661.5 60.4
Variable Minimum Maximum Q1 Q3Battery Life 543.4 4300.8 1348.7 2210.3
trimmed mean
More Minitab
450040003500300025002000150010005000
20
10
0
Battery Life
Fre
quen
cy
Histogram of Battery Life
More Minitab
450040003500300025002000150010005000
20
10
0
Battery Life
Fre
quen
cy
Histogram of Battery Life, with Normal Curve
Stem and Leaf Plot
Leaf Unit = 100
21 0 555677778888889999999 36 1 000222333333334 (40) 1 5555556666666677777777888888888899999999 44 2 0000000000111222223333333444444 13 2 56777888999 2 3 3 1 3 1 4 3
4000300020001000
Battery Life
Dotplot for Battery Life
More Minitab
40003000200010000
Battery Life
Boxplot of Battery Life
More Minitab
430037003100250019001300700
95% Confidence Interval for Mu
1950185017501650
95% Confidence Interval for Median
Variable: Battery Life
1691.57
587.10
1669.87
Maximum3rd QuartileMedian1st QuartileMinimum
NKurtosisSkewnessVarianceStDevMean
P-Value:A-Squared:
1945.04
757.74
1909.03
4300.802210.321813.451348.68 543.40
1200.7366280.359617
437627 661.531789.45
0.1390.566
95% Confidence Interval for Median
95% Confidence Interval for Sigma
95% Confidence Interval for Mu
Anderson-Darling Normality Test
Descriptive Statistics
Time Series Plot
12010080604020
4000
3000
2000
1000
0
Index
Battery
Life
Based upon the order that the data was generated
Time Series Plot
12010080604020
4000
3000
2000
1000
0
Index
sort
ed
Sorted by failure time
40003000200010000
99
95
90
80706050403020
10
5
1
Data
Per
cent
Normal Probability Plot for Battery Life
ML Estimates
Mean:
StDev:
1789.45
658.771
1000100
99959080706050403020
10
5
3 2
1
Data
Per
cent
Weibull Probability Plot for Battery Life
ML Estimates
Shape:
Scale:
2.92250
2005.35
1000050000
99
98
97
95
90
807060503010
Data
Per
cent
Exponential Probability Plot for Battery Life
ML Estimates
Mean: 1789.45
Computer Support
This is easy if you use the computer.
hang on, we are going to Excel…
A Recap …
•Population – the totality of observations with which we are concerned. Issue: conceptual vs. actual.
•Sample – subset of observations selected from a population.
•Statistic – any function of the observations in a sample.
•Sample range – If the n observations in a sample are denoted by x1, x2, …,xn, then the sample range is r = max(xi) – min(xi).
•Sample mean and variance.
1n
xnx
1n
)xx(s
xx
2n
1i
2i
n
1i
2i
2
n
1ii
Note that these are
functions of the observations in a sample and are,
therefore, statistics.
More Recapping …
N
Nx
N
)x(
x
2N
1i
2i
N
1i
2i
2
N
1ii
1n
xnx
1n
)xx(s
2n
1i
2i
n
1i
2i
2
Note difference in denominators
Sample variance uses an estimate of the mean (xbar) in its calculation. If divided by n, the sample variance would be a
biased estimate – biased low.
Note terminology – ‘population parameter’ vs.
‘sample statistic’
Sampling Process
X a random variable that represents one selection from a population.
Each observation in the sample is obtained under identical conditions. The population does not change during sampling. The probability distribution of values does not change
during sampling. f(x1,x2,…,xn) = f(x1)f(x2)…f(xn) if the sample is
independent. Notation
X1, X2,…, Xn are the random variables.
x1, x2,…, xn are the values of the random variables.
A Final Recap…
A probability distribution is often a model for a population.
This is often the case when the population is conceptual or infinite.
The histogram should resemble to distribution of population values. The bigger the sample
the stronger the resemblance.
Our Work Here Today is Done
Next Week: The Glorious Midterm
Prob/Stat studentsDiscussing stem and leaf plots
Top Related