Chapter 6 - Random Sampling and Data Description More joy of dealing with large quantities of data...

Post on 13-Jan-2016

216 views 0 download

Transcript of Chapter 6 - Random Sampling and Data Description More joy of dealing with large quantities of data...

Chapter 6 - Random Sampling and Data Description

More joy of dealing with large quantities of data

Chapter 6B

You can never have too much

data.

Today in Prob & Stat

6-2 Stem-and-Leaf Diagrams

Steps for Constructing a Stem-and-Leaf Diagram

6-2 Stem-and-Leaf Diagrams

Example 6-4

Figure 6-4

Stem-and-leaf diagram for the compressive strength data in Table 6-2.

Figure 6-5

25 observations on batch yields

Stem-and-leaf displays for Example 6-5. Stem: Tens digits. Leaf: Ones digits.

too few

too many

just right

Figure 6-6

Stem-and-leaf diagram from Minitab.

Number of observationsIn the middle stem

6-4 Box Plots

• The box plot is a graphical display that simultaneously describes several important features of a data set, such as center, spread, departure from symmetry, and identification of observations that lie unusually far from the bulk of the data.

• Whisker• Outlier• Extreme outlier

Figure 6-13

Description of a box plot.

Figure 6-14

Box plot for compressive strength data in Table 6-2.

Figure 6-15

Comparative box plots of a quality index at three plants.

6-5 Time Sequence Plots

• A time series or time sequence is a data set in which the observations are recorded in the order in which they occur. • A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say x) and the horizontal axis denotes the time (which could be minutes, days, years, etc.). • When measurements are plotted as a time series, weoften see

•trends, •cycles, or •other broad features of the data

Figure 6-16

Company sales by year (a) and by quarter (b).

Figure 6-17 gosh! – a stem and leaf diagram combined with a time series plot

A digidot plot of the compressive strength data in Table 6-2.

Figure 6-18

A digidot plot of chemical process concentration readings, observed hourly.

6-6 Probability Plots

• Probability plotting is a graphical method for determining whether sample data conform to a hypothesized distribution based on a subjective visual examination of the data.

• Probability plotting typically uses special graph paper, known as probability paper, that has been designed for the hypothesized distribution. Probability paper is widely available for the normal, lognormal, Weibull, and various chi-square and gamma distributions.

Probability (Q-Q)* Plots

•Forget ‘normal probability paper’

•Plot the z score versus the ranked observations, x(j)

•Subjective, visual technique usually applied to test normality. Can also be adapted to other distributions.

•Method (for normal distribution):

•Rank the observations x(1), x(2), …, x(n) from smallest to largest

•Compute the (j-1/2)/n value for each x(j)

•Plot zj=F-1((j-1/2)/n) versus x(j)

Parentheses usually indicate

ordering of data.

Computing zj, where zj = -1(j – ½)/n

25 5.65 1.01 0.84526 5.75 1.17 0.87927 5.79 1.36 0.91428 5.85 1.63 0.94829 5.86 2.11 0.983

29 values xj values zj (j-1/2)/n1 4.07 -2.11 0.0172 4.88 -1.63 0.0523 5.10 -1.36 0.0864 5.26 -1.17 0.1215 5.27 -1.01 0.155

xj values are ordered least to greatest

Example in EXCEL – Table 6-6, pp. 214

Cavendish Earth Density Data

-2.50

-2.00

-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

2.00

2.50

4.0 4.5 5.0 5.5 6.0

xj values

zj v

alu

es

zj is the function NORMSINV

Example in EXCEL – Table 6-6, cont’d

Cavendish Earth Density Data(censored)

-2.50

-2.00

-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

2.00

4.0 4.5 5.0 5.5 6.0

xj values

zj v

alu

es

zj is the function NORMSINV

Example 6-7

Example 6-7 (continued)

Figure 6-19

Normal probability

plot for battery life.

Figure 6-20

Normal probability plot obtained from standardized normal scores.

Figure 6-21

Normal probability plots indicating a nonnormal distribution. (a) Light-tailed distribution. (b) Heavy-tailed distribution. (c ) A distribution with positive (or right) skew.

The Beginning of a Comprehensive Example

Descriptive Statistics in Action see real numbers, real data watch as they are manipulated in perverse ways be thrilled as they are sorted and be amazed as they are compressed into a single numbers

The Raw Data

As part of a life span study of a particular type of lithium

polymer rechargable battery, 120 batteries were operated and their life span in operating hours determined.

1676.5 2386.6 1347.6 1840.2 916.0 1592.0895.6 1663.8 2420.6 943.5 1351.9 1395.4

1682.0 2045.9 2450.0 2210.5 1823.0 2401.81913.6 1985.2 2319.7 2856.3 1944.9 2968.72881.9 1387.6 2560.1 745.0 1641.3 1952.32007.8 718.3 884.1 2125.3 1694.0 2430.53313.4 1088.7 596.2 1759.9 1378.0 999.12156.4 1879.4 1779.7 1297.0 849.4 1608.41954.7 2056.6 908.3 2210.1 1882.6 983.82210.4 1740.2 955.4 543.4 2323.8 1831.11630.3 2791.8 2383.4 891.5 807.0 1307.41818.8 2476.0 1577.6 1818.8 2088.8 2139.01779.5 845.1 2365.4 1803.7 2940.7 1552.6984.2 1581.8 1527.9 1460.3 2004.6 1808.1

1512.6 2713.7 2749.2 1753.3 1714.3 2398.02046.1 2238.5 2439.7 2633.1 2039.1 2398.81613.3 1314.2 2016.2 4300.8 1760.5 2824.32066.1 729.3 1757.8 1250.8 577.8 715.22926.9 1898.7 1022.7 1005.2 1945.6 2277.31995.7 1377.2 2063.8 667.1 1299.9 1941.2

Data generated from a Weibull distribution with = 2.8 and = 2000

Descriptive Statistics - Minitab

Variable N Mean Median TrMean StDev SE MeanBattery Life 120 1789.4 1813.4 1773.9 661.5 60.4

Variable Minimum Maximum Q1 Q3Battery Life 543.4 4300.8 1348.7 2210.3

trimmed mean

More Minitab

450040003500300025002000150010005000

20

10

0

Battery Life

Fre

quen

cy

Histogram of Battery Life

More Minitab

450040003500300025002000150010005000

20

10

0

Battery Life

Fre

quen

cy

Histogram of Battery Life, with Normal Curve

Stem and Leaf Plot

Leaf Unit = 100

21 0 555677778888889999999 36 1 000222333333334 (40) 1 5555556666666677777777888888888899999999 44 2 0000000000111222223333333444444 13 2 56777888999 2 3 3 1 3 1 4 3

4000300020001000

Battery Life

Dotplot for Battery Life

More Minitab

40003000200010000

Battery Life

Boxplot of Battery Life

More Minitab

430037003100250019001300700

95% Confidence Interval for Mu

1950185017501650

95% Confidence Interval for Median

Variable: Battery Life

1691.57

587.10

1669.87

Maximum3rd QuartileMedian1st QuartileMinimum

NKurtosisSkewnessVarianceStDevMean

P-Value:A-Squared:

1945.04

757.74

1909.03

4300.802210.321813.451348.68 543.40

1200.7366280.359617

437627 661.531789.45

0.1390.566

95% Confidence Interval for Median

95% Confidence Interval for Sigma

95% Confidence Interval for Mu

Anderson-Darling Normality Test

Descriptive Statistics

Time Series Plot

12010080604020

4000

3000

2000

1000

0

Index

Battery

Life

Based upon the order that the data was generated

Time Series Plot

12010080604020

4000

3000

2000

1000

0

Index

sort

ed

Sorted by failure time

40003000200010000

99

95

90

80706050403020

10

5

1

Data

Per

cent

Normal Probability Plot for Battery Life

ML Estimates

Mean:

StDev:

1789.45

658.771

1000100

99959080706050403020

10

5

3 2

1

Data

Per

cent

Weibull Probability Plot for Battery Life

ML Estimates

Shape:

Scale:

2.92250

2005.35

1000050000

99

98

97

95

90

807060503010

Data

Per

cent

Exponential Probability Plot for Battery Life

ML Estimates

Mean: 1789.45

Computer Support

This is easy if you use the computer.

hang on, we are going to Excel…

A Recap …

•Population – the totality of observations with which we are concerned. Issue: conceptual vs. actual.

•Sample – subset of observations selected from a population.

•Statistic – any function of the observations in a sample.

•Sample range – If the n observations in a sample are denoted by x1, x2, …,xn, then the sample range is r = max(xi) – min(xi).

•Sample mean and variance.

1n

xnx

1n

)xx(s

xx

2n

1i

2i

n

1i

2i

2

n

1ii

Note that these are

functions of the observations in a sample and are,

therefore, statistics.

More Recapping …

N

Nx

N

)x(

x

2N

1i

2i

N

1i

2i

2

N

1ii

1n

xnx

1n

)xx(s

2n

1i

2i

n

1i

2i

2

Note difference in denominators

Sample variance uses an estimate of the mean (xbar) in its calculation. If divided by n, the sample variance would be a

biased estimate – biased low.

Note terminology – ‘population parameter’ vs.

‘sample statistic’

Sampling Process

X a random variable that represents one selection from a population.

Each observation in the sample is obtained under identical conditions. The population does not change during sampling. The probability distribution of values does not change

during sampling. f(x1,x2,…,xn) = f(x1)f(x2)…f(xn) if the sample is

independent. Notation

X1, X2,…, Xn are the random variables.

x1, x2,…, xn are the values of the random variables.

A Final Recap…

A probability distribution is often a model for a population.

This is often the case when the population is conceptual or infinite.

The histogram should resemble to distribution of population values. The bigger the sample

the stronger the resemblance.

Our Work Here Today is Done

Next Week: The Glorious Midterm

Prob/Stat studentsDiscussing stem and leaf plots