Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit...

38
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 1 Unit 4: Summarizing and Exploring Data Statistics 571: Statistical Methods Ramón V. León 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 2 Variable Types – Scales of Measurement The numerical scale of a variable determines the appropriate statistical treatment of the resulting data Types of variables Categorical (qualitative) • Nominal • Ordinal Numerical (quantitative) • Continuous • Discrete

Transcript of Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit...

Page 1: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

1

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 1

Unit 4: Summarizing and Exploring Data

Statistics 571: Statistical MethodsRamón V. León

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 2

Variable Types – Scales of Measurement• The numerical scale of a variable determines the

appropriate statistical treatment of the resulting data• Types of variables

– Categorical (qualitative)• Nominal• Ordinal

– Numerical (quantitative)• Continuous• Discrete

Page 2: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

2

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 3

Types of Variables

• A categorical (qualitative) variable classifies each study unit into one of several distinct categories– When the categories are simply distinct labels the variable is

nominal– When the categories can be ordered or ranked the variable is

ordinal

• A numerical (quantitative) variable takes values from a set of numbers– A continuous variable takes values from a continuous set of

numbers– A discrete variable takes values from a discrete set of numbers,

typically integers

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 4

Alternative Classification of Numerical Variables

• If two measurements on a variable can be meaningfully compared by their difference, but not their ratio, then it is called an interval variable (e.g. temperature)

• If both the difference and the ratio are meaningful, then the variable is called a ratio variable (e.g. number of crimes, weight)– A ratio variable has a true zero while an interval variable does not.

The zero on a temperature scale depends on the scale. True zeros are independent of scale, e.g. zero weight.

– A 40 Kgs weight is twice a 20 Kgs weight even if these weight are translated into pounds. A 40 F° temperature is not twice a 20 F°temperature if these temperatures are translated into C°.

• The ratio scale is the strongest scale of measurement

Page 3: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

3

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5

Summarizing Categorical Data

A frequency table shows the number of occurrences of each category

The relative frequency is the proportion in each category

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 6

Bar Chart, Pie Chart, and Pareto ChartA bar chart with categories arranged from the highest to thelowest is called a Pareto Chart

Used to distinguish the vital few from the trivial many

Page 4: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

4

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 7

Bar Chart, Pie Chart, and Pareto Chart

Homework: Go to the course home page and do the first two JMP tutorials. Turn in the Pareto chart produced by JMP. Then fish around in JMP and do a simple (unordered) bar chart and a pie chart. (See next page.) Turn in these charts in too. You will need to use the Help menu of JMP.

0

100

200

300

400

500

600C

ount

Sand

Cor

ebre

akBr

oken

Dro

pSh

iftM

is-ru

nC

rush

Run

-out

Cor

e-M

issi

ngBl

ow Cut

Cause_of_Rejection

0

25

50

75

100

Cum

Per

cent

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 8

Bar Chart, Pie Chart, and Pareto Chart

0

100

200

300

400

Freq

uenc

y

Sand

Mis

-run

Blow Sh

iftC

utBr

oken

Dro

pC

oreb

reak

Run

-out

Cor

e-M

issi

ngC

rush

Cause_of_Rejection

Cause_of_Rejection Sand Mis-runBlow ShiftCut BrokenDrop CorebreakRun-out Core-MissingCrush

Cause_of_Rejection

Cause_of_Rejection Sand Mis-runBlow ShiftCut BrokenDrop CorebreakRun-out Core-MissingCrush

Page 5: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

5

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 9

Summarizing Numerical Data(In Unit 4 measures are used to describe the data, not to infer anything about a population)

Mean:

1 2 1...

n

in i

xx x xX

n n=+ + +

= =∑

2

1( )

n

ii

x m=

−∑is minimizedwhen

X m=

Homework: Import into JMP the mileage data of Table 4.2 available in the data set link of the course home page. Go to the Analyze menu of JMP and select Distribution. Look for the mean and the median in the resulting JMP output. Turn in this output.

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 10

Summarizing Numerical Data

Median: Middle value in the ordered sample

min (1) ( 2 ) ( ) max... nx x x x x= ≤ ≤ ≤ =1( )

2

12 2

12

n

n n

x

xx x

+

+

= +

if n is odd

if n is even

1

n

ii

x m=

−∑ is minimized when x m=

Page 6: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

6

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 11

Mean or Median?• Appropriate summary of the center of the data:

– Mean if the data has a symmetric distribution with light tails (i.e. a relative small proportion of the observations lie away from the center of the data

– If the distribution has heavy tails or is asymmetric (skewed) use the median

• Extreme values that are far removed from the main body of the data are called outliers– Large influence on the mean but not on the median

• Right or positive skewness– Median < Mean

• Left of negative skewness– Median > Mean

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 12

Summary Statistics: Measures of Dispersion

• Two data sets may have the same center but quite different dispersions around it

• Two way to summarize variability– Give milestones that divide the data into equal

parts, e.g. quartiles– Compute a single number, e.g., range, variance,

or standard deviation

Page 7: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

7

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 13

Summary Statistics: Percentiles and Quartiles

• The 100pth percentile or pth quantile is the value such that a fraction p of the data is below it and 1-p of the data is above it

• Quartiles– Median is the 50th percentile– The 25th, 50th, and 75th percentiles are called quartiles

and divide the data into four equal part– The minimum, maximum, and three quartiles are called

the five number summary of the data

px

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 14

Calculation Percentiles and Quartiles( ( 1))

( ) ( 1) ( )[ ( 1) ]( )p n

pm m m

xx

x p n m x x+

+

= + + − −

if p(n+1) is an integerotherwise

m is the integer part of p(n+1)

Calculate Q1: ( 1) .25(24 1) 6.25p n + = + = is not an integer

1 (6) (7) (6)0.25( ) 24 0.25(25 24) 24.25Q x x x= + − = + − =

Example – First Quartile Calculation:

Page 8: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

8

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 15

Range, Interquartile Range, and Standard Deviation

Range = maximum - minimum

Interquartile range (IQR) = Q3 – Q1

Sample variance:

( )22 2 2

1 1

1 1( )1 1

n n

i ii i

s x x x n xn n= =

= − = − − − ∑ ∑

Sample standard deviation: 2s s=Sample mean, variance, and standard deviations are sample analogs of the population mean, variance, and standard deviation 2, ,µ σ σ

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 16

Variance Calculation Example

100154251140031-124-21

Xi iX X− ( )2iX X− 15

53

X =

=

2

2 1

( )

110

5 12.5

n

ii

X Xs

n=

−=

=−

=

Page 9: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

9

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 17

Sample Variance and Standard Deviation

• s2 and s should only be used to summarize dispersion with symmetric distributions

• For asymmetric distribution a more detailed breakup of the dispersion must be given in terms of quartiles

• For normal data and large samples– 50% of the data values fall between

– 68% of the data values fall between

– 95% of the data values fall between

– 99.7% of the data values fall between

0.67X s±

1X s±

2X s±

3X s±

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 18

Relationship between s and IQR for Normal Data

( 0.67 ) ( 0.67 ) 1.34

1.34

IQR x s x s sIQRs

+ − − =

.50

-.67 .67

Standard Normal Density

z scores

Page 10: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

10

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 19

Measures of Dispersion Example

Coefficient of VariationsCVX

= =

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 20

Coefficient of Variation (CV)

Relative measure of dispersion that expresses the sample SD interms of the sample mean:

sCVX

=

Group of children study:

:

138 , 9 , 9 /138 6.5%

Arm Circunference

x mm s mm CV= = = =

:

60%, 4%, 4 / 60 6.7%

Body Water

x s CV= = = =

Page 11: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

11

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 21

Box and Whiskers Plot (Box Plot)JMP’s (Default) Outlier Box Plot

•Q1 and Q3 are called the hinges•Center line is the median•Two lines called whiskers extend to the most extreme datavalues that are still inside the (inner) fences•Observation outside the fences are regarded as possible outliersand are denoted by asterisks

Lower (Inner) Fence = Q1 – 1.5 x IQRUpper (Inner) Fence = Q3 + 1.5 x IQR

Lower Outer Fence = Q1 – 3 x IQR Upper Outer Fence = Q3 + 3 x IQR

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 22

Mileage data + 100 Value in Extra JMP Row

0

20

40

60

80

100

120

100.0%99.5%97.5%90.0%75.0%50.0%25.0%10.0%2.5%0.5%0.0%

maximum

quartilemedianquartile

minimum

100.00 100.00 100.00 44.00 36.50 30.00 24.50 18.80 13.00 13.00 13.00

Quantiles

Outlier Box Plot obtained by going to the Analyze menu and selecting the Distributionoption. This is the default box plot.The Quantile Box Plot (innermost) was obtained by going to the red triangle menu on top of the histogram and selecting the option.

Homework:Turn in this JMP output

Page 12: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

12

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 23

Explanation of JMP Output

A and B are called the lower and upper fences. The dashed lines (whiskers) are the most extreme observations still within the fences. A point outside the fences is considered a possible outlier and should be examined with care.

Lower (Inner) Fence = Q1 – 1.5 x IQRUpper (Inner) Fence = Q3 + 1.5 x IQR

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 24

Explanation of JMP Output

This explanation was obtained by going to the Tools menu of JMP, selecting the Help (?) tool, and then clicking with it on top of the JMP output for which you want the explanation.

Page 13: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

13

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 25

Using Box Plots to Compare

Open circle denotes an extreme outlier, i.e. a data point beyondthe upper outer fence Q3 + 3 x IQR

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 26

Using Box Plots to Compare: JMP Output

Homework: Do the JMP tutorial “Using Box Plots to Compare”

Cos

t

0

5000

10000

15000

20000

25000

30000

35000

Control Treatment

Group

Page 14: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

14

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 27

Standardized Z-scores•Often it is useful to express the data on a standardized scale:

, 1, 2, ... ,ii

x Xz i ns−

= =

•How many standard deviations a data value is above or below the mean•Unitless quantities standardized to have mean 0 and SD 1•Used to flag outliers e.g. observations with z > 3•Used to compare two or more batches of data on the same scale that originally were measured (possibly) on different scales•If data is normal z score can be translated into percentiles using Table A.3 of the normal distribution

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 28

Histogram

10

15

20

25

30

35

40

45

50

55

10

15

20

25

30

35

40

45

50

55

The two different histograms were obtained by going to the JMP Tools menu, grabbing the Hand tool, and moving it around on top of the histogram.

Relative frequency Histogram:

Page 15: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

15

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 29

Histogram Remarks

• Gives a picture of the distribution of the data

• Area under the (relative frequency) histogram represents sample proportion

• Show if the distribution is– Symmetric or skewed– Unimodal or bimodal

• Gaps in the data may indicate a lack of precision with the measurement process

• Many quality control applications– Are there two processes?– Detection of rework or cheating– Tells if process meets the specifications

LSL USLLSL USL

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 30

Stem and Leaf Diagram

Obtained by going to the red triangle pop-up menu on top of the histogram and selecting Stem and Leaf.

Stem Leaf 5 0 4 4 00 3 5678 3 01112 2 557889 2 0134 1 7 1 3

Count1

2456411

Multiply Stem.Leaf by 10

Stem and Leaf

Page 16: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

16

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 31

Normal PlotsIf the data follows a normal distribution, then the percentiles of that normal distribution should plot linearly against the sample percentiles, except for sampling variation

See next page to see how the normal scores were obtained

p - percentiles p ( )1z p−= Φ

Plot pversus z

ordered data = percentiles

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 32

Standard Normal Table

( )1z p−= Φ p

= p

Page 17: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

17

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 33

Normal Plot

z

( )

( )

( )

1( )

( )1

i

i

i

xp

x p

xp

µσ

µ σ

µσ

− Φ =

= + Φ

⇒−

Φ =

Normal Score = z

1i

n +

ordered data

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 34

Normal Plot: JMP Output

Homework: Do the JMP tutorial “Normal Quantile Plot.” Turn in your JMP output

10

15

20

25

30

35

40

45

50

55.01 .05.10 .25 .50 .75 .90.95 .99

-2 -1 0 1 2 3

Normal Quantile Plot

z = normal score

( )1

iz pn

Φ = =+

ordered data

Page 18: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

18

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 35

Easier to Interpret JMP

Output.01

.05

.10

.25

.50

.75

.90

.95

.99

-3

-2

-1

0

1

2

3

Nor

mal

Qua

ntile

Plo

t

10 15 20 25 30 35 40 45 50 55

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 36

Normal Goodness of Fit Test

Page 19: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

19

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 37

Normal Goodness of Fit Test

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 38

SamplingVariationand Normal Plots

Use JMP’s Random function in the JMP calculator to obtain similar plots

Do the JMP tutorial: “Generating Standard Normal Random Numbers” to see how.

Page 20: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

20

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 39

Generating Random Samples with JMP’s Random Function

1. Open a new data table and add as many rows as you want in the random sample

2. Right click on top of the column and select Formula… . This will bring up the calculator

3. Select the column in the calculator and then from Random menu select Random Normal

Consult JMP’s help underRandom Number Functions Use other random functions to do Problem

4.24, Parts (b) and (c)

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 40

Departuresfrom

Normalityand

Normal Plots

Page 21: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

21

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 41

Normalizing Transformations•Data can be nonnormal in a number of ways, e.g., the distributionmay not be bell shaped or may be heavier tailed than the normal distribution or may not be symmetric

•Only the departure from symmetry can be easily corrected by transforming the data

•If the distribution is positively skewed, then the right tail needs tobe shrunk inward. The most common transformation used for thispurpose is the logarithmic transformation, i.e., logx x→

•The square-root transformation, i.e. x x→ providesa weaker shrinking effect; it is frequently used for (Poisson)count data

•For negatively skewed data used the inverse transformations 2orxx e x x→ →

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 42

Hospitalization Cost Before and After a Log Transformation

•The transformation facilitate statistical analysis, but the transformed scale might not have a physical meaning. •Also the same results are not obtained by transforming back to the original scale as by performing the same statistical analyses on the raw data, e.g., the average of the log is not the same as the log of the average

Page 22: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

22

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 43

Using the JMP Calculator To Transform Data1. Double click to the right of the “Group” column to get a new column and name it “LogCost”

2. Right click on top of the LogCostcolumn and select Formula… . This takes you to JMP’scalculator

3. Select the Log function from the TranscendentalGroup

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 44

Comparison of the Hospital Cost After Log Transformation

LogC

ost

6

6.5

7

7.5

8

8.5

9

9.5

10

10.5

Control Treatment

Group

Later we will learn a formal statistical procedure, called the t-test, that can be used to test for differences of populations that are normally distributed and have similar variation.

Page 23: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

23

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 45

Runs Chart•The graphical techniques described so far are inappropriate fortime series data because they ignore the time sequence

•A run chart graphs the data against time

Elongation Measurements on Fifty Springs

The histogram hides the time order effect and is therefore an inappropriate method for summarizing time-ordered data.

The Time Series platform of JMP can be used to produce a runs chart.

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 46

Summarizing Bivariate Data• When two or more variables are measured on each

sampling unit the result is multivariate data.• If only two variables are measured the result is bivariate

data. One variable is called the x variable and the other the y variable

• We can analyze the x and y variable separately with the methods we have learned so far but these methods would not answer questions about the relationship between x and y– What is the nature of the relationship (if any) between x

and y? – How strong is the relationship?– How well can one variable be predicted from the other?

Page 24: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

24

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 47

Summarizing Bivariate Categorical Data:Two-Way Table

•The numbers in the cells are the frequencies of each possiblecombination of categories•Why would you like to transform the frequencies into percentages?

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 48

Table and Row Percentages

What information can you tell from these two tables?

Page 25: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

25

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 49

What Percentage to Calculate on the Table?

• When a sample is drawn and then crossclassified according to the row and column variable then overall percentages should be reported. Row and column percentages are interpretable too. This is the case in the income and satisfaction example.

• When the sample is drawn by using predetermined sample sizes for the rows (or columns), the row (or column) percentages should be reported. Other percentages do not make sense. (See example in the next page.)

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 50

Appropriate Percentage Example

Here row percentages are meaningful but column percentages are not. It would be misleading to report that 25/30 = 83% of the people with a respiratory illness are smokers, because the proportion of smokers in the sample does not reflect the true proportion of smokers in the population.

Page 26: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

26

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 51

Two-Way Table: JMP

Output

Inco

me

($)

15,000-25,000

6000-15,000

<6000

>25,000

283.11

25.9311.91

818.99

25.3934.47

131.44

20.975.53

11312.5427.4348.09

384.22

35.1913.15

10411.5432.6035.99

222.44

35.487.61

12513.8730.3443.25

242.66

22.2211.65

808.88

25.0838.83

202.22

32.269.71

829.10

19.9039.81

182.00

16.6710.53

545.99

16.9331.58

70.78

11.294.09

9210.2122.3353.80

23526.08

28932.08

20622.86

17118.98

10811.99

31935.41

626.88

41245.73

901

Job SatisfactionCountTotal %Col %Row %

Little Dissatisfied

Moderately Satisfied

Very Dissatisfied

Very Satisfied

Homework: Do the JMP tutorial “Summarizing Bivariate Categorical Data.”

•Percentages can be deleted in JMP•This table needed some cleaning up in PowerPoint after pasting.

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 52

Two-Way Table: JMP

Output

Homework: Do the JMP tutorial “Summarizing Bivariate Categorical Data.”

•Percentages can be deleted in JMP•This table needed some cleaning up in PowerPoint after pasting.

Page 27: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

27

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 53

JMP’s Mosaic Plot

Job

Satis

fact

ion

0.00

0.25

0.50

0.75

1.00

<6000 6000-15,000 15,000-25,000 >25,000

Income ($)

Very Dissatisfied

Little Dissatisfied

Moderately Satisfied

Very Satisfied

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 54

Simpson’s Paradox: Collapsing Several Variable into Two Variables

University of California sex-bias study in graduate admissions:

Is there a gender bias?

Page 28: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

28

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 55

Simpson’s Paradox: Collapsing Several Variable into Two Variables

University of California sex-bias study in graduate admissions:

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 56

How Can the Admission Rates be Compared

Page 29: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

29

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 57

Summarizing Bivariate Numerical Data

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 58

Scatter Plot

Page 30: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

30

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 59

Scatter Plot: JMP Output

2

3

4

5

6

7

8

9M

etho

d B

2 3 4 5 6 7 8 9Method A

Homework: Do the JMP tutorial “Scatter Plot.”

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 60

Example: Life Expectancy by Race and Gender

Page 31: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

31

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 61

Labeled Scatter Plot

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 62

Labeled Scatter Plot: JMP Output

Homework: Go to the JMP tutorial “Labeled Scatter Plot” and produce the graphs in this page and the next. Turn them in.

45

50

55

60

65

70

75

80

Year

s

1910 1920 1930 1940 1950 1960 1970 1980 1990 2000Year

Page 32: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

32

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 63

Labeled Scatter

Plot: JMP Output

45

50

55

60

65

70

75

80

Yea

rs

1910 1920 1930 1940 1950 1960 1970 1980 1990 2000Year

Linear Fit Group=="BF"Linear Fit Group=="BM"Linear Fit Group=="WF"Linear Fit Group=="WM"

What is the message in this plot?

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 64

Sample Correlation CoefficientA single numerical summary statistic which measures the strengthof a linear relationship between x and y

1 1

where1 1( )( )

1 1is the sample covariance

xy

x y

n n

xy i i i ii i

sr

s s

s x x y y x y nx yn n= =

=

= − − = − − − ∑ ∑

1

11

ni i

i x y

x x y yrn s s=

− −= −

∑Note thatAverage of the products of the standardized scores

Page 33: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

33

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 65

Properties of the Sample Correlation Coefficient r

• Properties similar to the population correlation coefficient ρ– Unitless quantity– Takes values between –1 and 1– The extreme values are attained if and only if the points

(xi, yi) fall exactly on a straight line (r = -1 for a line with negative slope and r = +1 for a line with positive slope.)

– Takes values close to zero if there is no linear relationship between x and y

Homework: Do the JMP tutorial “Correlation Coefficient”

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 66

Values of the Correlation Coefficient

Page 34: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

34

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 67

Scatter Plots with

Zero Correlation

Lesson: Always plot the data first -before interpreting numerical summaries such as the correlation coefficient!

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 68

Military Draft

r = -0.226

Is the plot on the next page better?

There appears to be no linear relationship between the birth date and the selection number in the military draft lottery.

Page 35: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

35

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 69

Military Draft

r = -0.866Lesson: If the data is averaged over subsets of the sample, then the correlation coefficient is usually made larger in absolute value.

Another example: The daily closes of the S&P 500 stock index and the S&P bond index exhibit less correlation than the monthly averages of the daily closes of these indexes.

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 70

Correlation and Causation• High correlation is frequently mistaken for a

cause-and-effect relationship. Such a conclusion may not be valid in observational studies, where the variables are not controlled– A lurking variable may be affecting both variables– One can only claim association not causation

• Countries with high fat diets tend to have higher incidences of breast and colon cancer. Can we conclude causation?

• A common lurking variable in many studies is time order– Wealth and health problems go up with age. Does wealth cause

health problems?• Sometimes correlations can be found without any plausible

explanation, e.g., sun spots and economic cycles

Page 36: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

36

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 71

Straight Line Regression•It is convenient to summarize the relationship between twovariables, x and y, by an equation to predict y for a given x. The simplest equation is that of a straight line.•Least squares method:

y x

y y x xrs s

− −=

Equation predicts that if the value of x is one standard deviation away for the mean, then the value of y will be r standard deviations (rsy) away from the mean

where andy

x

sy a bx b r a y bx

s= + = = −

Slope-intercept form of the regression line:

Homework: Do the JMP tutorial “Simple Linear Regression.”

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 72

Regression Toward the Mean• “Regression” means falling back• Regression towards the mean (Francis Galton and Karl

Pearson)– Many hereditary traits tend to fall back towards the mean in

successive generations– What simple explanation for this phenomenon is there?

• Galton’s data on father’s heights and son’s heights

• It is natural to think that the regression line is y = x + 1, that is, that on the average a son would be taller than his father by 1 inch.

68, 69, 2.7, 2.7, 0.5x yx y s s r

Page 37: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

37

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 73

Regression Toward the Mean

• Actual regression line: y-69=0.5(x-68)– Son’s of 72 inch tall fathers are 71 inches tall on the

average– Son’s of 64 inch tall (short?) fathers are 67 inches tall

(short?) on the average• Same phenomenon occurs in pretest-posttest situation• What happens to the SAT scores of students who retake

the SAT exam after doing particularly bad?

Regression fallacy: Misleading conclusions due to regression towards the mean

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 74

Galton’sPlot of Father-

Son Height Data

Dashed Line:y = x + 1

Solid Line:Regression line

Page 38: Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5 Summarizing Categorical Data A frequency table

38

6/9/2004 Unit 4 - Stat 571 - Ramón V. León 75

Stephen Senn’s Quote on Regression to the Mean

•“This phenomenon of regression is widespread and general so that, for example, patients which have been selected for a clinical trial on grounds of high blood pressure may, on purely statistical grounds, be expected, other things being equal, to show an improvement in blood pleasure when measured again. This omnipresent, but widely ignored feature is an important reason for the unreliability of uncontrolled studies and it is asad fact that medical researchers are still regular and gulliblevictims of this phenomenon.”•Importance of the counterfactual view of causality.

Statistical Issues in Drug Development by Stephen Senn. Wiley