1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

51
1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter

Transcript of 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

Page 1: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

1

STAT131Week 2 Lecture 1b

Making Sense of Data

Anne Porter

Page 2: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

2

Review

1. Learning and Writing- what why how when

2. Statistics is a study of variation throughout a process

Page 3: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

3

Review Statistical Process

Process

Ethics The nature of the question to be

answered Expertise Design

SamplingMeasurement

Description and Analysis (Making sense of data) Conclusions & Decision Making

Design

Conclusions

ProblemResearch Question

EthicsExpertise

Measurement

Sampling

Description &Analysis

VA RIATION

Page 4: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

4

Where we do what!

In lectures the focus - What are we doing? Why are we doing it? When do we do it?

In labs the focus - How do we do it? Check definitions, Do by hand (simple) and SPSS Making choices about what to use

Page 5: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

5

Making sense of raw dataA shoe seller sets up on campus & collects some data about what size shoes students wear.

What do you see in this data?

Page 6: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

6

Making sense of raw dataWhat might we do to to make sense out of the shoe size data?

Page 7: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

7

What might we do to make sense?

• Order the data• Calculate the centre

– Mean average score– Median middle score of ordered values– Mode most common score

• Find the spread– Range from minimum to maximum

• Look for outliers unusual values

Page 8: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

8

Descriptive Statistics (mean, range)

N Minimum Maximum Mean Std. Deviation

SHOESIZE 150 4.0000 42.0000 9.816667 3.2291752

What do these statistics tell us?

Is this what the shoe seller needs to know?

Page 9: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

9

Descriptive Statistics (mean, range)

N Minimum Maximum Mean Std. Deviation

SHOESIZE 150 4.0000 42.0000 9.816667 3.2291752

Range= Maximum less minimum =42-4 =38

What do these statistics tell us?

Is this what the shoe seller needs to know? No

There is an error in the data!Minimum size 4, Maximum 42Average is 9.81

Page 10: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

10

Five number summary

• SHOESIZE N Valid 150

Percentiles 25 8.000000

50 9.500000

75 11.000000Five number summary• Minimum • Maximum• Lower quartile or 25th Percentile: shoe size with 25% of shoe sizes below it• Median, 5oth percentile or middle shoe size• Upper quartile 75th Percentile with 75% shoe sizes below it (ie 25% above it)• The interquartile range shoe size 75th percentile-shoe size 25th percentile

• What is a percentile?• How do you calculate quartiles? • And is this what the shoe seller wants?

Page 11: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

11

Five number summary

• SHOESIZE N Valid 150

Percentiles 25 8.000000

50 9.500000

75 11.000000Five number summary• Minimum 4• Maximum 42• Lower quartile , 25% of shoe sizes below = 8• Median, 50% of shoe sizes below it = 9.5• Upper quartile, 75% of shoe sizes below it =11• The interquartile range 75th percentile-shoe size 25th percentile 11-8

– (50% of sizes between 8 and 11)

What is a percentile?Does the shoeseller have what is needed?

Page 12: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

12

Percentiles - definition

• The kth percentile is a number that has k percent of the scores at or below it and (100-k)% above it

• The lower quartile has 25% of scores at or below that score

Page 13: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

13

Quartiles

• Q1 is value of the (n+3)/4th observation,

• and Q3 is the value of the (3n+1)/4th observation.

• Interpolate if necessary.• There are other approaches to calculating which may give different

answers. If the answers are similar there is no problem

• The interquartile range= Q3 - Q1

•If we have 17 heights what observation do we need to get the upper and lower quartile?•What observation will give the median?

Page 14: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

14

Quartiles

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

medianminimum maximumlower

quartileupperquartile

170

166150

147130

The upper quartile is?The lower quartile is?The interquartile range is?

166147 -

19

Page 15: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

15

What other statistics or graphs might inform the shoe-seller?

• Centre - mean, median

• Spread – Maximum-Minimum = Range

– Upper Quartile-Lower quartile = Interquartile range

– 75th percentile-25th percentile= Interquartile range

• Outliers

Page 16: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

16

Ordering the data

• Shoe Size

• 4

• 5

• 5

• 6

• :

• 42

• Ordering is often useful but we can do better

Page 17: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

17

Frequency or relative frequency table

Frequency table

SHOES SIZE

Class Frequency Rel freq Cumulative

-∞ to …

4.0 to …

8.0 to …12.0 to …

16.0 to …20.0 to …

24.0 to …

28.0 to …32.0 to …

36.0 to …

40.0 to …44.0 to ∞

0 0 0

20 .1333333 20

110 .733333 13019 .1266667 149

0 0 1490 0 149

0 0 149

0 0 1490 0 149

0 0 149

1 .0066667 1500 0 150

total 150 1.000000

• What is wrong with this display?

Page 18: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

18

Frequency or relative frequency table

Frequency table

SHOES SIZE

Class Frequency Rel freq Cumulative

-∞ to …

4.0 to …

8.0 to …12.0 to …

16.0 to …20.0 to …

24.0 to …

28.0 to …32.0 to …

36.0 to …

40.0 to …44.0 to ∞

0 0 0

20 .1333333 20

110 .733333 13019 .1266667 149

0 0 1490 0 149

0 0 149

0 0 1490 0 149

0 0 149

1 .0066667 1500 0 150

total 150 1.000000

• What is wrong with this display?

The data has been treated as if it were continuous. Some packages will do this but we want the data to be treated as discrete data

Page 19: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

19

Frequency distribution (order plus count)

We still have an error (42)But we have the frequency (count) of each shoe size.What might be better for the shoeseller?

Page 20: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

20

Percentages of each size

Why might this be useful rather than frequency?

Page 21: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

21

Percentages of each size

Why might this be useful rather than frequency?

•We only had a sample so this would suggest the percentage or even proportion of each size.

•Is this all the shoeseller needs?

Page 22: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

22

There are better ways of looking at distributions

• What else might we do?

Page 23: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

23

Stem-and leaf plot (with error)

Stems:10's Leaves:1's

-00011111222223333344+

88 90

8887

2

98 9

30

97

009

02

2

0

4

02

900

79

19

00 187

19

3

9 977

9

200

99

2

87

8 9990

2

90

91

6

1113

98 9887 7

1

7

10 11

7

2

8

55

98 9

2

6

1099

0

4

7

4

90 10

9

3

91

81

91

8 8

32

991

81

7

108

0

78

08

0000

7

02

098 88

09

1

5

96

08

• Some packages (SPSS) cut off the outliers and lists them as extremes.

• See if you can find a definition for an extreme as used in SPSS and an outlier from the text.

• Different packages, different procedures may use different definitions - check

Page 24: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

24

What do we do with outliers?

Stems:10's Leaves:1's

-00011111222223333344+

88 90

8887

2

98 9

30

97

0 09

02

2

0

4

02

900

79

19

0 0 187

19

3

9 977

9

200

99

2

87

8 99 90

2

90

91

6

1 113

98 9887 7

1

7

10 11

7

2

8

55

98 9

2

6

109 9

0

4

7

4

90 10

9

3

91

81

91

8 8

32

991

81

7

108

0

78

08

0 000

7

02

098 88

09

1

5

96

08

Page 25: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

25

What do we do with outliers?

Stems:10's Leaves:1's

-00011111222223333344+

88 90

8887

2

98 9

30

97

0 09

02

2

0

4

02

900

79

19

0 0 187

19

3

9 977

9

200

99

2

87

8 99 90

2

90

91

6

1 113

98 9887 7

1

7

10 11

7

2

8

55

98 9

2

6

109 9

0

4

7

4

90 10

9

3

91

81

91

8 8

32

991

81

7

108

0

78

08

0 000

7

02

098 88

09

1

5

96

08

• Know the context to see what values are possible• Check the original data to see if it is a data entry

error• See if it is in different units and transform to the

appropriate unit• If an error and you do not know what it should

be delete it and make a note• If there is no reason to conclude it is an error

leave it in• Sometimes analyse with the point in and the

point out of the the data set

Page 26: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

26

Stem-and-leaf plot (42 removed)

Stems:1's Leaves:0.1's

-456789

101112131415+

005

5

0000

0

00

0

0

00

0

0000

0

5

0

5

0

555

0

0

0

000

0

00

0

5

0

0 5

55

5

0

0055

0

00

0500

5

0

555

0

0

000

0

05

000

0 5

5

5

0000

0

0

0

00

000

0

5

00

550

0

0

0

55

50

0

0

5

0

5

0

0

0

0 5

05

00

5

0

0

5

05

0

0

00

0

0

0000

5

0

0

00

0 50

05

0

0

0

0

0

0

• What does it reveal?

• Could it be better?

Page 27: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

27

Stem-and-leaf plot (42 removed)

Stems:1's Leaves:0.1's

-456789

101112131415+

005

5

0000

0

00

0

0

00

0

0000

0

5

0

5

0

555

0

0

0

000

0

00

0

5

0

0 5

55

5

0

0055

0

00

0500

5

0

555

0

0

000

0

05

000

0 5

5

5

0000

0

0

0

00

000

0

5

00

550

0

0

0

55

50

0

0

5

0

5

0

0

0

0 5

05

00

5

0

0

5

05

0

0

00

0

0

0000

5

0

0

00

0 50

05

0

0

0

0

0

0

• What does it reveal?

• Could it be better?

•Change stems to focus on whole and half sizes.

•We should have transformed the 42. This is the difference between a lecture and data analysis, I deleted!

Page 28: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

28

Stem-and-leaf with different stems

• What do we notice now?

• Do we have what the shoe seller needs?

Page 29: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

29

Stem-and-leaf with different stems

• What do we notice now?

• There is a distribution within a distribution with fewer half sizes

• Do we have what the shoe seller needs?

• We need male and female data (Next lecture)

Page 30: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

30

Graphical Excellence

• Convey the message about the data

• Axes, units, variable names, figure labels

• DO NOT

• Distort the data

• Use pie charts (there is always a better chart)

• More dimensions than necessary, 3D instead of 2D

• Unnecessary pattern, fill, ink, decoration

Page 31: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

31

To reveal

Centre

Spread

Outliers

Distribution

Patterns

Anything unusual

Comparisons (next lecture)

And more

But there are choices to be made

Page 32: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

32

Centre

• Mean

• Median

• Mode• Trimmed Mean

mean=x =xi

i=1

4∑

n

Median, FIRST arrange the sample values from smallest to largest. N odd : Median of 8, 7, 9 is the middle of ordered scores 8N even:Median of 4,7,8,9 =(7+8)/2=7.5

Mode is the most common score in the data set eg for 1,2,3,3,4,5,6 The mode is 3Trimmed Mean Eg. Diving at the Olympics is the average of the judges scores after having tossed out the highest and the lowest scores

Page 33: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

33

Question: mean vs median

• Data A: 60, 2, 3, 5 Data B: 6, 2, 3, 5• Mean A = 17.5 Mean B = 4 • Median A = 4 Median B = 4• Which measure best typifies the data A? Why?• Which measure best typifies the data set B? Why?

Page 34: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

34

Question: mean vs median

• Data A: 60, 2, 3, 5 Data B: 6, 2, 3, 5• Mean A = 17.5 Mean B = 4 • Median A = 4 Median B = 4• Which measure best typifies the data A? Why?• Which measure best typifies the data set B? Why?

For A the outlier 60 suggests the median (4) as the Mean (17.5) is dragged up by the outlier 60

For B both are the same. The median (4) used 2 points the mean (4) uses all the data

Page 35: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

35

Question: mean vs median

• In what sense are the mean and median the same?

• In what sense are the mean and median different?

Page 36: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

36

Question: mean vs median

• In what sense are the mean and median the same?

• In what sense are the mean and median different?

They are both measures of the centre

They may give different numerical values and for different data sets one may be betteras a measure than the other or both may be required

Page 37: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

37

Making Choices between mean & median

• The mean uses all the information in the sample, because each value is added in the sum. – mean subject to error if spurious values are entered.– median is less affected by “wild” values, we say it is robust.

• If the mean is similar to median – use the mean as it uses all data.– often easier to work with the mean

• If they are different because of non-symmetric distribution– Can be useful to report both

• The context of what the data are are used for may also determine what is an appropriate measure

Page 38: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

38

Measures of Spread

• Range= maximum value - minimum value

• Interquartile range = Upper Quartile-Lower quartile

=Q1- Q3

• Sums of Squares

• Variance (S2)

• Standard Deviation

SS = (x - x i=1

n∑ )2

S2 =1

n −1x i − x ( )

2

i=1

n

s =1

n −1x i − x ( )

2

i=1

n

Page 39: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

39

Use of standard deviation

• The mean and std deviation gives information about where most of the distribution of values is to be found.

• For many distributions, the range

mean - 2 standard dev’s to mean + 2 standard dev’s

(mean + 2SD)

contains approx 95% of the distribution.• (The very least that this spread can contain is 75%

of the distribution.)

Page 40: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

40

Criteria for a good measure of spread

• Whatever measure of variability (or spread) the measure should not be affected by adding a constant to each value so as to change the centre (or location)

• If there is spread in the data it should indicate this• Should make sense in the context used• Should be robust, not influenced by outliers or extreme

points

Page 41: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

41

Undesirable features of measures of spread

• Sensitive to outliers• Does not use all data

•eg range based only on two scores• Difficult to understand

•Eg sum of squares in this context as the answer is very big and gets bigger with every additional data point. But useful in other contexts

Page 42: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

42

Revealing distributions

• Frequency Distribution Table• Stem-and-Leaf• Histograms• Box-and-whiskers

Page 43: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

43

Box-and-Whiskers plots

• Often just called box plots, they give a pictorial summary of the data for a single variable.

• They use the five-number summary:– minimum value,

– Q1,

– median,

– Q3,

– maximum value

Page 44: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

44

• Example: If minimum = 3, Q1 = 6,

median=10, Q3 = 12, maximum = 16, the box plot would look like

• You must draw a scale for the box plot.

2 4 6 14 168 1210

Page 45: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

45

• In a horizontal box plot, a horizontal axis shows the scale. The box’s left and right boundaries are Q1 and Q3, and an inner line shows the median.

• Whiskers are drawn outwards from the box to the minimum and maximum values.

• Often the sample mean is also shown.

Page 46: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

46

• What values given rise to the box plot below:

• If minimum= , Q1= ,

• median= , Q3 = , maximum= ,

• the box plot would look like• You must draw a scale for the box plot.

2 4 6 14 168 1210

Page 47: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

47

What do you want to see in data?

• Information• Meaning• We must turn data into information in order to

have meaning

Page 48: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

48

What can we see in data?Location (centre)

Spread

Shape

Outliers

Unusual patternsGaps, clusters

How do batches differ

Page 49: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

49

Tools for making meaning from data

Ordering data

Dot plots & jittered dot plots

Stem-and-leaf plots

Histograms, Boxplots, Bar charts

Pie charts

Frequency tables

Numerical summaries

Page 50: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

50

Selecting the tool depends on

The question asked

How the variable is measured

The structure of the data

Utility of the tool

More in the next lecture and labs

Page 51: 1 STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter.

51

Homework

• Textbook reading Utts & Heckard (2004) Chapter 2

Or• Textbook reading Moore and McCabe pp38-55.

Or • Textbook reading, Griffiths, Stirling and

Weldon, 1998, Chapters 1, 2, 6 (pp. )• Complete lab and preparation for next weeks lab.