1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data...

62
1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of Data September 10, 2008

Transcript of 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data...

Page 1: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

1

Chapter 2: Descriptive Statistics

2.1 Organizing Qualitative Data2.2 Organizing Quantitative Data2.3 Additional Displays2.4 Misrepresentations of Data

2.1 Organizing Qualitative Data2.2 Organizing Quantitative Data2.3 Additional Displays2.4 Misrepresentations of Data

September 10, 2008

Page 2: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

2

Categorical Variables

• Each observation (data point) for a categorical variable belongs to one category among different categories

• Variable:

– Gender (Categories: male or female)

– Religious Affiliation (Protestant, Catholic, Jew, Muslim, etc.)

– Home State or Country (NJ, AR, CA, FL, Canada, etc.)

– Favorite Singer (Elvis, Sting, Sinatra, etc.)

– Eye Color (brown, green, blue, hazel, black)

– Favorite Type of Music (jazz, country, rock, etc.)

Section 2.1

Page 3: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

3

Frequency Tables for Categorical Data

Consider a population that has N categorical variables: C1,C2 ,K ,CN{ } . For example

consider the population of freshman at Vanderbilt during the present semester and

the categorical variables for this population: C1,C2 ,C3{ } = gender, state, favorite color{ } .

For each categorical variable, we list the possible categories for this variable: for C j and say

it can have k values, xj1,xj2 ,K ,xjk{ } . For example, C1 has the possible categories male, female{ }

i.e., k=2; for C2 has 51 (50 states + other) possible categories i.e., k=51.

Definition: For a population or a sample and a particular categorical variable, the number of times that the variables is in a particular category is called the frequency of this category. The category that has the highest frequency is called the mode for the variable. A table composed of the frequencies for the categories is sometimes called the frequency distribution or simply distribution of the categorical variable.

Remark: It makes sense to construct frequency tables for a discrete quantitative variable since we can consider each discrete value of the variable a category.

Page 4: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

4

Relative Frequency

Example: The categorical variable is the color of a ball in a population. A sample of 10 red, green and blue balls

Category Frequency Relative Frequency

Red 5 5/10 = 0.5

Green 2 2/10 = 0.2

Blue 3 3/10 = 0.3

Definition : Suppose that a categorical variable has N categories. Furthermore, suppose for category

k it has a frequency of fk and n = f1 + f2 + ...+ fn is the total number of data points in the sample. Then the

relative frequency is of the kth category is defined as fk =fkn, k=1,2,...,N. The relative frequency is also called

the proportion.

Page 5: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

5

Example

Consider the population of vehicles that are parked in the 25th Avenue Garage and consider the categorical variable for the type of transmission (automatic or manual) in the vehicles. One hundred cars were surveyed. We construct a frequency table.

Category Automatic Manual

Number of Vehicles 73 27

The frequency of automatics is 73 and the frequency of manuals is 27. The mode for the categorical variable and sample is 73. The relative frequency of automatics is 73/100 = 0.73 (73%).

Page 6: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

6

Remarks on Frequency Tables

• A method of organizing data• Lists of all possible categories for a variable along with the number of

observations for each value of the variable.• In addition, we sometimes add columns for the proportion and

percentage for each value of the variable.

Page 7: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

7

Example

Florida : 289

735= 0.3931972 and

289

735×100 ≈ 39.3%

Page 8: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

8

Example (categorical)

We are interested in the dominant color of cars that are parked on the Vanderbilt campus. Suppose we go the 25th Avenue Garage and survey the color (black, white, red, blue, green, other) of 100 cars for a sample. In the table below we summarize the counts of this categorical variable.

Color Frequency

Black 20

White 10

Red 15

Blue 35

Green 10

Other 20

Page 9: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

9

Bar Chart

Color Frequency

Black 20

White 10

Red 15

Blue 35

Green 10

Other 20

Bar charts can also be constructed using Excel.

Definition: A bar chart for a categorical variable is series of horizontal or vertical bars with the height of each bar representing the frequency of a particular category for the variable.

Page 10: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

10

Bar Chart for Relative Frequency

Remark: Instead of the bars representing the frequency of a category, they could represent the relative frequency.

Color Frequency Relative Frequency

Black 20 0.182

White 10 0.091

Red 15 0.136

Blue 35 0.318

Green 10 0.091

Other 20 0.182

Page 11: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

11

Pie Chart

Color Frequency

Black 20

White 10

Red 15

Blue 35

Green 10

Other 20

Definition: A pie chart for a categorical variable is a circle divided into sectors with each sector representing the frequency of a category for the variable.

Page 12: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

12

Variations of Pie Chart

Page 13: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

13

Pie Chart with Excel

Create a pie chart for the following data using Excel.

Color Frequency

Black 20

White 10

Red 15

Blue 35

Green 10

Other 20

Page 14: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

14

Example (Doctorates)

Year Physical Sciences

Engineering Life Sciences

Social Sciences

Humanities Education

1983 4425 2781 5553 6096 3500 7174

1993 6496 5698 7395 6545 4481 6689

2003 5963 5265 8369 6777 5412 6627

Doctorate Recipients: 1983, 1993, 2003. For each year we have six categories: type of degree.

Page 15: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

15

(continued)

Green - 1983

Red - 1993

Orange - 2003

Page 16: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

16

Pareto Charts

In a bar chart, if we order the bars (categories) from tallest to smallest, then this bar chart is called a Pareto Chart. The reason for doing this is that the “most important” category appears first.

Definition: A Pareto Chart is a bar graph whose bars are drawn in decreasing order of frequency or relative frequency.

Page 17: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

17

ExampleConsider the following sample composed of Vanderbilt students who are studying at least one foreign language.

Spanish Chinese Spanish Spanish Spanish

Chinese German Spanish Spanish French

Spanish Spanish Japanese Latin Spanish

German German Spanish Italian Spanish

Italian Japanese Chinese Spanish French

Spanish Spanish Russian Latin French

(a) Construct the frequency distribution for this sample.(b) Construct the relative frequency distribution.(c) Construct the bar chart for the frequency.(d) Construct the bar chart for the relative frequency.(e) What is the mode of the frequency distribution?

Page 18: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

18

Solution

Category Frequency Relative Frequency

French 3 3/30 = 0.100

Latin 2 2/30 = 0.067

Russian 1 1/30 = 0.033

Japanese 2 2/30 = 0.067

Italian 2 2/30 = 0.067

German 3 3/30 = 0.100

Chinese 3 3/30 = 0.100

Spanish 14 14/30 = 0.467

Page 19: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

19

Organizing Quantitative Data

Section 2.1

Two Types of Quantitative Data

• Discrete

• Tables

• Frequency Tables

• Relative Frequency Tables

• Dot Plots

• Stem-and-Leaf Plots

• Histograms

• Continuous

• Histograms

Page 20: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

20

Tables and Discrete DataRemark: There is essentially no difference between categorical data and discrete quantitative data. Each number represents a category.

Example: Consider a discrete set of quantitative data:

{1,-1,1,0,0,2,3,1,0,2} .

We can construct a frequency table for the numbers in this set of numbers.

Data Point Frequency Relative Frequency

-1 1 1/10 = 0.1

0 3 3/10 = 0.3

1 3 3/10 = 0.3

2 2 2/10 = 0.2

3 1 1/10 = 0.1

Sum 10 1.0

Page 21: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

21

Frequency Chart

Page 22: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

22

Histograms

Definition: A histogram is a special type of bar chart that shows the frequency of quantitative data that is separated into intervals (bins or classes).

Page 23: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

23

Example

Construct a histogram for the data, {1.1,1.8, 0.9, 0.2, 2.5, 1.3 ,2.1, 2.1, 2.9, 2.0}, using the bins: [0,1), [1,2), [2,3).

[0,1): 0.9, 0.2 (frequency = 2)

[1,2): 1.1, 1.8, 1.3 (frequency = 3)

[2,3): 2.5, 2.1, 2.1, 2.9, 2.0 (frequency = 5)

Page 24: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

24

Dot Plots

• Primarily for discrete quantitative data• Similar to a bar chart or histogram• Includes information about frequency i.e., how many times a data

point appears as a single number or in a range of values.

Definition: A dot plot is a chart for discrete quantitative data where each observation is represented by a dot where the possible values of data is represented along the horizontal axis.

Page 25: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

25

Example (quantitative)

Suppose we stand at the entrance of the Math. Building and count the number of people entering over a 10 minute period in 1 minute increments. Below we have a table that summarizes our sample and the resulting dot plot.

Time Interval

Count

1 (0-1) 3

2 (1-2) 1

5 (4-5) 3

6 (5-6) 4

10 (9-10) 7

In the table, we didn’t put intervals during which no people entered.

Page 26: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

26

Example

This table summarizes the about of sodium (mg) and sugar (g) for some popular breakfast cereals. It also characterizes the type (adult or child) of cereal. Hence, we have three pieces of data (variables) for each cereal: 2 quantitative and 1 categorical. We will use the dot plot for the sodium.

Page 27: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

27

Dot Plot of Sodium

Notice that the a dot plot gives information about the frequency that a number in a numerical data sample reoccurs, e.g., 70 occurs once and 200 twice.

Page 28: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

28

Stem-and-Leaf Plots

• A stem-and-leaf plot organizes data to show its shape and distribution.• Each data point is represented by a stem and a leaf.• Usually, the leaf is the last digit of the numerical data point and the other

digits to the left of the leaf form the stem. For example, if 9834 is a data point, then 4 is the leaf and 983 is the stem. (stemleaf)

• In a set of data, a stem may have several leaves.• For one digit data (0,1,2,…,9), we can represent the data as 00,01,…09.

For a data point 0X, the leaf is X and stem is 0.• We usually organize by stems.• It is sometimes to modify this representation when large numbers are

involved. In this case the stem will represent a class of numbers of the form: d x 10s.

Page 29: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

29

ExampleSuppose a sample contains the following data points: {9, 15, 17, 24, 50, 65, 101, 170, 171}.

Number Stem Leaf

9 = 09 0 9

15 1 5

17 1 7

24 2 4

50 5 0

65 6 5

101 10 1

170 17 0

171 17 1

Stems Leaves

0 9

1 57

2 4

3

4

5 0

6 5

7

8

9

10 1

11

12

13

14

15

16

17 01

Page 30: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

30

ExampleConstruct a Stem-and-Leaf plot for the data: {5.4, 4.3, 4.1, 8.6, 6.0, 7.9, 9.1, 6.1, 3.1,14.5, 12.5, 8.3, 10.1, 8.2, 6.8, 10.9, 2.3, 1.0, 8.3, 8.9, 6.1, 6.5, 6.0, 9.4, 0.1, 13.9, 3.7, 10.1, 9.9, 4.9, 6.4, 10.3, 2.3. 11.9, 11.7, 12.1, 9.8, 7.8, 2.9, 6.7}.

We ignore the the decimal point or alternatively multiple each number by 10.

Stems Leaves

0 1

1

2 339

3 17

4 139

5 4

6 00114578

7 89

8 23369

9 1489

10 1139

11 79

12 15

13 9

14 5

Page 32: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

32

Stem-and-leaf Plots and Frequency

Consider a sample {101,103,104,108,109}. If we constructed the stem-and-leaf plot for this data, then there is a single stem (10) and five leaves (1,3,4,8,9). Hence, the number of leaves i.e., 5, the frequency that the data appears in the interval [100,109]. Hence, we can conclude that there is a connection in the number of leaves and the number of times data fall in 10 integer length intervals.

Page 33: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

33

Bottom Line

Dot plots and stem-and-leaf plots segregate the data into bins (or numerical ranges or classes) and they show the frequency of data within those classes. This is useful information, but it is not practical when one has a sample with a large number of data points.

Page 34: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

34

Remark: Frequency Tables & Dot Plots

Sodium Data:000 210 260 125220 290210 140220 200125 170250 150170 70230 200290 180

The frequency of a sodium interval level can be gotten from the dot plot.

A frequency table and a dot plot give basically the same information.

Page 35: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

35

Continuous Data described by Histograms

Definition: A histogram is a type of bar chart that gives the frequencies or relative frequencies of occurrences of a quantitative variable (either discrete or continuous) in specified intervals.

Interval Frequency

0-39 1

40-79 1

80-119 0

120-159 4

160-199 3

200-239 7

240-279 2

280-319 2

Page 36: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

36

Construction of Histograms

• Define intervals of equal width for the variable under consideration. For example if our data in our sample are integers and ranges from 0 to 50, we might choose the intervals (bins) [0,9],[10,19],[20,29],[30,39],[40,49,[50,60]. The intervals or bins are called classes. The length of a class is called the class width.

• Count the number of data points are in each bin. In the above example, we would calculate 6 nonnegative integer values.

• Construct a bar chart with the intervals specifying the width of the bars and the frequencies giving the height of the bars. Note that the width of the bar is arbitrary as long as we know the length of the intervals over which we do the frequency counting.

• The heights of the bars in the histogram are called the distribution of the sample.

• Histograms could be used for categorical data.• Remark: Instead of using the frequency counts, we could use the fraction of

the total sample size (percentage) as the height.

Page 37: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

37

Example

Construct a histogram (using percentages) for the following sample:{1.1, -1.0, 2.1, 3.5, -2.1, 0.9, 0.75, -0.5, 0.25, 4.5, 4.1}.

Interval Frequency Fraction

[-3,-2) 1 1/11~0.091

[-2,-1) 0 0/11

[-1,0) 2 2/11~0.181

[0,1) 3 3/11~0.273

[1,2) 1 1/11

[2,3) 1 1/11

[3,4) 1 1/11

[5,5) 2 2/11

Page 38: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

38

Histogram for Example

Page 39: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

39

Example (IQ Scores)

IQ Range Frequency

60-69 2

70-79 3

80-89 13

90-99 42

100-110 58

110-119 40

120-129 31

130-139 8

140-149 2

150-159 1

Page 40: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

40

(continued)

IQ Range Frequency

60-69 2

70-79 3

80-89 13

90-99 42

100-110 58

110-119 40

120-129 31

130-139 8

140-149 2

150-159 1

How many students were sampled?

What is the width of the intervals?

Which range of IQ had the highest frequency?

Which range of IQ had the lowest frequency?

Page 41: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

41

Dot, Stem-and-leaf, or Histogram?

• Dot plot and Stem-and-Leaf plot:– Useful for showing information about small data

sets.– Shows actual data.

• Histogram– Useful for showing information about large data

sets.– Can be used for continuous or discrete data.– Most compact plot.– Has flexibility in defining intervals.

Page 42: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

42

The Shape of the Distribution

For a histogram, we can associate the graph of a function by drawing a smooth curve through the midpoints of each bar. The shape of this curve can be used to describe the shape of the histogram.

Page 43: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

43

Unimodal and Bimodal

Unimodal: one hump Bimodal: two humps

Page 44: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

44

Skewed Distributions

Skewed to the right Skewed to the left

Symmetric

Page 45: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

45

Distribution Terminology

• The value of the highest bar in a histogram is called the mode of the distribution. Hence, the terminology unimodal and bimodal.

• A distribution is said to be symmetric in there is a vertical line that separates the distribution into identical pieces.

• A distribution that is not symmetric is said to be skewed.

• The “ends” of a distribution are called the tails of the distribution.

Page 46: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

46

Outliers

A bar that is completely separated from the cluster of bars is called an outlier.

Page 47: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

47

Hours of TV Watching

Page 48: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

48

Wechsler Adult Intelligence Scale (IQ)

Range %

<55 0.15

55-70 1.85

70-85 13.0

85-100 35.0

100-115 33.0

115-130 15.0

130-145 1.80

>145 0.20

The distribution is almost symmetric.

Page 49: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

49

Additional Displays for Quantitative Data

Section 2.3

Alternative to histograms for quantitative data: Frequency Polygons.

Definition: Suppose that an interval, [a,b), represents a class for a set of quantitative data. The class midpoint is defined as (a+b)/2.

Definition: A frequency polygon is a graph that is constructed from the class midpoints and their frequencies.

Bins (class) Class Midpoint Frequency

[a,b) (a+b)/2 f

… … …

… … …

Page 50: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

50

Example

Mathematica Demonstration

Page 51: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

51

Cumulative Frequency Distribution

Suppose that f1, f2 ,..., fk{ } is the set of frequencies for some data set of size n. That is, suppose that we

subdivide the interval between the largest and smallest values of the data set into k categories (subintervals).We then count the number of data points that lie in each subinterval. The cumulative frequency of category j is

defined as f1 + f2 + ...+ fj = fii=1

j

∑ . Note the cumulative frequency of category k, f1 + f2 + ...+ fk =n.

Page 52: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

52

Cumulative Frequency

Page 53: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

53

Example

data = {3.1, 0.1, 0.9, 1.1, 1.3, 1.6, 2.5, 0.3, 2.5, 1.6, 1.6, 3.5, 1.8}

bins = [0,1), [1,2), [2,3), [3,4)

n = 13

k = 4

Bin Frequency Cumulative Frequency

[0,1) 3 3

[1,2) 6 3+6 = 9

[2,3) 2 3+6+2 = 11

[3,4) 2 3+6+2+2 = 13

Page 54: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

54

Cumulative Relative Frequency Distribution

If f1, f2 ,..., fk{ } are the frequencies in bins (classes), a1,a2[ ), a2 ,a3[ ),..., ak,ak+1[ ){ } , for a set of data such that

f1 + f2 + ...+ fk =n, then we define the relative frequencies: rj =fj

n. We note that

r1 + r2 + ...+ rk =1. The cumulative relative frequency for bin j is defined as r1 + r2 + ...+ rj .

Page 55: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

55

Example

data = {3.1, 0.1, 0.9, 1.1, 1.3, 1.6, 2.5, 0.3, 2.5, 1.6, 1.6, 3.5, 1.8}

bins = [0,1), [1,2), [2,3), [3,4)

n = 13

k = 4

Bin Frequency Cumulative Frequency Relative Frequency

(rounded)

Cumulative Relative Frequency

[0,1) 3 3 3/13 = 0.230 0.230

[1,2) 6 3+6 = 9 6/13= 0.462 0.230+0.462 = 0.692

[2,3) 2 3+6+2 = 11 2/13 = 0.154 0.692+0.154 = 0.846

[3,4) 2 3+6+2+2 = 13 2/13 = 0.154 0.846+0.154 = 1.000

Page 56: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

56

Relative Frequency Distribution (histogram)

Page 57: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

57

Ogive

Definition: An ogive is a graph of the cumulative frequency or the relative cumulative frequency as a function of the bins used to construct the cumulative or relative cumulative frequency. It is constructed by using a cumulative frequency (or relative cumulative frequency) table.

Page 58: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

58

Example

Bin Frequency Cumulative Frequency Relative Frequency

(rounded)

Cumulative Relative Frequency

[0,1) 3 3 3/13 = 0.230 0.230

[1,2) 6 3+6 = 9 6/13= 0.462 0.230+0.462 = 0.692

[2,3) 2 3+6+2 = 11 2/13 = 0.154 0.692+0.154 = 0.846

[3,4) 2 3+6+2+2 = 13 2/13 = 0.154 0.846+0.154 = 1.000

Page 59: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

59

Time-series DataDefinition: Data about a particular variable collected over a period of time is called time-series data.

Example: Closing prices of IBM stock since Jan. 1, 2008.

Page 60: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

60

Bad Graphical Representation of Data

Section 2.4

Problem: Graphs can give an incomplete or even a misrepresentation of the sample (data).

Page 61: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

61

The Scale Problem

The number of bachelor’s degrees in engineering for 1999-2003 is given in the following table:

Year Number of Degrees

1999 62,372

2000 63,731

2001 65,113

2002 67,301

2003 70,949

Page 62: 1 Chapter 2: Descriptive Statistics 2.1 Organizing Qualitative Data 2.2 Organizing Quantitative Data 2.3 Additional Displays 2.4 Misrepresentations of.

62

Misleading Bar Chart