Topic 2

Describing Distributions with Graphs and Numbers

Topic 2

2

Target population Data

Sampling/ experiment

Size = nSize = N

summary

visualizationInference (estimation, testing)

Parameter and Statistic

• A parameter (in statistics) is a quantity that defines a certain characteristic of a population.– Average birthweight of all new-born babies

• Parameters are estimated based on a sample. • A statistic is a summary measure computed from

sample data. Note that a parameter is a summary measure for an entire population.

• A key use of a statistic is as an estimator for a parameter.

3

Distributions• When we say that 62% TAMUK students are

Hispanic, 32% are white, 3% are African-American, and 3% are others, we mean the DISTRIBUTION of TAMUK students according to race is

Race Percent Hispanic 62% White 32% African-American 3% Others 3%

4

• The DISTRIBUTION of grades for a class could be

Grade Percent A 20% B 45% C 22% D 10% F 3%

5

• The DISTRIBUTION of weights of all men aged 30 in Texas could be

Weights Percent Less than 130 lb. 3% 130 to 140 lb. 6% 140 to 150 lb. 15% 150 to 160 lb. 25% 160 to 170 lb. 30% 170 to 180 lb. 17% 180 or over 4%

6

• So, the DISTRIBUTION of a population describes how the population is made up of according to some characteristic.

If one is concerned with the characteristic of a population that can be described by a categorical variable, e.g., race, he or she may be interested in what percent of subjects fall in each race category.

If one is concerned with the characteristic of a population that can be described by a continuous variable, e.g., weight, he or she may be interested in what proportion of people fall in a weight interval.

7

Histograms

• A histogram is a bar graph in which the horizontal scale represents classes of data values and the vertical scale represents frequencies (or relative frequencies). The heights of the bars correspond to the frequency (or the relative frequency) values, and the bars are drawn adjacent to each other without gaps.

8

Example Construct a histogram for the 20 systolic blood pressures (SBP) of 20 men

93 104 105 108 109 112 114 115 117 119 119 120 121 123 127 130 135 139 139 158

9

SBP Frequency90-99 1100-109 4110-119 6120-129 4130-139 4140-149 0150-159 1

Histogram of SBP

SBP

Freq

uenc

y

100 120 140 160

01

23

45

6

R Codes SBP = c(93,104,105,108,109,112,114,115,117,119, 119,120,121,123,127,130,135,139,139,158) hist(SBP, breaks=c(89.5,99.5,109.5,119.5,129.5,139.5, 149.5,159.5,169.5),col=3)

Copy and paste these codes to R, then you will see the histogram.

10

Pie Charts

Pie chart: A circle having a “slice of a pie” for each category. The size of slice corresponds to the percentage of observations in the category.

11

5%

27%

6%

2%9%

38%

4%9%

Seats

EULPESEFAEDDELDREPPUENOther

12

Bar Graph for European Parliament in 2004

0

50

100

150

200

250

300

EUL PES EFA EDD ELDR EPP UEN Other

Seat

s

13

Pareto Chart

0

5

10

15

20

25

30

35

40

EPP PES ELDR Other EFA EUL UEN EDDGroup

Perc

enta

gePareto Chart: Bar Graph with categories Ordered by Their Frequency from the Tallest Bar to Shortest

14

Measuring the Center: the Mean and Median

15

• The distribution of data or a population can be displayed graphically. In practice, we also want to know where the center of a distribution is. The mean and median are common measures of a distribution.

• The mean of n observations x1, x2, …, xn, denoted ___, is defined as ______.

• Example: The selling prices ($) of 5 single-family homes are 198000, 219000, 175000, 260000, 630000. Find the mean price.

The Mean is Sensitive to Outliers

16

• If the 5th home were $360000, then the mean price would be ___. The significant difference in means is due primarily to the 5th price, which is called an outlier.

• If we construct a histogram or a stem plot for the data of these 5 prices, the distribution of the data can be seen to be skewed to the right. This skewness is caused by the outlier.

The Median

17

• Another measure of center of a distribution is the median.

• Given n observations x1, x2, …, xn, the median, denoted M, is defined as the number such that half the observations are smaller.

• To find the median of n observations, we first sort the observations in order, then pick the “midpoint”.

• Example: Find the median of the 5 prices 198000, 219000, 175000, 260000, and 630000.

• What if we have 6 prices: 198000, 219000, 175000, 260000, 630000, and 230000?

Location of the Median

18

• Given n observations, the location of the median in the ordered list is always (n+1)/2.

• When is the location of a median an integer? When decimal?

• If the location of a median is 4.5, it means that the median is halfway between the 4th and 5th observations in the ordered list. What does it mean if the location is 7?

• Find the median and its location for data: 2, 5, 1, 0, 9.

• Find the median and its location for data: 0, 3, 1, -2, 7, 4.

Example: Find the Mean and Median from a Stem Plot

19

1 | 69 2 | 455 3 | 334477 4 | 0255669 5 | 6 | 7 | 3

(a) What are the observations?(b) Find the mean.(c) Find the median and its location.

Comparing the Mean and Median

20

• For a symmetric distribution, mean = median.• For a right-skewed distribution, mean > median.• For a left-skewed distribution, mean < median.

Mean, Median, and Mode

The distribution of data is Symmetric

The distribution is skew to the left

The distribution is skew to the right

21

Measuring the Spread: The Quartiles

22

• The spread of a distribution measures how divergent the distribution is.

• The middle half of a distribution is marked out by two quartiles:

The 1st quartile Q1 is the number such that 25% of all values are smaller; The 3rd quartile Q3 is the number such that 75% of all values are smaller;– The median of a distribution is also called the 2nd quartile

which is the number such that 50% of all values are smaller;.

• Note also that these quartiles so defined are not unique.

• To find these quartiles, we will need to sort the data and find the locations of these quartiles.

Example: Find Quartiles

23

1. Given data 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73 46 45 45,find Q1, M, and Q3.

2. Given data 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73 46 45 45 31,find Q1, M, and Q3.

The Five-Number Summary and Boxplots

24

• Q1, M, and Q3 give the information about the middle half of a distribution; the tails of a distribution can be described by possible smallest and largest values of the distribution. These five values can intuitively picture a distribution and are called the 5-number summary.

• The Five-Number Summary of a distribution describes both the center and the spread of a distribution.

• The 5 numbers can be displayed in a (ordinary) boxplot, which consists of (a) a central box spanning the quartiles Q1 and Q3,(b) a line in the box masking the median M, and(c) two lines extended from the box out to the smallest and largest

observations.• Compared with its competitors histograms and stem plots, a boxplot show

less detail about the distribution. Boxplots are best used for side-by-side comparison of more than one distribution. The boxplot of a distribution should be interpreted in terms of skewness, the center and the spread.

25

section 01 Section 02

7080

9010

0

Gra

de

Compare the two boxplots in terms of skewness, spread, and center.

The side-by-side boxplot is produced with the following R codes: x = c(86, 91, 72, 79, 74, 83, 73, 92, 76, 72, 67, 88, 70, 79, 93, 65, 75, 83, 90, 75, 100, 63); y = c(74, 84, 86, 90, 78, 85, 75, 72, 97, 84, 87, 76, 78, 79, 82, 63, 95, 79, 82, 69, 96, 73) ;z=data.frame(Grade=c(x,y), Section = c(rep('Section 01', length(x)), rep('Section 02 ', length(y)))); attach(z); boxplot(Grade~Section, col = 2:3)

Spotting Suspected Outliers: The 1.5xIQR Rule

26

• In a boxplot, the distance between Q1 and Q3 (the range of the center half of the data) is a more resistant measure of spread. This distance is called the inter-quartile range, denoted IQR; that is

IQR = Q3 – Q1.• The 1.5xIQR Rule for outliers: An observation is called a

suspected outlier if it falls more than 1.5xIQR above Q3 or below Q1.

• Example: Find Q1, Q3, and IQR of the data: 72 83 91 84 84 78 90 85 67 91 80 85 67 65 95.

Identify any suspected outlier.

810

1214

16

minimum

lower hinge

median

upper hinge

maximum

lower fence

upper fence

lower fence

upper fence

lower fence

Obs. 25

A Modified Boxplot

27

myBoxPlot = function(x, col = 'gray'){ boxplot(x, col = col) text(rep(1.3,5), fivenum(x), labels=c('minimum', 'lower hinge', 'median',

'upper hinge', 'maximum'), col = 'blue') q = quantile(x, probs = c(0.25, 0.5, 0.75)) IQR = q[3] - q[1] lowerfence = q[1] - 1.5*IQR upperfence = q[3] + 1.5*IQR abline(h = c(lowerfence, upperfence), col = 'green', lty = 2) text(rep(1.3,5), c(lowerfence, upperfence), labels=c('lower fence', 'upper fence'), col = 'blue') Outliers = which((x - lowerfence)*(x - upperfence) > 0) if (length(Outliers) != 0) text(rep(0.63, length(Outliers)), x[Outliers], labels = paste(rep('Obs.', length(Outliers)),Outliers), col = 'red') }

Rainfall = c(9.6, 12.9, 9.9, 8.7, 6.8, 12.5, 13.0, 10.1, 10.1, 10.1, 10.8, 7.8, 14.1, 10.6, 10.0, 11.5, 13.6, 12.1, 12.0, 9.3, 7.7, 11.0, 6.9, 9.5, 16.5, 9.3, 9.4, 8.7, 9.5, 11.6, 12.1, 8.0, 10.7, 13.9, 11.3, 11.6, 10.4)

myBoxPlot(Rainfall)

R codes

28

Measuring Spread: the Standard Deviation

29

• Interestingly, the mean is not among the 5-numver summary of a distribution. The closest partner of the mean is the standard deviation, which is another measure of the spread of a distribution.

• The standard deviation measures how far the observations are from their mean.

• The variance of a set of observations is an average of the squares of deviation from the mean.

2 2 2

2 1 2

2

( ) ( ) ( )1

1 ( )1

n

i

x x x x x xs

n

x xn

• The standard deviation s is the square root of the variance

2( ) ,

1where 1 is called the degrees of freedom.

ix xs

nn

Calculation of Standard Deviations

30

• Example (Calculating the standard deviation s) Metabolic rates of 7 men who took part in a study of

dieting. The units are calories per 24 hours. 1792 1666 1362 1614 1460 1867 1439 Find the mean first:

x 1792 1439 112000 1600 calories 7 7

The standard deviation: Example

31

1792 192 368641666 66 43561362 -238 566441614 14 1961460 -140 196001867 267 712891439 -161 25921

Observations Deviations Squared deviations

ix ix x 2( )ix x

sum = 0 sum = 214870

The variance

The standard deviation

2 214870 35811.676

s

35811.67 189.24 caloriess

Cont’d

32

Summary of Strategies for Exploring Data on a Single Quantitative Variable

33

• The 5-number summary is always good for describing the distribution of quantitative data.

• The mean and its partner standard deviation should be used to describe the center and spread of the distribution of quantitative data only when the distribution is known to be symmetric, since both are sensitive to outliers.

• The shape of the distribution of quantitative data is better described using graphical displays such as histograms.

Topic 2

Documents

Transcript of Topic 2