Topic 2
description
Transcript of Topic 2
Describing Distributions with Graphs and Numbers
Topic 2
2
Target population Data
Sampling/ experiment
Size = nSize = N
summary
visualizationInference (estimation, testing)
Parameter and Statistic
• A parameter (in statistics) is a quantity that defines a certain characteristic of a population.– Average birthweight of all new-born babies
• Parameters are estimated based on a sample. • A statistic is a summary measure computed from
sample data. Note that a parameter is a summary measure for an entire population.
• A key use of a statistic is as an estimator for a parameter.
3
Distributions• When we say that 62% TAMUK students are
Hispanic, 32% are white, 3% are African-American, and 3% are others, we mean the DISTRIBUTION of TAMUK students according to race is
Race Percent Hispanic 62% White 32% African-American 3% Others 3%
4
• The DISTRIBUTION of grades for a class could be
Grade Percent A 20% B 45% C 22% D 10% F 3%
5
• The DISTRIBUTION of weights of all men aged 30 in Texas could be
Weights Percent Less than 130 lb. 3% 130 to 140 lb. 6% 140 to 150 lb. 15% 150 to 160 lb. 25% 160 to 170 lb. 30% 170 to 180 lb. 17% 180 or over 4%
6
• So, the DISTRIBUTION of a population describes how the population is made up of according to some characteristic.
If one is concerned with the characteristic of a population that can be described by a categorical variable, e.g., race, he or she may be interested in what percent of subjects fall in each race category.
If one is concerned with the characteristic of a population that can be described by a continuous variable, e.g., weight, he or she may be interested in what proportion of people fall in a weight interval.
7
Histograms
• A histogram is a bar graph in which the horizontal scale represents classes of data values and the vertical scale represents frequencies (or relative frequencies). The heights of the bars correspond to the frequency (or the relative frequency) values, and the bars are drawn adjacent to each other without gaps.
8
Example Construct a histogram for the 20 systolic blood pressures (SBP) of 20 men
93 104 105 108 109 112 114 115 117 119 119 120 121 123 127 130 135 139 139 158
9
SBP Frequency90-99 1100-109 4110-119 6120-129 4130-139 4140-149 0150-159 1
Histogram of SBP
SBP
Freq
uenc
y
100 120 140 160
01
23
45
6
R Codes SBP = c(93,104,105,108,109,112,114,115,117,119, 119,120,121,123,127,130,135,139,139,158) hist(SBP, breaks=c(89.5,99.5,109.5,119.5,129.5,139.5, 149.5,159.5,169.5),col=3)
Copy and paste these codes to R, then you will see the histogram.
10
Pie Charts
Pie chart: A circle having a “slice of a pie” for each category. The size of slice corresponds to the percentage of observations in the category.
11
5%
27%
6%
2%9%
38%
4%9%
Seats
EULPESEFAEDDELDREPPUENOther
12
Bar Graph for European Parliament in 2004
0
50
100
150
200
250
300
EUL PES EFA EDD ELDR EPP UEN Other
Seat
s
13
Pareto Chart
0
5
10
15
20
25
30
35
40
EPP PES ELDR Other EFA EUL UEN EDDGroup
Perc
enta
gePareto Chart: Bar Graph with categories Ordered by Their Frequency from the Tallest Bar to Shortest
14
Measuring the Center: the Mean and Median
15
• The distribution of data or a population can be displayed graphically. In practice, we also want to know where the center of a distribution is. The mean and median are common measures of a distribution.
• The mean of n observations x1, x2, …, xn, denoted ___, is defined as ______.
• Example: The selling prices ($) of 5 single-family homes are 198000, 219000, 175000, 260000, 630000. Find the mean price.
The Mean is Sensitive to Outliers
16
• If the 5th home were $360000, then the mean price would be ___. The significant difference in means is due primarily to the 5th price, which is called an outlier.
• If we construct a histogram or a stem plot for the data of these 5 prices, the distribution of the data can be seen to be skewed to the right. This skewness is caused by the outlier.
The Median
17
• Another measure of center of a distribution is the median.
• Given n observations x1, x2, …, xn, the median, denoted M, is defined as the number such that half the observations are smaller.
• To find the median of n observations, we first sort the observations in order, then pick the “midpoint”.
• Example: Find the median of the 5 prices 198000, 219000, 175000, 260000, and 630000.
• What if we have 6 prices: 198000, 219000, 175000, 260000, 630000, and 230000?
Location of the Median
18
• Given n observations, the location of the median in the ordered list is always (n+1)/2.
• When is the location of a median an integer? When decimal?
• If the location of a median is 4.5, it means that the median is halfway between the 4th and 5th observations in the ordered list. What does it mean if the location is 7?
• Find the median and its location for data: 2, 5, 1, 0, 9.
• Find the median and its location for data: 0, 3, 1, -2, 7, 4.
Example: Find the Mean and Median from a Stem Plot
19
1 | 69 2 | 455 3 | 334477 4 | 0255669 5 | 6 | 7 | 3
(a) What are the observations?(b) Find the mean.(c) Find the median and its location.
Comparing the Mean and Median
20
• For a symmetric distribution, mean = median.• For a right-skewed distribution, mean > median.• For a left-skewed distribution, mean < median.
Mean, Median, and Mode
The distribution of data is Symmetric
The distribution is skew to the left
The distribution is skew to the right
21
Measuring the Spread: The Quartiles
22
• The spread of a distribution measures how divergent the distribution is.
• The middle half of a distribution is marked out by two quartiles:
The 1st quartile Q1 is the number such that 25% of all values are smaller; The 3rd quartile Q3 is the number such that 75% of all values are smaller;– The median of a distribution is also called the 2nd quartile
which is the number such that 50% of all values are smaller;.
• Note also that these quartiles so defined are not unique.
• To find these quartiles, we will need to sort the data and find the locations of these quartiles.
Example: Find Quartiles
23
1. Given data 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73 46 45 45,find Q1, M, and Q3.
2. Given data 16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73 46 45 45 31,find Q1, M, and Q3.
The Five-Number Summary and Boxplots
24
• Q1, M, and Q3 give the information about the middle half of a distribution; the tails of a distribution can be described by possible smallest and largest values of the distribution. These five values can intuitively picture a distribution and are called the 5-number summary.
• The Five-Number Summary of a distribution describes both the center and the spread of a distribution.
• The 5 numbers can be displayed in a (ordinary) boxplot, which consists of (a) a central box spanning the quartiles Q1 and Q3,(b) a line in the box masking the median M, and(c) two lines extended from the box out to the smallest and largest
observations.• Compared with its competitors histograms and stem plots, a boxplot show
less detail about the distribution. Boxplots are best used for side-by-side comparison of more than one distribution. The boxplot of a distribution should be interpreted in terms of skewness, the center and the spread.
25
section 01 Section 02
7080
9010
0
Gra
de
Compare the two boxplots in terms of skewness, spread, and center.
The side-by-side boxplot is produced with the following R codes: x = c(86, 91, 72, 79, 74, 83, 73, 92, 76, 72, 67, 88, 70, 79, 93, 65, 75, 83, 90, 75, 100, 63); y = c(74, 84, 86, 90, 78, 85, 75, 72, 97, 84, 87, 76, 78, 79, 82, 63, 95, 79, 82, 69, 96, 73) ;z=data.frame(Grade=c(x,y), Section = c(rep('Section 01', length(x)), rep('Section 02 ', length(y)))); attach(z); boxplot(Grade~Section, col = 2:3)
Spotting Suspected Outliers: The 1.5xIQR Rule
26
• In a boxplot, the distance between Q1 and Q3 (the range of the center half of the data) is a more resistant measure of spread. This distance is called the inter-quartile range, denoted IQR; that is
IQR = Q3 – Q1.• The 1.5xIQR Rule for outliers: An observation is called a
suspected outlier if it falls more than 1.5xIQR above Q3 or below Q1.
• Example: Find Q1, Q3, and IQR of the data: 72 83 91 84 84 78 90 85 67 91 80 85 67 65 95.
Identify any suspected outlier.
810
1214
16
minimum
lower hinge
median
upper hinge
maximum
lower fence
upper fence
lower fence
upper fence
lower fence
Obs. 25
A Modified Boxplot
27
myBoxPlot = function(x, col = 'gray'){ boxplot(x, col = col) text(rep(1.3,5), fivenum(x), labels=c('minimum', 'lower hinge', 'median',
'upper hinge', 'maximum'), col = 'blue') q = quantile(x, probs = c(0.25, 0.5, 0.75)) IQR = q[3] - q[1] lowerfence = q[1] - 1.5*IQR upperfence = q[3] + 1.5*IQR abline(h = c(lowerfence, upperfence), col = 'green', lty = 2) text(rep(1.3,5), c(lowerfence, upperfence), labels=c('lower fence', 'upper fence'), col = 'blue') Outliers = which((x - lowerfence)*(x - upperfence) > 0) if (length(Outliers) != 0) text(rep(0.63, length(Outliers)), x[Outliers], labels = paste(rep('Obs.', length(Outliers)),Outliers), col = 'red') }
Rainfall = c(9.6, 12.9, 9.9, 8.7, 6.8, 12.5, 13.0, 10.1, 10.1, 10.1, 10.8, 7.8, 14.1, 10.6, 10.0, 11.5, 13.6, 12.1, 12.0, 9.3, 7.7, 11.0, 6.9, 9.5, 16.5, 9.3, 9.4, 8.7, 9.5, 11.6, 12.1, 8.0, 10.7, 13.9, 11.3, 11.6, 10.4)
myBoxPlot(Rainfall)
R codes
28
Measuring Spread: the Standard Deviation
29
• Interestingly, the mean is not among the 5-numver summary of a distribution. The closest partner of the mean is the standard deviation, which is another measure of the spread of a distribution.
• The standard deviation measures how far the observations are from their mean.
• The variance of a set of observations is an average of the squares of deviation from the mean.
2 2 2
2 1 2
2
( ) ( ) ( )1
1 ( )1
n
i
x x x x x xs
n
x xn
• The standard deviation s is the square root of the variance
2( ) ,
1where 1 is called the degrees of freedom.
ix xs
nn
Calculation of Standard Deviations
30
• Example (Calculating the standard deviation s) Metabolic rates of 7 men who took part in a study of
dieting. The units are calories per 24 hours. 1792 1666 1362 1614 1460 1867 1439 Find the mean first:
x 1792 1439 112000 1600 calories 7 7
The standard deviation: Example
31
1792 192 368641666 66 43561362 -238 566441614 14 1961460 -140 196001867 267 712891439 -161 25921
Observations Deviations Squared deviations
ix ix x 2( )ix x
sum = 0 sum = 214870
The variance
The standard deviation
2 214870 35811.676
s
35811.67 189.24 caloriess
Cont’d
32
Summary of Strategies for Exploring Data on a Single Quantitative Variable
33
• The 5-number summary is always good for describing the distribution of quantitative data.
• The mean and its partner standard deviation should be used to describe the center and spread of the distribution of quantitative data only when the distribution is known to be symmetric, since both are sensitive to outliers.
• The shape of the distribution of quantitative data is better described using graphical displays such as histograms.