stats powerpoint from the worst prof in the world
Transcript of stats powerpoint from the worst prof in the world
-
8/10/2019 stats powerpoint from the worst prof in the world
1/61
-
8/10/2019 stats powerpoint from the worst prof in the world
2/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Preliminaries
Topics:
What is Statistics?
Typical Descriptive Statistics Problems
Note for the StudentIt is recommended that students read this section in itsentiretybefore coming to class for the lecture to ensure thatthey have the required background information.1
During the lecture I will mainly focus on sections which havea direct bearing on the lecture topic under discussion.
Material in the last section serves to complement what wecover during the lecture.
1This also applies to the Preliminaries section of subsequent lecture slides.
-
8/10/2019 stats powerpoint from the worst prof in the world
3/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Statistics Overview
Topics:
What is Statistics?
Applications of Statistics
Learning Objectives:
Learn the nature of Statistics and study its relevance toBusiness Research Analysis and Decision Making.
Learn about the different subdisciplines of Statistics concerned
with extracting descriptive information from data, assessinguncertainty and making statistical inferences & predictions.
-
8/10/2019 stats powerpoint from the worst prof in the world
4/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
What is Statistics?
Statistics is the discipline which makes use of mathematical andcomputational techniques to, among other things,
collect data using surveys, observational studies or designedexperiments;
describe, summarize and present the collected data;assess and quantify uncertainty;
draw inferences about population characteristics based onsample information;
assess the statistical significance of observed differences orpresence of associations;
construct empirical models to obtain estimates, testhypotheses or for predictive purposes;
make projections using cross-sectional or time series data.
P li i i E i i l D Di ib i S M Mi ll
-
8/10/2019 stats powerpoint from the worst prof in the world
5/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Applications of Statistics
Some Applications:
Marketing Research
Eg. Assessing Brand Preferences for a Given Product
Finance
Eg. Measuring the Credit Risk of a Counterparty
Insurance
Eg. Measuring Risk of an Insurance Portfolio
Reliability Engineering
Eg. Assessing the Reliability of an Aircraft Engine
Medical Research
Eg. Determining the Efficacy of a New Drug
Q: Do you think Statistics is worthwhile learning? If so, why?
P li i i E i i l D t Di t ib ti S M Mi ll
-
8/10/2019 stats powerpoint from the worst prof in the world
6/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Typical Descriptive Statistics Problems
Organizing Data
Forty students in an Introductory Statistics course were asked tostate their political affliations (i.e., whether they favoured theDemocratic (D), Republican (R) or Other (O) party). Thefollowing results were obtained.
D R O R R R R R
D O R D O O R D
D R O D R R O R
D O D D D R O D
O R D R R R R D
What type of data are we dealing with?
What can we say about the distribution of political affliations?
Source: Adapted from Weiss (2012, p. 40).
Preliminaries Empirical Data Distrib tions S mmar Meas res Miscellan
-
8/10/2019 stats powerpoint from the worst prof in the world
7/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Summarizing Data
Arterial blood pressures (in mm of mercury) for a sample of 16children of diabetic mothers are given below.
81.6 84.1 87.6 82.8
82.0 88.9 86.7 96.4
84.6 104.9 90.8 94.069.4 78.9 75.2 91.0
What does the data tell you about the average blood pressureof a child whose mother is diabetic?
What can we conclude about the variability of the bloodpressure measurements?
Source: Adapted from Weiss (2012, p. 95)
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
8/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Empirical Data Distributions
Topics:Tabulating Data Distributions
Graphing Data Distributions
Learning Objectives:Learn tabular and graphical techniques for organizing andpresenting data.
Learn how to choose among the available techniques for a
given problem in descriptive statistical analysis.
Note:
Much of the material in this and the next section are of a review nature.Well quickly review such material but spend more time on materialstudents are less familiar with.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
9/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Tabulating Data Distributions
Tabulating Categorical Data
The first column of the table contains the possible categoriesand the second column the correponding absolute frequencies(optionally, relative frequencies may also be given in anothercolumn).
Example
Consider the political affliation data given in the first illustrativeproblem. Following is the frequency table for the data.
Affliation Abs Freq Rel FreqDemocratic 13 0.325Republican 18 0.450Other 9 0.225
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
10/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Tabulating Numerical Data
In an absolute frequency table, the number of observations ineach class (i.e., pre-defined sub-interval) is presented.
Class Frequency
(l1, u1] n1(l2, u2] n2(l3, u3] n3
... ...
(lk, uk] nk
Abs Frequency Table
Class Frequency(10, 20] 3(20, 30] 7(30, 40] 4(40, 50] 4
(50, 60] 2
Note: (10, 20] refers to values between 10 (exclusive) and 20 (inclusive) etc.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
11/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Example [Frequency Tables]
The absolute frequency table in the previous slide was obtained
from the following raw data
12 13 17 21 24 24 26 27 27 30
32 35 37 38 41 43 44 46 53 58
The corresponding relative and cumulative frequency tables are:
Class Rel Freq(10, 20] 0.15(20, 30] 0.35
(30, 40] 0.20(40, 50] 0.20(50, 60] 0.10
Class Cum Freq(10, 20] 0.15(20, 30] 0.50
(30, 40] 0.70(40, 50] 0.90(50, 60] 1.00
Q: What can we deduce from each table?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
12/61
p y y
Graphing Data Distributions
Graphing Distributions for Categorical Data
Pie Chart
A circle is divided into pie slices. The area of each slice isproportional to the relative frequency of each category.
ExampleFor the political affliation data, we have the following pie chart.
Pie Slice AngleDemocratic 117 degRepublican 162 degOther 81 deg
Q: How can we improve on this graphical display?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
13/61
y y
Bar Chart
Each category is represented by a vertical (or horizontal) bar.The height (or width) of each bar is equal or proportional tothe absolute or relative frequency of a category.
Example
For the political affliation data, we have the following bar chart.
Q: Which is preferred? A pie chart or bar chart?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
14/61
Side-by-Side Bar Chart
This chart may be used to present bivariate categorical data.
Example [Side-by-Side Bar Chart]
Consider the following distribution of student grades by gender.
A B C D EFemale 3 9 7 1 1
Male 4 6 5 3 1
In relative terms, we have the following table.
A B C D E
Female 0.14 0.43 0.33 0.05 0.05
Male 0.21 0.32 0.26 0.16 0.05
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
15/61
Example [Side-by-Side Bar Chart] (contd)
Information in the first (second) table may be displayed by the
chart in the left (right) panel of the following figure.
Q: What conclusion(s) can be drawn from the above figure?
Q: Does it matter which chart you base you conclusions on?
Source: Adapted from Chow et al (2007, p. 7).
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
16/61
Graphing Distributions for Numerical Data
Absolute Frequency Histogram
Displays information contained in an absolute frequency tableusing vertical bars with no gaps between bars.
The height of each bar gives the number of observations thatlie in the interval determined by the base of the bar.
Example
Class Frequency(10, 20] 3(20, 30] 7(30, 40] 4(40, 50] 4(50, 60] 2
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
17/61
Relative Frequency Histogram
Displays information in a relative frequency table by vertical
bars with no gaps between bars.The area of each bar gives the fraction of observations that liein the interval determined by the base of the bar.
Example
Class Frequency(10, 20] 0.15(20, 30] 0.35
(30, 40] 0.20(40, 50] 0.20(50, 60] 0.10
Q: What can you conclude from the above figure?
-
8/10/2019 stats powerpoint from the worst prof in the world
18/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
19/61
Cumulative Frequency Polygon
Displays a plot of cumulative frequency against upper class limit in
an expanded cumulative frequency table (as illustrated below).
Example
Class Cum Freq (%)(0, 10] 0
(10, 20] 15(20, 30] 50
(30, 40] 70(40, 50] 90(50, 60] 100
Q: What useful statistic(s) can we deduce from such plots?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
20/61
Digression: Quartiles
Let x1, x2, . . . , xn denote a set ofn observations for our study.
Usually, the xis are unordered.
For some applications, we need to work with ordered values in thedataset, i.e, with x(i)s such that
x(1) x(2) x(n).
Define
Q2 = second quartile of the xis
=
12
x(k)+ x(k+1)
, ifn= 2k,
x(k+1), ifn= 2k+ 1.
Note that Q2 is also referred to as the median of the xis.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
21/61
The first quartile, denoted Q1, may be definedas the median ofxivalues less than or equal to Q2.
The third quartile, denoted Q3, may be definedas the median ofxivalues greater than or equal to Q2.
Example
For the following set of 5 observations
101.96 109.76 99.63 99.76 100.22
the corresponding ordered sample is
99.63 99.76 100.22 101.96 109.76.
Here,Q1 = 99.76, Q2 = 100.22 and Q3 = 101.96.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
22/61
Stem and Leaf Diagram
A stem and leaf diagram (like the one shown below) is a graphicaldisplay that shows the distribution of a set of numerical values.From it, one can
sometimes recover the original data;
easily infer empirical percentiles;
obtain measures of central tendency and dispersion.
Example
1 | 677888992 | 0012257
3 | 2 8
4 | 2
Ordered data: 16, 17, . . . , 38, 42.Distribution is right-skewed.
Q1 = 18, Q2 = 20 and Q3 = 23.5
Min = 16 and Max = 42.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
23/61
Example [Stem and Leaf Display]
For the Cord Strength dataset
25 25 36 31 26 36 29 37 37 2034 27 21 35 30 41 33 21 26 2619 25 14 32 30 29 31 26 22 2434 33 28 26 43 30 40 32 32 3125 26 27 34 33 27 33 29 30 31
we obtain
1 | 4
1 | 9
2 | 011242 | 55556666667778999
3 | 000011112223333444
3 | 56677
4 | 013
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
24/61
Boxplots
We introduce the boxplot via a couple of examples.
Example [Boxplot]
Weekly television viewing times (in hours) of a sample of 20 peopleare given below.
25 41 27 32 4366 35 31 15 5
34 26 32 38 16
30 38 30 20 21
To obtain a boxplot, begin by finding the quartiles.
5 15 16 20 21
25 26 27 30 30
31 32 32 34 35
38 38 41 43 66
Q1 = 23
Q2 = 30.5
Q3 = 36.5
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
25/61
Example [Boxplot] (contd)
Then, determine the following limits
Lower Limit = Q1 1.5 IQR= 2.75,Upper Limit = Q3 + 1.5 IQR= 56.75,
where IQR= 36.5 23 = 13.5. Finally, obtain 5 and 43 as the
adjacent values
a
and note that 66 is a potential outlier since it fallsoutside the interval (2.75, 56.75).
aAdjacent values are the most extreme values that lie within the lower andupper limits; they are the most extreme observations that are not potentialoutliers (Weiss, 2012, p. 120).
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
26/61
Example [Parallel Boxplots]
Measurements on skinfold thickness (in mm) for samples of
runners and nonrunners in the same age group are given below.
Runners | Nonrunners
-----------------+-----------------------
7.3 6.7 8.7 | 24.0 19.9 7.5 18.4
3.0 5.1 8.8 | 28.0 29.4 20.3 19.0
7.8 3.8 6.2 | 9.3 18.1 22.8 24.25.4 6.4 6.3 | 9.6 19.4 16.3 16.3
3.7 7.5 4.6 | 12.4 5.2 12.2 15.6
Group
Statistics Runners Nonrunners5 Num Summary 3.0, 4.85, 6.3, 7.4, 8.8 5.2, 12.3, 18.25, 21.55, 29.4Limits 1.025, 11.225 -1.575, 35.425Adjacent Values 3.0, 8.8 5.2, 29.4Potential Outliers None None
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
27/61
Example [Parallel Boxplots] (contd)
Q: What conclusions can you draw from the above figure?
Source: Adapted from Weiss (2012, pp. 121-122)
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
28/61
Summary Measures
Topics:
Location & Spread of a Distribution
Measures of Central Tendency
Measures of Dispersion
Summary Measures for Grouped Data
Learning Objectives:
Learn how to measure the location and spread of thedistribution ofrawdata for a single numerical variable.
Learn how to obtain summary measures from grouped data.Learn how to interpret and choose between the varioussummary measures.
Learn the role played by robustness in the selection of a
summary measure.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
29/61
Location & Spread of a Distribution
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
30/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
31/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
32/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
33/61
Measures of Central Tendency
Let x1, x2, . . . , xn denote a set n observations with corresponding
ordered values x(1), x(2), . . . , x(n).
Some measures of central tendency are given below.
Mean
mean = 1n
ni=1
xi = x, say.
Median
median = 12 x(k)+ x(k+1) , ifn= 2k,
x(k+1), ifn= 2k+ 1.
Mode
mode = data value with highest frequency.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
34/61
Example
Consider dataset
101.96, 109.76, 99.63, 99.76, 100.22
with corresponding ordered values
99.63, 99.76, 100.22, 101.96, 109.76.
Here, the mean is
x=101.96 + 109.76 + 99.63 + 99.76 + 100.22
5 102.27
andmedian = x(3)= 100.22.
Q: What about the mode?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
35/61
Advantages & Disadvantages
Feature Mean Median ModeAlways Exists? Y Y NAlways Unique? Y N NNot Affected by Outliers? N Y YFurther Analysis Potential? Y N N
Note
Use a robust (i.e., resistant) measure of central tendencywhen outlying values (assuming these are valid) are present.
The trimmed mean is an example of a robust measure oflocation - see Exercise 3.54 on p. 101 of Weiss (2012) for aspecific illustration.
Q: What about the mean and median?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
36/61
Example [Robustness]
The mean is not robust since it is affected by outlying (extreme)
observations.
> set.seed(2012)
> x mean(x)
[1] 10.03585
> median(x)[1] 10.09504
Note that Ive decided to stop using R for this course. You may ignore the Rcodes that you see in this and the next three examples.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
37/61
Example [Robustness] (contd)
> x x[50] mean(x)
[1] 10.37307
> median(x)
[1] 10.09504The median is not affected by extreme observations and hence it isa robust measure of central tendency.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
R l ti M gnit d f L ti n M s s
-
8/10/2019 stats powerpoint from the worst prof in the world
38/61
Relative Magnitude of Location Measures
Example
> table(x)
x
1 2 3 4 5 6 7
4 7 23 32 23 7 4
> mean(x)
[1] 4
> median(x)[1] 4
The above example illustrates the case when
mean = median = mode.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
39/61
In the next example, we have
mean table(x)
x1 2 3 4 5 6 7
2 4 7 12 15 33 27
> mean(x)
[1] 5.41
> median(x)
[1] 6
Preliminaries Empirical Data Distributions Summary Measures Miscellany
It is also possible that
-
8/10/2019 stats powerpoint from the worst prof in the world
40/61
It is also possible that
mean >median = mode.
Example
> table(x)
x
1 2 3 4 5 6 7
27 33 15 12 7 4 2
> mean(x)
[1] 2.59
> median(x)
[1] 2
Q: What is the practical significance of these examples?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
41/61
Example [Mean vs Median]
The ordered sample and stem and leaf display for some data onarterial blood pressure are given below.
69.4 75.2 78.9 81.6
82.0 82.8 84.1 84.686.7 87.6 88.9 90.8
91.0 94.0 96.4 104.9
6 | 9
7 | 5 9
8 | 22345789
9 | 1146
1 0 | 5Here,
x= 86.18 and median = 85.65.
Q: Which measure do you recommend for the data at hand?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Measures of Dispersion
-
8/10/2019 stats powerpoint from the worst prof in the world
42/61
Measures of Dispersion
Some measures of dispersion are given below.
Rangerange = x(n) x(1)
Interquartile Range
IQR = Third Quartile
First Quartile
Variance
variance = 1
n 1n
i=1
(xi x)2
Standard Deviation
standard deviation =
1
n 1
ni=1
x2i nx2
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
43/61
Example
Consider the (ordered) dataset
99.63, 99.76, 100.22, 101.96, 109.76.
Here,range = 109.76
99.63 = 10.13
andIQR = 101.96 99.76 = 2.2.
Furthermore,
variance =
99.632 + + 109.762 5 102.2725 1 18.42
andstandard deviation
18.42 = 4.29.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
A relative measure of dispersion is
-
8/10/2019 stats powerpoint from the worst prof in the world
44/61
A relative measure of dispersion is
coefficient of variation = standard deviation
mean .
Example
For data in the previous example,
coefficient of variation =
4.29
102.27 0.04.
Advantages & Disadvantages
Feature R V SD IQR CV
Always Exists? Y Y Y Y YAlways Unique? Y N N N NNot Affected by Outliers? N N N Y NAbsolute Measure? Y Y Y Y NSame Units? Y N Y Y N
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
45/61
Example [Comparing Stock Performance]
Following are annual logarithmic returns of Microsof (MSFT) andHewlett-Packard (HWP) for the period spanning 1995-1999.
| 1995 1996 1997 1998 1999
-----+------------------------------------
MSFT | 0.3644 0.6622 0.5026 0.7648 0.5290
HWP | 0.5014 0.1836 0.2156 0.1864 0.4921
Some summary statistics for the returns are as follows:
| MSFT HWP
-------------+----------------
Mean | 0.5646 0.3158
Std Dev | 0.1539 0.1657
Median | 0.5290 0.2156IQR | 0.1596 0.3057
Coef of Var | 0.2727 0.5246
Q: Which of the two stocks performed better over 1995-1999?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Mean & Variance for Grouped Data
-
8/10/2019 stats powerpoint from the worst prof in the world
46/61
Mean & Variance for Grouped Data
Grouped data refers to data in a frequency distribution.
Example
Class | Freq. Percent Cum.
------------+-----------------------------------
(10,15] | 1 2.00 2.00
(15,20] | 2 4.00 6.00
(20,25] | 8 16.00 22.00
(25,30] | 17 34.00 56.00
(30,35] | 15 30.00 86.00
(35,40] | 5 10.00 96.00
(40,45] | 2 4.00 100.00------------+-----------------------------------
Information in the first and any one of the remaining threecolumns of the above table constitute grouped data.
-
8/10/2019 stats powerpoint from the worst prof in the world
47/61
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Example
-
8/10/2019 stats powerpoint from the worst prof in the world
48/61
Example
For the grouped data given earlier, we have
2Class | ni mi mi*ni mi * ni
-----------+------------------------------------------
(10,15] | 1 12.5 12.5 156.25
(15,20] | 2 17.5 35.0 612.50
(20,25] | 8 22.5 180.0 4050.00
(25,30] | 17 27.5 467.5 12856.25
(30,35] | 15 32.5 487.5 15843.75
(35,40] | 5 37.5 187.5 7031.25
(40,45] | 2 42.5 85.0 612.50
-----------+------------------------------------------
Total | 50 1455.0 44162.50
Hence,
xg=1455.0
50 = 29.1 and s2g =
44162.50
50 29.12 = 36.44.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
49/61
Topics:
Summation Notation
Classification of Statistical Studies
Questions for Class Discussion
Learning Objectives:
Review the notation used for summation.
Learn about different types of statistical studies.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Summation Notation
-
8/10/2019 stats powerpoint from the worst prof in the world
50/61
Summation Notation
Given numerical values x1, . . . , xn, we have:n
i=1
xi = x1+ x2+ +xnn
i=1
(axi+b) = (ax1+ b) + + (axn+ b) = an
i=1
xi+nb
Example
Ifxis are given by 1.75, 2.25, 2.25, 2.25, 1.75, 2.00, 1.50, we have
7i=1
xi = 13.75 and7
i=1
x2i = 1.752 + + 1.502 = 27.5625.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Classification of Statistical Studies
-
8/10/2019 stats powerpoint from the worst prof in the world
51/61
Observational Study
Observed relationships and other inferences apply only tothe study subjects (or objects) under investigation.
No control of extraneous sources of variation.
Example [Vasectomies & Prostrate Cancer]
A study found an association between vasectomy and prostratecancer - elevated risk after vasectomy.
No information that the study was based on a properly chosensample or a properly designed experiment.
We cannot infer causation nor generalize the observed association.
Source: Adapted from Weiss (2012, p. 7).
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
52/61
Inferential Study
The study is based on a properly chosen sample (e.g., randomsample).
Inferences made from sample information may be generalizedto a larger population.
Example [Testing Baseballs]
An independent testing company investigated the liveliness of 85randomly selected Rawlings baseballs from the 1977 supplies ofmajor league teams.
The Rawlings baseball was found to be more lively than the 1976Spalding baseball.
Source: Adapted from Weiss (2012, p. 6).
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
53/61
Designed Experiments
A proper randomization technique is used to allocate subjects(or objects) to treatment and control groups.
Relevant sources of extraneous variation are controlled.
Example [Folic Acid & Birth Defects]
4753 women prior to conception were divided randomly into twogroups. One group took daily doses of folic acid while the othertook only trace elements.
Incidence of major birth defects was much reduced for the group
taking folic acid.
Here, we can infer presence of a causal relationship.
Source: Adapted from Weiss (2012, p. 7).
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Questions for Class Discussion
-
8/10/2019 stats powerpoint from the worst prof in the world
54/61
Question 1
A stem-and-leaf display of daily protein intake (in grams) for asample of 51 female vegetarians is shown below.
The decimal point is 1 digit(s) to the right of the |
0 | 1259
1 | 34558
2 | 01889
3 | 013566688899
4 | 0012355675 | 002234467899
6 | 8 8
7 |
8 | 0 5
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
55/61
Question 1 (contd)
A similar display for a sample of 53 female nonvegetarians is given
below.
The decimal point is 1 digit(s) to the right of the |
0 | 5
1 | 1 42 | 34557
3 | 4567779
4 | 0112444569
5 | 0003345577
6 | 0113334799
7 | 1157
8 | 1444
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Question 1 (contd)
-
8/10/2019 stats powerpoint from the worst prof in the world
56/61
Question 1 (cont d)
(a) The quartiles for both groups of females are partially given in
the following table. Fill in the missing entries in table.
Group 1st Quartile 2nd Quartile 3rd QuartileVegetarian 39
Nonvegetarian 38 63
Table : Quartiles of Vegetarian and Nonvegetarian Females
(b) Based on information in (the completed) table, compare the
location and spread of the two sets of data.(c) Identify potential outliers, if any, for each dataset. Do you
obtain results that are consistent with what you observe in thestem-and-leaf displays?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
57/61
Question 2
(a) Which of the following is not a property of the coefficient ofvariation?
(i) It is not always unique.(ii) It is resistant to outliers.
(iii) It is a relative measure.(iv) It is not in the same units as the original data.
(b) The (arithmetic) mean computed from raw data is alwaysunique. The same is true of the mean computed fromgrouped data. True or False?
(c) The sample mid-range is a robust measure of location. Trueor False?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
58/61
Question 3
Suppose you obtain the following five number summaries fromdata on annual (percentage) returns for common stock andgovernment bonds over a fifteen year period.
Investment: Bonds
[1] -10.460 1.035 4.600 14.080 42.980
Investment: Stocks
[1] -25.930 -0.495 10.710 23.760 44.770
(a) What types of statistics do the numbers in each summaryrepresent?
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
59/61
Question 3 (contd)
(b) One of the values given in the five number summary for thebond returns looks unusual. Is it a potential outlier?
(c) Of the two financial instruments, which is preferred if yourprimary investment objective is to choose the one that gives
you the greater level of return on average?(d) Which is preferred if risk aversion is the key factor influencing
your choice of investment to make?
(e) Is there anything wrong with the following statement?
Under appropriate conditions, the coefficient of variation is auseful measure to consider when making risk-reward trade-offs
amongst several investment alternatives.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
-
8/10/2019 stats powerpoint from the worst prof in the world
60/61
Question 4
Consider the following absolute frequency distribution obtainedfrom data on distance (in miles) travelled to work for a randomsample of 50 workers.
Classes | (10,20] (20,30] (30,40] (40,50]
----------+------------------------------------Frequency | 3 19 23 5
(a) Determine the grouped data variance using information
provided by the above empirical distribution.(b) Determine one other grouped data measure of dispersion.
Preliminaries Empirical Data Distributions Summary Measures Miscellany
Acknowledgements
-
8/10/2019 stats powerpoint from the worst prof in the world
61/61
The current slides are based in parton material from:
Introductory Statistics (9th Edition) by Neil A. Weiss.
Introductory Statistics (2nd Edition) by H. K. Chow, A.Ghosh, D. H. Y. Leung and Y. K. Tse.
The slides were produced usingThe Beamer Class package andMikTeX (a public domain document preparation system).
Customized computations and graphics were produced usingR (apublic domain statistical software package).
I am grateful to the developers of the above resources for makingthem available.