stats powerpoint from the worst prof in the world

8/10/2019 stats powerpoint from the worst prof in the world

1/61


2/61

Preliminaries Empirical Data Distributions Summary Measures Miscellany

Preliminaries

Topics:

What is Statistics?

Typical Descriptive Statistics Problems

Note for the StudentIt is recommended that students read this section in itsentiretybefore coming to class for the lecture to ensure thatthey have the required background information.1

During the lecture I will mainly focus on sections which havea direct bearing on the lecture topic under discussion.

Material in the last section serves to complement what wecover during the lecture.

1This also applies to the Preliminaries section of subsequent lecture slides.


3/61


Statistics Overview

Topics:

What is Statistics?

Applications of Statistics

Learning Objectives:

Learn the nature of Statistics and study its relevance toBusiness Research Analysis and Decision Making.

Learn about the different subdisciplines of Statistics concerned

with extracting descriptive information from data, assessinguncertainty and making statistical inferences & predictions.


4/61


What is Statistics?

Statistics is the discipline which makes use of mathematical andcomputational techniques to, among other things,

collect data using surveys, observational studies or designedexperiments;

describe, summarize and present the collected data;assess and quantify uncertainty;

draw inferences about population characteristics based onsample information;

assess the statistical significance of observed differences orpresence of associations;

construct empirical models to obtain estimates, testhypotheses or for predictive purposes;

make projections using cross-sectional or time series data.

P li i i E i i l D Di ib i S M Mi ll


5/61


Applications of Statistics

Some Applications:

Marketing Research

Eg. Assessing Brand Preferences for a Given Product

Finance

Eg. Measuring the Credit Risk of a Counterparty

Insurance

Eg. Measuring Risk of an Insurance Portfolio

Reliability Engineering

Eg. Assessing the Reliability of an Aircraft Engine

Medical Research

Eg. Determining the Efficacy of a New Drug

Q: Do you think Statistics is worthwhile learning? If so, why?

P li i i E i i l D t Di t ib ti S M Mi ll


6/61


Typical Descriptive Statistics Problems

Organizing Data

Forty students in an Introductory Statistics course were asked tostate their political affliations (i.e., whether they favoured theDemocratic (D), Republican (R) or Other (O) party). Thefollowing results were obtained.

D R O R R R R R

D O R D O O R D

D R O D R R O R

D O D D D R O D

O R D R R R R D

What type of data are we dealing with?

What can we say about the distribution of political affliations?

Source: Adapted from Weiss (2012, p. 40).

Preliminaries Empirical Data Distrib tions S mmar Meas res Miscellan


7/61


Summarizing Data

Arterial blood pressures (in mm of mercury) for a sample of 16children of diabetic mothers are given below.

81.6 84.1 87.6 82.8

82.0 88.9 86.7 96.4

84.6 104.9 90.8 94.069.4 78.9 75.2 91.0

What does the data tell you about the average blood pressureof a child whose mother is diabetic?

What can we conclude about the variability of the bloodpressure measurements?

Source: Adapted from Weiss (2012, p. 95)



8/61


Empirical Data Distributions

Topics:Tabulating Data Distributions

Graphing Data Distributions

Learning Objectives:Learn tabular and graphical techniques for organizing andpresenting data.

Learn how to choose among the available techniques for a

given problem in descriptive statistical analysis.

Note:

Much of the material in this and the next section are of a review nature.Well quickly review such material but spend more time on materialstudents are less familiar with.



9/61


Tabulating Data Distributions

Tabulating Categorical Data

The first column of the table contains the possible categoriesand the second column the correponding absolute frequencies(optionally, relative frequencies may also be given in anothercolumn).

Example

Consider the political affliation data given in the first illustrativeproblem. Following is the frequency table for the data.

Affliation Abs Freq Rel FreqDemocratic 13 0.325Republican 18 0.450Other 9 0.225



10/61


Tabulating Numerical Data

In an absolute frequency table, the number of observations ineach class (i.e., pre-defined sub-interval) is presented.

Class Frequency

(l1, u1] n1(l2, u2] n2(l3, u3] n3

... ...

(lk, uk] nk

Abs Frequency Table

Class Frequency(10, 20] 3(20, 30] 7(30, 40] 4(40, 50] 4

(50, 60] 2

Note: (10, 20] refers to values between 10 (exclusive) and 20 (inclusive) etc.



11/61


Example [Frequency Tables]

The absolute frequency table in the previous slide was obtained

from the following raw data

12 13 17 21 24 24 26 27 27 30

32 35 37 38 41 43 44 46 53 58

The corresponding relative and cumulative frequency tables are:

Class Rel Freq(10, 20] 0.15(20, 30] 0.35

(30, 40] 0.20(40, 50] 0.20(50, 60] 0.10

Class Cum Freq(10, 20] 0.15(20, 30] 0.50

(30, 40] 0.70(40, 50] 0.90(50, 60] 1.00

Q: What can we deduce from each table?



12/61

p y y

Graphing Data Distributions

Graphing Distributions for Categorical Data

Pie Chart

A circle is divided into pie slices. The area of each slice isproportional to the relative frequency of each category.

ExampleFor the political affliation data, we have the following pie chart.

Pie Slice AngleDemocratic 117 degRepublican 162 degOther 81 deg

Q: How can we improve on this graphical display?



13/61

y y

Bar Chart

Each category is represented by a vertical (or horizontal) bar.The height (or width) of each bar is equal or proportional tothe absolute or relative frequency of a category.

Example

For the political affliation data, we have the following bar chart.

Q: Which is preferred? A pie chart or bar chart?



14/61

Side-by-Side Bar Chart

This chart may be used to present bivariate categorical data.

Example [Side-by-Side Bar Chart]

Consider the following distribution of student grades by gender.

A B C D EFemale 3 9 7 1 1

Male 4 6 5 3 1

In relative terms, we have the following table.

A B C D E

Female 0.14 0.43 0.33 0.05 0.05

Male 0.21 0.32 0.26 0.16 0.05



15/61

Example [Side-by-Side Bar Chart] (contd)

Information in the first (second) table may be displayed by the

chart in the left (right) panel of the following figure.

Q: What conclusion(s) can be drawn from the above figure?

Q: Does it matter which chart you base you conclusions on?

Source: Adapted from Chow et al (2007, p. 7).



16/61

Graphing Distributions for Numerical Data

Absolute Frequency Histogram

Displays information contained in an absolute frequency tableusing vertical bars with no gaps between bars.

The height of each bar gives the number of observations thatlie in the interval determined by the base of the bar.

Example

Class Frequency(10, 20] 3(20, 30] 7(30, 40] 4(40, 50] 4(50, 60] 2



17/61

Relative Frequency Histogram

Displays information in a relative frequency table by vertical

bars with no gaps between bars.The area of each bar gives the fraction of observations that liein the interval determined by the base of the bar.

Example

Class Frequency(10, 20] 0.15(20, 30] 0.35

(30, 40] 0.20(40, 50] 0.20(50, 60] 0.10

Q: What can you conclude from the above figure?


18/61



19/61

Cumulative Frequency Polygon

Displays a plot of cumulative frequency against upper class limit in

an expanded cumulative frequency table (as illustrated below).

Example

Class Cum Freq (%)(0, 10] 0

(10, 20] 15(20, 30] 50

(30, 40] 70(40, 50] 90(50, 60] 100

Q: What useful statistic(s) can we deduce from such plots?



20/61

Digression: Quartiles

Let x1, x2, . . . , xn denote a set ofn observations for our study.

Usually, the xis are unordered.

For some applications, we need to work with ordered values in thedataset, i.e, with x(i)s such that

x(1) x(2) x(n).

Define

Q2 = second quartile of the xis

=

12

x(k)+ x(k+1)

, ifn= 2k,

x(k+1), ifn= 2k+ 1.

Note that Q2 is also referred to as the median of the xis.



21/61

The first quartile, denoted Q1, may be definedas the median ofxivalues less than or equal to Q2.

The third quartile, denoted Q3, may be definedas the median ofxivalues greater than or equal to Q2.

Example

For the following set of 5 observations

101.96 109.76 99.63 99.76 100.22

the corresponding ordered sample is

99.63 99.76 100.22 101.96 109.76.

Here,Q1 = 99.76, Q2 = 100.22 and Q3 = 101.96.



22/61

Stem and Leaf Diagram

A stem and leaf diagram (like the one shown below) is a graphicaldisplay that shows the distribution of a set of numerical values.From it, one can

sometimes recover the original data;

easily infer empirical percentiles;

obtain measures of central tendency and dispersion.

Example

1 | 677888992 | 0012257

3 | 2 8

4 | 2

Ordered data: 16, 17, . . . , 38, 42.Distribution is right-skewed.

Q1 = 18, Q2 = 20 and Q3 = 23.5

Min = 16 and Max = 42.



23/61

Example [Stem and Leaf Display]

For the Cord Strength dataset

25 25 36 31 26 36 29 37 37 2034 27 21 35 30 41 33 21 26 2619 25 14 32 30 29 31 26 22 2434 33 28 26 43 30 40 32 32 3125 26 27 34 33 27 33 29 30 31

we obtain

1 | 4

1 | 9

2 | 011242 | 55556666667778999

3 | 000011112223333444

3 | 56677

4 | 013



24/61

Boxplots

We introduce the boxplot via a couple of examples.

Example [Boxplot]

Weekly television viewing times (in hours) of a sample of 20 peopleare given below.

25 41 27 32 4366 35 31 15 5

34 26 32 38 16

30 38 30 20 21

To obtain a boxplot, begin by finding the quartiles.

5 15 16 20 21

25 26 27 30 30

31 32 32 34 35

38 38 41 43 66

Q1 = 23

Q2 = 30.5

Q3 = 36.5



25/61

Example [Boxplot] (contd)

Then, determine the following limits

Lower Limit = Q1 1.5 IQR= 2.75,Upper Limit = Q3 + 1.5 IQR= 56.75,

where IQR= 36.5 23 = 13.5. Finally, obtain 5 and 43 as the

adjacent values

a

and note that 66 is a potential outlier since it fallsoutside the interval (2.75, 56.75).

aAdjacent values are the most extreme values that lie within the lower andupper limits; they are the most extreme observations that are not potentialoutliers (Weiss, 2012, p. 120).



26/61

Example [Parallel Boxplots]

Measurements on skinfold thickness (in mm) for samples of

runners and nonrunners in the same age group are given below.

Runners | Nonrunners

-----------------+-----------------------

7.3 6.7 8.7 | 24.0 19.9 7.5 18.4

3.0 5.1 8.8 | 28.0 29.4 20.3 19.0

7.8 3.8 6.2 | 9.3 18.1 22.8 24.25.4 6.4 6.3 | 9.6 19.4 16.3 16.3

3.7 7.5 4.6 | 12.4 5.2 12.2 15.6

Group

Statistics Runners Nonrunners5 Num Summary 3.0, 4.85, 6.3, 7.4, 8.8 5.2, 12.3, 18.25, 21.55, 29.4Limits 1.025, 11.225 -1.575, 35.425Adjacent Values 3.0, 8.8 5.2, 29.4Potential Outliers None None



27/61

Example [Parallel Boxplots] (contd)

Q: What conclusions can you draw from the above figure?

Source: Adapted from Weiss (2012, pp. 121-122)



28/61

Summary Measures

Topics:

Location & Spread of a Distribution

Measures of Central Tendency

Measures of Dispersion

Summary Measures for Grouped Data


Learn how to measure the location and spread of thedistribution ofrawdata for a single numerical variable.

Learn how to obtain summary measures from grouped data.Learn how to interpret and choose between the varioussummary measures.

Learn the role played by robustness in the selection of a

summary measure.



29/61

Location & Spread of a Distribution



30/61



31/61



32/61



33/61

Measures of Central Tendency

Let x1, x2, . . . , xn denote a set n observations with corresponding

ordered values x(1), x(2), . . . , x(n).

Some measures of central tendency are given below.

Mean

mean = 1n

ni=1

xi = x, say.

Median

median = 12 x(k)+ x(k+1) , ifn= 2k,

x(k+1), ifn= 2k+ 1.

Mode

mode = data value with highest frequency.



34/61

Example

Consider dataset

101.96, 109.76, 99.63, 99.76, 100.22

with corresponding ordered values

99.63, 99.76, 100.22, 101.96, 109.76.

Here, the mean is

x=101.96 + 109.76 + 99.63 + 99.76 + 100.22

5 102.27

andmedian = x(3)= 100.22.

Q: What about the mode?



35/61

Advantages & Disadvantages

Feature Mean Median ModeAlways Exists? Y Y NAlways Unique? Y N NNot Affected by Outliers? N Y YFurther Analysis Potential? Y N N

Note

Use a robust (i.e., resistant) measure of central tendencywhen outlying values (assuming these are valid) are present.

The trimmed mean is an example of a robust measure oflocation - see Exercise 3.54 on p. 101 of Weiss (2012) for aspecific illustration.

Q: What about the mean and median?



36/61

Example [Robustness]

The mean is not robust since it is affected by outlying (extreme)

observations.

> set.seed(2012)

> x mean(x)

[1] 10.03585

> median(x)[1] 10.09504

Note that Ive decided to stop using R for this course. You may ignore the Rcodes that you see in this and the next three examples.



37/61

Example [Robustness] (contd)

> x x[50] mean(x)

[1] 10.37307

> median(x)

[1] 10.09504The median is not affected by extreme observations and hence it isa robust measure of central tendency.


R l ti M gnit d f L ti n M s s


38/61

Relative Magnitude of Location Measures

Example

> table(x)

x

1 2 3 4 5 6 7

4 7 23 32 23 7 4

> mean(x)

[1] 4

> median(x)[1] 4

The above example illustrates the case when

mean = median = mode.



39/61

In the next example, we have

mean table(x)

x1 2 3 4 5 6 7

2 4 7 12 15 33 27

> mean(x)

[1] 5.41

> median(x)

[1] 6


It is also possible that


40/61

It is also possible that

mean >median = mode.

Example

> table(x)

x

1 2 3 4 5 6 7

27 33 15 12 7 4 2

> mean(x)

[1] 2.59

> median(x)

[1] 2

Q: What is the practical significance of these examples?



41/61

Example [Mean vs Median]

The ordered sample and stem and leaf display for some data onarterial blood pressure are given below.

69.4 75.2 78.9 81.6

82.0 82.8 84.1 84.686.7 87.6 88.9 90.8

91.0 94.0 96.4 104.9

6 | 9

7 | 5 9

8 | 22345789

9 | 1146

1 0 | 5Here,

x= 86.18 and median = 85.65.

Q: Which measure do you recommend for the data at hand?




42/61


Some measures of dispersion are given below.

Rangerange = x(n) x(1)

Interquartile Range

IQR = Third Quartile

First Quartile

Variance

variance = 1

n 1n

i=1

(xi x)2

Standard Deviation

standard deviation =

1

n 1

ni=1

x2i nx2



43/61

Example

Consider the (ordered) dataset

99.63, 99.76, 100.22, 101.96, 109.76.

Here,range = 109.76

99.63 = 10.13

andIQR = 101.96 99.76 = 2.2.

Furthermore,

variance =

99.632 + + 109.762 5 102.2725 1 18.42

andstandard deviation

18.42 = 4.29.


A relative measure of dispersion is


44/61

A relative measure of dispersion is

coefficient of variation = standard deviation

mean .

Example

For data in the previous example,

coefficient of variation =

4.29

102.27 0.04.

Advantages & Disadvantages

Feature R V SD IQR CV

Always Exists? Y Y Y Y YAlways Unique? Y N N N NNot Affected by Outliers? N N N Y NAbsolute Measure? Y Y Y Y NSame Units? Y N Y Y N



45/61

Example [Comparing Stock Performance]

Following are annual logarithmic returns of Microsof (MSFT) andHewlett-Packard (HWP) for the period spanning 1995-1999.

| 1995 1996 1997 1998 1999

-----+------------------------------------

MSFT | 0.3644 0.6622 0.5026 0.7648 0.5290

HWP | 0.5014 0.1836 0.2156 0.1864 0.4921

Some summary statistics for the returns are as follows:

| MSFT HWP

-------------+----------------

Mean | 0.5646 0.3158

Std Dev | 0.1539 0.1657

Median | 0.5290 0.2156IQR | 0.1596 0.3057

Coef of Var | 0.2727 0.5246

Q: Which of the two stocks performed better over 1995-1999?


Mean & Variance for Grouped Data


46/61

Mean & Variance for Grouped Data

Grouped data refers to data in a frequency distribution.

Example

Class | Freq. Percent Cum.

------------+-----------------------------------

(10,15] | 1 2.00 2.00

(15,20] | 2 4.00 6.00

(20,25] | 8 16.00 22.00

(25,30] | 17 34.00 56.00

(30,35] | 15 30.00 86.00

(35,40] | 5 10.00 96.00

(40,45] | 2 4.00 100.00------------+-----------------------------------

Information in the first and any one of the remaining threecolumns of the above table constitute grouped data.


47/61


Example


48/61

Example

For the grouped data given earlier, we have

2Class | ni mi mi*ni mi * ni

-----------+------------------------------------------

(10,15] | 1 12.5 12.5 156.25

(15,20] | 2 17.5 35.0 612.50

(20,25] | 8 22.5 180.0 4050.00

(25,30] | 17 27.5 467.5 12856.25

(30,35] | 15 32.5 487.5 15843.75

(35,40] | 5 37.5 187.5 7031.25

(40,45] | 2 42.5 85.0 612.50

-----------+------------------------------------------

Total | 50 1455.0 44162.50

Hence,

xg=1455.0

50 = 29.1 and s2g =

44162.50

50 29.12 = 36.44.


Miscellany


49/61

Topics:

Summation Notation

Classification of Statistical Studies

Questions for Class Discussion


Review the notation used for summation.

Learn about different types of statistical studies.


Summation Notation


50/61

Summation Notation

Given numerical values x1, . . . , xn, we have:n

i=1

xi = x1+ x2+ +xnn

i=1

(axi+b) = (ax1+ b) + + (axn+ b) = an

i=1

xi+nb

Example

Ifxis are given by 1.75, 2.25, 2.25, 2.25, 1.75, 2.00, 1.50, we have

7i=1

xi = 13.75 and7

i=1

x2i = 1.752 + + 1.502 = 27.5625.


Classification of Statistical Studies


51/61

Observational Study

Observed relationships and other inferences apply only tothe study subjects (or objects) under investigation.

No control of extraneous sources of variation.

Example [Vasectomies & Prostrate Cancer]

A study found an association between vasectomy and prostratecancer - elevated risk after vasectomy.

No information that the study was based on a properly chosensample or a properly designed experiment.

We cannot infer causation nor generalize the observed association.




52/61

Inferential Study

The study is based on a properly chosen sample (e.g., randomsample).

Inferences made from sample information may be generalizedto a larger population.

Example [Testing Baseballs]

An independent testing company investigated the liveliness of 85randomly selected Rawlings baseballs from the 1977 supplies ofmajor league teams.

The Rawlings baseball was found to be more lively than the 1976Spalding baseball.




53/61

Designed Experiments

A proper randomization technique is used to allocate subjects(or objects) to treatment and control groups.

Relevant sources of extraneous variation are controlled.

Example [Folic Acid & Birth Defects]

4753 women prior to conception were divided randomly into twogroups. One group took daily doses of folic acid while the othertook only trace elements.

Incidence of major birth defects was much reduced for the group

taking folic acid.

Here, we can infer presence of a causal relationship.



Questions for Class Discussion


54/61

Question 1

A stem-and-leaf display of daily protein intake (in grams) for asample of 51 female vegetarians is shown below.

The decimal point is 1 digit(s) to the right of the |

0 | 1259

1 | 34558

2 | 01889

3 | 013566688899

4 | 0012355675 | 002234467899

6 | 8 8

7 |

8 | 0 5



55/61

Question 1 (contd)

A similar display for a sample of 53 female nonvegetarians is given

below.

The decimal point is 1 digit(s) to the right of the |

0 | 5

1 | 1 42 | 34557

3 | 4567779

4 | 0112444569

5 | 0003345577

6 | 0113334799

7 | 1157

8 | 1444


Question 1 (contd)


56/61

Question 1 (cont d)

(a) The quartiles for both groups of females are partially given in

the following table. Fill in the missing entries in table.

Group 1st Quartile 2nd Quartile 3rd QuartileVegetarian 39

Nonvegetarian 38 63

Table : Quartiles of Vegetarian and Nonvegetarian Females

(b) Based on information in (the completed) table, compare the

location and spread of the two sets of data.(c) Identify potential outliers, if any, for each dataset. Do you

obtain results that are consistent with what you observe in thestem-and-leaf displays?



57/61

Question 2

(a) Which of the following is not a property of the coefficient ofvariation?

(i) It is not always unique.(ii) It is resistant to outliers.

(iii) It is a relative measure.(iv) It is not in the same units as the original data.

(b) The (arithmetic) mean computed from raw data is alwaysunique. The same is true of the mean computed fromgrouped data. True or False?

(c) The sample mid-range is a robust measure of location. Trueor False?



58/61

Question 3

Suppose you obtain the following five number summaries fromdata on annual (percentage) returns for common stock andgovernment bonds over a fifteen year period.

Investment: Bonds

[1] -10.460 1.035 4.600 14.080 42.980

Investment: Stocks

[1] -25.930 -0.495 10.710 23.760 44.770

(a) What types of statistics do the numbers in each summaryrepresent?



59/61

Question 3 (contd)

(b) One of the values given in the five number summary for thebond returns looks unusual. Is it a potential outlier?

(c) Of the two financial instruments, which is preferred if yourprimary investment objective is to choose the one that gives

you the greater level of return on average?(d) Which is preferred if risk aversion is the key factor influencing

your choice of investment to make?

(e) Is there anything wrong with the following statement?

Under appropriate conditions, the coefficient of variation is auseful measure to consider when making risk-reward trade-offs

amongst several investment alternatives.



60/61

Question 4

Consider the following absolute frequency distribution obtainedfrom data on distance (in miles) travelled to work for a randomsample of 50 workers.

Classes | (10,20] (20,30] (30,40] (40,50]

----------+------------------------------------Frequency | 3 19 23 5

(a) Determine the grouped data variance using information

provided by the above empirical distribution.(b) Determine one other grouped data measure of dispersion.


Acknowledgements


61/61

The current slides are based in parton material from:

Introductory Statistics (9th Edition) by Neil A. Weiss.

Introductory Statistics (2nd Edition) by H. K. Chow, A.Ghosh, D. H. Y. Leung and Y. K. Tse.

The slides were produced usingThe Beamer Class package andMikTeX (a public domain document preparation system).

Customized computations and graphics were produced usingR (apublic domain statistical software package).

I am grateful to the developers of the above resources for makingthem available.

stats powerpoint from the worst prof in the world

Documents

Transcript of stats powerpoint from the worst prof in the world