Post on 03-Apr-2018
7/28/2019 Sampling and Descriptive Statics
1/49
Sampling and Descriptive Statistics
Berlin Chen
Department of Computer Science & Information Engineering
National Taiwan Normal University
Reference:
1. W. Navidi. Statistics for Engineering and Scientists. Chapter 1 & Teaching Material
7/28/2019 Sampling and Descriptive Statics
2/49
Sampling (1/2)
Definition: A o ulation is the entire collection of ob ects
or outcomes about which information is sought
All NTNU students
Definition: A sample is a subset of a population,
containing the objects or outcomes that are actuallyobserved
. .,
Choose the 100 students from the rosters of football or
basketball teams (appropriate?) Choose the 100 students living a certain dorm or enrolled in
the statistics course (appropriate?)
Statistics-Berlin Chen 2
7/28/2019 Sampling and Descriptive Statics
3/49
Sampling (2/2)
Definition: A sim le random sam le SRS of size n is a
sample chosen by a method in which each collection ofn
population items is equally likely to comprise the sample,us as n e o ery
Definition: A sample of convenience is a sample that isno rawn y a we - e ne ran om me o
Things to consider with convenience samples:
Statistics-Berlin Chen 3
Only use when it is not feasible to draw a random sample
7/28/2019 Sampling and Descriptive Statics
4/49
More on SRS (1/3)
Definition: A conce tual o ulation consists of all the
values that might possibly have been observed
It is in contrast to tangible () population E.g., a geologist weighs a rock several times on a sensitive scale.
Each time, the scale gives a slightly different reading
.
that the scale could in principle produce
Statistics-Berlin Chen 4
7/28/2019 Sampling and Descriptive Statics
5/49
More on SRS (2/3)
A SRS is not uaranteed to reflect the o ulation
perfectly
SRSs always differ in some ways from each other,
occasionally a sample is substantially different from the
population
Two different samples from the same population will vary
from each other as well
This phenomenon is known as sampling variation
Statistics-Berlin Chen 5
7/28/2019 Sampling and Descriptive Statics
6/49
More on SRS (3/3)
The items in a sam le are inde endent if knowin thevalues of some of the items does not help to predict the
values of the others
(A Rule of Thumb) Items in a simple random sample
encountered in practice The exception occurs when the population is finite and the
samp e compr ses a su s an a rac on more an o epopulation
However, it is possible to make a population behave asthough it were infinite large, by replacing each item after
Statistics-Berlin Chen 6
Sampling With Replacement
7/28/2019 Sampling and Descriptive Statics
7/49
Other Sampling Methods
Weighting Sampling Some items are given a greater chance of being selected than
others
. .,others
Stratified Sampling
The population is divided up into subpopulations, called strata A simple random sample is drawn from each stratum
Su ervised ?
Cluster Sampling
Items are drawn from the population in groups or clusters E.g., the U.S. government agencies use cluster sampling to
sample the U.S. population to measure sociological factors suchas income and unemployment
Statistics-Berlin Chen 7
Unsupervised (?)
7/28/2019 Sampling and Descriptive Statics
8/49
Types of Experiments
One-Sam le Ex eriment
There is only one population of interest
A single sample is drawn from it
Multi-Sample Experiment
ere are wo or more popu a ons o n eres
A simple is drawn from each population
-
comparisons among populations
Statistics-Berlin Chen 8
7/28/2019 Sampling and Descriptive Statics
9/49
Types of Data
Numerical or uantitative if a numerical uantit is
assigned to each item in the sample
Height Weight
Age
Categorical or qualitative if the sample items are placed
Gender
Hair color
Blood type
Statistics-Berlin Chen 9
7/28/2019 Sampling and Descriptive Statics
10/49
Summary Statistics
The summar statistics are sometimes called descri tive
statistics because they describe the data
Numerical Summaries Sample mean, median, trimmed mean, mode
Sample standard deviation (variance), range
,
Skewness, kurtosis
.
Graphical Summaries
Stem and leaf plot Dotplot
Histogram (more commonly used)
Scatterplot
. Statistics-Berlin Chen 10
7/28/2019 Sampling and Descriptive Statics
11/49
Numerical Summaries (1/4)
Definition: Sam le Mean (thecenterofthedata) Let be a sample. The sample mean isnXX ,,1
n1
Its customary to use a letter with a bar over it to denote a
in 1
Definition: Sample Variance (howspreadoutthedataare) Let be a sam le. The sam le variance isnXX ,,1
n
i XXs
22 1
Which is equivalent to
2
1
22
1
1XnX
ns
n
ii
Statistics-Berlin Chen 11
5040,30,20,,10
3231,30,29,,20
7/28/2019 Sampling and Descriptive Statics
12/49
Numerical Summaries (2/4)
Actually, we are interest in
Population mean
Population deviation: Measuring the spread of the population
mean
, ,
use sample mean to replace it
a ema ca y, e ev a ons aroun e samp e mean
tend to be a bit smaller than the deviations around the
So when calculating sample variance, the quantity divided by
rather than provides the right correctionn1n
Statistics-Berlin Chen 12
To be proved later on !
7/28/2019 Sampling and Descriptive Statics
13/49
Numerical Summaries (3/4)
Definition: Sam le Standard Deviation
Let be a sample. The sample deviation isnXX ,,1
ii XX
ns
11
nnXs 22
1
The sample deviation also measures the degree of spread in a
in 11
samp e av ng esameun sas e a a
Statistics-Berlin Chen 13
7/28/2019 Sampling and Descriptive Statics
14/49
Numerical Summaries (3/4)
If is a sam le, and ,where andnXX ,,1 bXaY a b
are constants, then XbaY
If is a sample, and ,where and
are constants, thennXX ,,1 ii bXaY a b
xyxy sbssbs and222
Definition: Outliers
Sometimes a sample may contain a few points that are much
larger or smaller than the rest (mainlyresultingfromdataentryerrors)
Statistics-Berlin Chen 14
7/28/2019 Sampling and Descriptive Statics
15/49
More on Numerical Summaries (1/2)
Definition: The median is another measure of center of a
sample , like the mean
To compute the median items in the sample have to be ordered
nXX ,,1
y e r va ues
If is odd, the sample median is the number in positionn 2/1n
If is even, the sample median is the average of the numbers inpositions andn
2/n 12/ n
The median is an important (robust) measure of center for
samples containing outliers
Statistics-Berlin Chen 15
7/28/2019 Sampling and Descriptive Statics
16/49
More on Numerical Summaries (2/2)
Definition: The trimmed mean of one-dimensional data
is computed by
First, arranging the sample values in (ascending or descending)or er
Then, trimming an equal number of them from each end, say,
%
Finally, computing the sample mean of those remaining
Statistics-Berlin Chen 16
7/28/2019 Sampling and Descriptive Statics
17/49
More on mean
Arithmetic mean n
XX1
Geometric mean
in 1
nn
1
Harmonic mean
ii
1
1n
1 i iXn1
meanharmonic:1mean;arithmetic:1
minimum:maximum;:
mm
mm
mn
i
m
iXn
X1
meanquadratic:2
meangeometric:0
m
m
Arithmetic mean Geometric mean Harmonic mean
n
Weighted arithmetic meanStatistics-Berlin Chen 17http://en.wikipedia.org/wiki/Mean
ni i
i ii
wX 1
1
7/28/2019 Sampling and Descriptive Statics
18/49
Quartiles
Definition: the quartiles of a sample divides it asnXX ,,1
near y as poss e n o quar ers. e samp e va ues ave
to be ordered from the smallest to the largest
, .
The second quartile found by computing the value 0.5(n+1)
The third quartile found by computing the value 0.75(n+1)
Example 1.14: Find the first and third quartiles of the
.30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 254 247 254 274 384 470
n=24
To find the first quartile, compute (n+1)25=6.25
105+126 /2=115.5
Statistics-Berlin Chen 18
To find the third quartile, compute (n+1)75=18.75
(242+245)/2=243.5
7/28/2019 Sampling and Descriptive Statics
19/49
Percentiles
Definition: Thepth percentile of a sample , for anXX ,,1
number between 0 and 100, divide the sample so that as
nearly as possiblep% of the sample values are less than
.
Order the sample values from smallest to largest
Then com ute the uantit /100 n+1 where n is the sam le
size If this quantity is an integer, the sample value in this position is
. ,
either side
, ,
is the 50th percentile, and the third quartile is the 75th
Statistics-Berlin Chen 19
7/28/2019 Sampling and Descriptive Statics
20/49
Mode and Range
Mode The sample mode is the most frequently occurring values in a
sample
Ran e
The difference between the largest and smallest values in asample
values
Statistics-Berlin Chen 20
7/28/2019 Sampling and Descriptive Statics
21/49
Numerical Summaries for Categorical Data
For cate orical data, each sam le item is assi ned a
category rather than a numerical value
Two Numerical Summaries for Categorical Data
Definition: (Relative) Frequencies
The frequency of a given category is simply the number ofsample items falling in that category
Definition: Sample Proportions (alsocalledrelativefrequency) The sample proportion is the frequency divided by the
sample size
Statistics-Berlin Chen 21
7/28/2019 Sampling and Descriptive Statics
22/49
Sam le Statistics and Po ulation Parameters 1/2
A numerical summar of a sam le is called a statistic
A numerical summary of a population is called a
parameter If a population is finite, the methods used for calculating the
numerical summaries of a sample can be applied for calculating
outcome) occurs with probability? See Chapter 2) Exceptions are the variance and standard deviation (?)
2
2
2
2
1:Normal
x
exf
However, sample statistics are often used to estimate
parameters (to be taken as estimators)
Statistics-Berlin Chen 22
In practice, the entire population is never observed, so the
population parameters cannot be calculated directly
7/28/2019 Sampling and Descriptive Statics
23/49
Sample Statistics and Population Parameters (2/2)
A Schematic Depiction
Population Sample
Statistics
In erence
Parameters
Statistics-Berlin Chen 23
7/28/2019 Sampling and Descriptive Statics
24/49
Graphical Summaries
Recall that the mean, median and standard deviation,
etc., are numerical summaries of a sample of of a
population
On the other hand, the graphical summaries are used as
will to help visualize a list of numbers (or the sampleitems). Methods to be discussed include:
Dotplot
Histo ram more commonl used Boxplot (more commonly used)
Scatterplot
Statistics-Berlin Chen 24
7/28/2019 Sampling and Descriptive Statics
25/49
Stem-and-leaf Plot (1/3)
A sim le wa to summarize a data set
Each item in the sample is divided into two parts
stem, consisting of the leftmost one or two digits leaf, consisting of the next significant digit
The stem-and-leaf plot is a compact way to represent the
data It also gives us some indication of the shape of our data
Statistics-Berlin Chen 25
7/28/2019 Sampling and Descriptive Statics
26/49
Stem-and-leaf Plot (2/3)
Example: Duration of dormant () periods of thee ser Old Faithful in Minutes
e s oo a e rs ne o e s em-an - ea p o . srepresents measurements of 42, 45, and 49 minutes
A good feature of these plots is that they display all the sample
Statistics-Berlin Chen 26
values. One can reconstruct the data in its entirety from a stem-
and-leaf plot (however, the order information that items sampled is lost)
7/28/2019 Sampling and Descriptive Statics
27/49
Stem-and-leaf Plot (3/3)
Another Exam le: Particulate matter PM emissions for
62 vehicles driven at high altitude
Contain the a count of number of
items at or above this line
This stem contains
the medium
Contain the a count of number of
items at or below this line
Statistics-Berlin Chen 27Cumulative frequency columnStem (tens digits)
Leaf (ones digits)
7/28/2019 Sampling and Descriptive Statics
28/49
Dotplot
A dot lot is a ra h that can be used to ive a rou h
impression of the shape of a sample
Where the sample values are concentrated Where the gaps are
It is useful when the sample size is not too large and
Good method, along with the stem-and-leaf plot to
Not generally used in formal presentations
Statistics-Berlin Chen 28Figure 1.7 Dotplot of the geyser data in Table 1.3
7/28/2019 Sampling and Descriptive Statics
29/49
Histogram (1/3)
A ra h ives an idea of the sha e of a sam le
Indicate regions where samples are concentrated or sparse
The first step is to construct a frequency table
Choose boundary points for the class intervals
Compute the frequencies and relative frequencies for eachclass
Relative frequencies: frequency/sample size
Com ute the densit for each class accordin to the formula
Density = relative frequency/class width
Statistics-Berlin Chen 29
ens y can e oug o as e re a ve requency per
unit
7/28/2019 Sampling and Descriptive Statics
30/49
Histogram (2/3)
Table 1.4A frequency table
Draw a rectangle for each class, whose height is equal to thedensity
The total areas of rectangles is equal to 1
Figure 1.8
Statistics-Berlin Chen 30
7/28/2019 Sampling and Descriptive Statics
31/49
Histogram (3/3)
A common rule of thumb for constructin the histo ram
of a sample
It is good to have more intervals rather than fewer But it also to good to have large numbers of sample points in the
intervals
judgment and of trial and error It is reasonable to take the number of intervals roughly equal
o e square roo o e samp e s ze
Statistics-Berlin Chen 31
7/28/2019 Sampling and Descriptive Statics
32/49
Histogram with Equal Class Widths
Default settin of most software acka e
Example: an histogram with equal class widths for Table
.
The total areas of rectangles is equal to 1
Figure 1.9
few (7) data points
Compared to Figure 1.9, Figure 1.8 presents a smoother
Statistics-Berlin Chen 32
appearance and better enables the eye to appreciate the
structure of the data set as a whole
7/28/2019 Sampling and Descriptive Statics
33/49
Histogram, Sample Mean and Sample Variance (1/2)
ii
i valClassInterDensityOflassIntervaCenterOfCl
An approximation to the sample mean
E. ., the center of mass of the histo ram in Fi ure 1.8 is
730.6065.020177.04194.02
While the sample mean is 6.596
The narrower the rectangles (intervals), the closer the
=>items of the same value)
1, 1, 1, 2, 3, 422.1
18
22
16
14
36
52
0.5 3.5 4.5
Statistics-Berlin Chen 33
1, 1, 1, 2, 3, 4
4.50.5 1.5 2.5 3.5
26
12
6
1
46
1
36
1
26
3
1
7/28/2019 Sampling and Descriptive Statics
34/49
Histogram, Sample Mean and Sample Variance (2/2)
Definition: The moment of inertia () for the entire
histogram is
i ramssOfHistogCenterOfMalassIntervaCenterOfCl -2
iallassIntervDensityOfC
An approximation to the sample variance
E.g., the moment of inertia for the entire histogram in Figure 1.8 is
While the sam le mean is 20.42
25.20065.0730.620177.0730.64194.0730.62 222
The narrower the rectangles (intervals) are, the closer the
approximation is
Statistics-Berlin Chen 34
7/28/2019 Sampling and Descriptive Statics
35/49
Symmetry and Skewness (1/2)
A histo ram is erfectl s mmetric if its ri ht half is a
mirror image of its left half
E.g., heights of random men
Histograms that are not symmetric are referred to as
skewed
A histogram with a long right-hand tail is said to be
s ewe o e r g , orpos ve y s ewe
E.g., incomes are right skewed (?)
-skewed to the left, ornegatively skewed
Grades on an eas test are left skewed ?
Statistics-Berlin Chen 35
7/28/2019 Sampling and Descriptive Statics
36/49
Symmetry and Skewness (2/2)
skewed to the left nearly symmetric skewed to the right
ere s a so ano er erm ca e ur os s a s a so
widely used for descriptive statistics
,the distribution of a population
Statistics-Berlin Chen 36
7/28/2019 Sampling and Descriptive Statics
37/49
More on Skewness and Kurtosis (1/3)
Skewness can be used to characterize the s mmetr of
a data set (sample)
Given a sample : nXXX ,,, 21
3n Skewness is defined by
31
1 snSkewness i i
symmetric distribution shape => 0Skewnessi
: Skewed to the right0Skewness
: ewe o e e
Statistics-Berlin Chen 37
0Skewness
7/28/2019 Sampling and Descriptive Statics
38/49
More on Skewness and Kurtosis (2/3)
kurtosis can be used to characterize the flatness of a
data set (sample)
Given a sample : nXXX ,,, 21
Kurtosis is defined by
4
1
4XX
Kurtosisni i
A standard normal distribution has 3Kurtosis
A larger kurtosis value indicates a peaked distribution
A smaller kurtosis value indicates a flat distribution
Statistics-Berlin Chen 38
7/28/2019 Sampling and Descriptive Statics
39/49
More on Skewness and Kurtosis (3/3)
standard normal
Statistics-Berlin Chen 39
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
7/28/2019 Sampling and Descriptive Statics
40/49
Unimodal and Bimodal Histograms (1/2)
Definition: Mode
Can refer to the most frequently occurring value in a sample
Or refer to a peak or local maximum for a histogram or othercurves
A unimodal histogram A bimodal histogram
mo a s ogram, n some cases, n ca es a e
sample can be divides into two subsamples that differ
Statistics-Berlin Chen 40
7/28/2019 Sampling and Descriptive Statics
41/49
Unimodal and Bimodal Histograms (2/2)
Example: Durations of dormant periods (in minutes) and
lon : more than 3 minutes
short: less than 3 minutes
Statistics-Berlin Chen 41
Histogram of all 60 durations Histogram of the durations
following short eruption
Histogram of the durationsfollowing long eruption
7/28/2019 Sampling and Descriptive Statics
42/49
Histogram with Height Equal to Frequency
Till now, we refer the term histo ram to a ra h in
which the heights of rectangles represent densities
,of rectangles equal to the frequencies
.
the heights equal to the frequencies
cf. Figure 1.8Figure 1.13
exaggerate the proportion
of vehicles in the intervals
Statistics-Berlin Chen 42
7/28/2019 Sampling and Descriptive Statics
43/49
Boxplot (1/4)
A box lot is a ra h that resents the median, the first
()1.5 IQR
and third quartiles, and any outliers present in the
sample
1.5 IQR
1.5 IQR
The interquartile range (IQR) is the difference between the third
Statistics-Berlin Chen 43
an rs quar e. s s e s ance nee e o span e m e
half of the data
7/28/2019 Sampling and Descriptive Statics
44/49
Boxplot (2/4)
Ste s in the Construction of a Box lot
Compute the median and the first and third quartiles of the
sample. Indicate these with horizontal lines. Draw vertical lines
Find the largest sample value that is no more than 1.5 IQR
above the third quartile, and the smallest sample value that is
not more than 1.5 IQR below the first quartile. Extend vertical
lines (whiskers) from the quartile lines to these points
1.5 IQR below the first quartile are designated as outliers. Plot
each outlier individually
Statistics-Berlin Chen 44
7/28/2019 Sampling and Descriptive Statics
45/49
Boxplot (3/4)
Exam le: A box lot for the e ser data resented in
Table 1.5
Notice there are no outliers in these data The sample values are comparatively
densely packed between the median and
the third uartile
The lower whisker is a bit longerthan the upper one, indicating that
tail than an upper tail
The distance between the first quartile and the median is greater
than the distance between the median and the third quartile
This boxplot suggests that the data are skewed to the left (?)
Statistics-Berlin Chen 45
B l t (4/4)
7/28/2019 Sampling and Descriptive Statics
46/49
Boxplot (4/4)
Another Exam le: Com arative box lots for PM
emissions data for vehicle driving at high versus low
altitudes
Statistics-Berlin Chen 46
S tt l t (1/2)
7/28/2019 Sampling and Descriptive Statics
47/49
Scatterplot (1/2)
Data for which item consists of a air of values is called
bivariate
The graphical summary for bivariate data is a scatterplot Display of a scatterplot (strength of Titanium () welds
vs. its chemical contents)
Statistics-Berlin Chen 47
S tt l t (2/2)
7/28/2019 Sampling and Descriptive Statics
48/49
Scatterplot (2/2)
Exam le: S eech feature sam le Dimensions 1 & 2 of
male (blue) and female (red) speakers after LDA
transformation
0
.
-0.4
-0.2
e
2
-0.8
-0.6featur
-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2-1.2
-1
feature 1
Statistics-Berlin Chen 48
S
7/28/2019 Sampling and Descriptive Statics
49/49
Summary
We discussed t es of data
We looked at sam lin mostl SRS
We studied summar descri tive statistics
We learned about numeric summaries We examined graphical summaries (displays of data)
Statistics-Berlin Chen 49