Lesson 2: Descriptive Statisticskafuwong.econ.hku.hk/teaching/econ1003/2004F/ppt/Lesson2.pdf ·...
Transcript of Lesson 2: Descriptive Statisticskafuwong.econ.hku.hk/teaching/econ1003/2004F/ppt/Lesson2.pdf ·...
Lesson 2: Descriptive Statistics
Ka-fu WONG
September 15, 2004
There are a lot of ways to describe the data collected. Graphical descriptions are often used. They are
easy to learn but difficult to master. Nowadays, standard data management programmes, such as Excel,
allow us to produce graphics on the fly. For this reason, we will not spend time on graphical presentations
such as histogram and stem-and-leaf. Most introductory business statistics textbooks will have a chapter on
the topic. If you are interested to learn more about graphical presentation of data, I recommend two books
1. Cohn, Victor (1989): News & Numbers: A guide to reporting statistical claims and controversies in
health and other fields, Iowa State University Press.
2. Spirer, Herbert F., Louise Spirer, and A.J. Jaffe (1998): Misused Statistics, Marcel Dekker, Inc.
Here, we shall focus on selective descriptive statistics we use often in the Economics and Finance disci-
pline. In the following discussion, we do not distinguish between the population statistics and their sample
counterparts. In most cases, population statistics and their sample counterparts use exactly the same for-
mula. The approach to use the sample counterparts to estimate the population statistics is known as “analog
estimation methods”.1 In the following discussion, the only exception to this rule is in the sample variance
formula.
Definition 1 (Population Parameter): A population parameter is number calculated from all the
population measurements that describes some aspect of the population. Population parameters
are generally unknown and need to be estimated. Often population parameters are denoted
by Greek letters (α, β, θ, etc.). For example, the population mean, often denoted by µ, is a
population parameter and is the average of the population measurements.
Definition 2 (Estimator): An estimator is a formula or a rule that takes a set of data and
1Additional discussion may be found in a book by Charles F. Manski: Analog estimation methods in econometrics. NewYork: Chapman and Hall, 1988.
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
1
returns an estimate of the quantity we are interested in.
θ(x1, x2, ..., xn)
Generally, the formula will return two different numbers for two different samples.
Definition 3 (Point Estimate): A point estimate is a one-number estimate of the value of a
population parameter. In other words, a point estimate is a realization of the estimator based
on a sample. Often point estimates are denoted by English letters (a, b, c, etc.).
Definition 4 (Sample Statistic): A sample statistic is a number calculated using sample measure-
ments that describes some aspect of the sample. Thus, a sample statistic could be mechanically
computed and need not be a point estimate of a population parameter we are interested in. Of
course, a statistic is often a point estimate of a population parameter we are interested in.
1 Measure of central tendency
There are three major statistics that measure the central tendency: mean, median and mode. They help us
to answer questions like:
1. What is the likely income of a fresh university graduate?
2. What is the likely age of a first-year university student?
3. What is the likely stock return for a hedge fund?
In essence, we want to find a single number b such that the difference between a randomly drawn
observation and this number is small on average, based on some criteria. Let x be the randomly drawn
observation. Different criteria will end up with different measures of central tendency. Historically when
computing power is limited, we often consider the squared difference, i.e., (x − b)2. Then, the measure of
central tendency is the solution to the minimization problem of
minb
n∑
i=1
(xi − b)2.
It turns out the solving this minimization problem is relatively less demanding in computing resources.
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
2
Recently, as computing power has become readily available, we are willing to consider the alternative
criteria – the absolute difference, i.e., |x− b|, where |.| denote “the absolute value of”. Then, the measure of
central tendency is the solution to the minimization problem of
minb
n∑
i=1
|xi − b|.
Although solving this minimization problem cost substantial computing resources, the solution has some
desirable property as discussed below.
The major reason for the dominance of the “squared difference” criteria over the “abolute difference”
is that the former minimization problem is smooth or differentiable. Thus, we may use differentiation to
reduce its demand on computing resources. We may also use differentiation to deduce some properties of
the “estimator”.
We will see these criteria again in our study of Econometrics (Economic Statistics) later.
These three statistics have different properties. Depending on the question we are asking, the likely
population distribution and the computing resources we have, we may want to focus on one of them or
report all of them. Often, we report all of them unless we are short of computing resources.
Definition 5 (Mean): The mean of a collection of n observations (say, {x1, x2, ..., xn} is defined
by
m =x1 + x2 + ...+ xn
n=∑ni=1 xin
=n∑
i=1
1nxi
Sometimes, we want to emphasize that the mean depends on the n observations and write:
m(x1, x2, ..., xn) or m(sn) where sn ≡ {x1, x2, ..., xn}. We note that m is the solution to the
minimization problem of
minb
n∑
i=1
(xi − b)2
and is used often historically due to its computational convenience.
Mean is a very important concept in Statistics and Economics. Population mean is also known as
“expected value” to students in Economics and Finance. Sample mean is often used to estimate
the population mean.
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
3
Example 1 (Simple Mean): Suppose that a sample of 10 observations have been collected: 3,
5, 3, 4, 2, 1, 0, 6, 9, 7. The simple mean is
(3 + 5 + 3 + 4 + 2 + 1 + 0 + 6 + 9 + 7)/10 = 4
Challenge 1 (Mean of repeated observations): Consider a collection of 781 observations which
take only three values as following:
Value 3 5 7
No. of observations 234 200 347
Can you compute its mean?
For large number of repeated observations, the following weighted mean formula would be handy..
Definition 6 (Weighted Mean): Suppose we have the following collection of repeated observa-
tions
Value x1 x2 x3 ... xn
No. of observations w1 w2 w3 ... wn
The weighted mean is defined by
m =w1x1 + w2x2 + ...+ wnxn
w1 + w2 + ...+ wn=∑ni=1 wixi∑ni=1 wi
=n∑
i=1
wi∑ni=1 wi
xi =n∑
i=1
w∗i xi
where wi is called the positive weight for observation xi. It may be demonstrated that∑ni w∗i = 1..
Thus, the simple mean has equal weight (w∗i = 1/n) for all observations.
Example 2 (Mean of repeated observations): Consider a collection of 781 observations which
take only three values as following:
Value 3 5 7
No. of observations 234 200 347
The mean can be computed as
(3× 234 + 5× 200 + 7× 347)(234 + 200 + 347)
=4131781
= 5.29
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
4
Note that we are not explicitly asked to computed a weighted mean, but the weighted mean
formula is convenient.
Example 3 (Proportions interpreted as mean (I)): The proportion of observations with certain
characteristics in a collection of observations may be computed using the mean formula. Suppose
we are interested in the proportion of male in a class of 40. We can define an indicator variable
for each person i in class
xi =
1 if the student is male
0 otherwise
The proportion is simply
m =x1 + x2 + ...+ x40
40=∑40i=1 xi40
The interpretation of proportion as mean has greatly simplified our calculation. Often, survey
results are coded into numbers. For example, in census data, the sex variable is often coded 1 for
male and 2 for female. While the simple average of this variable is slightly difficult to interpret,
the simple average after a “re-coding” to 1 for male and 0 for female may be interpreted as the
proportion of male. How should we code the data if we want to compute the proportion of female
in a class.
Example 4 (Proportions interpreted as mean (II)): Suppose the followings are the grades of the
13 students in a class: {A,B,A−, C,D,B,B−, A−, A,A+, B+, B,B}. What is the proportion
of students who get B grade or above?
First we convert the grades into an indicator.
xi =
1 if the student gets a B grade or above
0 otherwise
In term of indicators, we have {1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1}. The sample proportion is simply
m =x1 + x2 + ...+ x13
13=
1013≈ 0.77
Thus, 77% students got B grade or above in class.
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
5
Definition 7 (Percentile): The p-th percentile of a set of data is the number such that p% of
the data is less than that number. Operationally, the determination of a p-th for a collection of
n observations takes two steps.
1. Rank the data in ascending order (from the smallest to the largest).
2. The p-th percentile is the k-th observation in ascending, where k = ceil(n × p/100) and
ceil(.) means rounding up to the nearest integer.
Example 5 (Percentile): Suppose we have a collection of 1000 observations, arranged in as-
cending order: {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, ...., 99.7, 99.8, 99.9, 100.0}. What is
the 75-th percentile?
The 75-th percentile is the 750-th observation (750 = 0.75× 1000). The 750-th observation is 75.
Challenge 2 (Percentile): Suppose we have collection of 2000 observations, arranged in as-
cending order: {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, ...., 199.7, 199.8, 199.9, 200.0}. What
is the 75-th percentile?
Definition 8 (Median): The median of a set of data is the number such that 50% of the data is
less than that number. in other words, the sample median is the middle observation of the ordered
data. Suppose the data {x1, x2, ..., xn} has been ordered ascendingly, x1 <= x2 <= ... <= xn,
then
• If the number of observations, n, is an odd number, the median is the [(n+ 1)/2]-th ordered
observation.
• Otherwise, if the the number of observations, n, is an even number, the median is the point
half way between the (n/2)-th observation and the (n/2 + 1)-th observation in your ordered
list. Note that this will not be a observation, unless the two observations in question are
equal.
Median is also known as second quartile, or 50th percentile.
We note that median is the solution to the minimization problem of
minb
n∑
i=1
|xi − b|.
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
6
Example 6 (Median): Consider a collection of 10 observations: 3, 5, 3, 4, 2, 1, 0, 6, 9, 7. To
find the median, we first rank the data from the smallest to the largest: {0, 1, 2, 3, 3, 4, 5, 6, 7, 9}.Because we have even number of observations, the median is the average of the two middle values
(3 + 4)/2 = 3.5
Example 7 (Effect of an extreme observation on median and mean): Two samples of five
executives received the following bonus last year ($000):
sample #1 15 17 16 15 200
sample #2 15 17 16 15 18
The excutive pay of 200 in the first sample appears an extreme observation. The following are
the sample mean and median of the first and second sample.
sample #1 sample #2
mean 52.60 16.20
median 16.00 16.00
Another way to see the impact of an extreme observation is to compute the sample mean and
median with and without the extreme observations for the first sample.
with without
mean 52.60 15.75
median 16.00 15.50
Thus, mean can be greatly affected by extreme observations but median will not.
Example 8 (Monthly income from main employment 2001): Possibly because median is less
sensitive to extreme observations, the Census and Statistics Department has chosen to report the
median income of employees surveyed in 2001. The following table are median monthly income
from main employment of employees by education in 2001 extracted from the 2001 Population
Census Main Report.
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
7
Male Female
No schooling/Kindergarten 8,000 4,600
Primary 9,500 5,500
Lower secondary 10,000 6,500
Upper secondary 12,000 10,000
Matriculation 14,500 8,800
Tertiary: Non-degree course 20,000 15,300
Tetiary: Degree course 26,250 18,000
From the table, one can easily observed that female employees of the same education are paid
less than male.
The following table are median monthly income from main employment of employees by industry
in 2001 extracted from the 2001 Population Census Main Report.
Male Female
Manufacturing 12,000 8,500
Construction 10,000 8,600
Wholesale, retail and import/export trades, restaurants and hotels 11,000 8,000
Transport, storage and communications 10,500 10,000
Financing, insurance, real estate and business services 15,000 13,000
Community, Social and personal services 15,000 6,200
Others 13,048 8,800
The tables have many potential uses. What conclusion can we draw if we were analysts for
1. The University Admissions Tutor persuading secondary school students to pursue a degree
course.
2. The Admissions Tutor of the Economics and Finance programmes to persuade students to
choose his programmes.
3. The Equal Opportunities Commission to see whether there is any violation in equal oppor-
tunities law.
Definition 9 (First Quartile): The first quartile of a set of data is the number such that 25%
of the data is less than that number. Thus, first quartile is also known as 25th percentile.
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
8
Definition 10 (Third Quartile): The third quartile of a set of data is the number such that 75%
of the data is less than that number. Thus, third quartile is also known as 75th percentile.
Note that the operational definitions of first quartile and third quartile may vary slightly across textbooks
and softwares.
Definition 11 (Mode): Mode of a set of data is the most common value found in the set of data.
A set of data can have more than one modal value.
2 Measuring Dispersion
The accuracy (i.e., how close is the point estimate to the parameter we are interested in) of central tendency
measures depends on the how spread out the data, i.e., dispersion, are. If the data is very concentrate, we
will be more confident that the central tendency measure is an accurate predictor of the question we have
in mind, say, the likely income earned by a fresh university graduate. If the data is very disperse, we will be
less confident that the central tendency measure is an accurate predictor of the question we have in mind.
Definition 12 (Minimum, Maximum and Range): Rank the data in ascending order. The first
observation in the ranked data is the minimum. The last observation in the ranked data is the
maximum. The difference between the largest and the smallest value in the dataset is the range.
Suppose the ranked data are: 0, 1, 2, 3, 3, 4, 5, 6, 7, 9. The first observation 0 is the minimum.
The last observation 9 is the maximum. The range is 9− 0 = 9.
Definition 13 (Variance and Standard Deviation): Both variance and standard deviation are
measures of how spread out the data are. Variance of a collection of n observations is computed
as the average squared deviation of each number from its mean.
v =(x1 −m)2 + (x2 −m)2 + ...+ (xn −m)2
n=∑ni=1(xi −m)2
n
where m is the mean of the n observations.
Standard deviation is the square root of the variance
s =√v.
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
9
Note that the use of the above variance formula when applied to a sample of data as an estima-
tor for the population variance is generally biased. The unbiased estimator for the population
variance turns out to be
v∗ =n
n− 1v =
∑ni=1(xi −m)2
n− 1
where m is the sample mean. The division by “n − 1” instead of “n” is because m has to be
estimated from the same sample. It is often said “a degree of freedom” is lost in the computating
of sample variance. The difference between v and v∗ will be negligible when n is large.
Variance and standard deviation are convenient measure of dispersion. They differ mainly in the units of
measurement. Large standard deviation means the data is more dispersed.
Example 9 (Dispersion of asset returns): Dispersion is used as a measure of risk. Consider two
assets of the same expected (mean) returns of 2%.
Asset bad time normal time bad time
A 0% 2% 4%
B -2% 2% 6%
The dispersion of returns of the second asset is larger than the first. If we were to compute the
standard deviation for the population, we would get a bigger variance for the second asset. Thus,
the second asset is more risky. Thus, the knowledge of dispersion is essential for investment
decision. And so is the knowledge of expected (mean) returns.
The following two theorems related standard deviations and dispersion.
Theorem 1 (Chebyshev’s theorem): For any set of observations, the minimum proportion of the
values that lie within k standard deviations of the mean is at least:
1− 1k2
where k is any constant greater than 1. The following table shows the relationship between k
and the proportion of data covered (i.e., coverage).
k 1 2 3 4 5 6
Coverage 0.00% 75.00% 88.89% 93.75% 96.00% 97.22%
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
10
The Chebyshev’s theorem is in contrast to the empirical rule.
Theorem 2 (Empirical rule): For any symmetrical, bell-shaped distribution:
1. About 68% of the observations will lie within one standard deviation of the mean.
2. About 95% of the observations will lie within two standard deviation of the mean.
3. Virtually all the observations will be within three standard deviation of the mean
Empirical rule is also known as normal rule.
Definition 14 (Coefficient of variation): The coefficient of variation is the ratio of the standard
deviation (s) to the arithmetic mean (m), expressed as a percentage:
CV =s
m× 100%
Coefficient of variation is often used to measure the relative dispersion.
In finance, under some conditions the well-known Sharpe Ratio can be written as the inverse of CV.2 If x is
the return of an investment strategy in excess of the market portfolio, m will be the expected excess return
and s will be the standard deviation of the excess return. Of course, an investment strategy of a higher
Sharpe Ratio is preferred.
Definition 15 (Skewness): Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of
the center point.
skewness =∑ni=1(xi −m)3
(n− 1)s3
where m is the sample mean and s is the standard deviation.
A positive skewness implies the distribution is skewed to the right and mode < median < mean.
A negative skewness implies the distribution is skewed to the left and mean < median < mode.
Zero skewness implies a symmetric distributiion and mean = median = mode.
Skewness measures the degree of asymmetry in risk: upside risk versus downside risk. Right skewed dis-
tribution of asset returns implies higher upside risk than downside risk. Left skewed distribution of asset
returns implies higher downside risk than upside risk.2Sharpe ratio is a measure of the performance of investment strategies, with an adjustment for risk. For additional discussion
of the Sharpe ratio, please refer to http://www.stanford.edu/∼wfsharpe/art/sr/sr.htm.
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
11
Definition 16 (Kurtosis): Kurtosis is a measure of whether the data are peaked or flat relative
to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near
the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have
a flat top near the mean rather than a sharp peak.
kurtosis =∑ni=1(xi −m)4
(n− 1)s4
where m is the mean and s is the standard deviation.
Excess kurtosis is defined as
kurtosis− 3.
It is defined such that a normal distribution will have a excess kurtosis of zero. A positive excess
kurtosis implies the distribution has fatter tails than normal.
Among all descriptive statistics discussed above, we are often interested in the mean and standard devia-
tion. As we will show in later chapters, under some regularity conditions, sample mean will be distributed as
“normal” and thus allows us to compute confidence interval, to conduct hypothesis testing. Simple sample
mean can also be extended to allow the mean to vary with some other variables. For instance, we may
compute the mean income from a sample. We may also allow the mean income to vary with education,
similar to the one reported in one of the earlier examples.
Ka-fu WONG, September 15, 2004ECON1003 Lesson 2: Descriptive Statistics
12