Probability Review - online.sfsu.eduonline.sfsu.edu/mbar/ECON851_files/Probability Review.pdf ·...
Transcript of Probability Review - online.sfsu.eduonline.sfsu.edu/mbar/ECON851_files/Probability Review.pdf ·...
Probability Review
Michael Bar1
July 9, 2012
1San Francisco State University, department of economics.
ii
Contents
1 Appendix 1: Probability Review 1
1.1 Probability Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Marginal pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Conditional pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Independence of random variables . . . . . . . . . . . . . . . . . . . . 8
1.4 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Variance and standard deviation . . . . . . . . . . . . . . . . . . . . . 12
1.4.3 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . 16
iii
iv CONTENTS
Chapter 1
Appendix 1: Probability Review
In this chapter we review some basic concepts in probability and statistics. This chapter is
not a substitute for an introductory course in statistics, such as ECON 311. My goal here is
to review some key concepts for those who are already familiar with them, but might need
a speedy refresher.
1.1 Probability Spaces
The most fundamental concept in probability theory is a random experiment.
Definition 1 A random experiment is an experiment whose outcome cannot be predicted
with certainty, before the experiment is run.
For example, a coin flip, tossing a die, getting married or getting a college degree. Al-
though we cannot predict with certainty the exact outcome of a coin flip, but we know that
it can be either "heads" or "tails". Similarly, in a toss of a die we know that the outcome is
one of the following: 1 2 3 4 5 6.
Definition 2 A sample space, Ω, is the set of all possible (and distinct) outcomes of a
random experiment.
Thus, the sample space for the random experiment of a coin flip is Ω = "heads", "tails",and the sample space of the random experiment of a die toss is Ω = 1 2 3 4 5 6. It isless clear what is the sample space for the random experiments of getting married or getting
a college degree. In the first, one can look at the number of kids as an outcome, or the
duration the marriage will last. In the second, one can look at the types of jobs the person
gets, his wage, etc.
As another example, what is the sample space of flipping two coins? Let heads and tails
be denoted by H and T. Then the sample space is Ω = HH, HT, TH, TT, that is, thesample space consists of 4 outcomes.
Exercise 1 Write the sample space of flipping 3 coins.
Exercise 2 Write the sample space of tossing two dice.
1
2 CHAPTER 1. APPENDIX 1: PROBABILITY REVIEW
Sometimes we don’t care about all the particular outcomes of an experiment, but only
about groups of outcomes. For example, consider the random experiment of taking a course.
The set of all possible numerical grades is Ω = 0 1 100. However, if a letter grade of"A" is assigned to all grades of 92 and above, then we might be particularly interested in
the subset of Ω which gives an "A": 92 93 100.
Definition 3 An event is a subset of the sample space, denoted ⊆ Ω.
In the previous example of a grade in a course, we can define an event = 92 93 100.If one of the outcomes in occurs, we say that event occurred. Notice that ⊆ Ω, which
reads " is a subset of Ω". As another example, suppose that a student fails a course if his
grade is below 60. We can define an event = 0 1 59. Again, if one of the outcomesin occurs, we say that event occurred.
Exercise 3 Let the random experiment be flipping two coins, and be the event in which
the two coins have identical outcomes. List the outcomes in .
Exercise 4 Let the random experiment be a toss of a die, and be the event in which the
outcome is odd. List the outcomes in .
Finally, we would like to calculate the probability of an event. Intuitively, probability
of an event is the "chance" that an event will happen.
Definition 4 The probability of a random event denotes the relative frequency of occur-
rence of an experiment’s outcomes in that event, when repeating the experiment many times.
For example, in a toss of a balanced coin, if we repeat the experiment many times, we
expect to get "heads" half of the time, and "tails" half of the time. We then say that "the
probability of heads is 12 and the probability of tails is also 12". Similarly, in tossing a
balanced die, we expect each of the outcomes to occur with probability 16. Let the event
denote an even outcome in a toss of a die. Thus, = 2 4 6. What is the probability of? Intuitively, the probability of this event is the sum of the probabilities of each outcome,
i.e. () = 16+ 1
6+ 1
6= 1
2. In general, the probability of event is calculated by adding
up the probabilities of the outcomes in .
It should be obvious that (Ω) = 1 from the definition of the sample space. Since the
sample space contains all possible outcomes of a random experiment, the probability that
some outcome will occur is 1. It should also be obvious that probability of any event must
be between 0 and 1, and your instructors will be very alarmed if you calculate negative
probabilities or probabilities that are greater than 1.
In the examples of a coin flip and a toss of a die, each distinct outcome occurs with
equal probability. This need not always be the case. Consider the experiment of pick-
ing a person at random and recording his education. The sample space can be for example
"no school", "less-than-high school", "high school", "college degree", "MA", "Ph.D.". Ob-viously ("high school") is much higher that ("MA").
1.2. RANDOM VARIABLES 3
1.2 Random Variables
Notice that some random experiments have numerical outcomes (e.g. as a toss of a die) while
others have outcomes that we described verbally (e.g. "heads" and "tails in a flip of a coin,
or education level of a person, "high school", "college degree",...). Calculating probabilities
and performing statistical analysis becomes much simpler if we could describe all outcomes
in terms of numbers.
Definition 5 A random variable is a function which assigns a real number to each out-
come in the sample space. Formally, : Ω → R is the notation of a function (named ),
which maps the sample space Ω into the real numbers R.
From the above definition, note that if is a random variable and (·) is some real valuedfunction, then () is another random variable. It is conventional to use capital letters to
denote the random variable’s name, and small case letters to denote particular realizations
of the random variable. For example, we had seen that the sample space of flipping two coins
is Ω = HH, HT, TH, TT. One can define a random variable , which counts the numberof times that "heads" is observed. The possible values of (called the support of ) are
∈ 0 1 2. We can then calculate the probabilities ( = 0), ( = 1) and ( = 2),
in other words we can find the distribution of the random variable .
Exercise 5 Calculate the probabilities the random variable , which counts the number of
"heads" in a two coin flip. That is, find ( = 0), ( = 1) and ( = 2).
As another example, consider the random experiment to be a course with 20 students,
with the relevant outcomes being success or failure. Suppose that I want to calculate the
probability that all students pass the course, or that 1 student fails and 19 pass, or that
2 students fail and 18 pass, etc. I can define a random variable to be the number of
students who fail the course. Thus, the possible values of are 0 1 2 20. If I knowfrom previous experience that with probability a student fails my course, we can calculate
the distribution of . Roughly speaking, we can calculate ( = 0), ( = 1),...,
( = 20).
Definition 6 A discrete random variable is one that has a finite or countable number
of possible values.
Definition 7 A continuos random variable is one that has a continuum of possible val-
ues.
The distribution of a random variable is usually described by it’s probability density
function (pdf). The probability density function for a discrete random variable assigns
probabilities to each value of the random variable:
() = ( = )
Here is the name of the random variable, and is one possible realization. The probability
density function of a continuous random variable can be used to calculate the probability
4 CHAPTER 1. APPENDIX 1: PROBABILITY REVIEW
that the random variable gets values in a given interval [ ], by integrating the pdf over
that interval: Z
() = ( ≤ ≤ )
For example, the random variable that counts the number of failing students, among 20
who take the course, is an example of a discrete random variable. It has 21 possible values
(the support is ∈ 0 1 2 20) and a binomial distribution. We write ∼ ( )
to indicate that there are students, each fails with probability . If there are 20 students
and = 5%, we write ∼ (20 005). In case you are curious, the pdf of is
() = ( = ) =
µ
¶ (1− )
−
where µ
¶=
!
! (− )!
The next figure illustrates the pdf of (20 005):
Notice that high number of failures is very unlikely, which is a good news.
Exercise 6 Suppose that 20 students enroll in a course, and their instructor knows from
experience that each of them fails with probability 005. Calculate the probability that 7
students fail the course.
An example of a continuous random variable is continuous uniform distribution. If
∼ [ ], we say that has a uniform distribution on the interval [ ]. The probability
density function of is () = 1− ∀ ∈ [ ], and () = 0 otherwise. Thus, the support
of this random variable is ∈ [ ]. The next figure plots the pdf of a continuous uniform
1.2. RANDOM VARIABLES 5
distribution.
Suppose that we have two numbers, , such that ≤ ≤ ≤ . The probability that
∈ [ ] is ( ≤ ≤ ) =
Z
1
− =
−
−
In other words, the probability that a continuous uniform random variable falls within some
interval [ ] is equal to the relative length of that interval to the support.
In general, when the pdf of a random variable is given, the support of that random
variable is the set of all the values for which the pdf is positive. Written mathematically,
support () = | () 0which reads "the support of is the set of values such that the pdf is positive". This
is the same as saying that the support of is the set of all possible values of , because
values for which the pdf is zero are not possible.
It should be obvious that sum (in discrete case) or integral1 (in continuos case) of all the
values of a probability density function is 1. This has to be true, because a random variable
assigns a number to any outcome in the sample space, and since (Ω) = 1, the probability
of the random variable getting some number is also 1. Formally,
[ is discrete] :X
() = 1
[ is continuos] :
Z ∞
−∞ () = 1
In fact, any function which integrates (or sums) to 1 and is nonnegative (never attains
negative values) is a probability density function of some random variable.
Exercise 7 Suppose that is a continuous uniform random variable, with support [ ].
Find the critical value such that ( ≥ ) = 005.
1Mathematicians often don’t even bother using the word sum, because integral is also a sum. The integral
symbol,R, comes from an elongated letter s, standing for summa (Latin for "sum" or "total").
6 CHAPTER 1. APPENDIX 1: PROBABILITY REVIEW
Exercise 8 Suppose that is a continuous uniform random variable, with support [ ].
Find the critical value such that ( ≤ ) = 005.
Exercise 9 Suppose that has exponential distribution, denoted ∼ (), 0,
i.e. the pdf is:
() =
½−
0
≥ 0 0
(a). Prove that the above function is indeed a probability density function.
(b). Calculate the critical value ∗ such that ( ≥ ∗) = 005.
1.3 Random Vectors
Sometimes we want to explore several random variables at the same time, i.e. jointly.
For example, suppose that we choose people at random and look at how their education
and wages are distributed together. Letting denote the years of education and denote
hourly wage, we can describe the joint distribution of and with a joint probability
density function (joint pdf). If and are discrete, their joint pdf is
( ) = ( = = )
If and are continuous, the probability that ∈ [ ] and ∈ [ ] is
( ≤ ≤ ≤ ≤ ) =
Z
Z
( )
Obviously this discussion can be generalized to any number of random variables, say1 ,
with joint pdf (1 ). For example, when the unit of observation is a randomly cho-
sen household, we might be interested in the joint distribution of number of people in the
household, their genders, their incomes, their education levels, their work experience, their
age, their race, etc.
The joint pdfs must also integrate (or sum) to 1, just as in the single random variable,
because we know with certainty that and will attain some values. Formally,
[ and are discrete] :X
X
( ) = 1
[ and are continuos] :
Z ∞
−∞
Z ∞
−∞ ( ) = 1
Again, the above can be generalized to joint pdf of any number of random variables.
1.3.1 Marginal pdf
In relation to the joint pdf ( ), the functions () and () are called marginal or
(individual) probability density functions. The marginal pdfs are obtained by
[ and are discrete] :
½ () =
P ( )
() =P
( )(1.1)
[ and are continuos] :
½ () =
R∞−∞ ( )
() =R∞−∞ ( )
(1.2)
1.3. RANDOM VECTORS 7
Intuitively, can attain a given value of when attains the value 1 or 2 or any other
possible value. Thus, to get ( = ) we need to sum (integrate) over all these cases of
getting any of its possible values.
A point of caution is needed here. The marginal densities of and do not need to have
the same functions. Some texts are using the notations () and () to make it clear in
equations (1.1) and (1.2) that the marginal pdfs are in general not the same function; has
pdf and has pdf . These notes follow other texts, which give up some mathematical
precision in favor of ease of notation. Our textbook does not discuss joint and marginal
densities, and therefore does not have this problem2.
For example, suppose that I am looking at the joint distribution of gender, , attaining
the value of 1 for female and 0 for male, and GPA, , attaining values 1 , of studentsin this course. Suppose I want to obtain the marginal distribution of the gender . Then,
using equation (1.1), we have
(1) = ( = 1) = (1 1) + + (1 ) =X
(1 )
(0) = ( = 0) = (0 1) + + (0 ) =X
(0 )
Written in another way, a female or a male student can have any of the GPA values
1 , and therefore (female) = (female and = 1) + + (female and = )
(male) = (male and = 1) + + (male and = )
Exercise 10 Consider the function:
( ) =
½2− −
0
0 ≤ ≤ 1; 0 ≤ ≤ 1otherwise
(a) Show that Z ∞
−∞
Z ∞
−∞ ( ) = 1
In other words, prove that ( ) is a probability density function.
(b) Find the marginal pdfs () and ().
1.3.2 Conditional pdf
In econometrics, we are often interested in the behavior of one variable, conditional on
given values of another variable. For example, I might want to study how well female
students are doing in statistics courses, compared with male students. This is the study
of grades pdf, given that the student is female, and grades pdf, given that the student is
male. As another example, one might want to study the distribution of wages, given that
the level of education is "college" v.s. the distribution of wages given that the education
level is "MBA". Again, this is the study of the pdf of wages, conditional on education level.
2The reason why I bother you with detailed discussion of joint densities will become clear when we talk
about independence of random variables - a key concept in sampling theory and regression analysis.
8 CHAPTER 1. APPENDIX 1: PROBABILITY REVIEW
Definition 8 Conditional probability density function of given is given by
(|) = ( )
()(1.3)
In cases when () = 0, we define (|) = 0.
The left hand side of equation (1.3) is sometimes written as (| = ), to emphasize
that this is the conditional density of given that is fixed at a given value of . The
vertical bar with following it, |, means that we are conditioning on (or it is given that = ). For example, the conditional pdf of wages ( ), given that education is at level 3,
( = 3), is written
(| = 3) = ( )
( = 3)
or in short
(|3) = ( )
(3)
One can think of whatever is written after the bar as information. If two variables,
and are related, then the information on one of them should restrict the possible values
of other. For example, consider the random experiment of tossing a die, and let be the
result to a toss. Thus, the possible values of (the support of ) are 1 2 3 4 5 6. Thepdf of is () = 16 for = 1 6. Now suppose that we provide an information that the
result of a toss is an even number. We can provide this information by defining a random
variable = 1 if is even and = 0 if is odd. What is the conditional pdf of
given that = 1? We know that conditional on = 1, can only attain values 2 4 6.Intuitively, these values can be attained with equal probabilities, so ( = 2| = 1) =
( = 4| = 1) = ( = 6| = 1) = 13. This pdf can be obtained using the definition
of conditional density in equation (1.3), and the fact that even numbers in a toss of a die
occur with probability 12:
(2| = 1) = (1 2)
( = 1)=
¡16
¢¡12
¢ = 1
3
(4| = 1) = (1 4)
( = 1)=
¡16
¢¡12
¢ = 1
3
(6| = 1) = (1 6)
( = 1)=
¡16
¢¡12
¢ = 1
3
1.3.3 Independence of random variables
What if I provide an information which is not useful? Intuitively, the distribution of a random
variable with no information at all should be the same as the conditional distribution given
some useless information. For example, suppose that I toss a die, and let be the random
variable which is equal to the result. Suppose that at the same time, my friend flips a coin,
and let = 1 if heads and = 0 if tails. What is the conditional pdf of given that
1.4. MOMENTS 9
= 1? The information about the result of the coin flip does not reveal anything about
the likely outcomes of the die toss. In this case we say that and are independent.
Intuitively, when information is useless, we expect that (|) = (). Thus, from the
definition (1.3) it follows that when and are independent, we have:
(|) = ( )
()= ()
⇒ ( ) = () ()
Definition 9 Two random variables are statistically independent if and only if
( ) = () ()
that is, the joint pdf can be written as the product of marginal pdfs.
In general though, if and are dependent, the definition in (1.3) says that the joint
pdf can be factored out in two ways:
(|) = ( )
()⇒ ( ) = () (|)
(|) = ( )
()⇒ ( ) = () (|)
In other words, the joint pdf is the product of a conditional and marginal pdfs. Comparing
the two gives the relationship between the two conditional densities, known as the Bayes
rule for pdfs:
(|) = () (|) ()
The concept of independence is crucial in econometrics. For starters, a random sample
must be such that the units of observation are independent of each other.
Exercise 11 Consider the pdf
( ) =
½2− −
0
0 ≤ ≤ 1; 0 ≤ ≤ 1otherwise
Check whether are are statistically independent.
1.4 Moments
Often the distribution of a random variable can be summarized in terms of a few of its char-
acteristics, known as the moments of the distribution. The most widely used moments are
mean (or expected value) and variance. In characterizing a random vector, in addition
to mean and variance, we also look at covariance and correlation.
10 CHAPTER 1. APPENDIX 1: PROBABILITY REVIEW
1.4.1 Expected value
Intuitively, expected value or the mean of a random variable, is some average of all the values
that the random variable can attain. In particular, the values of the random variable should
be weighted by their probabilities.
Definition 10 The expected value (or mean) of a random variable is
[If is discrete] : () =X
()
[If is continuos] : () =
Z ∞
−∞ ()
In the discrete case, the sum is over all the possible values of , and each is weighted
by the discrete pdf. No differently, in the continuous case, the integral is over all the possible
values of weighted by the continuous pdf.
For example of discrete case, consider to be the result of a die toss, with pdf () = 16
for all ∈ 1 2 3 4 5 6. This is called the discrete uniform distribution. The expected
value of is:
() =X
() =1
6(1 + 2 + 3 + 4 + 5 + 6) = 35
For example of a continuous case, consider ∼ [ ] with pdf () = 1− ( () = 0
outside of the interval [ ]). This is the continuous uniform distribution. The expected
value of is:
() =
Z ∞
−∞ () =
Z
1
− =
1
−
µ2
2
¯
=1
−
µ2 − 2
2
¶=
1
−
(− ) (+ )
2
=+
2
This result is not surprising, when you look at the graph of the continuous uniform density
and notice that the mean is simply the middle of the support [ ].
It is important to realize that expectation of a random variable is always a number.
While the toss of a die is a random variable which attains 6 different values, its expectation
(mean) is a single number 35. Next, we list some general properties of expected values, i.e.
properties that apply to discrete and continuous random variables.
Rules of Expected Values
1. Expected value of a constant is the constant itself. Thus, if is a constant, then
() =
2. Constants factor out. If is a constant number, then
() = ()
1.4. MOMENTS 11
3. Expected value of a sum is the sum of expected values. That is, for any random variable
and
( + ) = () + ( )
Together with rule 2, this generalizes to any linear combination of random variables.
Let 1 be random variables, and let 1 be numbers. Then,
(11 + + ) = 1 (1) + + ()
4. If and are independent, then
( ) = () ( )
We will prove the 4th rule, and leave the 2nd and 3rd as exercise.
Proof. (4th rule of expected values). The product is another random variable, since any
function of random variables is another random variable. Thus, if and are continuous,
the expected value of the product is
( ) =
Z ∞
−∞
Z ∞
−∞ ( )
By definition 9
( ) =
Z ∞
−∞
Z ∞
−∞ () ()
=
Z ∞
−∞ ()
∙Z ∞
−∞ ()
¸| z
()
= ()
Z ∞
−∞ ()
= () ( )
The proof is exactly the same for discrete and with summation instead of integration.
Exercise 12 (Proof of rules 2 and 3 of expected values). Let and be a random variables
and are numbers.
(a) Prove that () = () when is discrete.
(b) Prove that () = () when is continuous.
(c) Prove that ( + ) = () + ( ) when and are discrete.
(d) Prove that ( + ) = () + ( ) when and are continuous.
12 CHAPTER 1. APPENDIX 1: PROBABILITY REVIEW
1.4.2 Variance and standard deviation
While the expected value tells us about some average of distribution, the variance and
standard deviation measure the dispersion of the values around the mean. Imagine one
uniform random variable with values 9 10 11 and another with 0 10 20. They bothhave the same mean of 10, but in the first the values are concentrated closer to the mean
(smaller variance).
Definition 11 Let be a random variable with mean () = . The variance of is
() = 2 = £( − )
2¤
Definition 12 The positive square root of variance, () = =p (), is the
standard deviation of .
Thus, the variance is the expected value of the squared deviation of the random variable
from its mean. We already know the definitions of expected value of discrete and continuous
random variables, and therefore we can write the definition of variance as follows:
[If is discrete] : () = 2 =X
(− )2 ()
[If is continuos] : () = 2 =
Z ∞
−∞(− )
2 ()
As example of a discrete case, lets compute the variance of a toss of a die, , with pdf
() = 16 for all ∈ 1 2 3 4 5 6. We already computed the mean = 35. Thus, the
variance is:
() =1
6
∙(1− 35)2 + (2− 35)2 + (3− 35)2+(4− 35)2 + (5− 35)2 + (6− 35)2
¸= 29167
For example of a continuous case, consider ∼ [ ] with pdf () = 1− ( () = 0
outside of the interval [ ]). This is the continuous uniform distribution. We already found
that the mean is = +2, and therefore the variance of is:
() =
Z
µ− +
2
¶21
−
=1
−
Z
Ã2 − (+ )+
µ+
2
¶2!
=1
−
Ã3
3− (+ )
2
2+
µ+
2
¶2
¯¯
=1
−
"3 − 3
3− (+ )
2 − 2
2+
µ+
2
¶2(− )
#
1.4. MOMENTS 13
Using the rule, 3 − 3 = (− ) (2 + + 2), we get
() =
"2 + + 2
3− (+ )
2
2+
µ+
2
¶2#
=
"2 + + 2
3− (+ )
2
4
#
=
∙4 (2 + + 2)− 3 (2 + 2+ 2)
12
¸=
∙2 − 2+ 2
12
¸=(− )
2
12
The derivation was pretty messy, but the result makes intuitive sense. Notice that the
variance of ∼ [ ] depends positively on the length of the interval − .
Exercise 13 Consider the continuous uniform random variable ∼ [ ].
(a) If = 0 and = 1, find the pdf of , its expected value and variance.
(b) If = 0 and = 2, find the pdf of , its expected value and variance.
For computational convenience, the definition of variance can be manipulated to create
an equivalent, but easier to use formula.
() = £( − )
2¤
= £2 − 2 + 2
¤=
¡2¢− 2 () + 2
= ¡2¢− 22 + 2
= ¡2¢− 2
Thus, you will often use the formula:
() = ¡2¢− ()
2(1.4)
Having found the expected value, the last term is just its square. So we only need to compute
the mean of 2. For example of discrete case, consider again toss of a die, , with pdf
() = 16 for all ∈ 1 2 3 4 5 6. We found that () = 35, so ()2 = 352 = 1225.The first term is
¡2¢=
X
2 ()
=1
6
¡12 + 22 + 32 + 42 + 52 + 62
¢= 151667
Thus,
() = ¡2¢− ()
2
= 15167− 1225 = 29167
14 CHAPTER 1. APPENDIX 1: PROBABILITY REVIEW
For example of a continuous case, consider ∼ [ ] with pdf () = 1− ( () = 0
outside of the interval [ ]). We already found that () = +2, so ()
2=¡+2
¢2=
(+)2
4. The first term is:
¡2¢=
Z
21
−
=1
−
µ3
3
¯
=1
−
µ3 − 3
3
¶=
2 + + 2
3
Thus, the variance is
() = ¡2¢− ()
2
=2 + + 2
3− (+ )
2
4
=(− )
2
12
Notice that expressions become less messy with the definition of variance is (1.4).
Rules of Variance
1. Variance of constant is zero. Thus, if is a constant, then
() = 0
2. If is a constant number, then
() = 2 ()
Thus, for example, if you multiply a random variable by 2, its variance increase by a
factor of 4.
3. Adding a constant to a random variable, does not change its variance. Thus, if is a
constant number, then
( + ) = ()
Adding a constant to a random variable, just shifts all the values and the mean by
that number, but does not change their dispersion around the mean.
4. Variance of the sum (difference) is NOT the sum (difference) of the variances. For
random variables and , we have
( + ) = () + ( ) + 2 ( )
( − ) = () + ( )− 2 ( )
where ( ) is the covariance between and , to be defined in the next section.
1.4. MOMENTS 15
The first 3 rules can be easily proved directly from the definition of variance () =
£( − )
2¤. We’ll get back to the 4th rule in the next section, after we discuss covariance.
Proof. (Rules 1,2,3 of variance).
(1) If = , then from the rules of expected values () = . Thus
() = £(− )
2¤= 0
(2) From the rules of expected values, () = (). Thus,
() = £( − ())
2¤
= £( − ())
2¤
= £2 ( − )
2¤
= 2£( − )
2¤
= 2 ()
(3) From rules of expected values, ( + ) = () + = + . Thus,
( + ) = £( + − (+ ))
2¤
= £( − )
2¤
= ()
Rules of Standard Deviation
Usually, textbooks do not even bother presenting the rules of standard deviations, because
they follow directly from rules of variances. For example, taking square roots of the first 3
rules of variances, gives the following rules of standard deviations:
[1] :p () =
√0⇒ () = 0
[2] :p () =
p2 ()⇒ () = || · ()
[3] :p ( + ) =
p ()⇒ ( + ) = ()
Notice that rule 2 implies that if we double a random variable, we also double its standard
deviation.
Besides the importance of variance and standard deviation is statistics (we will be using
them everywhere in this course), they are extremely important in finance. Any financial
asset (or a portfolio of assets) is a random variable, which has expected return (mean re-
turn) and risk (standard deviation of returns). A great deal of modern portfolio theory
deals with finding portfolios which maximize expected return for any given level of risk, or
alternatively, minimizing the level of risk for any given expected return. These are called
efficient portfolios. If you take Finance 350 course (which is highly recommended), these
notes will be useful as well.
16 CHAPTER 1. APPENDIX 1: PROBABILITY REVIEW
Exercise 14 Let be random variable with mean and variance 2. Let be the
following transformation of :
= −
Thus is obtained by subtracting the mean and dividing by standard deviation (this operation
called standardization). Show that has mean zero and standard deviation 1 (i.e. = 0,
and = 1). Such random variable is called standard.
Exercise 15 Can the variance of any random variable be negative?
Exercise 16 Can the standard deviation of any random variable be negative?
Exercise 17 Let be random variable with mean and variance 2. What is the stan-
dard deviation of = 2?
Exercise 18 Let be random variable with mean and variance 2. What is the stan-
dard deviation of = −2?
1.4.3 Covariance and correlation
When analyzing the joint behavior of two random variables, we often start by asking if they
are correlated. For example, we would like to know if education level and wages are positively
correlated (otherwise, what are we doing here?). As another example, in macroeconomics we
often want to know if some variable is correlated with real GDP or not (say unemployment).
What this means in plane language, is that we want to investigate whether two variables are
moving around the mean in the same direction, opposite direction or there is no connection
between their movements.
Definition 13 Let and be two random variables with means and respectively.
The covariance between them is
( ) = [( − ) ( − )]
Notice that if and are mostly moving together around their mean, then when is
above , also tends to be above , and when is below , also tends to be below
. In this case the terms ( − ) and ( − ) are either both positive or both negative,
and ( ) 0. We then say that and are positively correlated. On the other hand,
suppose that and are for the most part moving in opposite directions. Then when
is above , tends to be below , and when is below , tends to be above .
In such case the terms ( − ) and ( − ) will have opposite signs and ( ) 0.
We then say that and are positively correlated. We will see later, that when and
are independent, then ( ) = 0, and we say that and are uncorrelated.
Notice that just as the definition of variance, the definition of covariance involves com-
puting expected value. In order to compute these expected values, we once again have two
1.4. MOMENTS 17
formulas for discrete and for continuos random variables.
[If and are discrete] : ( ) =X
(− ) ( − ) ( )
[If and are continuos] : ( ) =
Z ∞
−∞
Z ∞
−∞(− ) ( − ) ( )
Also as we did with variance, we can manipulate the definition of covariance to derive a more
convenient formula for computation.
( ) = [( − ) ( − )]
= [ − − + ]
= ( )− ()− ( ) +
= ( )− − +
= ( )−
Thus, you will often use the formula
( ) = ( )− () ( ) (1.5)
Rules of Covariance
1. Covariance of a random variable with itself is the variance of :
() = ()
Notice that we could have discussed covariance before variance, and then define vari-
ance as the covariance of a variable with itself. In fact, this way we can prove the rules
of variances in a more elegant and easy way. Our textbook does exactly that, so in
these notes I present moments in the usual order (mean → variance → covariance).
2. Covariance between a random variable and a constant is zero. Thus, if is a random
variable, and is a constant, then
( ) = 0
3. Constants factor out as product. Let and be random variables and and be
constant numbers. Then,
( ) = · ( )
4. Distributive property. Let , and be random variables. Then,
( + ) = ( ) + ()
5. Adding constants to random variables, does not change the covariance between them.
Let and be random variables and and be constant numbers. Then,
( + + ) = ( )
18 CHAPTER 1. APPENDIX 1: PROBABILITY REVIEW
6. Covariance between two independent variables is 0. That is, suppose that random
variables and are independent. Then,
( ) = 0
or
( ) = () ( )
Recall that variables with zero covariance are called uncorrelated. This rule says
that independence implies lack of correlation.
Proof. For all the proofs of the covariance rules it is easier to use the definition in equation
(1.5), that is, ( ) = ( )− () ( ).
(Rule 1). This just follows from comparing the definition of covariance (1.5), when
= , with the definition of variance (1.4).
(Rule 2). It should be intuitive, that covariance of any random variable with a constant
is zero, because covariance measures the comovement of the two around their means, and a
constant simply does not move. Formally,
( ) = ()− () ()
= ()− ()
= 0
(Rule 3). Constants factor out because the do so in expected values. Formally,
( ) = ( · )− () ( )
= ( )− () ( )
= [ ( )− () ( )]
= · ( )
(Rule 4). Distributive property.
( + ) = ( · ( + ))− () ( + )
= ( +)− () [ ( ) + ()]
= ( ) + ()− () ( )− () ()
= [ ( )− () ( )]| z ( )
+ [ ()− () ()]| z ()
(Rule 5). Adding constants does not affect the covariance.
( + + ) = [( + ) · ( + )]− ( + ) ( + )
= [ ++ + ]− [ () + ] [ ( ) + ]
= [ ( ) + () + ( ) + ]− [ () ( ) + () + ( ) + ]
= ( )− () ( )
= ( )
1.4. MOMENTS 19
(Rule 6). Covariance between two independent variables is 0. This is exactly rule 4 of
expected values, which we already proved and will not repeat here.
Now we are in a position to prove rule 4 of variance, which is a formula of variance of
sums:
( + ) = () + ( ) + 2 ( )
Proof. (Rule 4 of variance). Recall that variance is covariance between the variable and
itself. Therefore,
( + ) = ( + + )
By distributive property of covariance (rule 4), we have
( + + ) = ( + ) + ( + )
= ()| z ()
+ ( ) + () + ( )| z ( )
= () + ( ) + 2 ( )
Exercise 19 Prove that for any two random variables and , we have
( − ) = () + ( )− 2 ( )
Exercise 20 Prove that for two independent random variables and , the variance of
a sum is the sum of the variances:
( − ) = () + ( )
Recall that covariance measures the degree of comovement of two random variables
around their means. The main disadvantage of this measure is that it is NOT unit-free.
Suppose researcher studies the comovement of years of schooling and annual wages
measured in thousands of dollars, . He finds a positive covariance:
( ) = 7
Researcher has the same goal and the same data on wages and schooling, but he chooses
to measure wages in dollars (instead of thousands of dollars). Thus, he finds
( 1000 · ) = 1000 ( ) = 7000
Should researcher conclude that wages and years of schooling have stronger comovement
than what was previously found by researcher ? NO, if researcher and use the same
data, we want them to reach the same conclusion about the degree of comovement between
wages and years of schooling. In order to make the results comparable, researchers report
correlation instead of covariance.
20 CHAPTER 1. APPENDIX 1: PROBABILITY REVIEW
Definition 14 The correlation between random variables and is
( ) = = ( )
We define ( ) = 0 whenever = 0, or = 0 or both. In other words, whenever
the covariance between two random variables is zero, we define the correlation to be also zero,
and say that the two variables are uncorrelated.
Thus, the correlation is the variance divided by the product of standard deviations. How
does this resolve the problem of covariance depending on units of the random variables?
Let and be random variables, and let and be positive numbers. We know that
( ) = · ( ), so if and represent different units used by different
researchers, we will have different results. Correlation however is unit free, and does not
change when we scale the random variables by positive numbers.
Properties of Correlation
1. For any two random variables and , the sign of the correlation between them is
the same as the sign of the covariance between them.
( ( )) = ( ( ))
In other words, correlation gives us the same information as covariance about the qual-
itative nature of the comovement between and around their mean. If ( )
0, meaning that and tend to be above and below their respective means at the
same time, then we will also have ( ) 0.
2. Correlation is unit free. Let and be random variables, and let and be numbers,
both of the same sign. Then,
( ) = ( )
3. Correlation between any two random variables can be only between −1 and 1. Let and be random variables. Then,
−1 ≤ ( ) ≤ 1
Proof. (Property 1). Since standard deviation of any random variable is always positive,
dividing by them does not change the sign.
(Property 2). Let and be random variables, and let and be numbers with the
same sign.
( ) = ( )
() · ( )=
· ( )
|| · () · || · ( )=
|| · || ( )
() ( )
= ( )
1.4. MOMENTS 21
We do not provide the proof of property 3, since it involves knowledge of inner products
and Cauchy—Schwarz inequality. Instead, we discuss this property at intuitive level. The
highest possible correlation that can be achieved, is when = . Then the two random
variables and are always guaranteed to be above or below their mean at the same
time. The lowest possible correlation is achieved when = −. In this case, the twovariables always deviate from their means in opposite directions. Thus, we will prove that
() = 1 and (−) = −1, and these will be the maximal and minimalcorrelations, so that in general, for any and we have −1 ≤ ( ) ≤ 1.
() = ()
=
()
()= 1
(−) = (−)
=− ()
= −1
In the last equation we used rule 3 of covariance, that of constants factoring out, with the
constant in this case being −1.
Exercise 21 Prove that if two random variables, and , are independent, then they are
uncorrelated, i.e. ( ) = 0.
The opposite statement is not true. Two random variables can be uncorrelated, but not
independent. The next exercise gives such an example.
Exercise 22 Let be a random variable with pdf () = 14 for ∈ 1 2−1−2. Let = 2.
(a) Show that and are not correlated, i.e. show that ( ) = 0.
(b) Show that and are not independent. (Hint: compare the conditional pdf (|)with (). If and are independent, then we must have (|) = ()).
Exercise 23 Let be a random variables, and , be some numbers. Let = + .
(a) Prove that if 0, then ( ) = 1.
(b) prove that if 0, then ( ) = −1.(c) Prove that if = 0, then ( ) = 0.