Basic Statistics for SGPE Students [.3cm] Part II ...€¦ · Outline 1. Probabilitytheory I...

Basic Statistics for SGPE Students

Part II: Probability distribution1

Nicolai [email protected]

University of Edinburgh

September 2019

1Thanks to Achim Ahrens, Anna Babloyan and Erkal Ersoy for creatingthese slides and allowing me to use them.

[email protected]

Outline1. Probability theory

I Conditional probabilities and independenceI Bayes’ theorem

2. Probability distributionsI Discrete and continuous probability functionsI Probability density function & cumulative distribution functionI Binomial, Poisson and Normal distributionI E[X] and V[X]

3. Descriptive statisticsI Sample statistics (mean, variance, percentiles)I Graphs (box plot, histogram)I Data transformations (log transformation, unit of measure)I Correlation vs. Causation

4. Statistical inferenceI Population vs. sampleI Law of large numbersI Central limit theoremI Confidence intervalsI Hypothesis testing and p-values

1 / 61

Random variablesMost of the outcomes or events we have considered so far havebeen non-numerical, e.g. either head or tail. If the outcome of anexperiment is numerical, we call the variable that is determined bythe experiment a random variable.

Random variables may be either discrete (e.g. the number of daysthe sun shines) or continuous (e.g. your salary after graduatingfrom the MSc). In contrast to a continuous random variable, wecan list the distinct potential outcomes of a discrete randomvariable.

NotationRandom variables are usually denoted by capital letters, e.g. X .The corresponding realisations are denote by small letters, e.g. x.

2 / 61

Should you make the bet?Example III.1I propose the following game. We toss a fair coin 10 times. If headappears 4 times or less, I pay you £2. If head appears more than 4 times,you pay me £1. Should you make the bet?

Let’s try to formalise the problem. Let the random variablesX1,X2, . . . ,X10 be defined such that

Xi ={

1 if head appears on the ith toss0 if tail appears on the ith toss for i = 1, . . . , 10.

3 / 61

Should you make the bet?Furthermore, let the random variable Y denote the number of heads.Clearly,

Y = X1 + X2 + . . .+ X10.

If the realisation of Y is greater than 4, I win.

Let P(Y = y) denote the probability that Y takes the value y.Accordingly, P(Y ≤ 4) is the probability that we obtain 4 or less headsand P(Y > 4) is the probability that we obtain more than 4 heads.When would you make the bet?

Your expected value isE[V ] = P(Y ≤ 4) ·£2 + P(Y > 4) · (£−1)

where V is the money you get. If E[V ] > 0 (and you are risk neutral),you’ll choose to play.

4 / 61

Should you make the bet?

Expected valueThe expected value of a discrete random variable X is denoted by E[X ]and given by

E[X ] = x1P(X=x1) + x2P(X=x2) + · · ·+ xkP(X=xk) =k∑

i=1xiP(X=xi)

where k is the number of distinct outcomes.

5 / 61

Should you make the bet?To solve the problem, we need to find P(Y ≤ 4) and P(Y > 4).From the additive law (Rule 4), we know that

P(Y≤4) = P(Y =0 ∪Y =1 ∪Y =2 ∪Y =3 ∪Y =4)= P(Y =0) + P(Y =1) + P(Y =2) + P(Y =3) + P(Y =4)

P(Y>4) = P(Y =5) + P(Y =6) + P(Y =7) + P(Y =8) + P(Y =9) + P(Y =10)

Hence, we need to find P(Y = yi) for i = 0, . . . , 10.

6 / 61

Discrete probability distributionIt is common to denote the probability distribution of a discrete randomvariable Y by f (y).

Discrete probability distributionThe probability distribution or probability mass function of a discreterandom variable X associates with each of the distinct potentialoutcomes xi (i = 1, . . . , k) a probability P(X = xi). That is,

f (xi) = P(X = xi).The sum of the probabilities add up to 1, i.e.

∑ki f (xi) = 1.

7 / 61

Discrete probability distributionTwo examples:

Example III.2 (Discrete Uniform Distribution)

Let X be the result from rolling a fair dice. The probability distribution issimply

f (x) = P(X = x) ={

1/6 for x = {1, 2, . . . , 6}0 otherwise .

This probability distribution is an example for a discrete uniform distributions.

Bernoulli distributionIt is said that a random variable X has a Bernoulli distribution with parameterP(X = 1) = p (i.e. probability of success) if X can take only the values 1(success) and 0 (failure). The probability distribution is given by

f (x) =

{ p if x = 11− p if x = 00 otherwise

8 / 61

Binomial coefficient & binomial distributionLet’s start with f (0) = P(Y =0) which is the probability of obtaining noheads. Using the multiplicative law,

P(Y =0) = P(X1=0)P(X2=0) . . .P(X10=0) = (1/2)10 = 0.00097656

Now, f (1) = P(Y =1). Since we are interested in the number of heads,we have to take into account that there is more than one combinationthat results in 1 head.

P(Y =1) = P(X1=1)P(X2=0) . . .P(X10=0)+ P(X1=0)P(X2=1) . . .P(X10=0)

...+ P(X1=0)P(X2=0) . . .P(X10=1)= 10 · (1/2)10 = 0.00976563

9 / 61

Binomial coefficient & binomial distributionNow, f (2) = P(Y =2). How many combinations are there that yield2 heads out of 10 tosses? Given that the first toss produces a head, thereare 9 combinations that yield two heads in total. And so on...

toss1 2 3 4 5 6 7 8 9 10

com

bina

tion

1 H H T T T T T T T T2 H T H T T T T T T T3 H T T H T T T T T T4 H T T T H T T T T T5 H T T T T H T T T T6 H T T T T T H T T T7 H T T T T T T H T T8 H T T T T T T T H T9 H T T T T T T T T H

10 / 61


toss1 2 3 4 5 6 7 8 9 10

com

bina

tion

1 H H T T T T T T T T2 T H H T T T T T T T3 T H T H T T T T T T4 T H T T H T T T T T5 T H T T T H T T T T6 T H T T T T H T T T7 T H T T T T T H T T8 T H T T T T T T H T9 T H T T T T T T T H

10 / 61


toss1 2 3 4 5 6 7 8 9 10

com

bina

tion

1 H T H T T T T T T T2 T H H T T T T T T T3 T T H H T T T T T T4 T T H T H T T T T T5 T T H T T H T T T T6 T T H T T T H T T T7 T T H T T T T H T T8 T T H T T T T T H T9 T T H T T T T T T H

10 / 61


toss1 2 3 4 5 6 7 8 9 10

com

bina

tion

1 H T T H T T T T T T2 T H T H T T T T T T3 T T H H T T T T T T4 T T T H H T T T T T5 T T T H T H T T T T6 T T T H T T H T T T7 T T T H T T T H T T8 T T T H T T T T H T9 T T T H T T T T T H

10 / 61


toss1 2 3 4 5 6 7 8 9 10

com

bina

tion

1 H T T T H T T T T T2 T H T T H T T T T T3 T T H T H T T T T T4 T T T H H T T T T T5 T T T T H H T T T T6 T T T T H T H T T T7 T T T T H T T H T T8 T T T T H T T T H T9 T T T T H T T T T H

10 / 61


toss1 2 3 4 5 6 7 8 9 10

com

bina

tion

1 H T T T T H T T T T2 T H T T T H T T T T3 T T H T T H T T T T4 T T T H T H T T T T5 T T T T H H T T T T6 T T T T T H H T T T7 T T T T T H T H T T8 T T T T T H T T H T9 T T T T T H T T T H

10 / 61


toss1 2 3 4 5 6 7 8 9 10

com

bina

tion

1 H T T T T T H T T T2 T H T T T T H T T T3 T T H T T T H T T T4 T T T H T T H T T T5 T T T T H T H T T T6 T T T T T H H T T T7 T T T T T T H H T T8 T T T T T T H T H T9 T T T T T T H T T H

10 / 61


toss1 2 3 4 5 6 7 8 9 10

com

bina

tion

1 H T T T T T T H T T2 T H T T T T T H T T3 T T H T T T T H T T4 T T T H T T T H T T5 T T T T H T T H T T6 T T T T T H T H T T7 T T T T T T H H T T8 T T T T T T T H H T9 T T T T T T T H T H

10 / 61


toss1 2 3 4 5 6 7 8 9 10

com

bina

tion

1 H T T T T T T T H T2 T H T T T T T T H T3 T T H T T T T T H T4 T T T H T T T T H T5 T T T T H T T T H T6 T T T T T H T T H T7 T T T T T T H T H T8 T T T T T T T H H T9 T T T T T T T T H H

10 / 61


toss1 2 3 4 5 6 7 8 9 10

com

bina

tion

1 H T T T T T T T T H2 T H T T T T T T T H3 T T H T T T T T T H4 T T T H T T T T T H5 T T T T H T T T T H6 T T T T T H T T T H7 T T T T T T H T T H8 T T T T T T T H T H9 T T T T T T T T H H

This gives us 10 · 9. This approach has the problem of double counting.Each combination appears twice. So we have to divide by 2 and get(10 · 9)/2 distinct combinations. Thus,

P(Y = 2) = 10 · 92

(12

)10= 0.04394531.

10 / 61

Binomial coefficient & binomial distributionFor P(Y = 3),P(Y = 4), . . . this gets even more complicated.

Binomial coefficientSuppose that there is a set of n distinct elements from which it is desiredto choose a subset of k elements (typically 1 ≤ k ≤ n). The binomialcoefficient gives the number of ways k elements can be selected from nelements. The binomial coefficient is defined as

Cn,k = C nk =

(nk

)= n!

k!(n − k)! .

where k! = k(k − 1)(k − 2) . . . 1 and 0! = 1.

RemarkNote that

n!(n − k)!

= n · (n − 1) · (n − 2) · . . . · (n − k + 1).

For example,7!

(7− 3)!=

7!4!

=7 · 6 · 5 · 4 · 3 · 2 · 1

4 · 3 · 2 · 1= 7 · 6 · 5.

11 / 61

Binomial coefficient & binomial distributionLet’s consider another example to get a better understanding of thebinomial coefficient.Example III.3Imagine a box with four distinct elements (n = 4) denoted as a, b, c, d.We want to randomly pick two elements (k = 2). If the order of selectingelements matters, there exist 4 · 3 different combinations. However, wedon’t want the order to matter, so we divide by 2, as there are two waysof ordering two elements ({b, a} and {a, b}). Therefore, there are

4 · 32 = 4!

2!(4− 2)! =(

42

)= 6

different combinations.{a, b} {a, c} {a, d} {b, c} {b, d} {c, d}.

12 / 61

Combination vs. permutationExample III.3Imagine a box with four distinct elements (n = 4) denoted as a, b, c, d.We want to randomly pick two elements (k = 2). If the order of selectingelements matters, there exist 4 · 3 different combinations. However, wedon’t want the order to matter, so we divide by 2, as there are two waysof ordering two elements ({b, a} and {a, b}). Therefore, there are

4 · 32 = 4!

2!(4− 2)! =(

42

)= 6

different combinations.{a, b} {a, c} {a, d} {b, c} {b, d} {c, d}.

Note the distinction between permutation (order matters) andcombination (order does not matter).

If order matters (e.g. we distinguish between {a, b} and {b, a}), thesolution to the above problem is simply 4 · 3 = 12.

13 / 61

Binomial coefficient & binomial distributionBack to our problem: For P(Y = 3),

f (3) = P(Y = 3) =(

103

)(12

)10= 10!

3!(10− 3)!

(12

)10

= 10 · 9 · 83 · 2 · 1

(12

)10= 0.1171875.

The binomial coefficient allows us to find a general expression for f (y).

Binomial distributionIf the random variables X1, . . . ,Xn form n Bernoulli trials with parameterp (i.e. probability of success), then Y = X1 + · · ·+ Xn follows abinomial distribution. The binomial distribution is given by

f (y; n, p) =(

ny

)py(1− p)n−y

for y = 0, 1, . . . ,n.

14 / 61

Binomial distributionWe now know the specific functional form of f (y) = P(Y = y). Hence,we can obtain the probability that we draw 0, 1, 2, . . . , 10 heads.

y f (y)0 0.000981 0.009772 0.043953 0.117194 0.205085 0.246096 0.205087 0.117198 0.043959 0.0097710 0.00098

0 1 2 3 4 5 6 7 8 9 10

0.0

0.1

0.2

0.3

0.4

Binomial distribution (n=10,p=0.5)

y

f(y)

15 / 61

Cumulative distribution functionHowever, we are interested in P(Y ≤ 4).

Cumulative distribution functionThe cumulative distribution function of a discrete random variable X isdenoted by F(x) and is defined as

F(x) = P(X ≤ x)where −∞ ≤ x ≤ +∞. The cumulative distribution function F(x) givesthe probability that the outcome of X in a random trial will be less thanor equal to any specified value x.

16 / 61

Binomial distribution

y f (y) F(y)0 0.00098 0.000981 0.00977 0.010742 0.04395 0.054693 0.11719 0.171884 0.20508 0.376955 0.24609 0.623056 0.20508 0.828137 0.11719 0.945318 0.04395 0.989269 0.00977 0.9990210 0.00098 1.00000

0 1 2 3 4 5 6 7 8 9 10

0

0.2

0.4

0.6

0.8

1

y

F(y)

Cumulative distribution function

For example,F(2) = f (0) + f (1) + f (2) = 0.00098 + 0.00977 + 0.04395 = 0.05469.

17 / 61

Should you make the bet?

Example III.1 (continued)

I propose the following game. We toss a fair coin 10 times. If headappears 4 times or less, I pay you £2. If head appears more than 4 times,you pay me £1. Should you make the bet?

We can finally solve the problem. Your expected value is

E[V ] = P(Y ≤ 4) ·£2 + P(Y > 4) · (£− 1)= F(4) ·£2 + (1− F(4)) · (£− 1)= 0.377 ·£2 + 0.623 · (£− 1) ≈ £0.131.

You should make the bet (if you are risk neutral)!

What does E[V ] mean? If we repeat the game an infinite number oftimes, your average payoff will be £0.131.

18 / 61

Binomial distribution (Simulation)Suppose that we play the game m times. That is, we toss the coin 10 times,write down the number of heads, and play again.

Let’s set m = 20. We get: 3 7 5 2 3 3 2 5 3 7 4 4 4 4 5 3 4 5 4 2.y 0 1 2 3 4 5 6 7 8 9 10frequency 0 0 3 5 6 4 0 2 0 0 0rel. frequency 0.00 0.00 0.15 0.25 0.30 0.20 0.00 0.10 0.00 0.00 0.00

0.0

0.1

0.2

0.3

0.4

Empirical binomial distribution (n=10,p=0.5)20 repetitions

01

f(y)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0.0

0.1

0.2

0.3

0.4


y

f(y)

19 / 61


Let’s set m = 50.y 0 1 2 3 4 5 6 7 8 9 10frequency 0 0 3 7 11 8 7 7 4 3 0rel. frequency 0.00 0.00 0.06 0.14 0.22 0.16 0.14 0.14 0.08 0.06 0.00

0.0

0.1

0.2

0.3

0.4


y

f(y)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0.0

0.1

0.2

0.3

0.4


y

f(y)

19 / 61


Let’s set m = 100.y 0 1 2 3 4 5 6 7 8 9 10frequency 0 2 3 17 26 22 14 11 3 2 0rel. frequency 0.00 0.02 0.03 0.17 0.26 0.22 0.14 0.11 0.03 0.02 0.00

0.0

0.1

0.2

0.3

0.4


y

f(y)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0.0

0.1

0.2

0.3

0.4


y

f(y)

19 / 61


Let’s set m = 10,000.y 0 1 2 3 4 5 6 7 8 9 10frequency 4 89 429 1171 2045 2470 2075 1198 411 103 5rel. frequency 0.0004 0.0089 0.0429 0.1171 0.2045 0.2470 0.2075 0.1198 0.0411 0.0103 0.0005

0.0

0.1

0.2

0.3

0.4


y

f(y)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0.0

0.1

0.2

0.3

0.4


y

f(y)

19 / 61

Binomial distributionThe binomial distribution has two parameters: n is the number of(Bernoulli) trials and p is the probability of success in each trial. Howdoes the distribution look like for different values of n and p?

0 1 2 3 4 5 6 7 8 9 10

0.0

0.1

0.2

0.3

0.4


y

f(y)

0 1 2 3 4 5 6 7 8 9 10

0.0

0.1

0.2

0.3

0.4


y

f(y)

0 1 2 3 4 5 6 7 8 9 10

0.0

0.1

0.2

0.3

0.4


y

f(y)

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3


y

f(y)

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3


y

f(y)

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3


y

f(y)

20 / 61

Binomial distributionExpected value and variance

In the same way we summarise an observed dataset by the sample averageand the sample variance (or standard deviation), we can characterise aprobability distribution by its expected value and its variance. From thefigures we can see that expected value and variance change with n and p.

To find the expected value of Y , note that:

Linearity of expectationIf Y is the sum of random variables X1,X2, . . . ,Xn , then:

E[Y ] = E

[ n∑i=1

Xi

]=

n∑i=1

E[Xi].

Furthermore, if c is a constant (i.e., non-random) and X a randomvariable, then

E[X + c] = E[X ] + cE[cX ] = cE[X ].

21 / 61


Recall that Y = X1 + X2 + X3 + · · ·+ Xn. Therefore,

E[Y ] = E[X1] + E[X2] + E[X3] + · · ·+ E[Xn]

Recall that Xi follows a Bernoulli distribution. The expected value of aBernoulli variable is

E[Xi] = p · 1 + (1− p) · 0 = p.

Therefore,

E[Y ] = E[X1] + E[X2] + E[X3] + . . .+ E[Xn] = np.

22 / 61


VarianceThe variance of a discrete random variable X is denoted by V[X ] andgiven by

V[X ] =k∑i

(xi − E[X ])2P(X = xi).

Variance of the sum of uncorrelated random variablesIf Y is the sum of independent (!) random variables X1,X2, . . . ,Xn,then:

V[Y ] = V

[ n∑i=1

Xi

]=

n∑i=1

V[Xi]

The variance of Xi is given byV[X ] = (1− E[X ])2p + (0− E[X ])2(1− p) = p(1− p)

ThereforeV[Y ] = np(1− p) 23 / 61

Poisson distributionExample III.4Let X be the number of cars passing by in an hour. On average λ carspass by. What is the probability that 3 cars pass by?

To simplify the problem, we divide each hour into 60 minutes. SinceE[X ] = λ, the probability that one car passes by in any particular minuteis λ/60.

Using this simplification, we can work with the binomial distribution.

f (3) = P(X = 3) ≈(

603

)(λ

60

)3(1− λ

60

)60−3

However, this approach does not take into account that more than onecar may pass by in a minute.

24 / 61

Poisson distributionWe can address this problem by dividing each hour into 3,600 seconds,3,600,000 milliseconds, and so on. More general, with n being thenumber of units we divide the hour into:

f (x) = P(X = x) ≈(

nx

)(λ

n

)x (1− λ

n

)n−x

Let n →∞, do some maths and you’ll arrive at:

Poisson distributionIf X follows a poisson process, then

f (x) = P(X = x) = e−λλx

x!for x = 0, 1, 2, 3, . . . ,∞. Note that e = limn→∞

(1 + 1

n)n = 2.718 . . . .

25 / 61

Poisson distribution

0 5 10 15 20

0.0

0.1

0.2

0.3

Poisson distributionlambda=1

x

f(x)

0 5 10 15 20

0.00

0.05

0.10

0.15

0.20


x

f(x)

The poisson distribution is asymmetric (or right-skewed) forE[X ] = λ = 1 and λ = 3

26 / 61

Poisson distribution

0 5 10 15 20

0.00

0.02

0.04

0.06

0.08

0.10

0.12


x

f(x)

850 900 950 1000 1050 1100 1150

0.00

00.

002

0.00

40.

006

0.00

80.

010

0.01

2


x

f(x)

The higher λ, the more symmetric is the poisson distribution. Also, thedistribution looks very similar to the normal distribution!

27 / 61

Continuous distributionsProbability density function (PDF)

If the random variable X is continuous, we use f (x) to denote theprobability density function (PDF) of X . The PDF satisfies tworequirements:

(1) f (x) ≥ 0 and (2)∫ +∞

−∞f (x)dx = 1

Remark (!)

If the random variable X is continuous, then the probability that X takesa particular value x is zero. That is,

f (x) 6= P(X = x) = 0.

28 / 61

Continuous distributionsUniform distributionLet a and b be two real numbers (a < b) and consider an experimentwhere a number X is randomly selected from the interval [a, b]. If theprobability that X belongs to any subinterval of [a, b] is proportional tothe length of the subinterval, we say that X is uniformly distributed. ThePDF of X is given by

f (x) ={ 1

b−a for x ∈ [a, b]0 otherwise

We write that X ∼ u(a, b).

29 / 61

Continuous distributionsExample III.5Anna and Achim arrange to meet ‘between 1pm and 2pm’ at Old College.Their arrival time is uniformly distributed and they arrive independently of eachother. What is the probability that no one will have to wait more than 15minutes?

Let X denote Anna’s arrival time and Y denote Achim’s arrival time.I Express the event ‘no one will wait for more than 15 minutes’ in

terms of X and Y .I What is the joint distribution of X and Y ?

30 / 61

Continuous distributionsThe normal distribution is by far the single most important probabilitydistribution. Many natural phenomena are (approximately) normallydistributed. Another reason for its importance comes from the centrallimit theorem (to be discussed in the next lecture).

Normal distributionIf the random variable X follows normal distribution with mean µ(−∞ < µ <∞) and variance σ2 (σ > 0), its PDF is given by

f (x) = 1√σ22π

e−(x−µ)2

2σ2

for −∞ < x <∞. We write that X ∼ N (µ, σ2).

31 / 61

Continuous distributions

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

x

f(x

)

N(0, 1)

−4 −2 0 2 40

0.2

0.4

0.6

0.8

1

x

f(x

)

u(1, 3)

u(−3, 0)

32 / 61

E[X ] for continuous distributionsExpected valueThe expected value of a continuous random variable is

E[X ] =∫ +∞

−∞xf (x)dx.

E[X ] is the balance point of the probability mass. The probability mass tothe left of E[X ] is in balance to the probability mass on the right of E[X ].

I

33 / 61

E[X ] for continuous distributionsThus, the expected value of the normal distribution is simply at itshighest point (due to symmetry) and the expected value of auniform distribution is half way between a and b.

I Ia b

34 / 61

E[X ] for the uniform distributionLet’s do this formally for the uniform distribution.

E[X ] =∫ +∞

−∞xf (x)dx (1)

=∫ +∞

−∞x 1

b − a dx (2)

= 1b − a

∫ b

axdx (3)

= 1b − a

[12x2

]b

a(4)

= 1b − a

12(b2 − a2

)(5)

= 12

(b2 − a2)(b + a)(b − a)(b + a) (6)

= 12(b + a) (7)

35 / 61

E[X ] for the uniform distribution

(1) By the definition of the expected value.(2) By the definition of the uniform distribution.(3) If k is a constant, then∫

kf (x)dx = k∫

f (x)dx.

(4) Since ∫xndx = 1

n + 1xn+1 + c (n 6= −1).

(4-5) Since ∫ b

af (x)dx =

[F(x)

]b

a= F(b)− F(a)

where ddx F(x) = f (x).

36 / 61

V[X ] for continuous distributions

VarianceThe variance of a continuous random variable X is

V[X ] =∫ +∞

−∞(x − E[X ])2f (x)dx. (8)

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

x

f(x

)

N(0, 1)

N(0, 2)

N(0, 3)

37 / 61

V[X ] for uniform distributionInstead of using the definition above, we can make use of thefollowing:

Variance

V[X ] = E[(X − E[X ])2

]= E

[X2 − 2XE[X ] + (E[X ])2

]= E[X2]− 2(E[X ])2 + (E[X ])2

= E[X2]− (E[X ])2

where the third line uses the fact that E[A + B] = E[A] + E[B]and E[E[A]] = E[A].

We know E[X ]. Hence, we only need to find E[X2].

38 / 61

V[X ] for uniform distributionWe treat X2 as a new random variable which follows the same PDF as X .

E[X2] =∫ +∞

−∞x2f (x)dx

= 1b − a

∫ b

ax2dx

= 1b − a

[13x3

]b

a

= 1b − a

13(b3 − a3)

= 13

(b3 − a3)(a2 + ab + b2)(b − a)(a2 + ab + b2)

= 13 (a2 + ab + b2)

V [X ] = E[X2]−(E[X ])2 = 13 (a2 +ab+b2)−

(12 (b + a)

)2= 1

12 (b−a)2

39 / 61

PDF and probabilityConsider the standard normal distribution depicted in the figure. As youcan see, f (0) ≈ 0.4. To be precise, f (0) = 0.3989423 . . . . It is importantto understand that this does not mean that P(X = 0) = 0.3989423 . . . !If a random variable is continuous, there are infinite distinct values thatthe random variable can take. Thus, the probability that the randomvariable takes a specific value is zero.

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

x

f(x

)

N(0, 1)

40 / 61

PDF and probabilityHowever, we can say that the probability that X is below 0 is equal tothe shaded gray area. Since we know that the area under f (x) is 1, weknow (due to symmetry) that P(X ≤ 0) = 0.5.

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

x

f(x

)N(0, 1)

41 / 61

PDF and probabilityBut what is the probability that, say, P(X ≤ −1)?

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

−1

x

f(x

)N(0, 1)

42 / 61

CDF and probabilityCumulative density functionThe cumulative density function (CDF) of a continuous random variableX is denoted by F(x) and given by

F(x) = P(X ≤ x) =∫ x

−∞f (u)du for −∞ ≤ x ≤ +∞.

The CDF gives the probability that the outcome of X in a randomexperiment is less than or equal to x.

43 / 61

CDF and probability

We can read from the CDF that

F(−1) = P(X ≤ −1) ≈ 0.159 . . .

and

F(0) = P(X ≤ 0) = 0.5.

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

−1

f(x

)

N(0, 1)

−4 −2 0 2 40

0.2

0.4

0.6

0.8

1

1.2

−1

0 .159 ..

0 .50 .5

x

F(x

)

N(0, 1)

44 / 61

CDF and probability

What is the probability that Xlies between −1 and 0? That is,what is

P(−1 ≤ X ≤ 0)?

It is simply

F(0)−F(−1) = 0.5−0.159 ≈ 0.341.

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

−1

f(x

)

N(0, 1)

−4 −2 0 2 40

0.2

0.4

0.6

0.8

1

1.2

−1

0 .159 ..

0 .50 .5

x

F(x

)

N(0, 1)

45 / 61

CDF and probability

What is the probability that Xis below +1? Due to symmetry

F(1) = 1− F(−1).

Thus,

1−F(−1) = 1−0.159 ≈ 0.841.

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

−1 1

f(x

)

N(0, 1)

−4 −2 0 2 40

0.2

0.4

0.6

0.8

1

1.2

−1 1

0 .159 ..

0 .5

0 .841 ..

0 .5

x

F(x

)

N(0, 1)

46 / 61

Inverse functions and CDFWe will often use the inverse function of the CDF.Inverse functionIn general, if g(x) is an invertible function, then the inverse function isgiven by

g−1(g(x)) = x.

Intuition: A function works like a machine. It takes x as an input andreturns the output g(x) = a. An inverse function works the other wayaround: g−1(a) = x.

Suppose we are interested in the following question: What is the value ofx such that P(X ≤ x) = 0.95.

47 / 61

Inverse functions and CDF

Suppose we are interested in thefollowing question: What is thevalue of x such that P(X ≤x) = 0.95. Using the inverseCDF:

F−1(0.95) ≈ 1.64 . . .

This implies that, due to sym-metry, 90% of the probabilitymass is approximately in the±1.64 interval.

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

−1 1 1 .64

f(x

)

N(0, 1)

−4 −2 0 2 40

0.2

0.4

0.6

0.8

1

1.2

−1 1 1 .64

0 .95

x

F(x

)

N(0, 1)

48 / 61

Standard normal distributionStandard normal distributionIf X ∼ N (µ, σ2), then

Z = X − µσ

∼ N (0, 1).We say that Z follows a standard normal distribution. The PDF and CDFof the standard normal distribution are often denoted by φ(z) and Φ(z).

0 5 10 150

0.1

0.2

0.3

0.4

0.5

f(x

)

Z ∼ N(0, 1)

X ∼ N(10, 4)

49 / 61

Standard normal distributionExample III.6Suppose X ∼ N (10, 4) and Z = (X − 10)/

√4 ∼ N (0, 1). What is P(X ≤ 8)?

P(X ≤ 8) = P(X − µ

σ≤ 8− µ

σ

)= P

(X − 10√

4≤ 8− 10√

4

)= P (Z ≤ −1)

0 5 10 150

0.1

0.2

0.3

0.4

0.5

−1 8

f(x

)

Z ∼ N(0, 1)

X ∼ N(10, 4)

50 / 61

Multivariate distributions (discrete)Joint probability functionThe joint probability function of two discrete random variables X and Y isgiven by

f (x, y) = P(X = x and Y = y) and∑

i

∑j

f (xi , yj) = 1.

Table of Probabilities

X\Y 1 2 3 41 0.1 0 0.1 02 0.3 0 0.1 0.23 0 0.2 0 04 0 0 0 0

x

y

f(x, y)

0.1

0.2

0.3

0.4

12

34

12

34

51 / 61

Multivariate distributions (continuous)Joint probability density functionThe joint probability density function (or joint PDF) of two continuous randomvariables X and Y is given by

f (x, y) and∫ +∞

−∞

∫ +∞

−∞f (x, y)dxdy = 1.

−1

−0.50

0.5

1

1.5

2

2.5

3 0

0.5

1

1.5

2

2.5

3

3.5

40

0.1

x y

f(x,y)

52 / 61

Multivariate distributions (continuous)Joint probability density functionThe joint probability density function of two continuous random variables Xand Y is given by

f (x, y) and∫ +∞

−∞

∫ +∞

−∞f (x, y)dxdy = 1.

Example III.7If X and Y have joint PDF f (x, y) then the probability that X lies between 0and 2 and that, at the same time, Y lies between 0 and 1 is

P(0 ≤ X ≤ 2 and 0 ≤ Y ≤ 1) =∫ 1

0

∫ 2

0f (x, y)dxdy

53 / 61

Marginal distributionsMarginal probability function (discrete)

If X and Y are two discrete random variables for which the joint probabilityfunction is f (x, y), then the marginal probability function for X is

fX(x) = P(X = x) =∑

y

P(X = x and Y = y) =∑

y

f (x, y)

The marginal probability gives the probability of observing a specific value of X(say X = x). To calculate the probability of observing x, we need to add theprobabilites of all events that correspond to X = x: That is,P(X = x) = f (x, y1) + f (x, y2) + . . .+ f (x, yn).

54 / 61

Marginal distributionsExample III.8 (Discrete Marginal Distribution)

Table of Probabilities

X\Y 1 2 3 41 0.1 0 0.1 02 0.3 0 0.1 0.23 0 0.2 0 04 0 0 0 0

What is the marginal probability function forX?

Marginal probability density function (continuous)

If X and Y are two continuous variables for which the joint probability densityfunction is f (x, y), then the marginal probability density function for X is

fX(x) =∫ +∞

−∞f (x, y)dy

55 / 61

Joint distributions and independenceRecall from the last lecture that, if X and Y are two independent events,then P(X and Y ) = P(X)P(Y ). We can generalize this statement:

IndependenceTwo independent continuous (or discrete) random variables areindependent if and only if

f (x, y) = fX(x)fY (y) ⇐⇒ F(x, y) = FX(x)FY (y)where fX(x) and fY (y) are marginal PDF’s. FX(x) and FY (y) denotemarginal CDF’s.

56 / 61

Conditional distributionsRecall from the last lecture that the conditional probability is defined asP(X |Y ) = P(X,Y )

P(Y ) . Furthermore, recall that if X and Y are twoindependent events, then P(X |Y ) = P(X).

Conditional probability density functionSuppose that X and Y are two continuous (or discrete) random variablesfor which the joint PDF is f (x, y) and the marginal PDF’s are fX(x) andfY (y). Suppose also that the value y has already been observed. Theconditional probability density function of X given that Y = y is given by

fX(x|y) = f (x, y)fY (y) .

Note that if X and Y are independent, we get the relationfX(x|y) = fX(x).

57 / 61

Conditional distributions

−1

−0.50

0.5

1

1.5

2

2.5

3 0

0.5

1

1.5

2

2.5

3

3.5

40

0.1

x y

f(x,y)

The black, thick line shows f (x, 1) which is proportional to theconditional distribution of X given Y = 1. More specific:

fX(x|Y = 1) = f (x, 1)fY (1) .

58 / 61

(Conditional) Expectations and CovarianceExample III.9Let X and Z be two independently distributed standard normal randomvariables and let Y = X2 + Z . Hence, V(X) = V(Z) = 1 andE(X) = E(Z) = 0.a) Derive E[Y |X ].b) Derive E[Y ]. [Hint: Use V[X ] = E[X2]− (E [X ])2.]c) Derive E[XY ]. [Hint: Since X is standard normal, E[X3] = 0.]d) Find Cov(X ,Y ) = E[(X − E(X))(Y − E(Y ))].What is interesting about the results?

59 / 61

Independence and CovarianceCovariance is defined as

Cov(X ,Y ) = E[(X − E(X))(Y − E(Y ))] = E[XY ]− E[X ]E[Y ].

If X and Y are independent, E[XY ] = E[X ]E[Y ]. Therefore, if X andY are independent, Cov(X ,Y ) = 0.

However, Cov(X ,Y ) = 0 (and Corr(X ,Y ) = 0) does not implyindependence as demonstrated in the previous example.

60 / 61

Summary

I Random variables are either discrete or continuous. Randomvariables are usually denoted by capital letters, e.g. X , andrealisations by small letters, e.g. x.

I For a continuous random variable, P(X = x) = 0 andf (x) 6= P(X = x) where f (x) denotes the probability densityfunction.

I Many probability distributions are closely related. For example, wecan derive the Poisson distribution from the Binomial distributionand the Poisson distribution behaves similar to the Normaldistribution as λ→∞.

I Independence implies Cov(X ,Y ) = 0 (and Corr(X ,Y ) = 0), butnot the other way around. Cov(X ,Y ) and Corr(X ,Y ) measure thestrength of the linear relation between two variables.

61 / 61

Basic Statistics for SGPE Students [.3cm] Part II ...€¦ · Outline 1. Probabilitytheory I...

Documents

Transcript of Basic Statistics for SGPE Students [.3cm] Part II ...€¦ · Outline 1. Probabilitytheory I...