Probabilities and Statistics

Probabilities and Statistics

Olivier Lemer

June 7, 2017

Contents

1 COMBINATORIAL ANALYSIS 51.1 The basic principle of counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 The Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.2 Other properties of binomial coefficients . . . . . . . . . . . . . . . . . . . . . 61.3.3 multinomial coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.4 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 AXIOMS OF PROBABILITY 72.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Inclusion-Exclusion Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Prediction Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 RANDOM VARIABLES 103.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Bernoulli Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3.1 Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.2 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.3 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.4 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.5 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.6 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5 Transformations of discrete random variables . . . . . . . . . . . . . . . . . . . . . . 133.6 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.6.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.6.2 Expected value of a function . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.6.3 Moments of a distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.6.4 Properties of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1

CONTENTS 2

3.7 Conditional Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.7.1 Conditional Probability mass function . . . . . . . . . . . . . . . . . . . . . . 143.7.2 Conditional expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.7.3 Law of small numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 CONTINUOUS RANDOM VARIABLES 164.1 Definition : Probability density function . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Basic distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.2 Eponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.3 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.4 Laplace distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.5 Pareto distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4 Conditional Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.6 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.7 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.7.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.7.2 Moivre-Laplace : Normal approximation of the binomial distribution . . . . . 19

4.8 Q-Q Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.9 Densities recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 SEVERAL RANDOM VARIABLES 215.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.4 Marginal and conditional distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 225.5 Multivariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.5.1 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.6.1 Independent and Identically distributed variables . . . . . . . . . . . . . . . . 225.7 Joint Moments and Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.7.1 Properties of covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.7.2 Independence and covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.8 Linear combinations of random variables . . . . . . . . . . . . . . . . . . . . . . . . . 235.9 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.9.1 Properties of correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.10 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.10.1 Expectation and Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 245.11 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.11.1 Properties of Moment-Generating Functions . . . . . . . . . . . . . . . . . . . 255.11.2 Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.11.3 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.11.4 Mean vector and covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 255.11.5 Moment-generating function : Multivariate case . . . . . . . . . . . . . . . . . 265.11.6 Characteristic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.11.7 Cumulant-generating function . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

CONTENTS 3

5.11.8 Cumulants of sums of random variables . . . . . . . . . . . . . . . . . . . . . 275.11.9 Multivariate cumulant-generating function . . . . . . . . . . . . . . . . . . . . 27

5.12 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.12.1 Lemma : Properties of Normal variables . . . . . . . . . . . . . . . . . . . . . 275.12.2 Lemma : Normal density function . . . . . . . . . . . . . . . . . . . . . . . . 285.12.3 Marginal and conditional distributions . . . . . . . . . . . . . . . . . . . . . . 28

5.13 Transformation of joint continuous densities . . . . . . . . . . . . . . . . . . . . . . . 285.14 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.14.1 Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.15 Approximation and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.15.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.15.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.15.3 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.15.4 Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . . 305.15.5 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.15.6 Sample quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 EXPLORATORY STATISTICS 326.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.2 Graphical Study of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.2.1 Kernel Density Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.3 Numerical Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.3.1 Breakdown point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.3.2 Quartiles and Sample Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . 336.3.3 Variability, Dispersion measures . . . . . . . . . . . . . . . . . . . . . . . . . 336.3.4 Interquartile range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.3.5 Sample correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.4 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.4.1 Five-number summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.4.2 Boxplot calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.5 Choice of a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.5.1 Normal Q-Q Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.6 Statistic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.6.1 Statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.6.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.7 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.1 Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.2 Estimation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.3 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.4 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.5 M-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.6 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.7.7 Mean Square Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.7.8 Delta method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.8 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.8.1 Pivots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.8.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.8.3 Construction of a CI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

CONTENTS 4

6.8.4 One and two-sided intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.8.5 Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.8.6 Normal Random Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.8.7 Unknown variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.8.8 Confidence intervals and tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.8.9 Null and Alternative Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . 396.8.10 Size and power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.8.11 Pearson statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.8.12 Chi-square distribution with v degrees of freedom . . . . . . . . . . . . . . . . 396.8.13 Evidence and P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Likelihood 417.0.1 Relative likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.1 Scalar Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.1.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.1.2 Limit distribution of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1.3 Likelihood ratio statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1.4 Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1.5 Vector Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1.6 Nested models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.1.7 Likelihood ratio statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.1.8 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8 BAYESIAN INFERENCE 448.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8.1.1 Application of Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 448.1.2 Beta(a,b) density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448.1.3 Properties of π(θ|y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458.1.4 Point estimation and loss functions . . . . . . . . . . . . . . . . . . . . . . . . 458.1.5 Interval estimation and credibility intervals . . . . . . . . . . . . . . . . . . . 458.1.6 Conjugate densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458.1.7 Prediction of a futur random variable Z . . . . . . . . . . . . . . . . . . . . . 458.1.8 Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Chapter 1

COMBINATORIAL ANALYSIS

1.1 The basic principle of counting

If experiment 1 can result in m outcomes and for each outcome, experiment 2 has n possible out-comes, then the total number of outcomes is nm. Note that this can be expanded to r consecutiveexperiments.

1.2 Permutations

A permutation of a set of objects is an arragement of these objects. If there are n distinct objects,then the number of permutations will be n!, as there is n objects to choose from for the first one,(n− 1) for the second one and so on.

1.3 Combinations

A combination of r objects among n is an unordered subset of r objects taken from the n originalones. The number of such combinations is defined as :(

n

r

)=

n!

(n− r)! r!

1.3.1 The Binomial Theorem

(x+ y)n =

n∑k=0

(n

x

)xkyn−k

5

CHAPTER 1. COMBINATORIAL ANALYSIS 6

Proof

Using combinatorics, and given that there are 2n different terms in the sum after development, eachof n factors, we can determine the how many of these 2n terms have k xs and (n − k) ys, whichcorresponds to the choice of the group of k from n that will be xs, i.e.

(nx

).

1.3.2 Other properties of binomial coefficients(n

r

)=

(n

n− r

)(n+ 1

r

)=

(n

r − 1

)+

(n

r

)r∑j=0

(m

j

)(n

r − j

)=

(m+ n

r

)

1.3.3 multinomial coefficients

We consider the following problem : a set on n distinct items that we want to divide into r groupsof respective sizes ni. The total number of such divisions is given by(

n

n1, n2, ..., nr

)=

n!

n1! n2! ... nr!

1.3.4 Partitioning

We have a set of n identical items that we want to distribute into r groups. This is equivalent tofinding the number of vectors of the form (n1, n2, ..., nr) such that n1 + n2 + ... + nr = n. Anelegant way of solving the problem is to align the n items in a row and to place r − 1 separators inbetween the items. The solution is the number of combinations of separator positions that can bechosen. There are n − 1 possible gaps (as no group can be empty) and r − 1 separators to place.Hence the solution is simply (

n− 1

r − 1

)

If we decide that we can have empty groups, then there are a total of n + r − 1 objects of whichr− 1 must be seperators and the rest must be items. So the solution is the number of ways to selectr − 1 objects out of n+ r − 1, or in other words :(

n+ r − 1

r − 1

)

Chapter 2

AXIOMS OF PROBABILITY

2.1 Terminology

A random experiment is modelled by a probability space (Ω, F, P )

• Sample space (Universe) : Set of all possible outcomes of an experiment, denoted Ω.

• Event Space : A collection of subsets of Ω, denoted F . These subsets are called events.

• Probability Distribution : P : F → [0, 1] which associates to each event A in F a probabilityP (A).

The Event Space F must verify the following properties :

• F is nonempty

• If A ∈ F then Ac ∈ F

• The union of any elements in F is in F .

Remark

The biggest event space of Ω is also its power set.

2.2 Basic Properties

All following properties can be mirrored by exchanging unions with intersections.

• Commutative laws : E ∪ F = F ∪ E

• Associative laws : (E ∪ F ) ∪G = E ∪ (F ∪G)

• Distributive laws : (E ∪ F ) ∩G = (E ∩G) ∪ (F ∩G)

• DeMorgan’s laws :( n⋃i=1

Ei

)c=

n⋂i=i

Eci

7

CHAPTER 2. AXIOMS OF PROBABILITY 8

• Linearity of probabilities : P( ∞⋃i=1

Ei

)=∞∑i=1

P (Ei) for any mutually exclusive events.

• If E ⊂ F then P (E) ≤ P (F )

• P (E ∪ F ) = P (E) + P (F )− P (E ∩ F )

2.2.1 Inclusion-Exclusion Formulae

If A1, ..., An are events of (Ω, F, P ), then :

P (A1∪A2∪A3) = P (A1)+P (A2)+P (A3)−P (A1∩A2)−P (A1∩A3)−P (A2∩A3)+P (A1∩A2∩A3)

Note that there are ’+’ signs when the number of sets is odd.

P( n⋃i=1

Ai

)=

n∑r=1

(−1)r+1∑

1≤i1<...<ir≤n

P (Ai1 ∩ ... ∩Air )

The number of terms in the general formula is(n

1

)+

(n

2

)+ ...+

(n

n

)= 2n − 1

2.3 Conditional Probability

Definition

The conditional probability of A given B is

P (A|B) =P (A ∩B)

P (B)

Theorem

Let (Ω, F, P ) be a probability space and B ∈ F such that P (B) > 0 and Q(A) = P (A|B). Then(Ω, F,Q) is a probability space.

Law of total probability

If Bi for all i > 0 are pairwise disjoint events of (Ω, F, P ), and if A is in the union of all Bi, then

P (A) =

∞∑i=1

P (A ∩Bi)

2.3.1 Bayes Theorem

If Bi for all i > 0 are pairwise disjoint events of (Ω, F, P ), and if A is in the union of all Bi, then forsome integer j :

P (Bj |A) =P (A|Bj)P (Bj)∑∞i=1 P (A|Bi)P (Bi)

which is really just a detailed way of writing the conditional probability of Bj given A.

CHAPTER 2. AXIOMS OF PROBABILITY 9

2.3.2 Prediction Decomposition

If Ai for all 1 ≤ i ≤ n are events in a probability space, then :

P (A1 ∩A2) = P (A2|A1)P (A1)

P (A1 ∩A2 ∩A3) = P (A3|A1 ∩A2)P (A2|A1)P (A1)

or in general

P (A1 ∩ ... ∩An) =

n∏i=2

P (Ai|A1 ∩ ... ∩Ai−1) · P (A1)

2.4 Independence

If (Ω, F, P ) is a probability space, and two events A and B in F are independent, we write A ⊥⊥ B,and

P (A ∩B) = P (A)P (B)

P (A|B) = P (A)

Definitions

• The events A1, ...An are (mutually) independent if for all sets F :

P( ⋂i∈F

Ai

)=∏i∈F

P (Ai)

• The events A1, ...An are pairwise independent if for all 1 ≤ i < j ≤ n,

P (Ai ∩Aj) = P (Ai)P (Aj)

• The events A1, ...An are conditionally independent given B if for all sets F ,

P( ⋂i∈F

Ai|B)

=∏i∈F

P (Ai|B)

Remarks

• Mutual independence implies pairwise independence. The converse is not true if n > 2.• Mutual independence and conditional independence do not imply one another.

Chapter 3

RANDOM VARIABLES

3.1 Definition

Let (Ω, F, P ) be a probability space. A random variable is defined by X : Ω→ R.The set DX = x ∈ R : ∃w ∈ Ω such that X(w) = x is the support of X and associates a

value in R to every event. If DX is countable, then X is a discrete random variable.

3.2 Bernoulli Random Variables

A random variable that takes only values 0 and 1 is called indicator variable, Bernoulli randomvariable or Bernoulli trial.

3.3 Probability Mass Function

The probability mass function (PMF) of a discrete random variable X is

fX(x) = P (X = x)

Its properties are naturally that fX(x) ≥ 0 for all x ∈ DX , where DX is the image of the functionX, and the sum for all xi ∈ DX of fX(xi) is equal to 1.

3.3.1 Binomial Random Variable

A binomial random variable X has the following PMF :

f(x) =

(n

x

)px(1− p)n−x

x ∈ [0, n], n ∈ N, 0 ≤ p ≤ 1

We write X ∼ B(n, p) and call n the denominator and p the probability of success. It representsthe number of successes of a trial independently repeated a fixed number of times n.

10

CHAPTER 3. RANDOM VARIABLES 11

3.3.2 Geometric Distribution

A geometric random variable X has the following PMF :

f(x) = p(1− p)x−1

x ∈ [0,∞[, 0 ≤ p ≤ 1

We write X ∼ Geom(p) and call p the probability of success. It represents the number of failuresbefore the first success happens.

Theorem : Lack of memory

If X ∼ Geom(p) then

P (X > n+m|X > m) = P (X > n)

3.3.3 Negative Binomial Distribution

A negative binomial random variable X has the following PMF :

f(x) =

(x− 1

n− 1

)pn(1− p)x−n

x ≥ n, 0 ≤ p ≤ 1

We write X ∼ NegBin(n, p). When n = 1, X ∼ Geom(p). It represents the number of trials untilthe nth success happens.

Note that there exists an alternative version using the Gamma function Γ(α) which is defined inthe course’s slide 101 page 52.

3.3.4 Hypergeometric Distribution

We draw without replacement m balls from an urn with w white and b black balls. If X is thenumber of white balls drawn, then

P (X = x) =

(wx

)(b

m−x)(

w+bm

)We write X ∼ HyperGeom(w, b;m).

It can be understood as the number of distinct ways of picking x white balls without pickingm− x black balls, among all possible combinations.

3.3.5 Discrete Uniform Distribution

A discrete uniform random variable X has the following PMF :

f(x) =1

b− a+ 1

x ∈ [a, b] ⊂ Z

We write X ∼ DU(a, b)


3.3.6 Poisson Distribution

A Poisson random variable X has the following PMF :

f(x) =λx

x!e−λ

x ≥ 0, λ > 0

We write X ∼ Pois(λ).

Note that by the fact that ex =∞∑i=0

λi

i! , we can see that the sum of all probabilities equals 1.

Poisson Process

Consider point events taking place in a time period [0, T ] and N(I) the number of events in a subsetI ⊂ T .

Suppose :

• Events in disjoint subsets of T are independent.

• The probability for an event to happen in an interval of width δ is δλ+ o(δ) for some δ > 0.

• The probability for no events to happen in an interval of width δ is 1− δλ+ o(δ).

Then :

• N(I) ∼ Pois(λ|I|)

• If all Ii are disjoint subsets of T , then all N(Ii) are independent Poisson variables.

Note that the probability that the waiting time X1 until the first event occurs exceedst is

P (X1 > t) = P (N([0, t]) = 0) = e−λt

3.4 Cumulative distribution function

The cumulative distribution function CDF of a random variable X is

FX(x) = P (X ≤ x) x ∈ R

If X is discrete, we can write

FX(x) =∑

xi∈DX : xy≤x

P (X = xi)

where DX is the support of fX(x), i.e. the values that X can take.


3.4.1 Properties

The cumulative distribution function FX of a random variable X satisfies :

• limx→−∞

FX(x) = 0

• limx→∞

FX(x) = 1

• FX is non-decreasing.• FX is continuous on the right, i.e. lim

t→0, t>0FX(x+ t) = FX(x)

• P (X > x) = 1− FX(x)• If x < y, then P (x < X ≤ y) = FX(y)− FX(x)

Remark

We can find the PMF of a discrete random variable from the CDF :

f(x) = F (x)− limy→x,y<x

F (y)

3.5 Transformations of discrete random variables

If X is a random variable, then Y = g(X) is a random variable too, and

fY (y) =∑

x : g(x)=y

fX(x)

3.6 Expectation

Let X be a discrete random variable for which∑

x∈DX|x|fX(x) <∞. The expectation, or expected

value or mean of X isE(X) =

∑x∈DX

xfX(x)

Remark

If∑

x∈DX|x|fX(x) is not finite, then E(X) is not well defined.

3.6.1 Properties

Let X be a random variable with a finite expected value E(X), and let a ∈ R be a constant. Then

• E(·) is a linear operatior, meaning that E(aX + b) = aE(X) + b• If g(X) and h(X) have finite expected values, then E(g(X) + h(X)) = E(g(X)) + E(h(X))• (E(X))2 ≤ E(X2)


3.6.2 Expected value of a function

Let X be a random variable with mass function f and let g be a real-valued function of X such that∑x∈DX

|g(x)|f(x) <∞. Then

E(g(X)) =∑x∈DX

g(x)f(x)

3.6.3 Moments of a distribution

If X has a PMF f(x) such that∑x |x|rf(x) <∞, then

• the rth moment of X is E(Xr)• the rth central moment of X is E((X − E(X))r)• the variance of X is the second central moment of X, i.e. var(X) = E((X − E(X))2)• the standard deviation of X is σ =

√varX.

• the rth factorial moment of X is E(X(X − 1)...(X − r + 1)) = E(

X!(X−r)!

)3.6.4 Properties of variance

Let X be a random variable whose variance exists, and let a, b be constants. Then

• var(X) = E(X2)− E(X)2

• var(X) = E(X(X − 1)) + E(X)− E(X)2

• var(aX + b) = a2var(X)• var(X) = 0 ⇒ X is constant with probability 1.

Also, if X takes its values in 0, 1, ..., for r ≥ 2 and E(X) <∞, then

E(X) =

∞∑x=1

P (X ≥ x)

and in a more general sense,

E(X(X − 1)...(X − r + 1)) = r

∞∑x=r

(x− 1)...(x− r + 1)P (X ≥ x)

3.7 Conditional Probability Distributions

3.7.1 Conditional Probability mass function

Let (Ω, F, P ) be a probability space on which we define a random variable X, and let B ∈ F be suchthat P (B) > 0. Then the conditional probability mass function of X given B is

fX(x|B) = P (X = x|B) =P (X = x ∩B)

P (B)

We can note that fX(x|B) is a well defined mass function

fX(x|B) ≥ 0,∑x

fX(x|B) = 1


3.7.2 Conditional expected value

Suppose that∑x |g(x)|fX(x|B) <∞. Then the conditional expected value of g(X) given B is

E(g(X)|B) =∑x

g(x)fX(x|B)

Let X be a random variable and Bi events that form a partition of Ω for i ∈ N, then,

E(X) =

∞∑i=1

E(X|Bi)P (Bi)

Convergence of distribution

Let Xn and X be random variables whose cumulative distribution functions are Fnand F . Then we say that the random variables Xn converge in distribution, orconverge in law to X if ∀x ∈ R,

limn→∞

Fn(x) = F (x)

We write Xn →D X.Also, if fn(x)→ f(x) for all x, then Fn(x)→ F (x).

3.7.3 Law of small numbers

Lemma :

limn→∞

1

nr

(n

r

)=

1

r!

If Xn ∼ B(n, pn) and limn→∞ npn = λ > 0, then Xn →D X where X ∼ Pois(λ).

Chapter 4

CONTINUOUS RANDOMVARIABLES

4.1 Definition : Probability density function

A random variable X is continuous if there exists a function f(x) called the probability densityfunction (PDF), or density of X such that

P (X ≤ x) = F (x) =

∫ x

−∞f(u)du

which implies that f(x) ≥ 0 and∫∞−∞ f(x)dx = 1. Note that

f(x) =dF (x)

dx

4.2 Basic distributions

4.2.1 Uniform distribution

The random variable U is called a uniform random variable written U ∼ U(a, b) if it has thefollowing density

f(u) =

1b−a , a ≤ u ≤ b0, otherwise

4.2.2 Eponential distribution

The random variable X is called an exponential random variable written X ∼ exp(λ) if it hasthe following density

f(x) =

λe−λx, x > 00, otherwise

16

CHAPTER 4. CONTINUOUS RANDOM VARIABLES 17

4.2.3 Gamma distribution

The random variable X is called a gamma random variable written X ∼ Gamma(α, λ) if it hasthe following density

f(x) =

λα

Γ(α)xα−1e−λx, x > 0

0, otherwise

Here α is called the shape parameter and λ the rate.

Remark

The Gamma function Γ(α) is defined as

Γ(α) =

∫ ∞0

yα−1e−ydy

with the following properties :Γ(1) = 1

Γ(α) = (α− 1)Γ(α− 1)

For n ∈ Z, we haveΓ(n) = (n− 1)!

4.2.4 Laplace distribution

The random variable X is called a Laplace random variable or double exponential if it hasthe following density

f(x) = λ2 e−λ|x−η|, x, η ∈ R, λ > 0

4.2.5 Pareto distribution

The random variable X is called a Pareto random variable if it has the cumulative distributionfunction

F (x) =

0, x < β

1−(βx

)αx ≥ β

4.3 Expectation

Let g(x) be a real-valued function, and X a continuous random variable of density f(x). Then ifE (|g(X)|) <∞, then the expectation of g(X) is

E(g(X)

)=

∫ ∞−∞

g(x)f(x)dx


4.4 Conditional Densities

The conditional cumulative distribution is given by

FX(x|X ∈ A) =

∫y<x,y∈A

f(y)dy

P (X ∈ A)

and the conditional density function by

fX(x|X ∈ A) =

fX(x)P (X∈A) , x ∈ A0, otherwise

Finally we can also define the conditional expectation as

E(g(X)|X ∈ A

)=E(g(X)I(X ∈ A)

)P (X ∈ A)

4.5 Quantiles

Let 0 < p < 1. We define the p quantile of the cumulative distribution function F (x) to be

xp = infx : F (x) ≥ p

Which basically represents the smallest x such that F (x) equals or becomes greater than p.For most continuous random variables, xp is unique and equals xp = F−1(p) which is equivalent

to P (X ≤ xp) = p. In particular, the 0.5 quantile is the median of F .

4.6 Transformations

Let g : R→ R and By =]−∞, y]. Let Y = g(X) be a random variable. Then

FY (y) = P (Y ≤ y) =

∫g−1(By)

fX(x)dx if X continuous∑x∈g−1(By) fX(x) if X distrete

where g−1(By) = x ∈ R : g(x) ≤ y.When g is monotone increasing or decreasing, then since f(x) = F ′(x)

fY (y) =

∣∣∣∣dg−1(y)

dy

∣∣∣∣ fX(g−1(y))

4.7 Normal Distribution

A random variable X is called a normal random variable with expectation µ and variance σ2 ifit has density

fX(x) =1

σ√

2πexp

(− (x− µ)2

2σ2

)We write X ∼ N(µ, σ2).


When µ = 0 and σ2 = 1 the corresponding random variable Z is standard normal with density

φ(z) =1√2π

exp

(−z

2

2

)yielding

FZ(z) = P (Z ≤ z) = Φ(z) =

∫ z

−∞φ(z)dz =

∫ z

−∞

1√2πe−

z2

2 dz

Note that fX(x) = 1σφ(x−µσ

)4.7.1 Properties

The density φ(z), the cumulative distribution function Φ(z) and the quantiles zp of Z ∼ N(0, 1)verify :

• φ(z) = φ(−z)• P (Z ≤ z) = 1− P (Z ≥ z)• zp = −z1−p• limz→±∞z

rφ(z) = 0 for all r > 0.• φ′(z) = −zφ(z), φ′′(z) = (z2 − 1)φ(z), φ′′′(z) = −(z3 − 3z)φ(z), ...

• If X ∼ N(µ, σ2) then Z = X−µσ ∼ N(0, 1)

4.7.2 Moivre-Laplace : Normal approximation of the binomial distribu-tion

Let Xn ∼ B(n, p), µn = E(Xn) = np, σ2n = var(Xn) = np(1− p) and Z ∼ N(0, 1).

limn→∞

P

(Xn − µn

σn≤ z)

= Φ(z)

and

P (Xn ≤ r) = Φ

(r − µnσn

)which corresponds to Xn ∼ N(np, np(1− p))

Continuity correction

A better approximation of P (Xn ≤ r) is given by replacing r by r+ 12 which is called the continuity

correction. This yields

P (Xn ≤ r) = Φ

(r + 1

2 − np√np(1− p)

)

4.8 Q-Q Plots

When we want to compare a sample of n values Xi that a random variable takes with a theoretical

distribution F , what we typically do is we order the n values and then plot them against F−1(

1n+1

),

F−1(

2n+1

)etc...


The idea is that the n + 1 subdivisions will divide the graph into intervals of equiprobability(meaning more cuts around zero for N(0, 1) for example). If the distribution of the samples corre-sponds to F , then their plot against the cuts should result in a straight line.

4.9 Densities recap

• Uniform variables lie in a finite interval and give equal probability to each part of the interval.• Exponential and Gamma variabels lie in (0,∞) and are often used to model waiting time or

positive quantities. Gamma has 2 parameters and is more flexible, but exponential is simplerand elegant.• Pareto variables lie in (β,∞) and are often used to model financial losses over some thresholdβ.• Normal variables lie in R and are used to model quantities that arise from averaging of many

small effects, or are subject to error.• Laplace variables lie in R and are often used in place of normal when outliers might be present.

Chapter 5

SEVERAL RANDOMVARIABLES

5.1 Discrete Random Variables

Let (X,Y ) be a discrete random variable. The joint probability mass function of (X,Y ) is

fX,Y (x, y) = P ((X,Y ) = (x, y))

and the joint cumulative distribution function of (X,Y ) is

FX,Y (x, y) = P (X ≤ x, Y ≤ y)

5.2 Continuous Random Variables

The random variable (X,Y ) is said to be jointly continuous if there exists a function fX,Y (x, y)called the joint density of (X,Y ) such that

P ((X,Y ) ∈ A) =

∫∫(u,v)∈A

fX,Y (u, v)dudv

We can hence also write the joint cumulative distribution function of (X,Y ) as

FX,Y (x, y) =

∫ x

−∞

∫ y

−∞fX,Y (u, v)dudv

and

fX,Y (x, y) =∂2

∂x∂yFX,Y (x, y)

5.3 Exponential Families

Let (X1, ...Xn) be a discrete or continuous random variable with mass/density function of the form

f(x1, ...xn) = exp

(p∑i=1

si(x)θi − κ(θ1, ...θp) + c(x1, ...xn)

)where (θ1, ...θp) ⊂ Rp. This is called an exponential family distribution.

21

CHAPTER 5. SEVERAL RANDOM VARIABLES 22

5.4 Marginal and conditional distribution

The marginal probability mass/density function of X is

fX(x) =

∑y fX,Y (x, y) if discrete∫∞−∞ fX,Y (x, y)dy if continuous

The conditional probability mass/density function of Y given X is

fY |X(y|x) =fX,Y (x, y)

fX(x)

5.5 Multivariate Random Variables

Let X1, ...Xn be random variables defined on the same space. Their joint cumulative distributionfunction is

FX1,...Xn(x1, ...xn) = P (X1 ≤ x1, ...Xn ≤ xn)

and their joint density/mass probability function is

fX1,...Xn(x1, ...xn) =

P (X1 = x1, ...Xn = xn) if discrte∂FX1,...Xn

(x1,...xn)

∂x1...∂xnif continuous

5.5.1 Multinomial Distribution

The random variable (X1, ...Xk) has the multinomial distribution of denominator m andprobabilities (p1, ...pk) if its mass function is

f(x1, ...xk) =m!

x1! · ... · xk!· px1

1 · ... · pxkk

5.6 Independence

Random variables X,Y defined on the same probability space are independent if

F (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B)

implyingfX,Y (x, y) = fX(x)fY (y) and FX,Y (x, y) = FX(x)FY (y)

5.6.1 Independent and Identically distributed variables

A random sample of size n from a distribution F of density f is a set of n independent randomvariables each of distirbution F . We say that they are independent and identically distributed(iid) with distribution F or with density f .


5.7 Joint Moments and Covariance

Let X,Y be random variables of density fX,Y (x, y). Then if E|g(X,Y )| < ∞, we can define theexpectation of g(X,Y ) to be

E(g(X,Y )) =

∑x,y g(x, y)fX,Y (x, y) if discrete∫∫g(x, y)fX,Y (x, y)dxdy if continuous

In particular we define the joint moments and the joint central moments by

E(XrY s) and E((X − E(X))r(Y − E(Y ))s)

If r = s = 1 we call it the covariance of X and Y

cov(X,Y ) = E((X − E(X))(Y − E(Y ))) = E(XY )− E(X)E(Y )

5.7.1 Properties of covariance

Let X,Y, Z be random variables and a, b, c, d ∈ R constants. The covariance satisfies

• cov(X,X) = var(X)• cov(a,X) = 0• cov(X,Y ) = cov(Y,X)• cov(a+ bX + cY, Z) = b cov(X,Z) + c cov(Y,Z)• cov(a+ bX, c+ dY ) = bd cov(X,Y )• var(a+ bX + cY ) = b2 var(X) + 2bc cov(X,Y ) + c2 var(Y )• cov(X,Y )2 ≤ var(X)var(Y ) (Cauchy Schwarz inequality)

5.7.2 Independence and covariance

Recall that if X and Y are independent and g(X) and h(Y ) are functions whose expectations exist,then

E(g(X)h(Y )

)= E

(g(X)

)E(h(Y )

)If g(X) = X − E(X) and h(Y ) = Y − E(Y ) then we can see that

cov(X,Y ) = 0

5.8 Linear combinations of random variables

The average of random variables X1, ...Xn is

X = n−1n∑j=1

Xj

If a, b1, ...bn are contsants, then

var(a+ b1X1 + ...+ bnXn) =∑j,k

bjbkcov(Xj , Xk)

=n∑j=1

b2jvar(Xj) +∑j 6=k

bjbkcov(Xj , Xk)


If X1, ...Xn all have mean µ and variance σ2, then

E(X) = µ and var(X) = σ2

n

5.9 Correlation

The covariance depends on the units of measurements, so we often use the following, that measureslinear dependence.

The Correlation of x and Y is

corr(X,Y ) =cov(X,Y )√var(X)var(Y )

5.9.1 Properties of correlation

Let X and Y be random variables with correlation ρ = corr(X,Y ). Then

• −1 ≤ ρ ≤ 1• if ρ = ±1, then there exist a, b, c ∈ R such that aX + bY + c = 0 with probability 1. X and Y

are then said to be linearly dependent.• if X and Y are independent, then corr(X,Y ) = 0• the effect of the transformation (X,Y )→ (a+bX, c+dY ) is corr(X,Y )→ sign(bd)corr(X,Y )

5.10 Conditional Expectation

Let g(X,Y ) be a function of a random vector (X,Y ). Its conditional expectation given X = x is

E(g(X,Y )|X = x

)=

∑y g(x, y)fY |X(y|x) if discrete∫∞−∞ g(x, y)fY |X(y|x)dy if continuous

5.10.1 Expectation and Conditioning

E(g(X,Y )

)= EX

(E(g(X,Y )|X = x

))var(g(X,Y )

)= EX

(var(g(X,Y )|X = x

))+ varX

(E(g(X,Y )|X = x

))where EX and varX are the expectation and variance according to the distribution of X.

For the variance, the second term is here to take into account the variance from one x to another.In fact, the first term computes the average over all x of the variance along Y , but it is oblivious ofthe potential offsets there can be from one x to the other since the variance ignores constants. If atx1 the distribution is the same as at x0, plus some constant c, the first term alone would not takeit into account, and this is what the second term is for.

5.11 Generating Functions

We define the moment-generating function of a random variable X by

MX(t) = E(etX)


for t ∈ R such that MX(t) <∞. This definition of the MGF allows us to write the following

MX(t) = E

( ∞∑r=0

trXr

r!

)=

∞∑r=0

tr

r!E(Xr)

from which we can obtain all the moments E(Xr) by differentiation.

5.11.1 Properties of Moment-Generating Functions

If M(t) is the MGF of a random variable X then

• MX(0) = 1• Ma+bX(t) = eatMX(bt)

• E(Xr) = ∂rMX(t)∂tr

∣∣∣t=0

• E(X) = M ′X(0)• var(X) = M ′′X(0)−M ′X(0)2

There exists an injection between the cumulative distribution functions FX(x) and the moment-generating functions MX(t), meaning that if we recognize a MGF then we know to which distributionit corresponds.

5.11.2 Linear combinations

Let a, b1, ...bn ∈ R and X1, ...Xn independent random variables whose MGFs exist. Then Y =a+ b1X1 + ...+ bnXn has MGF

MY (t) = etan∏j=1

MXj (tbj)

5.11.3 Continuity

Let Xn, X be random variables with distribution functions Fn, F whose MGFs Mn(t),M(t)exist for 0 ≤ |t| < b. Then if Mn(t) → M(t) for |t| < b when n → ∞, then Xn →D X, i.e.Fn(x)→ F (x) at each x ∈ R where F is continuous.

5.11.4 Mean vector and covariance matrix

Let X = (X1, ...Xy)T be a p× 1 vector of random variables. Then

E(X)p×1 =

E(X1):

E(Xp)

var(X)p×p =

var(X1) cov(X1, X2) ... cov(X1, Xp)cov(X1, X2) var(X2) ... cov(X2, Xp)

: : :cov(X1, Xp) cov(X2, Xp) ... var(Xp)

are called the expectation (mean vector) and the (co)-variance matrix of X.


5.11.5 Moment-generating function : Multivariate case

The moment-generationg function of a random vector Xp×1 is

MX(t) = E(etTX)

= E(e∑pr=1 trXr

)where t ∈ Rp | MX(t) <∞. This implies

E(X)p×1 = M ′X(0) =∂MX(t)

∂t

∣∣∣∣t=0

var(X)p×p =∂2MX(t)

∂t∂tT

∣∣∣∣t=0

−M ′X(0)M ′X(0)T

Note that all we do here is rewrite the already known formulae for single random variables in a formthat allows us to compute multiple “at once”.

Independence

If A,B ⊂ 1, ..., p and A∩B = ∅ and we write XA for the subvector of X containing Xj : j ∈ A,then XA and XB are independent iff

MX(t) = E(etTAXA+tTBXB

)= MXA(tA)MXB (tB)

where t is such that MX(t) <∞.

5.11.6 Characteristic function

Many distributions don’t have a defined MGF. In this case we define the characteristic functionof X

φX(t) = E(eitX) t ∈ R

where i =√−1. Characteristic functions share the same properties as MGFs, but they are to be

used only if MGFs are not defined since they require complex analysis.

Theorem

X and Y have the same cumulative distribution function iff they have the same characteristicfunction.If X is continuous and has density y and characteristic function φ then

f(x) =1

2π

∫ ∞−∞

e−itxφ(t)dt

5.11.7 Cumulant-generating function

The cumulant-generating function (CGF) of X is KX(t) = logMX(t). The cumulants κr of Xare defined by

KX(t) =∞∑r=1

tr

r!κr κr = drKX(t)dtr

∣∣∣t=0

This implies that E(X) = κ1 and var(X) = κ2.


5.11.8 Cumulants of sums of random variables

If a, b1, ..., bn are constants and X1, ..., Xn are independent random variables, then

Xa+b1X1+...+bnXn = ta+

n∑j=1

KXj (tbj)

If X1, ...Xn are independent variables having cumulants κj,r, then the CGF of S = X1 + ...+Xn is

KS(t) =

n∑j=1

KXj (t) =

n∑j=1

∞∑r=1

tr

r!κj,r

5.11.9 Multivariate cumulant-generating function

The cumulant-generating function (CGF) of a random variable Xp×1 = (X1, ..., Xp)T is

KX(t) = logMX(t)

This implies that

E(X)p×1 = K ′X(0) =∂KX(t)

∂t

∣∣∣∣t=0

var(X)p×p =∂2KX(t)

∂t∂tT

∣∣∣∣t=0

Independence

If A,B ⊂ 1, ..., p and A∩B = ∅ and we write XA for the subvector of X containing Xj : j ∈ A,then XA and XB are independent iff

KX(t) = logE(etTAXA+tTBXB

)= KXA(tA) +KXB (tB)

where t is such that MX(t) <∞.

5.12 Multivariate Normal Distribution

The random vector X = (X1, ..., Xp)T has a multivariate normal distribution if there exist a

p× 1 vector µ = (µ1, ..., µp)T ∈ Rp and a p× p symmetric matrix Ω with elements ωjk such that

uTX ∼ N(uTµ, uTΩu)

where u ∈ Rp. We then write X ∼ Np(µ,Ω).In other words, if a linear combination of the individual random variables has a normal distri-

bution, then so does the vector.

5.12.1 Lemma : Properties of Normal variables

• E(Xj) = µj , var(Xj) = ωjj , cov(Xj , Xk) = ωjk for j 6= k.• The moment-generating funtion of X is MX(t) = exp(uTµ+ 1

2 tTΩt) for t ∈ Rp.

• Iff XA and XB are independent, ΩA,B = 0.• If X1, ..., Xn ∼iid N(µ, σ2) then Xn×1 = (X1, ...Xn)T ∼ Nn(µ1n, σ

2In).• Linear cominations of normal variables are normal : ar×1 +Br×pX ∼ Nr(a+Bµ,BΩBT )


5.12.2 Lemma : Normal density function

The random vector X ∼ Np(µ,Ω) has density function on Rp iff Ω is positive definite, i.e. Ω hasrank p. If so, the density function is

f(x;µ,Ω) =1

(2π)p/2|Ω|1/2exp

(−1

2(x− µ)TΩ−1(x− µ)

)

5.12.3 Marginal and conditional distributions

Let X ∼ Np(µp×1,Ωp×p), where |Ω| > 0, and let A,B ⊂ 1, ..., p with |A| = q < p, |B| = r < p andA ∩B = ∅.Let µA, ΩA and ΩAB be respectively the q × 1 subvector of µ, q × q and q × r submatrices of Ωconformable with A, A×A, A×B. Then

• The marginal distribution of XA is normal, XA ∼ Nq(µA,ΩA)• The conditional distribution of XA given XB = xB is normal,

XA|XB = xB ∼ Nq(µA + ΩABΩ−1

B (XB − µB),ΩA − ΩABΩ−1B ΩBA

)5.13 Transformation of joint continuous densities

Let X = (X1, X2) ∈ R2 be a continuous random variable and let Y = (g1(X1, X2), g2(X1, X2)) suchthat there exist h1 and h2 such that X1 = h1(Y1, Y2) and X2 = h2(Y1, Y2).Let J(x1, x2) be the jacobian of g1 and g2 :

J(x1, x2) =

∣∣∣∣∣ ∂g1∂x1

∂g1∂x2

∂g2∂x1

∂g2∂x2

∣∣∣∣∣then

fY1,Y2(y1, y2) = fX1,X2

(x1, x2) |J(x1, x2)|−1∣∣∣x1=h1(y1,y2),x2=h2(y1,y2)

5.14 Order Statistics

The order statistics of the random variables X1, ...Xn are the ordered values

X(1) ≤ X(2) ≤ ... ≤ X(n−1) ≤ X(n)

If the X1, ...Xn are continuous, then no two of the Xj can be equal.In particular,

min1≤j≤n

(Xj) = X(1)

median =

X(m+1) if n = 2m+ 1 odd

12

(X(m) +X(m+1)

)if n = 2m even

max1≤j≤n

(Xj) = X(n)


5.14.1 Theorem

Let X1, ...Xniid∼ F from a continuous distribution with density f , then

P (X(n) ≤ x) = F (x)n

P (X(1) ≤ x) = 1− (1− F (x))n

fX(r)(x) = n!

(r−1)!(n−r)!F (x)r−1f(x) (1− F (x))n−r

, r = 1, ..., n

5.15 Approximation and Convergence

5.15.1 Inequalities

If X is a random variable, a > 0 constant, h a non-negative function and g a convex function. Then

• P (h(X) ≥ a) ≤ E(h(X))/a (basic inequality)• P (|X| ≥ a) ≤ E(|X|)/a (Markov’s inequality)• P (|X| ≥ a) ≤ E(X2)/a2 (Chebysho’s inequality)• E(g(X)) ≥ g(E(X)) (Jensen’s inequality)

From Chebyshov’s inequality, by replacing X with X − E(X), we get

P (|X − E(X)| ≥ a) ≤ var(X)/a2

Remark : to prove the basic inequality from which all others more or less derive, we say that forany a > 0, y ≥ yI(y ≥ a) ≥ aI(y ≥ a), which implies from the definition of the expectation thatE(Y ) ≥ aP (Y ≥ a)

Hoeffding’s inequality

Let Z1, ...Zn be independent random variables such that E(Zi) = 0 and ai ≤ Zi ≤ bi for constantsai < ai. If ε > 0, then ∀t > 0

P

(n∑i=1

Zi ≥ ε

)≤ e−tε

n∏i=1

et2(bi−ai)2/8

This inequality is much more precise in finding close bounds than the ones seen before.

5.15.2 Convergence

• Xn converges almost surely, Xna.s.−−→ X, if

P(

limn→∞

Xn = X)

= 1

• Xn converges in mean square, Xn2−→ X, if

limn→∞

E((Xn −X)2

)= 0 where E(X2

n), E(X2) <∞

• Xn converges in probability, XnP−→ X, if for all ε > 0

limn→∞

P (|Xn −X| > ε) = 0

• Xn converges in distribution, XnD−→ X, if

limn→∞

Fn(x) = F (x) at each point x where F (x) is continuous


Relations between modes of convergence

Xna.s.−−→ X ⇒

Xn2−−→ X ⇒

Xn

P−→ X ⇒ XnD−→ X

The most importantones areP−→ and

D−→.

Combinations of convergent sequences

Let x0, y0 be constants, X,Y, Xn, Yn random variables, and h a function continuous at x0. Then

XnD−→ x0 ⇒ Xn

P−→ x0

XnP−→ x0 ⇒ h(Xn)

P−→ h(x0)

XnD−→ X and Yn

P−→ y0 ⇒ Xn + YnD−→ X + y0, XnYn

D−→ Xy0

The latter being Slutsky’s lemma.

5.15.3 Laws of Large Numbers

Weak law of large numbers

Let X1, ..., Xn be a sequence of independent identically distributed random variables with finiteexpectation µ, and average

X = n−1(X1 + ...+Xn)

Then XP−→ µ, i.e. ∀ε > 0

P(∣∣X − µ∣∣ > ε

)→ 0, n→∞

Remark : This is very easily proved by using Chebyshov’s inequality :

P(|X − µ| > ε

)≤ var(X)/ε2 =

σ2

nε2→ 0, n→∞

Strong law of large numbers

Under the same conditions, and in addition Xa.s.−−→ µ

P(

limn→∞

X = µ)

= 1

This one is stronger since the weak law allows the event∣∣X − µ∣∣ > ε to occur an infinite number of

times, while the strong one excludes this.

5.15.4 Central Limit Theorem (CLT)

Standardisation of an average

We know that if Xjiid∼ (µ, σ), then

E(X) = µ and var(X) =σ2

n


Therefor it is natural to say that

Zn =X − µ√σ2/n

=

√n(X − µ)

σ

has expectation 0 and variance 1.

Central Limit Theorem

Let X1, ..., Xn be independent random variables with expectation µ and variance 0 < σ2 <∞. Then

Zn =

√n(X − µ)

σ

D−→ Z, n→∞

where Z ∼ N(0, 1)Thus for large n,

P

(√n(X − µ)

σ≤ z)

= P (Z ≤ z) = Φ(z)

5.15.5 Delta Method

Let X1, ..., Xn be independent random variables with expectation µ and variance 0 < σ2 <∞, andlet g′(µ) 6= 0. Then

g(X)− g(µ)√g′(µ)2σ2/n

D−→ N(0, 1), n→∞

which is just a more general form of the central limit theorem.

5.15.6 Sample quantiles

Definition

Let X1, ..., Xniid∼ F and 0 < p < 1. Then the p sample quantiles of X1, ..., Xn is the rth order

statistic X(r) where r = dnpe.

Theorem : Asymptotic distribution of order statistics

Let 0 < p < 1, X1, ..., Xniid∼ F and xp = F−1(p). Then if f(xp) > 0,

X(dnpe) − xp√p(1− p)/(nf(xp)2)

D−→ N(0, 1), n→∞

which implies that

X(dnpe) ∼ N

(xp,

p(1− p)nf(xp)2

)

Chapter 6

EXPLORATORY STATISTICS

Statistics are the science of extracting information from data. Key points to keep in mind arevariation in the data and consequent uncertainty, and context.

Statistical Cycle

There are four main stages in the statistical method.

• plannig• implementation• data analysis• presentation

Study types

There are two main approaches, designed experiments where we can influence the experiment andhence remove some correlations and observational study where many hidden factors can exist.

6.1 Types of Data

• Population : the entire set of units we might study.• Sample : subset of the population.• Statistical variable : quantitative or qualitative characteristic of a unit in the population.• Modes : “bumps” the data exhibits.

6.2 Graphical Study of Variables

6.2.1 Kernel Density Estimate

Let K be a kernel, i.e. a function with symmetric density at 0 and variance 1, and let y1, ..., yn be asample of data drawn from some distribution with probability donsity f . Then the kernel densityestimator KDE of f for h > 0 is

fh(y) =1

nh

n∑j=1

K

(y − yjh

)y ∈ R

32

CHAPTER 6. EXPLORATORY STATISTICS 33

which gives a nonparametric estimator of the density underlying the sample depending on the kernelK and most importantly the bandwidth h.

The effect of this is to replace each sample with small normal-like distribution and adding themup to generate a new distribution that usually looks like a normal distibution. h will determin thewidth of each little distribution around a sample, which will then make the final sum smoother.

6.3 Numerical Summaries

6.3.1 Breakdown point

We say that, for example, the median has asymptotic breakdown point 50% because the medianwould only move an arbitrarily large amount if 50% of the observations were corrupted. The averagehas breakdown point 0% since a single bad value can move it arbitrarily far.

6.3.2 Quartiles and Sample Quantiles

We define the p quantileq(p) = x(dnpe)

where p ≤ 1. Sometimes, p will be in percent, in which case it must be divided by 100.Hence quartiles are q(0.25) and q(0.75)

6.3.3 Variability, Dispersion measures

Sample standard deviation

s =

(1

n− 1

n∑i=1

(xi − x)2

)1/2

=

(1

n− 1

n∑i=1

(x2i − nx2

))1/2

The sample variance is then s2. Both have breakdown point 0%.

Range

x(n) − x(1)

The range has breakdown point 0%.

6.3.4 Interquartile range

IQR(x) = q(0.75)− q(0.25)

which has breakdown point 25%.

6.3.5 Sample correlation

The sample correlation rxy is defined in the exact same way as the correlation, and has the sameproperties.


6.4 Boxplot

6.4.1 Five-number summary

The five-number summary is the list of the following five values

min = x(1) q(0.25) median = q(0.5) q(0.75) x(n)

which are used for drawing the bokplot.

6.4.2 Boxplot calculations

The additional calculation required for the boxplot is to avoid outliers :

C = 1.5 ∗ IQR(x)

which will determine the size of the whiskers on each side of the box.

6.5 Choice of a model

6.5.1 Normal Q-Q Plot

To verify that data follows a normal distribution, we use normal Q-Q plots, i.e. a graph of theordered sample values against the normal plotting positions. A grap close to a line suggests that teobservations can be fitted by a normal model, and the slope and intercept at x = 0 gives estimatesof σ and µ respectively.

6.6 Statistic Inference

Having observed an event A we want to say something about the underlying probability space(Ω, F, P ).

6.6.1 Statistical model

Several problems must be addressed when trying to deduce information about a probability spacefrom events :

• specification of a model for the data• estimation of the unknowns of the model (parameters, ...)• tests of hypotheses concerning a model• planning of the data collection and analysis to minimize uncertainty.

6.6.2 Definitions

• A statistical model is a probability distribution f(y) chosen to learn observed data y or frompotential data Y . If f(y) = f(y; θ), then f is a parametric model.

• A statistic T = t(Y ) is a known function of the data Y .

• The sampling distribution of a statistic T = t(Y ) is its distribution when Y ∼ f(y).

• A random sample is a set of independent and identically distributed random variablesY1, ...Yn or their realisations y1, ...yn.


6.7 Point Estimation

6.7.1 Estimator

An estimator is a statistic ˆtheta used to estimate a parameter θ of f .

6.7.2 Estimation methods

There are many methods for estimating the parameters of a model. The choice depends on ease ofcalculation, efficiency (precision), and robustness. Common methods are

• method of moments, simple but potentially inefficient• maximum likelihood estimation, general and often optimal• M-estimation, even more general, mostly robust but less efficient.

6.7.3 Method of moments

The method of moments estimate of a parameter θ is the value θ that matches both theoretical andempirical moments.

For a model with p unknown parameters, this gives

E(Y r) =

∫yrf(y; θ)dy =

1

n

∑j=1

nyrj

where r = 1, ..., p, meaning that we need as many finite moments of the underlying model as thereare unknown parameters.

6.7.4 Maximum likelihood estimation

If y1, ..., yn is a random sample from the density f(y; θ), then the likelihood for θ is

L(θ) = f(y1, ..., yn; θ) = f(y1; θ) · f(y2; θ) · ... · f(yn; θ)

The maximum likelihood estimate MLE θ of a parameter θ is the value that gives the highestlikelihood, i.e.

L(θ) = maxθL(θ)

Calculation of the MLE

We sometimes simplify the calculations by maximising l(θ) = logL(θ) rather than L(θ).

6.7.5 M-estimation

This is a generalisation of the maximum likelihood estimation. We maximise a function of the form

ρ(θ;Y ) =

n∑j=1

ρ(θ;Yj)

where ρ(θ; y) is, if possible, concave as a function of θ for all y. We choose ρ to obtain estimatorswith suitable properties such as small variance or robustness to outliers.

Note that for ρ(θ; y) = log(f(y; θ)), we obtain back the maximum likelihood estimator.


6.7.6 Bias

The bias of an estimator θ of a parameter θ is

b(θ) = E(θ)− θ

If b(θ) < 0 for all θ, then θ tends to underestimate θ, and vice versa. If b(θ) = 0 for all θ, then θ issaid to be unbiased.

6.7.7 Mean Square Error

The mean square error MSE if the estimator θ of θ is

MSE(θ) = E[(θ − θ)2] = var(θ) + b(θ)2

This is the average squared distance between θ and θ.

Let θ1 ad θ2 be two unbiased estimators of the same parameter θ. Then if

MSE(θ1) = var(θ1) ≤ var(θ2) = MSE(θ2)

we say that θ1 is more efficient than θ2

6.7.8 Delta method

Let θ be an estimator based on a sample size n, such that

θ ∼ N(θ, v/n) when n→∞

and let g be a smooth function such that g′(θ) 6= 0. Then

g(θ) ∼ N(g(θ) +

vg′′(θ)

2n,vg′(θ)2

n

)when n→∞

This implies that the mean square error of the estimator g(θ) of g(θ) is

MSE(g(θ)) =

(vg′′(θ)

2n

)2

+vg′(θ)2

n

Which implies that, for large n, we can neglict the bias.

6.8 Interval Estimation

6.8.1 Pivots

We want to give an interval in which our estimator will be with a given probability. This intervalwill widen when the size of the sample decreases.

Let Y = (Y1, ..., Yn) be sampled from a ditsribution F with parameter θ. Then a pivot is thefunction Q = q(Y, θ) of the data and the parameter θ, where the distribution FQ of Q is known anddoes not depend on θ.


6.8.2 Confidence intervals

Let Y = (Y1, ..., Yn) be data from a parametric statistical model with scalar parameter θ. A con-fidence interval CI (L,U) for θ with lower and upper confidence bounds L and U , is a randominterval that contains θ with a specific probability called the confidence level.

If we write P (θ < L) = αL and P (U < θ) = αU , then (L,U) has confidence level

P (L ≤ θ ≤ U) = 1− αL − αU

If αL = αU , we say that we have an equi-tailed (1− αL − αU ) · 100% confidence interval.

6.8.3 Construction of a CI

• We find a pivot Q = q(Y, θ) involving θ• We obtain the quantiles qαU and q1−αL of Q• We transform the equation

P (qαU ≤ q(Y, θ) ≤ q1−αL) = 1− αL − αU

intoP (L ≤ θ ≤ U) = 1− αL − αU

where L and U depend on Y , qαU and q1−αL but not on θ.

6.8.4 One and two-sided intervals

As opposed to a two-sided interval (L,U), we can use one-sided confidence intervals of the form(−∞, U) or (L,∞).

For this, we take αL = 0 or αU = 0 giving respective intervals (−∞, U) or (L,∞).

6.8.5 Standard Errors

Let T = t(Y1, ..., Yn) be an estimator of θ, let τ2n = var(T ) be its variance and let V = v(Y1, ..., Yn)

be an estimator of τ2n. We call V 1/2 or its realisation v1/2 a standard error for T .

Theorem

Let T be an estimator of θ (written V in course..?) based on a sample of size n, with

T−θτn

D−→ Z Vτ2n

F−→ 1 n→∞

Where Z ∼ N(0, 1). Therefore,

T−θV 1/2 = T−θ

τn× τn

V 1/2

D−→ Z n→∞

which implies that, when basing a cenfidence interval on the Central Limit Theorem, we can replaceτn with V 1/2.


6.8.6 Normal Random Sample

If Y1, ..., Yniid∼ N(µ, σ2), then

Y ∼ N(µ, σ2

n )(n− 1)S2 =

∑nj=1(Yj − Y )2 ∼ σ2χ2

n−1

are independent.

where χ2v is the chi-square distribution with v degrees of freedom.

The first result implies that if σ2 is known, then

Z =Y − µ√σ2/n

∼ N(0, 1)

is a pivot that provides an exact (1− αL − αU ) confidence interval for µ of the form

(L,U) =

(Y − σ√

nz1−αL , Y −

σ√nzαU

)where zp is the p quantile of the standard normal distribution, i.e. zp = Φ−1(p)

6.8.7 Unknown variance

Usually, σ2 is unknown. In this case,

Y−µ√S2/n

∼ tn−1,(n−1)S2

σ2 ∼ χ2n−1

are pivots that provide confidence intervals for µ and σ2 respectively, of the form :

(L,U) =(Y − S√

ntn−1(1− αL), Y − S√

ntn−1(αU )

)(L,U) =

((n− 1)S2

χ2n−1(1− αL)

,(n− 1)S2

χ2n−1(αU )

)where

• tv(p) is the p quantile of the Student t distribution with v degrees of freedom• χ2

v(p) is the p quantile of the chi-square distribution with v degrees of freedom.

Note : For symmetric distributions like the Student t (and unlike the chi-square), the quantilessatisfy zp = −z1−p, so the equi-tailed (1−α) ·100% confidence intervals have the form Y ± σ√

nz1−α/2.

6.8.8 Confidence intervals and tests

We can use CIs to assess the plausibility of a falue θ0 of θ :

• If θ0 lies inside a (1 − α) · 100% CI, then we cannot reject the hypothesis that θ = θ0 atsignificance level α.• If θ0 lies outside a (1−α) ·100% CI, then we reject the hypothesis that θ = θ0 at significance

level α.

Hence the smaller α is when we do reject, the stronger the evidence against θ0.


6.8.9 Null and Alternative Hypotheses

In general, we use data to decide between two hypotheses :

• The null hypothesis H0 which represents the theory or model we want to test (for coin tossesH0 is that the coin is fair).• The alternative hypothesis H1 which represents what happens if H0 is false (the coin is not

fair).

There are hence two types of errors when we decide between these two :

• False positive : H0 is true but we reject it.• False negative : H0 is false but we accept it.

Simple and composite hypotheses

A simple hypothesis entirely fixes the distribution of the data Y , whereas a composite hypoth-esis does not fix the distribution of Y .

ROC curve

The receiver operating characteristic (ROC) curv of a test, plots β(t) = P1(T > t) againstα(t) = P0(T > t) as the cut-off value t varies, i.e. it shows (P0(T > t), P1(T > t)) for all t ∈ R.

6.8.10 Size and power

As the difference in µ increases between the two hypotheses, it becomes easier to detect when H0 isfalse.

Let P0(·) and P1(·) be the probabilities computed under null and alternative hypotheses H0 andH1 respectively. The size and power of a statistical test of H0 against H1 are

size : α = P0(reject H0), power : β = P1(reject H0)

6.8.11 Pearson statistic

Let O1, ..., Ok be the number of observations of a random sample of size n = n1 + ... + nk fallingint the categories 1, ..., k, whose expected numbers are E1, ..., Ek with Ei > 0. Then the Pearsonstatistic, or chi-square statistic, is

T =

k∑i=1

(Oi − Ei)2

Ei

6.8.12 Chi-square distribution with v degrees of freedom

Let Z1, ..., Zviid∼ N(0, 1), then W = Z2

1 + ...+ Z2v follows the chi-square distribution with v degrees

of freedom, whose density function is

fW (w) =1

2v/2Γ(v/2)wv/2−1e−w/2, w > 0, v = 1, 2, ...

where Γ(a) =∫∞

0ua−1e−udu for a > 0.


• If Oi ≈ Ei for all i, then T will be small.• If the joint distribution of O1, ...Ok is multinomial with denominator n and probabilities pi =Ei/n, then each Oi ∼ B(n, pi) and

E(Oi) = npi = Ei var(Oi) = npi(1− pi) = Ei(1− Ei/n) ≈ Ei

thus Zi = Oi−Ei√Ei∼ N(0, 1) for large n, and

T =

k∑i=1

(Oi − Ei)2

Ei=

k∑i=1

Z2i ∼ χ2

k

6.8.13 Evidence and P-values

A statistical hypothesis test has the following elements :

• A null hypothesis H0 to be tested against an alternative hypothesis H1

• Data from which we compute a test statistic T , chosen such that large values of T provideevidence against H0.• The observed value of T , tobs which we compare with the null distribution of T , H0.• The P-value

pobs = P0(T ≥ tobs)

where small pobs suggests that H0 is false (or something unlikely has occured).• If pobs < α we say that the test is significant at level α.• We reject H0 if pobs < α and accept otherwise.

Chapter 7

Likelihood

Basic Idea For a value of θ which is not very credible, the density of the data will be smaller :the higher the density, the more credible the corresponding θ. In other words, if y1, ..., yn are resultsfrom independent trials, we have

f(y1, ..., yn; θ) =n∏j=1

f(yj ; θ)

which is a function of θ we call the likelihood L(θ).

7.0.1 Relative likelihood

To compare values of θ, we only need to consider the ratio between then :

L(θ1)

L(θ2)= c

which implies that θ1 is c times more plausible than θ2.

The most plausible value θ is called the maximum likelihood estimate and satisfies

L(θ) ≥ L(θ)

We can equivalently maximise the log likelihood

l(θ) = logL(θ)

The relative likelihood RL(θ) = L(θ)

L(θ)gives the plausibility of θ with respect to θ.

7.1 Scalar Parameter

7.1.1 Information

The observed information J(θ) and the expected information, or Fisher information I(θ) are

J(θ) = −d2l(θ)dθ2 , I(θ) = E[J(θ)]

They measure the curvature of −l(θ) ; the larger they are, the more concentrated the likelihoodsare.

41

CHAPTER 7. LIKELIHOOD 42

7.1.2 Limit distribution of the MLE

Let Y1, ..., Yn be a random sample from a parametric density f(y; θ) and let θ be the MLE of θ. iff satisfies regularity conditions, then

J(θ)1/2(θ − θ) D−→ N(0, 1), n→∞

thus for large n,θ ∼ N(θ, J(θ)−1)

We can therefore use this to compute two-sided equi-tailed CIs for θ :

I θ1−α = (L,U) =(θ − J(θ)−1/2z1−α/2, θ + J(θ)−1/2z1−α/2

)We can show that for large n and a regular model, no estimator has a smaller variance and

narrower CI than such θ.

7.1.3 Likelihood ratio statistic

Sometimes it is unreasonable to use a CI based on the normal limit distribution of θ. In this casewe use l(θ).

The likelihood ratio statistic is

W (θ) = 2l(θ)− l(θ)

In addition, if θ0 is the value of θ that generated the data, then if θ has a normal limit distribution,

W (θ0)D−→ χ2

1, n→∞

or in other words, W (θ0) ∼ χ21 for large n.

7.1.4 Regularity

We talked about regularity conditions, which are quite complicated. Situations where they are falseare often cases where

• one of the parameters is discrete.• the support of f(y; θ) depends on θ.• the true θ is on the limit of its possible values.

In the majority of cases they are satisfied though.

7.1.5 Vector Parameter

If θ is a vector of dimension p, then the above definitions hold with some slight changes :

• the MLE θ often satisfies the vector equation

dl(θ)

dθ= 0p×1

• J(θ) and I(θ) are p× p matrices.• in regular cases,

θ ∼ Np(θ, J(θ)−1

)

CHAPTER 7. LIKELIHOOD 43

7.1.6 Nested models

In some cases where we have multiple parameters, we want to test a model where one parameter isa specified value while not restricting the other parameters. For instance in the normal model, wemight want to compare a general model against a simple model, respectively

θ = (µ, σ2) ∈ R× R+

θ = (µ, σ2) ∈ µ0 × R+

In such a situation where one model can become the other one when some parameters are restricted,we say that the simpler model is nested in the general model.

7.1.7 Likelihood ratio statistic

Take two nested models with corresponding MLEs

θ = (φ, λ)

θ0 = (φ0, λ0)

where l(θ) ≥ l(θ0), and write the likelihood ratio statistic

W (φ0) = 2(l(θ)− l(θ0)

). Then if the simpler model is true, i.e. φ = φ0, we have

W (φ0)D−→ χ2

q n→∞

7.1.8 Simple linear regression

Let Y be a random variable, the response variable, that depends on a variable x, the explanatoryvariable. A simple model that describes linear dependence of E(Y ) on x is

Y ∼ N(β0 + β1x, σ2)

where β0, β1 and σ2 are the unknown parameters.

Chapter 8

BAYESIAN INFERENCE

8.1 Bayesian Inference

Up to now we have supposed that all the information about θ comes from the data y. But if wehave prior knowledge about θ in the form of a prior density, π(θ), we can use Bayes’ Theorem tocompute the posterior density for θ conditional on y

π(θ|y) =f(y|θ)π(θ)

f(y)

The difference with previous points here is that the observed data y is fixed and θ is regarded as arandom variable.

In order to do this, we need prior information, π(θ), which may be based on data seperate fromy, an objective or a subjective notion of what we believe about θ.

8.1.1 Application of Bayes’ Theorem

We suppose that θ has density π(θ) and that Y conditional on θ has density f(y|θ). Then theconditional density of θ given Y = y is

π(θ|y) =f(y|θ)π(θ)

f(y)

where we know how to compute f(y) from f(y|θ) and π(θ).

We can use Bayes’ Theorem to update the prior density for θ to a posterior density for θ.

8.1.2 Beta(a,b) density

The beta(a,b) density for θ ∈ (0, 1) has the form

π(θ) =θa−1(1− θ)b−1

B(a, b), 0 < θ < 1, a, b > 0

where a and b are parameters, B(a, b) = Γ(a)Γ(b)Γ(a+b) is the beta function and Γ(a) =

∫∞0uα−1e−udu

for a > 0. Note that for a = b = 1, this gives the U(0, 1) distribution.

44

CHAPTER 8. BAYESIAN INFERENCE 45

If θ ∼ Beta(a, b), then

E(θ) =a

a+ b

var(θ) =ab

(a+ b+ 1)(a+ b)2

8.1.3 Properties of π(θ|y)We can of course compute the posterior expectation and the posterior variances of this density,respectively E(θ|y), var(θ|y). We can also compute the Maximum A Posteriori estimator(MAP) θ such that for all θ,

π(θ|y) ≥ π(θ|y)

8.1.4 Point estimation and loss functions

The choice of estimate when contructing an estimator based on data y is important and to makethe best decision we might want to minimise the expected loss from a bad decision.If Y ∼ f(y; θ), then the loss function R(y; θ) is a non-negative function of Y and θ. The expectedposterior loss is

E (R(y; θ)|y) =

∫R(y; θ)π(θ|y)dθ

8.1.5 Interval estimation and credibility intervals

The bayesian analogue of the (1 − α) × 100% CI for θ is the (1 − α) credibility interval for θobtained using the α/2 and 1− α/2 quantiles of π(θ|y).

8.1.6 Conjugate densities

Particular combinations of data and prior densities give posterior densities of the same form as theprior densities. For example

θ ∼ Beta(a, b)s,n−−→ θ|y ∼ Beta(a+ s, b+ n− s)

where the data s ∼ B(n, θ) corresponds to s successes out of n independent trials with successprobability θ.

We say that the beta density is the conjugate prior density of the binomial distribution :if the likelihood is proportional to θs(1 − θ)n−s, then choosing a beta prior for θ ensures that theposterior density of θ is also beta with updated parameters.

8.1.7 Prediction of a futur random variable Z

“Will the next result be tails or heads?”Use Bayes’ Theorom to calculate the posterior density of Z gives Y = y, P (Z = z|Y = y)

CHAPTER 8. BAYESIAN INFERENCE 46

8.1.8 Bayesian approach

• Treat each unknown as a random variable, give it a distribution and use Bayes’ Theorem tocalculate its posterior distribution given data.

Probabilities and Statistics

Documents

Transcript of Probabilities and Statistics