Probabilities and Statistics
Transcript of Probabilities and Statistics
Contents
1 COMBINATORIAL ANALYSIS 51.1 The basic principle of counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 The Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.2 Other properties of binomial coefficients . . . . . . . . . . . . . . . . . . . . . 61.3.3 multinomial coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.4 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 AXIOMS OF PROBABILITY 72.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Inclusion-Exclusion Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Prediction Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 RANDOM VARIABLES 103.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Bernoulli Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.2 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.3 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.4 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.5 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.6 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Transformations of discrete random variables . . . . . . . . . . . . . . . . . . . . . . 133.6 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.6.2 Expected value of a function . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.6.3 Moments of a distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.6.4 Properties of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
CONTENTS 2
3.7 Conditional Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.7.1 Conditional Probability mass function . . . . . . . . . . . . . . . . . . . . . . 143.7.2 Conditional expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.7.3 Law of small numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 CONTINUOUS RANDOM VARIABLES 164.1 Definition : Probability density function . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Basic distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.2 Eponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.3 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.4 Laplace distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.5 Pareto distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4 Conditional Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.6 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.7 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.7.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.7.2 Moivre-Laplace : Normal approximation of the binomial distribution . . . . . 19
4.8 Q-Q Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.9 Densities recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 SEVERAL RANDOM VARIABLES 215.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.4 Marginal and conditional distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 225.5 Multivariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5.1 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.6.1 Independent and Identically distributed variables . . . . . . . . . . . . . . . . 225.7 Joint Moments and Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.7.1 Properties of covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.7.2 Independence and covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.8 Linear combinations of random variables . . . . . . . . . . . . . . . . . . . . . . . . . 235.9 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.9.1 Properties of correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.10 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.10.1 Expectation and Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 245.11 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.11.1 Properties of Moment-Generating Functions . . . . . . . . . . . . . . . . . . . 255.11.2 Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.11.3 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.11.4 Mean vector and covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 255.11.5 Moment-generating function : Multivariate case . . . . . . . . . . . . . . . . . 265.11.6 Characteristic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.11.7 Cumulant-generating function . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
CONTENTS 3
5.11.8 Cumulants of sums of random variables . . . . . . . . . . . . . . . . . . . . . 275.11.9 Multivariate cumulant-generating function . . . . . . . . . . . . . . . . . . . . 27
5.12 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.12.1 Lemma : Properties of Normal variables . . . . . . . . . . . . . . . . . . . . . 275.12.2 Lemma : Normal density function . . . . . . . . . . . . . . . . . . . . . . . . 285.12.3 Marginal and conditional distributions . . . . . . . . . . . . . . . . . . . . . . 28
5.13 Transformation of joint continuous densities . . . . . . . . . . . . . . . . . . . . . . . 285.14 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.14.1 Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.15 Approximation and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.15.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.15.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.15.3 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.15.4 Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . . 305.15.5 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.15.6 Sample quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 EXPLORATORY STATISTICS 326.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.2 Graphical Study of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2.1 Kernel Density Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.3 Numerical Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.3.1 Breakdown point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.3.2 Quartiles and Sample Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . 336.3.3 Variability, Dispersion measures . . . . . . . . . . . . . . . . . . . . . . . . . 336.3.4 Interquartile range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.3.5 Sample correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.4.1 Five-number summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.4.2 Boxplot calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.5 Choice of a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.5.1 Normal Q-Q Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.6 Statistic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.6.1 Statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.6.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.7 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.1 Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.2 Estimation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.3 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.4 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.5 M-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.7.6 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.7.7 Mean Square Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.7.8 Delta method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.8 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.8.1 Pivots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.8.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.8.3 Construction of a CI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
CONTENTS 4
6.8.4 One and two-sided intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.8.5 Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.8.6 Normal Random Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.8.7 Unknown variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.8.8 Confidence intervals and tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.8.9 Null and Alternative Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . 396.8.10 Size and power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.8.11 Pearson statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.8.12 Chi-square distribution with v degrees of freedom . . . . . . . . . . . . . . . . 396.8.13 Evidence and P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7 Likelihood 417.0.1 Relative likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.1 Scalar Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.1.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.1.2 Limit distribution of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1.3 Likelihood ratio statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1.4 Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1.5 Vector Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1.6 Nested models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.1.7 Likelihood ratio statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.1.8 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8 BAYESIAN INFERENCE 448.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.1.1 Application of Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 448.1.2 Beta(a,b) density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448.1.3 Properties of π(θ|y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458.1.4 Point estimation and loss functions . . . . . . . . . . . . . . . . . . . . . . . . 458.1.5 Interval estimation and credibility intervals . . . . . . . . . . . . . . . . . . . 458.1.6 Conjugate densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458.1.7 Prediction of a futur random variable Z . . . . . . . . . . . . . . . . . . . . . 458.1.8 Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Chapter 1
COMBINATORIAL ANALYSIS
1.1 The basic principle of counting
If experiment 1 can result in m outcomes and for each outcome, experiment 2 has n possible out-comes, then the total number of outcomes is nm. Note that this can be expanded to r consecutiveexperiments.
1.2 Permutations
A permutation of a set of objects is an arragement of these objects. If there are n distinct objects,then the number of permutations will be n!, as there is n objects to choose from for the first one,(n− 1) for the second one and so on.
1.3 Combinations
A combination of r objects among n is an unordered subset of r objects taken from the n originalones. The number of such combinations is defined as :(
n
r
)=
n!
(n− r)! r!
1.3.1 The Binomial Theorem
(x+ y)n =
n∑k=0
(n
x
)xkyn−k
5
CHAPTER 1. COMBINATORIAL ANALYSIS 6
Proof
Using combinatorics, and given that there are 2n different terms in the sum after development, eachof n factors, we can determine the how many of these 2n terms have k xs and (n − k) ys, whichcorresponds to the choice of the group of k from n that will be xs, i.e.
(nx
).
1.3.2 Other properties of binomial coefficients(n
r
)=
(n
n− r
)(n+ 1
r
)=
(n
r − 1
)+
(n
r
)r∑j=0
(m
j
)(n
r − j
)=
(m+ n
r
)
1.3.3 multinomial coefficients
We consider the following problem : a set on n distinct items that we want to divide into r groupsof respective sizes ni. The total number of such divisions is given by(
n
n1, n2, ..., nr
)=
n!
n1! n2! ... nr!
1.3.4 Partitioning
We have a set of n identical items that we want to distribute into r groups. This is equivalent tofinding the number of vectors of the form (n1, n2, ..., nr) such that n1 + n2 + ... + nr = n. Anelegant way of solving the problem is to align the n items in a row and to place r − 1 separators inbetween the items. The solution is the number of combinations of separator positions that can bechosen. There are n − 1 possible gaps (as no group can be empty) and r − 1 separators to place.Hence the solution is simply (
n− 1
r − 1
)
If we decide that we can have empty groups, then there are a total of n + r − 1 objects of whichr− 1 must be seperators and the rest must be items. So the solution is the number of ways to selectr − 1 objects out of n+ r − 1, or in other words :(
n+ r − 1
r − 1
)
Chapter 2
AXIOMS OF PROBABILITY
2.1 Terminology
A random experiment is modelled by a probability space (Ω, F, P )
• Sample space (Universe) : Set of all possible outcomes of an experiment, denoted Ω.
• Event Space : A collection of subsets of Ω, denoted F . These subsets are called events.
• Probability Distribution : P : F → [0, 1] which associates to each event A in F a probabilityP (A).
The Event Space F must verify the following properties :
• F is nonempty
• If A ∈ F then Ac ∈ F
• The union of any elements in F is in F .
Remark
The biggest event space of Ω is also its power set.
2.2 Basic Properties
All following properties can be mirrored by exchanging unions with intersections.
• Commutative laws : E ∪ F = F ∪ E
• Associative laws : (E ∪ F ) ∪G = E ∪ (F ∪G)
• Distributive laws : (E ∪ F ) ∩G = (E ∩G) ∪ (F ∩G)
• DeMorgan’s laws :( n⋃i=1
Ei
)c=
n⋂i=i
Eci
7
CHAPTER 2. AXIOMS OF PROBABILITY 8
• Linearity of probabilities : P( ∞⋃i=1
Ei
)=∞∑i=1
P (Ei) for any mutually exclusive events.
• If E ⊂ F then P (E) ≤ P (F )
• P (E ∪ F ) = P (E) + P (F )− P (E ∩ F )
2.2.1 Inclusion-Exclusion Formulae
If A1, ..., An are events of (Ω, F, P ), then :
P (A1∪A2∪A3) = P (A1)+P (A2)+P (A3)−P (A1∩A2)−P (A1∩A3)−P (A2∩A3)+P (A1∩A2∩A3)
Note that there are ’+’ signs when the number of sets is odd.
P( n⋃i=1
Ai
)=
n∑r=1
(−1)r+1∑
1≤i1<...<ir≤n
P (Ai1 ∩ ... ∩Air )
The number of terms in the general formula is(n
1
)+
(n
2
)+ ...+
(n
n
)= 2n − 1
2.3 Conditional Probability
Definition
The conditional probability of A given B is
P (A|B) =P (A ∩B)
P (B)
Theorem
Let (Ω, F, P ) be a probability space and B ∈ F such that P (B) > 0 and Q(A) = P (A|B). Then(Ω, F,Q) is a probability space.
Law of total probability
If Bi for all i > 0 are pairwise disjoint events of (Ω, F, P ), and if A is in the union of all Bi, then
P (A) =
∞∑i=1
P (A ∩Bi)
2.3.1 Bayes Theorem
If Bi for all i > 0 are pairwise disjoint events of (Ω, F, P ), and if A is in the union of all Bi, then forsome integer j :
P (Bj |A) =P (A|Bj)P (Bj)∑∞i=1 P (A|Bi)P (Bi)
which is really just a detailed way of writing the conditional probability of Bj given A.
CHAPTER 2. AXIOMS OF PROBABILITY 9
2.3.2 Prediction Decomposition
If Ai for all 1 ≤ i ≤ n are events in a probability space, then :
P (A1 ∩A2) = P (A2|A1)P (A1)
P (A1 ∩A2 ∩A3) = P (A3|A1 ∩A2)P (A2|A1)P (A1)
or in general
P (A1 ∩ ... ∩An) =
n∏i=2
P (Ai|A1 ∩ ... ∩Ai−1) · P (A1)
2.4 Independence
If (Ω, F, P ) is a probability space, and two events A and B in F are independent, we write A ⊥⊥ B,and
P (A ∩B) = P (A)P (B)
P (A|B) = P (A)
Definitions
• The events A1, ...An are (mutually) independent if for all sets F :
P( ⋂i∈F
Ai
)=∏i∈F
P (Ai)
• The events A1, ...An are pairwise independent if for all 1 ≤ i < j ≤ n,
P (Ai ∩Aj) = P (Ai)P (Aj)
• The events A1, ...An are conditionally independent given B if for all sets F ,
P( ⋂i∈F
Ai|B)
=∏i∈F
P (Ai|B)
Remarks
• Mutual independence implies pairwise independence. The converse is not true if n > 2.• Mutual independence and conditional independence do not imply one another.
Chapter 3
RANDOM VARIABLES
3.1 Definition
Let (Ω, F, P ) be a probability space. A random variable is defined by X : Ω→ R.The set DX = x ∈ R : ∃w ∈ Ω such that X(w) = x is the support of X and associates a
value in R to every event. If DX is countable, then X is a discrete random variable.
3.2 Bernoulli Random Variables
A random variable that takes only values 0 and 1 is called indicator variable, Bernoulli randomvariable or Bernoulli trial.
3.3 Probability Mass Function
The probability mass function (PMF) of a discrete random variable X is
fX(x) = P (X = x)
Its properties are naturally that fX(x) ≥ 0 for all x ∈ DX , where DX is the image of the functionX, and the sum for all xi ∈ DX of fX(xi) is equal to 1.
3.3.1 Binomial Random Variable
A binomial random variable X has the following PMF :
f(x) =
(n
x
)px(1− p)n−x
x ∈ [0, n], n ∈ N, 0 ≤ p ≤ 1
We write X ∼ B(n, p) and call n the denominator and p the probability of success. It representsthe number of successes of a trial independently repeated a fixed number of times n.
10
CHAPTER 3. RANDOM VARIABLES 11
3.3.2 Geometric Distribution
A geometric random variable X has the following PMF :
f(x) = p(1− p)x−1
x ∈ [0,∞[, 0 ≤ p ≤ 1
We write X ∼ Geom(p) and call p the probability of success. It represents the number of failuresbefore the first success happens.
Theorem : Lack of memory
If X ∼ Geom(p) then
P (X > n+m|X > m) = P (X > n)
3.3.3 Negative Binomial Distribution
A negative binomial random variable X has the following PMF :
f(x) =
(x− 1
n− 1
)pn(1− p)x−n
x ≥ n, 0 ≤ p ≤ 1
We write X ∼ NegBin(n, p). When n = 1, X ∼ Geom(p). It represents the number of trials untilthe nth success happens.
Note that there exists an alternative version using the Gamma function Γ(α) which is defined inthe course’s slide 101 page 52.
3.3.4 Hypergeometric Distribution
We draw without replacement m balls from an urn with w white and b black balls. If X is thenumber of white balls drawn, then
P (X = x) =
(wx
)(b
m−x)(
w+bm
)We write X ∼ HyperGeom(w, b;m).
It can be understood as the number of distinct ways of picking x white balls without pickingm− x black balls, among all possible combinations.
3.3.5 Discrete Uniform Distribution
A discrete uniform random variable X has the following PMF :
f(x) =1
b− a+ 1
x ∈ [a, b] ⊂ Z
We write X ∼ DU(a, b)
CHAPTER 3. RANDOM VARIABLES 12
3.3.6 Poisson Distribution
A Poisson random variable X has the following PMF :
f(x) =λx
x!e−λ
x ≥ 0, λ > 0
We write X ∼ Pois(λ).
Note that by the fact that ex =∞∑i=0
λi
i! , we can see that the sum of all probabilities equals 1.
Poisson Process
Consider point events taking place in a time period [0, T ] and N(I) the number of events in a subsetI ⊂ T .
Suppose :
• Events in disjoint subsets of T are independent.
• The probability for an event to happen in an interval of width δ is δλ+ o(δ) for some δ > 0.
• The probability for no events to happen in an interval of width δ is 1− δλ+ o(δ).
Then :
• N(I) ∼ Pois(λ|I|)
• If all Ii are disjoint subsets of T , then all N(Ii) are independent Poisson variables.
Note that the probability that the waiting time X1 until the first event occurs exceedst is
P (X1 > t) = P (N([0, t]) = 0) = e−λt
3.4 Cumulative distribution function
The cumulative distribution function CDF of a random variable X is
FX(x) = P (X ≤ x) x ∈ R
If X is discrete, we can write
FX(x) =∑
xi∈DX : xy≤x
P (X = xi)
where DX is the support of fX(x), i.e. the values that X can take.
CHAPTER 3. RANDOM VARIABLES 13
3.4.1 Properties
The cumulative distribution function FX of a random variable X satisfies :
• limx→−∞
FX(x) = 0
• limx→∞
FX(x) = 1
• FX is non-decreasing.• FX is continuous on the right, i.e. lim
t→0, t>0FX(x+ t) = FX(x)
• P (X > x) = 1− FX(x)• If x < y, then P (x < X ≤ y) = FX(y)− FX(x)
Remark
We can find the PMF of a discrete random variable from the CDF :
f(x) = F (x)− limy→x,y<x
F (y)
3.5 Transformations of discrete random variables
If X is a random variable, then Y = g(X) is a random variable too, and
fY (y) =∑
x : g(x)=y
fX(x)
3.6 Expectation
Let X be a discrete random variable for which∑
x∈DX|x|fX(x) <∞. The expectation, or expected
value or mean of X isE(X) =
∑x∈DX
xfX(x)
Remark
If∑
x∈DX|x|fX(x) is not finite, then E(X) is not well defined.
3.6.1 Properties
Let X be a random variable with a finite expected value E(X), and let a ∈ R be a constant. Then
• E(·) is a linear operatior, meaning that E(aX + b) = aE(X) + b• If g(X) and h(X) have finite expected values, then E(g(X) + h(X)) = E(g(X)) + E(h(X))• (E(X))2 ≤ E(X2)
CHAPTER 3. RANDOM VARIABLES 14
3.6.2 Expected value of a function
Let X be a random variable with mass function f and let g be a real-valued function of X such that∑x∈DX
|g(x)|f(x) <∞. Then
E(g(X)) =∑x∈DX
g(x)f(x)
3.6.3 Moments of a distribution
If X has a PMF f(x) such that∑x |x|rf(x) <∞, then
• the rth moment of X is E(Xr)• the rth central moment of X is E((X − E(X))r)• the variance of X is the second central moment of X, i.e. var(X) = E((X − E(X))2)• the standard deviation of X is σ =
√varX.
• the rth factorial moment of X is E(X(X − 1)...(X − r + 1)) = E(
X!(X−r)!
)3.6.4 Properties of variance
Let X be a random variable whose variance exists, and let a, b be constants. Then
• var(X) = E(X2)− E(X)2
• var(X) = E(X(X − 1)) + E(X)− E(X)2
• var(aX + b) = a2var(X)• var(X) = 0 ⇒ X is constant with probability 1.
Also, if X takes its values in 0, 1, ..., for r ≥ 2 and E(X) <∞, then
E(X) =
∞∑x=1
P (X ≥ x)
and in a more general sense,
E(X(X − 1)...(X − r + 1)) = r
∞∑x=r
(x− 1)...(x− r + 1)P (X ≥ x)
3.7 Conditional Probability Distributions
3.7.1 Conditional Probability mass function
Let (Ω, F, P ) be a probability space on which we define a random variable X, and let B ∈ F be suchthat P (B) > 0. Then the conditional probability mass function of X given B is
fX(x|B) = P (X = x|B) =P (X = x ∩B)
P (B)
We can note that fX(x|B) is a well defined mass function
fX(x|B) ≥ 0,∑x
fX(x|B) = 1
CHAPTER 3. RANDOM VARIABLES 15
3.7.2 Conditional expected value
Suppose that∑x |g(x)|fX(x|B) <∞. Then the conditional expected value of g(X) given B is
E(g(X)|B) =∑x
g(x)fX(x|B)
Let X be a random variable and Bi events that form a partition of Ω for i ∈ N, then,
E(X) =
∞∑i=1
E(X|Bi)P (Bi)
Convergence of distribution
Let Xn and X be random variables whose cumulative distribution functions are Fnand F . Then we say that the random variables Xn converge in distribution, orconverge in law to X if ∀x ∈ R,
limn→∞
Fn(x) = F (x)
We write Xn →D X.Also, if fn(x)→ f(x) for all x, then Fn(x)→ F (x).
3.7.3 Law of small numbers
Lemma :
limn→∞
1
nr
(n
r
)=
1
r!
If Xn ∼ B(n, pn) and limn→∞ npn = λ > 0, then Xn →D X where X ∼ Pois(λ).
Chapter 4
CONTINUOUS RANDOMVARIABLES
4.1 Definition : Probability density function
A random variable X is continuous if there exists a function f(x) called the probability densityfunction (PDF), or density of X such that
P (X ≤ x) = F (x) =
∫ x
−∞f(u)du
which implies that f(x) ≥ 0 and∫∞−∞ f(x)dx = 1. Note that
f(x) =dF (x)
dx
4.2 Basic distributions
4.2.1 Uniform distribution
The random variable U is called a uniform random variable written U ∼ U(a, b) if it has thefollowing density
f(u) =
1b−a , a ≤ u ≤ b0, otherwise
4.2.2 Eponential distribution
The random variable X is called an exponential random variable written X ∼ exp(λ) if it hasthe following density
f(x) =
λe−λx, x > 00, otherwise
16
CHAPTER 4. CONTINUOUS RANDOM VARIABLES 17
4.2.3 Gamma distribution
The random variable X is called a gamma random variable written X ∼ Gamma(α, λ) if it hasthe following density
f(x) =
λα
Γ(α)xα−1e−λx, x > 0
0, otherwise
Here α is called the shape parameter and λ the rate.
Remark
The Gamma function Γ(α) is defined as
Γ(α) =
∫ ∞0
yα−1e−ydy
with the following properties :Γ(1) = 1
Γ(α) = (α− 1)Γ(α− 1)
For n ∈ Z, we haveΓ(n) = (n− 1)!
4.2.4 Laplace distribution
The random variable X is called a Laplace random variable or double exponential if it hasthe following density
f(x) = λ2 e−λ|x−η|, x, η ∈ R, λ > 0
4.2.5 Pareto distribution
The random variable X is called a Pareto random variable if it has the cumulative distributionfunction
F (x) =
0, x < β
1−(βx
)αx ≥ β
4.3 Expectation
Let g(x) be a real-valued function, and X a continuous random variable of density f(x). Then ifE (|g(X)|) <∞, then the expectation of g(X) is
E(g(X)
)=
∫ ∞−∞
g(x)f(x)dx
CHAPTER 4. CONTINUOUS RANDOM VARIABLES 18
4.4 Conditional Densities
The conditional cumulative distribution is given by
FX(x|X ∈ A) =
∫y<x,y∈A
f(y)dy
P (X ∈ A)
and the conditional density function by
fX(x|X ∈ A) =
fX(x)P (X∈A) , x ∈ A0, otherwise
Finally we can also define the conditional expectation as
E(g(X)|X ∈ A
)=E(g(X)I(X ∈ A)
)P (X ∈ A)
4.5 Quantiles
Let 0 < p < 1. We define the p quantile of the cumulative distribution function F (x) to be
xp = infx : F (x) ≥ p
Which basically represents the smallest x such that F (x) equals or becomes greater than p.For most continuous random variables, xp is unique and equals xp = F−1(p) which is equivalent
to P (X ≤ xp) = p. In particular, the 0.5 quantile is the median of F .
4.6 Transformations
Let g : R→ R and By =]−∞, y]. Let Y = g(X) be a random variable. Then
FY (y) = P (Y ≤ y) =
∫g−1(By)
fX(x)dx if X continuous∑x∈g−1(By) fX(x) if X distrete
where g−1(By) = x ∈ R : g(x) ≤ y.When g is monotone increasing or decreasing, then since f(x) = F ′(x)
fY (y) =
∣∣∣∣dg−1(y)
dy
∣∣∣∣ fX(g−1(y))
4.7 Normal Distribution
A random variable X is called a normal random variable with expectation µ and variance σ2 ifit has density
fX(x) =1
σ√
2πexp
(− (x− µ)2
2σ2
)We write X ∼ N(µ, σ2).
CHAPTER 4. CONTINUOUS RANDOM VARIABLES 19
When µ = 0 and σ2 = 1 the corresponding random variable Z is standard normal with density
φ(z) =1√2π
exp
(−z
2
2
)yielding
FZ(z) = P (Z ≤ z) = Φ(z) =
∫ z
−∞φ(z)dz =
∫ z
−∞
1√2πe−
z2
2 dz
Note that fX(x) = 1σφ(x−µσ
)4.7.1 Properties
The density φ(z), the cumulative distribution function Φ(z) and the quantiles zp of Z ∼ N(0, 1)verify :
• φ(z) = φ(−z)• P (Z ≤ z) = 1− P (Z ≥ z)• zp = −z1−p• limz→±∞z
rφ(z) = 0 for all r > 0.• φ′(z) = −zφ(z), φ′′(z) = (z2 − 1)φ(z), φ′′′(z) = −(z3 − 3z)φ(z), ...
• If X ∼ N(µ, σ2) then Z = X−µσ ∼ N(0, 1)
4.7.2 Moivre-Laplace : Normal approximation of the binomial distribu-tion
Let Xn ∼ B(n, p), µn = E(Xn) = np, σ2n = var(Xn) = np(1− p) and Z ∼ N(0, 1).
limn→∞
P
(Xn − µn
σn≤ z)
= Φ(z)
and
P (Xn ≤ r) = Φ
(r − µnσn
)which corresponds to Xn ∼ N(np, np(1− p))
Continuity correction
A better approximation of P (Xn ≤ r) is given by replacing r by r+ 12 which is called the continuity
correction. This yields
P (Xn ≤ r) = Φ
(r + 1
2 − np√np(1− p)
)
4.8 Q-Q Plots
When we want to compare a sample of n values Xi that a random variable takes with a theoretical
distribution F , what we typically do is we order the n values and then plot them against F−1(
1n+1
),
F−1(
2n+1
)etc...
CHAPTER 4. CONTINUOUS RANDOM VARIABLES 20
The idea is that the n + 1 subdivisions will divide the graph into intervals of equiprobability(meaning more cuts around zero for N(0, 1) for example). If the distribution of the samples corre-sponds to F , then their plot against the cuts should result in a straight line.
4.9 Densities recap
• Uniform variables lie in a finite interval and give equal probability to each part of the interval.• Exponential and Gamma variabels lie in (0,∞) and are often used to model waiting time or
positive quantities. Gamma has 2 parameters and is more flexible, but exponential is simplerand elegant.• Pareto variables lie in (β,∞) and are often used to model financial losses over some thresholdβ.• Normal variables lie in R and are used to model quantities that arise from averaging of many
small effects, or are subject to error.• Laplace variables lie in R and are often used in place of normal when outliers might be present.
Chapter 5
SEVERAL RANDOMVARIABLES
5.1 Discrete Random Variables
Let (X,Y ) be a discrete random variable. The joint probability mass function of (X,Y ) is
fX,Y (x, y) = P ((X,Y ) = (x, y))
and the joint cumulative distribution function of (X,Y ) is
FX,Y (x, y) = P (X ≤ x, Y ≤ y)
5.2 Continuous Random Variables
The random variable (X,Y ) is said to be jointly continuous if there exists a function fX,Y (x, y)called the joint density of (X,Y ) such that
P ((X,Y ) ∈ A) =
∫∫(u,v)∈A
fX,Y (u, v)dudv
We can hence also write the joint cumulative distribution function of (X,Y ) as
FX,Y (x, y) =
∫ x
−∞
∫ y
−∞fX,Y (u, v)dudv
and
fX,Y (x, y) =∂2
∂x∂yFX,Y (x, y)
5.3 Exponential Families
Let (X1, ...Xn) be a discrete or continuous random variable with mass/density function of the form
f(x1, ...xn) = exp
(p∑i=1
si(x)θi − κ(θ1, ...θp) + c(x1, ...xn)
)where (θ1, ...θp) ⊂ Rp. This is called an exponential family distribution.
21
CHAPTER 5. SEVERAL RANDOM VARIABLES 22
5.4 Marginal and conditional distribution
The marginal probability mass/density function of X is
fX(x) =
∑y fX,Y (x, y) if discrete∫∞−∞ fX,Y (x, y)dy if continuous
The conditional probability mass/density function of Y given X is
fY |X(y|x) =fX,Y (x, y)
fX(x)
5.5 Multivariate Random Variables
Let X1, ...Xn be random variables defined on the same space. Their joint cumulative distributionfunction is
FX1,...Xn(x1, ...xn) = P (X1 ≤ x1, ...Xn ≤ xn)
and their joint density/mass probability function is
fX1,...Xn(x1, ...xn) =
P (X1 = x1, ...Xn = xn) if discrte∂FX1,...Xn
(x1,...xn)
∂x1...∂xnif continuous
5.5.1 Multinomial Distribution
The random variable (X1, ...Xk) has the multinomial distribution of denominator m andprobabilities (p1, ...pk) if its mass function is
f(x1, ...xk) =m!
x1! · ... · xk!· px1
1 · ... · pxkk
5.6 Independence
Random variables X,Y defined on the same probability space are independent if
F (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B)
implyingfX,Y (x, y) = fX(x)fY (y) and FX,Y (x, y) = FX(x)FY (y)
5.6.1 Independent and Identically distributed variables
A random sample of size n from a distribution F of density f is a set of n independent randomvariables each of distirbution F . We say that they are independent and identically distributed(iid) with distribution F or with density f .
CHAPTER 5. SEVERAL RANDOM VARIABLES 23
5.7 Joint Moments and Covariance
Let X,Y be random variables of density fX,Y (x, y). Then if E|g(X,Y )| < ∞, we can define theexpectation of g(X,Y ) to be
E(g(X,Y )) =
∑x,y g(x, y)fX,Y (x, y) if discrete∫∫g(x, y)fX,Y (x, y)dxdy if continuous
In particular we define the joint moments and the joint central moments by
E(XrY s) and E((X − E(X))r(Y − E(Y ))s)
If r = s = 1 we call it the covariance of X and Y
cov(X,Y ) = E((X − E(X))(Y − E(Y ))) = E(XY )− E(X)E(Y )
5.7.1 Properties of covariance
Let X,Y, Z be random variables and a, b, c, d ∈ R constants. The covariance satisfies
• cov(X,X) = var(X)• cov(a,X) = 0• cov(X,Y ) = cov(Y,X)• cov(a+ bX + cY, Z) = b cov(X,Z) + c cov(Y,Z)• cov(a+ bX, c+ dY ) = bd cov(X,Y )• var(a+ bX + cY ) = b2 var(X) + 2bc cov(X,Y ) + c2 var(Y )• cov(X,Y )2 ≤ var(X)var(Y ) (Cauchy Schwarz inequality)
5.7.2 Independence and covariance
Recall that if X and Y are independent and g(X) and h(Y ) are functions whose expectations exist,then
E(g(X)h(Y )
)= E
(g(X)
)E(h(Y )
)If g(X) = X − E(X) and h(Y ) = Y − E(Y ) then we can see that
cov(X,Y ) = 0
5.8 Linear combinations of random variables
The average of random variables X1, ...Xn is
X = n−1n∑j=1
Xj
If a, b1, ...bn are contsants, then
var(a+ b1X1 + ...+ bnXn) =∑j,k
bjbkcov(Xj , Xk)
=n∑j=1
b2jvar(Xj) +∑j 6=k
bjbkcov(Xj , Xk)
CHAPTER 5. SEVERAL RANDOM VARIABLES 24
If X1, ...Xn all have mean µ and variance σ2, then
E(X) = µ and var(X) = σ2
n
5.9 Correlation
The covariance depends on the units of measurements, so we often use the following, that measureslinear dependence.
The Correlation of x and Y is
corr(X,Y ) =cov(X,Y )√var(X)var(Y )
5.9.1 Properties of correlation
Let X and Y be random variables with correlation ρ = corr(X,Y ). Then
• −1 ≤ ρ ≤ 1• if ρ = ±1, then there exist a, b, c ∈ R such that aX + bY + c = 0 with probability 1. X and Y
are then said to be linearly dependent.• if X and Y are independent, then corr(X,Y ) = 0• the effect of the transformation (X,Y )→ (a+bX, c+dY ) is corr(X,Y )→ sign(bd)corr(X,Y )
5.10 Conditional Expectation
Let g(X,Y ) be a function of a random vector (X,Y ). Its conditional expectation given X = x is
E(g(X,Y )|X = x
)=
∑y g(x, y)fY |X(y|x) if discrete∫∞−∞ g(x, y)fY |X(y|x)dy if continuous
5.10.1 Expectation and Conditioning
E(g(X,Y )
)= EX
(E(g(X,Y )|X = x
))var(g(X,Y )
)= EX
(var(g(X,Y )|X = x
))+ varX
(E(g(X,Y )|X = x
))where EX and varX are the expectation and variance according to the distribution of X.
For the variance, the second term is here to take into account the variance from one x to another.In fact, the first term computes the average over all x of the variance along Y , but it is oblivious ofthe potential offsets there can be from one x to the other since the variance ignores constants. If atx1 the distribution is the same as at x0, plus some constant c, the first term alone would not takeit into account, and this is what the second term is for.
5.11 Generating Functions
We define the moment-generating function of a random variable X by
MX(t) = E(etX)
CHAPTER 5. SEVERAL RANDOM VARIABLES 25
for t ∈ R such that MX(t) <∞. This definition of the MGF allows us to write the following
MX(t) = E
( ∞∑r=0
trXr
r!
)=
∞∑r=0
tr
r!E(Xr)
from which we can obtain all the moments E(Xr) by differentiation.
5.11.1 Properties of Moment-Generating Functions
If M(t) is the MGF of a random variable X then
• MX(0) = 1• Ma+bX(t) = eatMX(bt)
• E(Xr) = ∂rMX(t)∂tr
∣∣∣t=0
• E(X) = M ′X(0)• var(X) = M ′′X(0)−M ′X(0)2
There exists an injection between the cumulative distribution functions FX(x) and the moment-generating functions MX(t), meaning that if we recognize a MGF then we know to which distributionit corresponds.
5.11.2 Linear combinations
Let a, b1, ...bn ∈ R and X1, ...Xn independent random variables whose MGFs exist. Then Y =a+ b1X1 + ...+ bnXn has MGF
MY (t) = etan∏j=1
MXj (tbj)
5.11.3 Continuity
Let Xn, X be random variables with distribution functions Fn, F whose MGFs Mn(t),M(t)exist for 0 ≤ |t| < b. Then if Mn(t) → M(t) for |t| < b when n → ∞, then Xn →D X, i.e.Fn(x)→ F (x) at each x ∈ R where F is continuous.
5.11.4 Mean vector and covariance matrix
Let X = (X1, ...Xy)T be a p× 1 vector of random variables. Then
E(X)p×1 =
E(X1):
E(Xp)
var(X)p×p =
var(X1) cov(X1, X2) ... cov(X1, Xp)cov(X1, X2) var(X2) ... cov(X2, Xp)
: : :cov(X1, Xp) cov(X2, Xp) ... var(Xp)
are called the expectation (mean vector) and the (co)-variance matrix of X.
CHAPTER 5. SEVERAL RANDOM VARIABLES 26
5.11.5 Moment-generating function : Multivariate case
The moment-generationg function of a random vector Xp×1 is
MX(t) = E(etTX)
= E(e∑pr=1 trXr
)where t ∈ Rp | MX(t) <∞. This implies
E(X)p×1 = M ′X(0) =∂MX(t)
∂t
∣∣∣∣t=0
var(X)p×p =∂2MX(t)
∂t∂tT
∣∣∣∣t=0
−M ′X(0)M ′X(0)T
Note that all we do here is rewrite the already known formulae for single random variables in a formthat allows us to compute multiple “at once”.
Independence
If A,B ⊂ 1, ..., p and A∩B = ∅ and we write XA for the subvector of X containing Xj : j ∈ A,then XA and XB are independent iff
MX(t) = E(etTAXA+tTBXB
)= MXA(tA)MXB (tB)
where t is such that MX(t) <∞.
5.11.6 Characteristic function
Many distributions don’t have a defined MGF. In this case we define the characteristic functionof X
φX(t) = E(eitX) t ∈ R
where i =√−1. Characteristic functions share the same properties as MGFs, but they are to be
used only if MGFs are not defined since they require complex analysis.
Theorem
X and Y have the same cumulative distribution function iff they have the same characteristicfunction.If X is continuous and has density y and characteristic function φ then
f(x) =1
2π
∫ ∞−∞
e−itxφ(t)dt
5.11.7 Cumulant-generating function
The cumulant-generating function (CGF) of X is KX(t) = logMX(t). The cumulants κr of Xare defined by
KX(t) =∞∑r=1
tr
r!κr κr = drKX(t)dtr
∣∣∣t=0
This implies that E(X) = κ1 and var(X) = κ2.
CHAPTER 5. SEVERAL RANDOM VARIABLES 27
5.11.8 Cumulants of sums of random variables
If a, b1, ..., bn are constants and X1, ..., Xn are independent random variables, then
Xa+b1X1+...+bnXn = ta+
n∑j=1
KXj (tbj)
If X1, ...Xn are independent variables having cumulants κj,r, then the CGF of S = X1 + ...+Xn is
KS(t) =
n∑j=1
KXj (t) =
n∑j=1
∞∑r=1
tr
r!κj,r
5.11.9 Multivariate cumulant-generating function
The cumulant-generating function (CGF) of a random variable Xp×1 = (X1, ..., Xp)T is
KX(t) = logMX(t)
This implies that
E(X)p×1 = K ′X(0) =∂KX(t)
∂t
∣∣∣∣t=0
var(X)p×p =∂2KX(t)
∂t∂tT
∣∣∣∣t=0
Independence
If A,B ⊂ 1, ..., p and A∩B = ∅ and we write XA for the subvector of X containing Xj : j ∈ A,then XA and XB are independent iff
KX(t) = logE(etTAXA+tTBXB
)= KXA(tA) +KXB (tB)
where t is such that MX(t) <∞.
5.12 Multivariate Normal Distribution
The random vector X = (X1, ..., Xp)T has a multivariate normal distribution if there exist a
p× 1 vector µ = (µ1, ..., µp)T ∈ Rp and a p× p symmetric matrix Ω with elements ωjk such that
uTX ∼ N(uTµ, uTΩu)
where u ∈ Rp. We then write X ∼ Np(µ,Ω).In other words, if a linear combination of the individual random variables has a normal distri-
bution, then so does the vector.
5.12.1 Lemma : Properties of Normal variables
• E(Xj) = µj , var(Xj) = ωjj , cov(Xj , Xk) = ωjk for j 6= k.• The moment-generating funtion of X is MX(t) = exp(uTµ+ 1
2 tTΩt) for t ∈ Rp.
• Iff XA and XB are independent, ΩA,B = 0.• If X1, ..., Xn ∼iid N(µ, σ2) then Xn×1 = (X1, ...Xn)T ∼ Nn(µ1n, σ
2In).• Linear cominations of normal variables are normal : ar×1 +Br×pX ∼ Nr(a+Bµ,BΩBT )
CHAPTER 5. SEVERAL RANDOM VARIABLES 28
5.12.2 Lemma : Normal density function
The random vector X ∼ Np(µ,Ω) has density function on Rp iff Ω is positive definite, i.e. Ω hasrank p. If so, the density function is
f(x;µ,Ω) =1
(2π)p/2|Ω|1/2exp
(−1
2(x− µ)TΩ−1(x− µ)
)
5.12.3 Marginal and conditional distributions
Let X ∼ Np(µp×1,Ωp×p), where |Ω| > 0, and let A,B ⊂ 1, ..., p with |A| = q < p, |B| = r < p andA ∩B = ∅.Let µA, ΩA and ΩAB be respectively the q × 1 subvector of µ, q × q and q × r submatrices of Ωconformable with A, A×A, A×B. Then
• The marginal distribution of XA is normal, XA ∼ Nq(µA,ΩA)• The conditional distribution of XA given XB = xB is normal,
XA|XB = xB ∼ Nq(µA + ΩABΩ−1
B (XB − µB),ΩA − ΩABΩ−1B ΩBA
)5.13 Transformation of joint continuous densities
Let X = (X1, X2) ∈ R2 be a continuous random variable and let Y = (g1(X1, X2), g2(X1, X2)) suchthat there exist h1 and h2 such that X1 = h1(Y1, Y2) and X2 = h2(Y1, Y2).Let J(x1, x2) be the jacobian of g1 and g2 :
J(x1, x2) =
∣∣∣∣∣ ∂g1∂x1
∂g1∂x2
∂g2∂x1
∂g2∂x2
∣∣∣∣∣then
fY1,Y2(y1, y2) = fX1,X2
(x1, x2) |J(x1, x2)|−1∣∣∣x1=h1(y1,y2),x2=h2(y1,y2)
5.14 Order Statistics
The order statistics of the random variables X1, ...Xn are the ordered values
X(1) ≤ X(2) ≤ ... ≤ X(n−1) ≤ X(n)
If the X1, ...Xn are continuous, then no two of the Xj can be equal.In particular,
min1≤j≤n
(Xj) = X(1)
median =
X(m+1) if n = 2m+ 1 odd
12
(X(m) +X(m+1)
)if n = 2m even
max1≤j≤n
(Xj) = X(n)
CHAPTER 5. SEVERAL RANDOM VARIABLES 29
5.14.1 Theorem
Let X1, ...Xniid∼ F from a continuous distribution with density f , then
P (X(n) ≤ x) = F (x)n
P (X(1) ≤ x) = 1− (1− F (x))n
fX(r)(x) = n!
(r−1)!(n−r)!F (x)r−1f(x) (1− F (x))n−r
, r = 1, ..., n
5.15 Approximation and Convergence
5.15.1 Inequalities
If X is a random variable, a > 0 constant, h a non-negative function and g a convex function. Then
• P (h(X) ≥ a) ≤ E(h(X))/a (basic inequality)• P (|X| ≥ a) ≤ E(|X|)/a (Markov’s inequality)• P (|X| ≥ a) ≤ E(X2)/a2 (Chebysho’s inequality)• E(g(X)) ≥ g(E(X)) (Jensen’s inequality)
From Chebyshov’s inequality, by replacing X with X − E(X), we get
P (|X − E(X)| ≥ a) ≤ var(X)/a2
Remark : to prove the basic inequality from which all others more or less derive, we say that forany a > 0, y ≥ yI(y ≥ a) ≥ aI(y ≥ a), which implies from the definition of the expectation thatE(Y ) ≥ aP (Y ≥ a)
Hoeffding’s inequality
Let Z1, ...Zn be independent random variables such that E(Zi) = 0 and ai ≤ Zi ≤ bi for constantsai < ai. If ε > 0, then ∀t > 0
P
(n∑i=1
Zi ≥ ε
)≤ e−tε
n∏i=1
et2(bi−ai)2/8
This inequality is much more precise in finding close bounds than the ones seen before.
5.15.2 Convergence
• Xn converges almost surely, Xna.s.−−→ X, if
P(
limn→∞
Xn = X)
= 1
• Xn converges in mean square, Xn2−→ X, if
limn→∞
E((Xn −X)2
)= 0 where E(X2
n), E(X2) <∞
• Xn converges in probability, XnP−→ X, if for all ε > 0
limn→∞
P (|Xn −X| > ε) = 0
• Xn converges in distribution, XnD−→ X, if
limn→∞
Fn(x) = F (x) at each point x where F (x) is continuous
CHAPTER 5. SEVERAL RANDOM VARIABLES 30
Relations between modes of convergence
Xna.s.−−→ X ⇒
Xn2−−→ X ⇒
Xn
P−→ X ⇒ XnD−→ X
The most importantones areP−→ and
D−→.
Combinations of convergent sequences
Let x0, y0 be constants, X,Y, Xn, Yn random variables, and h a function continuous at x0. Then
XnD−→ x0 ⇒ Xn
P−→ x0
XnP−→ x0 ⇒ h(Xn)
P−→ h(x0)
XnD−→ X and Yn
P−→ y0 ⇒ Xn + YnD−→ X + y0, XnYn
D−→ Xy0
The latter being Slutsky’s lemma.
5.15.3 Laws of Large Numbers
Weak law of large numbers
Let X1, ..., Xn be a sequence of independent identically distributed random variables with finiteexpectation µ, and average
X = n−1(X1 + ...+Xn)
Then XP−→ µ, i.e. ∀ε > 0
P(∣∣X − µ∣∣ > ε
)→ 0, n→∞
Remark : This is very easily proved by using Chebyshov’s inequality :
P(|X − µ| > ε
)≤ var(X)/ε2 =
σ2
nε2→ 0, n→∞
Strong law of large numbers
Under the same conditions, and in addition Xa.s.−−→ µ
P(
limn→∞
X = µ)
= 1
This one is stronger since the weak law allows the event∣∣X − µ∣∣ > ε to occur an infinite number of
times, while the strong one excludes this.
5.15.4 Central Limit Theorem (CLT)
Standardisation of an average
We know that if Xjiid∼ (µ, σ), then
E(X) = µ and var(X) =σ2
n
CHAPTER 5. SEVERAL RANDOM VARIABLES 31
Therefor it is natural to say that
Zn =X − µ√σ2/n
=
√n(X − µ)
σ
has expectation 0 and variance 1.
Central Limit Theorem
Let X1, ..., Xn be independent random variables with expectation µ and variance 0 < σ2 <∞. Then
Zn =
√n(X − µ)
σ
D−→ Z, n→∞
where Z ∼ N(0, 1)Thus for large n,
P
(√n(X − µ)
σ≤ z)
= P (Z ≤ z) = Φ(z)
5.15.5 Delta Method
Let X1, ..., Xn be independent random variables with expectation µ and variance 0 < σ2 <∞, andlet g′(µ) 6= 0. Then
g(X)− g(µ)√g′(µ)2σ2/n
D−→ N(0, 1), n→∞
which is just a more general form of the central limit theorem.
5.15.6 Sample quantiles
Definition
Let X1, ..., Xniid∼ F and 0 < p < 1. Then the p sample quantiles of X1, ..., Xn is the rth order
statistic X(r) where r = dnpe.
Theorem : Asymptotic distribution of order statistics
Let 0 < p < 1, X1, ..., Xniid∼ F and xp = F−1(p). Then if f(xp) > 0,
X(dnpe) − xp√p(1− p)/(nf(xp)2)
D−→ N(0, 1), n→∞
which implies that
X(dnpe) ∼ N
(xp,
p(1− p)nf(xp)2
)
Chapter 6
EXPLORATORY STATISTICS
Statistics are the science of extracting information from data. Key points to keep in mind arevariation in the data and consequent uncertainty, and context.
Statistical Cycle
There are four main stages in the statistical method.
• plannig• implementation• data analysis• presentation
Study types
There are two main approaches, designed experiments where we can influence the experiment andhence remove some correlations and observational study where many hidden factors can exist.
6.1 Types of Data
• Population : the entire set of units we might study.• Sample : subset of the population.• Statistical variable : quantitative or qualitative characteristic of a unit in the population.• Modes : “bumps” the data exhibits.
6.2 Graphical Study of Variables
6.2.1 Kernel Density Estimate
Let K be a kernel, i.e. a function with symmetric density at 0 and variance 1, and let y1, ..., yn be asample of data drawn from some distribution with probability donsity f . Then the kernel densityestimator KDE of f for h > 0 is
fh(y) =1
nh
n∑j=1
K
(y − yjh
)y ∈ R
32
CHAPTER 6. EXPLORATORY STATISTICS 33
which gives a nonparametric estimator of the density underlying the sample depending on the kernelK and most importantly the bandwidth h.
The effect of this is to replace each sample with small normal-like distribution and adding themup to generate a new distribution that usually looks like a normal distibution. h will determin thewidth of each little distribution around a sample, which will then make the final sum smoother.
6.3 Numerical Summaries
6.3.1 Breakdown point
We say that, for example, the median has asymptotic breakdown point 50% because the medianwould only move an arbitrarily large amount if 50% of the observations were corrupted. The averagehas breakdown point 0% since a single bad value can move it arbitrarily far.
6.3.2 Quartiles and Sample Quantiles
We define the p quantileq(p) = x(dnpe)
where p ≤ 1. Sometimes, p will be in percent, in which case it must be divided by 100.Hence quartiles are q(0.25) and q(0.75)
6.3.3 Variability, Dispersion measures
Sample standard deviation
s =
(1
n− 1
n∑i=1
(xi − x)2
)1/2
=
(1
n− 1
n∑i=1
(x2i − nx2
))1/2
The sample variance is then s2. Both have breakdown point 0%.
Range
x(n) − x(1)
The range has breakdown point 0%.
6.3.4 Interquartile range
IQR(x) = q(0.75)− q(0.25)
which has breakdown point 25%.
6.3.5 Sample correlation
The sample correlation rxy is defined in the exact same way as the correlation, and has the sameproperties.
CHAPTER 6. EXPLORATORY STATISTICS 34
6.4 Boxplot
6.4.1 Five-number summary
The five-number summary is the list of the following five values
min = x(1) q(0.25) median = q(0.5) q(0.75) x(n)
which are used for drawing the bokplot.
6.4.2 Boxplot calculations
The additional calculation required for the boxplot is to avoid outliers :
C = 1.5 ∗ IQR(x)
which will determine the size of the whiskers on each side of the box.
6.5 Choice of a model
6.5.1 Normal Q-Q Plot
To verify that data follows a normal distribution, we use normal Q-Q plots, i.e. a graph of theordered sample values against the normal plotting positions. A grap close to a line suggests that teobservations can be fitted by a normal model, and the slope and intercept at x = 0 gives estimatesof σ and µ respectively.
6.6 Statistic Inference
Having observed an event A we want to say something about the underlying probability space(Ω, F, P ).
6.6.1 Statistical model
Several problems must be addressed when trying to deduce information about a probability spacefrom events :
• specification of a model for the data• estimation of the unknowns of the model (parameters, ...)• tests of hypotheses concerning a model• planning of the data collection and analysis to minimize uncertainty.
6.6.2 Definitions
• A statistical model is a probability distribution f(y) chosen to learn observed data y or frompotential data Y . If f(y) = f(y; θ), then f is a parametric model.
• A statistic T = t(Y ) is a known function of the data Y .
• The sampling distribution of a statistic T = t(Y ) is its distribution when Y ∼ f(y).
• A random sample is a set of independent and identically distributed random variablesY1, ...Yn or their realisations y1, ...yn.
CHAPTER 6. EXPLORATORY STATISTICS 35
6.7 Point Estimation
6.7.1 Estimator
An estimator is a statistic ˆtheta used to estimate a parameter θ of f .
6.7.2 Estimation methods
There are many methods for estimating the parameters of a model. The choice depends on ease ofcalculation, efficiency (precision), and robustness. Common methods are
• method of moments, simple but potentially inefficient• maximum likelihood estimation, general and often optimal• M-estimation, even more general, mostly robust but less efficient.
6.7.3 Method of moments
The method of moments estimate of a parameter θ is the value θ that matches both theoretical andempirical moments.
For a model with p unknown parameters, this gives
E(Y r) =
∫yrf(y; θ)dy =
1
n
∑j=1
nyrj
where r = 1, ..., p, meaning that we need as many finite moments of the underlying model as thereare unknown parameters.
6.7.4 Maximum likelihood estimation
If y1, ..., yn is a random sample from the density f(y; θ), then the likelihood for θ is
L(θ) = f(y1, ..., yn; θ) = f(y1; θ) · f(y2; θ) · ... · f(yn; θ)
The maximum likelihood estimate MLE θ of a parameter θ is the value that gives the highestlikelihood, i.e.
L(θ) = maxθL(θ)
Calculation of the MLE
We sometimes simplify the calculations by maximising l(θ) = logL(θ) rather than L(θ).
6.7.5 M-estimation
This is a generalisation of the maximum likelihood estimation. We maximise a function of the form
ρ(θ;Y ) =
n∑j=1
ρ(θ;Yj)
where ρ(θ; y) is, if possible, concave as a function of θ for all y. We choose ρ to obtain estimatorswith suitable properties such as small variance or robustness to outliers.
Note that for ρ(θ; y) = log(f(y; θ)), we obtain back the maximum likelihood estimator.
CHAPTER 6. EXPLORATORY STATISTICS 36
6.7.6 Bias
The bias of an estimator θ of a parameter θ is
b(θ) = E(θ)− θ
If b(θ) < 0 for all θ, then θ tends to underestimate θ, and vice versa. If b(θ) = 0 for all θ, then θ issaid to be unbiased.
6.7.7 Mean Square Error
The mean square error MSE if the estimator θ of θ is
MSE(θ) = E[(θ − θ)2] = var(θ) + b(θ)2
This is the average squared distance between θ and θ.
Let θ1 ad θ2 be two unbiased estimators of the same parameter θ. Then if
MSE(θ1) = var(θ1) ≤ var(θ2) = MSE(θ2)
we say that θ1 is more efficient than θ2
6.7.8 Delta method
Let θ be an estimator based on a sample size n, such that
θ ∼ N(θ, v/n) when n→∞
and let g be a smooth function such that g′(θ) 6= 0. Then
g(θ) ∼ N(g(θ) +
vg′′(θ)
2n,vg′(θ)2
n
)when n→∞
This implies that the mean square error of the estimator g(θ) of g(θ) is
MSE(g(θ)) =
(vg′′(θ)
2n
)2
+vg′(θ)2
n
Which implies that, for large n, we can neglict the bias.
6.8 Interval Estimation
6.8.1 Pivots
We want to give an interval in which our estimator will be with a given probability. This intervalwill widen when the size of the sample decreases.
Let Y = (Y1, ..., Yn) be sampled from a ditsribution F with parameter θ. Then a pivot is thefunction Q = q(Y, θ) of the data and the parameter θ, where the distribution FQ of Q is known anddoes not depend on θ.
CHAPTER 6. EXPLORATORY STATISTICS 37
6.8.2 Confidence intervals
Let Y = (Y1, ..., Yn) be data from a parametric statistical model with scalar parameter θ. A con-fidence interval CI (L,U) for θ with lower and upper confidence bounds L and U , is a randominterval that contains θ with a specific probability called the confidence level.
If we write P (θ < L) = αL and P (U < θ) = αU , then (L,U) has confidence level
P (L ≤ θ ≤ U) = 1− αL − αU
If αL = αU , we say that we have an equi-tailed (1− αL − αU ) · 100% confidence interval.
6.8.3 Construction of a CI
• We find a pivot Q = q(Y, θ) involving θ• We obtain the quantiles qαU and q1−αL of Q• We transform the equation
P (qαU ≤ q(Y, θ) ≤ q1−αL) = 1− αL − αU
intoP (L ≤ θ ≤ U) = 1− αL − αU
where L and U depend on Y , qαU and q1−αL but not on θ.
6.8.4 One and two-sided intervals
As opposed to a two-sided interval (L,U), we can use one-sided confidence intervals of the form(−∞, U) or (L,∞).
For this, we take αL = 0 or αU = 0 giving respective intervals (−∞, U) or (L,∞).
6.8.5 Standard Errors
Let T = t(Y1, ..., Yn) be an estimator of θ, let τ2n = var(T ) be its variance and let V = v(Y1, ..., Yn)
be an estimator of τ2n. We call V 1/2 or its realisation v1/2 a standard error for T .
Theorem
Let T be an estimator of θ (written V in course..?) based on a sample of size n, with
T−θτn
D−→ Z Vτ2n
F−→ 1 n→∞
Where Z ∼ N(0, 1). Therefore,
T−θV 1/2 = T−θ
τn× τn
V 1/2
D−→ Z n→∞
which implies that, when basing a cenfidence interval on the Central Limit Theorem, we can replaceτn with V 1/2.
CHAPTER 6. EXPLORATORY STATISTICS 38
6.8.6 Normal Random Sample
If Y1, ..., Yniid∼ N(µ, σ2), then
Y ∼ N(µ, σ2
n )(n− 1)S2 =
∑nj=1(Yj − Y )2 ∼ σ2χ2
n−1
are independent.
where χ2v is the chi-square distribution with v degrees of freedom.
The first result implies that if σ2 is known, then
Z =Y − µ√σ2/n
∼ N(0, 1)
is a pivot that provides an exact (1− αL − αU ) confidence interval for µ of the form
(L,U) =
(Y − σ√
nz1−αL , Y −
σ√nzαU
)where zp is the p quantile of the standard normal distribution, i.e. zp = Φ−1(p)
6.8.7 Unknown variance
Usually, σ2 is unknown. In this case,
Y−µ√S2/n
∼ tn−1,(n−1)S2
σ2 ∼ χ2n−1
are pivots that provide confidence intervals for µ and σ2 respectively, of the form :
(L,U) =(Y − S√
ntn−1(1− αL), Y − S√
ntn−1(αU )
)(L,U) =
((n− 1)S2
χ2n−1(1− αL)
,(n− 1)S2
χ2n−1(αU )
)where
• tv(p) is the p quantile of the Student t distribution with v degrees of freedom• χ2
v(p) is the p quantile of the chi-square distribution with v degrees of freedom.
Note : For symmetric distributions like the Student t (and unlike the chi-square), the quantilessatisfy zp = −z1−p, so the equi-tailed (1−α) ·100% confidence intervals have the form Y ± σ√
nz1−α/2.
6.8.8 Confidence intervals and tests
We can use CIs to assess the plausibility of a falue θ0 of θ :
• If θ0 lies inside a (1 − α) · 100% CI, then we cannot reject the hypothesis that θ = θ0 atsignificance level α.• If θ0 lies outside a (1−α) ·100% CI, then we reject the hypothesis that θ = θ0 at significance
level α.
Hence the smaller α is when we do reject, the stronger the evidence against θ0.
CHAPTER 6. EXPLORATORY STATISTICS 39
6.8.9 Null and Alternative Hypotheses
In general, we use data to decide between two hypotheses :
• The null hypothesis H0 which represents the theory or model we want to test (for coin tossesH0 is that the coin is fair).• The alternative hypothesis H1 which represents what happens if H0 is false (the coin is not
fair).
There are hence two types of errors when we decide between these two :
• False positive : H0 is true but we reject it.• False negative : H0 is false but we accept it.
Simple and composite hypotheses
A simple hypothesis entirely fixes the distribution of the data Y , whereas a composite hypoth-esis does not fix the distribution of Y .
ROC curve
The receiver operating characteristic (ROC) curv of a test, plots β(t) = P1(T > t) againstα(t) = P0(T > t) as the cut-off value t varies, i.e. it shows (P0(T > t), P1(T > t)) for all t ∈ R.
6.8.10 Size and power
As the difference in µ increases between the two hypotheses, it becomes easier to detect when H0 isfalse.
Let P0(·) and P1(·) be the probabilities computed under null and alternative hypotheses H0 andH1 respectively. The size and power of a statistical test of H0 against H1 are
size : α = P0(reject H0), power : β = P1(reject H0)
6.8.11 Pearson statistic
Let O1, ..., Ok be the number of observations of a random sample of size n = n1 + ... + nk fallingint the categories 1, ..., k, whose expected numbers are E1, ..., Ek with Ei > 0. Then the Pearsonstatistic, or chi-square statistic, is
T =
k∑i=1
(Oi − Ei)2
Ei
6.8.12 Chi-square distribution with v degrees of freedom
Let Z1, ..., Zviid∼ N(0, 1), then W = Z2
1 + ...+ Z2v follows the chi-square distribution with v degrees
of freedom, whose density function is
fW (w) =1
2v/2Γ(v/2)wv/2−1e−w/2, w > 0, v = 1, 2, ...
where Γ(a) =∫∞
0ua−1e−udu for a > 0.
CHAPTER 6. EXPLORATORY STATISTICS 40
• If Oi ≈ Ei for all i, then T will be small.• If the joint distribution of O1, ...Ok is multinomial with denominator n and probabilities pi =Ei/n, then each Oi ∼ B(n, pi) and
E(Oi) = npi = Ei var(Oi) = npi(1− pi) = Ei(1− Ei/n) ≈ Ei
thus Zi = Oi−Ei√Ei∼ N(0, 1) for large n, and
T =
k∑i=1
(Oi − Ei)2
Ei=
k∑i=1
Z2i ∼ χ2
k
6.8.13 Evidence and P-values
A statistical hypothesis test has the following elements :
• A null hypothesis H0 to be tested against an alternative hypothesis H1
• Data from which we compute a test statistic T , chosen such that large values of T provideevidence against H0.• The observed value of T , tobs which we compare with the null distribution of T , H0.• The P-value
pobs = P0(T ≥ tobs)
where small pobs suggests that H0 is false (or something unlikely has occured).• If pobs < α we say that the test is significant at level α.• We reject H0 if pobs < α and accept otherwise.
Chapter 7
Likelihood
Basic Idea For a value of θ which is not very credible, the density of the data will be smaller :the higher the density, the more credible the corresponding θ. In other words, if y1, ..., yn are resultsfrom independent trials, we have
f(y1, ..., yn; θ) =n∏j=1
f(yj ; θ)
which is a function of θ we call the likelihood L(θ).
7.0.1 Relative likelihood
To compare values of θ, we only need to consider the ratio between then :
L(θ1)
L(θ2)= c
which implies that θ1 is c times more plausible than θ2.
The most plausible value θ is called the maximum likelihood estimate and satisfies
L(θ) ≥ L(θ)
We can equivalently maximise the log likelihood
l(θ) = logL(θ)
The relative likelihood RL(θ) = L(θ)
L(θ)gives the plausibility of θ with respect to θ.
7.1 Scalar Parameter
7.1.1 Information
The observed information J(θ) and the expected information, or Fisher information I(θ) are
J(θ) = −d2l(θ)dθ2 , I(θ) = E[J(θ)]
They measure the curvature of −l(θ) ; the larger they are, the more concentrated the likelihoodsare.
41
CHAPTER 7. LIKELIHOOD 42
7.1.2 Limit distribution of the MLE
Let Y1, ..., Yn be a random sample from a parametric density f(y; θ) and let θ be the MLE of θ. iff satisfies regularity conditions, then
J(θ)1/2(θ − θ) D−→ N(0, 1), n→∞
thus for large n,θ ∼ N(θ, J(θ)−1)
We can therefore use this to compute two-sided equi-tailed CIs for θ :
I θ1−α = (L,U) =(θ − J(θ)−1/2z1−α/2, θ + J(θ)−1/2z1−α/2
)We can show that for large n and a regular model, no estimator has a smaller variance and
narrower CI than such θ.
7.1.3 Likelihood ratio statistic
Sometimes it is unreasonable to use a CI based on the normal limit distribution of θ. In this casewe use l(θ).
The likelihood ratio statistic is
W (θ) = 2l(θ)− l(θ)
In addition, if θ0 is the value of θ that generated the data, then if θ has a normal limit distribution,
W (θ0)D−→ χ2
1, n→∞
or in other words, W (θ0) ∼ χ21 for large n.
7.1.4 Regularity
We talked about regularity conditions, which are quite complicated. Situations where they are falseare often cases where
• one of the parameters is discrete.• the support of f(y; θ) depends on θ.• the true θ is on the limit of its possible values.
In the majority of cases they are satisfied though.
7.1.5 Vector Parameter
If θ is a vector of dimension p, then the above definitions hold with some slight changes :
• the MLE θ often satisfies the vector equation
dl(θ)
dθ= 0p×1
• J(θ) and I(θ) are p× p matrices.• in regular cases,
θ ∼ Np(θ, J(θ)−1
)
CHAPTER 7. LIKELIHOOD 43
7.1.6 Nested models
In some cases where we have multiple parameters, we want to test a model where one parameter isa specified value while not restricting the other parameters. For instance in the normal model, wemight want to compare a general model against a simple model, respectively
θ = (µ, σ2) ∈ R× R+
θ = (µ, σ2) ∈ µ0 × R+
In such a situation where one model can become the other one when some parameters are restricted,we say that the simpler model is nested in the general model.
7.1.7 Likelihood ratio statistic
Take two nested models with corresponding MLEs
θ = (φ, λ)
θ0 = (φ0, λ0)
where l(θ) ≥ l(θ0), and write the likelihood ratio statistic
W (φ0) = 2(l(θ)− l(θ0)
). Then if the simpler model is true, i.e. φ = φ0, we have
W (φ0)D−→ χ2
q n→∞
7.1.8 Simple linear regression
Let Y be a random variable, the response variable, that depends on a variable x, the explanatoryvariable. A simple model that describes linear dependence of E(Y ) on x is
Y ∼ N(β0 + β1x, σ2)
where β0, β1 and σ2 are the unknown parameters.
Chapter 8
BAYESIAN INFERENCE
8.1 Bayesian Inference
Up to now we have supposed that all the information about θ comes from the data y. But if wehave prior knowledge about θ in the form of a prior density, π(θ), we can use Bayes’ Theorem tocompute the posterior density for θ conditional on y
π(θ|y) =f(y|θ)π(θ)
f(y)
The difference with previous points here is that the observed data y is fixed and θ is regarded as arandom variable.
In order to do this, we need prior information, π(θ), which may be based on data seperate fromy, an objective or a subjective notion of what we believe about θ.
8.1.1 Application of Bayes’ Theorem
We suppose that θ has density π(θ) and that Y conditional on θ has density f(y|θ). Then theconditional density of θ given Y = y is
π(θ|y) =f(y|θ)π(θ)
f(y)
where we know how to compute f(y) from f(y|θ) and π(θ).
We can use Bayes’ Theorem to update the prior density for θ to a posterior density for θ.
8.1.2 Beta(a,b) density
The beta(a,b) density for θ ∈ (0, 1) has the form
π(θ) =θa−1(1− θ)b−1
B(a, b), 0 < θ < 1, a, b > 0
where a and b are parameters, B(a, b) = Γ(a)Γ(b)Γ(a+b) is the beta function and Γ(a) =
∫∞0uα−1e−udu
for a > 0. Note that for a = b = 1, this gives the U(0, 1) distribution.
44
CHAPTER 8. BAYESIAN INFERENCE 45
If θ ∼ Beta(a, b), then
E(θ) =a
a+ b
var(θ) =ab
(a+ b+ 1)(a+ b)2
8.1.3 Properties of π(θ|y)We can of course compute the posterior expectation and the posterior variances of this density,respectively E(θ|y), var(θ|y). We can also compute the Maximum A Posteriori estimator(MAP) θ such that for all θ,
π(θ|y) ≥ π(θ|y)
8.1.4 Point estimation and loss functions
The choice of estimate when contructing an estimator based on data y is important and to makethe best decision we might want to minimise the expected loss from a bad decision.If Y ∼ f(y; θ), then the loss function R(y; θ) is a non-negative function of Y and θ. The expectedposterior loss is
E (R(y; θ)|y) =
∫R(y; θ)π(θ|y)dθ
8.1.5 Interval estimation and credibility intervals
The bayesian analogue of the (1 − α) × 100% CI for θ is the (1 − α) credibility interval for θobtained using the α/2 and 1− α/2 quantiles of π(θ|y).
8.1.6 Conjugate densities
Particular combinations of data and prior densities give posterior densities of the same form as theprior densities. For example
θ ∼ Beta(a, b)s,n−−→ θ|y ∼ Beta(a+ s, b+ n− s)
where the data s ∼ B(n, θ) corresponds to s successes out of n independent trials with successprobability θ.
We say that the beta density is the conjugate prior density of the binomial distribution :if the likelihood is proportional to θs(1 − θ)n−s, then choosing a beta prior for θ ensures that theposterior density of θ is also beta with updated parameters.
8.1.7 Prediction of a futur random variable Z
“Will the next result be tails or heads?”Use Bayes’ Theorom to calculate the posterior density of Z gives Y = y, P (Z = z|Y = y)