Probability and Statistical Inference - WordPress.com...2018/08/03 · Don't forget to apply...

1

Probability and Statistical Inference

Probability Spaces

Random Events Outcomes of a random process are elements of the sample space, the set of all possible outcomes.

We denote this:

Events are subsets of the sample space for which probability is defined. When we say an event

"occurs", we mean that:

Where the event might be defined as something like:

Sigma-Algebra A family of subsets of is said to be a -algebra if:

In words, a sigma-algebra is closed under complementation and countable union and intersection.

Extreme cases are the smallest possible -algebra:

And the largest possible -algebra:

The -algebra generated by is the smallest -algebra that contains all the sets :

To find the for a given RV , simply take using the standard formula

Probability Probability is a set function on the -algebra:

It has the following three properties:

2

for any pairwise disjoint

Boole's Inequality Derives from subadditivity, which states that:

This implies:

This holds with equality if are pairwise disjoint.

De Morgan's Law

Borel-Cantelli Lemma

Proof

Distribution Function The distribution function of a probability is the function:

Set Theory Notation Set difference:

Symmetric set difference:

Disjoint sets:

3

occurs infinitely often:

occurs finitely often:

Only one of occurred:

Three or more out of occurred:

Binomial Coefficients

Useful Series and Limits

Random Variables

Random Variable A random variable is a function of chance:

With the property that for all Borel subsets :

4

This means that the inverse image under of some interval of must be an event, so an element of

the -algebra.

Note the following notation:

A function is a random variable if

Properties, where is a subset of :

The -algebra generated by is given by:

Indicator Variables An indicator function can be used instead of sets. It is defined as follows:

For an event , we have:

Simple Random variable Thus we can define simple random variables as a weighted average of all the indicators of events in

the -algebra:

Random Vectors A random vector has the form:

Its distribution is given by:

The density is given by:

5

Note that if the constituent variables are independent, the overall distribution will be given as a

product.

Combinations of Random Variables Any continuous function of a random variable is also a random variable.

A constant times a random variable is still a random variable

The sum of two random variables is also a random variable

The product of two random variables is also a random variable

Convolution Formula The distribution of the sum of two independent integer random variables can be found by:

Distribution and Density Functions The distribution of X is given by:

The distribution function is defined as:

The survival function is simply:

In the multivariate case:

For absolutely continuous distributions there is also a density function satisfying:

For discrete random variables the probability mass function (often used instead of distribution):

Transformations of Random Variables If is a random variable, is an increasing and continuous function on and , then the

transformed random variable has the distribution:

6

Since:

If we have the relevant densities than we can also represent this as:

For two AC random vectors:

Where

Independent Random Variables Two random variables are independent if:

Proof

By definition of independence we have

Sum of Independent Poissons If and are independent random variables with poisson distributions and , then their

sum is poisson distribution with .

Independent Events Events are called independent if their indicators are independent random variables.

Note that pair-wise independence is not the same thing as independence simplicitor.

7

Expectations and Correlations

Expectation The expected value of a continuous variable is the integral of the random variable with respect to its

probability measure.

More generally for :

In the discrete case:

Relative frequency interpretation:

Simple Random Variables A simple random variable takes on only finitely many values, and has the form:

Where are (possibly zero) real numbers, and are events that form a partition of .

The expected value of a simple random variable, which is invariant of the representation of X, is:

8

Such simple random variables are linear:

Limiting Expectations Any non-negative random variable can be approximated by an increasing sequence of simple

random variables:

It also turns out that such sequences are consistent, meaning that regardless of the choice of ,

the value of the limit will always be the same.

Proof

Fix and let . Let be an increasing sequence of simply random variables.

So obviously as .

By the definition of it obviously follows that:

Thus we have:

Since probabilities are strictly positive and less than one:

9

as , meaning that . Hence:

Now let be any arbitrary sequence that equals :

By symmetry:

Hence:

Properties of Expectation Integrability: If is integrable, meaning that , then

Monotonicity: if and then

Linearity: if then

Corollary:

The Lebesgue Integral For the Riemann integral, we partition the domain of integration, usually part of or . However,

for the Lebesgue integral, we partition the range of the integrand that can be defined on an abstract

set like . This makes this conception of integration much broader. When integrating nice functions

on , both integrals give the same answer.

Let be the distribution function of . Then we have:

Thus we can write the Lebesgue integral:

In the discrete case:

10

Expectation and Distribution Theorem For :

Proof:

For the general case:

Functions of Random Variables For a random variable , if is 'nice' enough, then we have:

In the discrete case this becomes:

If and are independent then:

Proof

Let

Now let

11

Moments The th moment of random variable is defined as:

The th central moment of random variable is defined as:

If you know all the moments of X, you’ll know its distribution as well (under broad conditions).

Jensen's Inequality For and being a convex function:

Proof

For a convex function , for any there is such that:

Thus we have, letting :

Lyapunov's Inequality For :

12

This implies, in particular, that if the th moment is finite, then so is the th one. Thus, if the second

moment is finite, then the expectation must be finite, too.

This is actually a special case of jensen's inequality with

Markov's/Chebyshev's Inequality If is a positive, non-decreasing function, then for any random variable and number :

Proof

Cauchy-Bunyakovsky Inequality If then and:

For , and knowing that , it follows that:

Correlation Correlation is a measure of the degree of linear association between two variables. Zero correlation

does not imply independence, except for normal distributions. It is defined as:

Where covariance is an example of a mixed moment :

Correlation can be thought of in terms of orthogonality:

There is also another interesting property of correlation:

Independence is defined as:

13

So it follows that:

Proof

Let , where are both standardised so and

Now let

. By Markov's inequality:

Since we know that:

So it must hold for all n that , meaning that

. It therefore follows that:

Covariance Matrices These are used when dealing with random vectors. For row vectors

is always symmetric and positive (non-negative) definite.

Proof

Let

Since

14

For a multivariate normal distribution where and :

In the more general case:

Note that if matrix is orthogonal then and hence .

Conditional Expectations

Conditional Expectations A conditional expectation is the expected value of a real random variable with respect to a

conditional probability distribution. It can be written in the form .

CEs are not numbers but RVs themselves, and in fact are functions of the observed RVs (the

information we have). Specifically, the conditional expectation is always a function of the variable

being conditioned on:

Minimised Mean Square Error To see the motivation for CEs, consider the case where we want to find the expectation of random

variable given that event has occured.

Differentiate to minimise error:

15

Thus we see that the definition of conditional expectation is that value which minimises the mean

quadratic error for predicting given that has occurred.

Also note that it is only the information contained in the values of which is needed, not the actual

values themselves. Thus for any 1-1 function , we have:

Computing CEs Usually the easiest way to compute conditional expectations is to use the formula:

That is, simply take the average values over , ignoring the rest of the sample space.

An alternative formula is:

Poisson Sums To find the expected value of :

For :

Since and are independent:

16

Which by binomial coefficients is apparently equal to:

Which it can be easily confirmed is simply:

Conditional Distribution For absolutely continuous random variables:

The expectation value of the condition can then be found as:

Note that the conditional distribution is a function of both variables. The will be integrated out

when computing the conditional expectation, resulting in a function of only.

Properties of CEs Linearity

Proof

Monotonicity

Constants of Conditioning

Independence

If is independent of then:

Double Expectation Law

And equivalently:

17

Proof

Convolution Formula If and are independent continuous random variables then:

Don't forget to apply relevant domains for variables (e.g. exponential and Poisson distributions) as

endpoints on the integral.

For positive integer valued independent random variables:

Law of Total Probability

Statistical Inference

Sufficient Statistics Any measurable function of the sample observation is called a statistic.

An estimator is a statistic that is used to determine unknown parameters of the probability

distribution: . Statistics are also random variables.

A statistic is said to be sufficient for parameter if the conditional distribution does

not depend upon . All the information about is stored in the single value , so if we only care

about finding , we can throw away all the other sample data and only need keep .

A sufficient statistic is minimal if it can be written as a function of any other sufficient statistics:

Order Statistics The distribution of the first order statistic is given by:

The distribution of the highest order statistic is given by:

18

Neyman-Fisher Factorisation For an absolutely continuous probability over some measure , a statistic will be sufficient for

if and only if we can write the density:

Note that can be equal to 1.

Proof that factorisation implies independence of theta

Since uniquely determines

Since

Which is independent of

Proof that independence of theta implies factorisation

Since by assumption is independent of

Alternative Statement

If is a sufficient statistic for then for any fixed :

Is a function of only. In addition, if one can write:

19

Then is also a minimal sufficient statistic (from slide 136)

Estimator Efficiency The most common method of comparing the efficiency of different estimators is called the mean

quadratic error approach:

In general there does not exist a single estimator which yields a strictly smaller mean quadratic

error for all possible . To see this, imagine that such a value does exist, and then use an

alternative estimator :

Thus it must be that:

This would mean that we would need to be able to predict exactly every time, which is absurd.

However, we can select minimum estimators within particular subclasses. Specifically, an estimator

from a class of estimators is called efficient in if for any other :

We often define in terms of the amount of bias present:

When , we say the estimator is unbiased. Efficient estimators in are called simply efficient.

Uniqueness Efficient estimators in are unique (if they exist).

Proof

Suppose that and

are both efficient in , with error . Then we have:

It can be easily verified that:

Taking expectations:

20

Since is efficient,

, which means that:

Which implies that:

Hence it follows that

Rao-Blackwell Theorem The Rao-Blackwell Theorem states that the expectation of an estimator conditioned upon a

sufficient statistic is an efficient estimator for theta.

If and is a SS for then the efficient estimator is given by:

Proof that

Proof that is minimal

Note that the middle term is zero by property of conditional expectations. Hence:

Thus any other estimator in must have weakly greater mean quadratic error than .

The multivariate version of Rao-Blackwell applies when , using an inner product

Conditioning on Order Statistics Some important results about conditioning estimators on order statistics, useful in applying the Rao-

Blackwell Theorem.

21

Similar results for computing probabilities:

Complete Sufficient Statistics A family of distributions on is said to be complete if, given a function :

A statistic is said to be complete if the family of its distributions is complete.

A statistic is complete if and only if every bias class contains only a single unique estimator

Proof that completeness implies uniqueness

Let and

be from the same bias class. Thus:

Since is complete, it follows that , and hence

Proof that uniqueness implies completeness

Let

be a unique estimator for , and be such that:

Since we can write:

But is unique, so for all theta, meaning is complete.

22

Minimality and Completeness Any complete sufficient statistic is also a minimal sufficient statistic, but minimal sufficient statistics

need not necessarily be complete.

Proof

Let be complete, and be a minimal sufficient statistic (i.e. ). Consider:

Since , we can write for some function :

However we also know that:

Which means that:

Since is complete it follows that , and thus:

This means that is a function of , which in turn is a function of . By the definition of minimality,

this implies that they must both be minimal estimators.

The converse does not hold: e.g. is minimal for but not complete.

Fischer Information The Fisher information is a way of measuring the amount of information that an observable random

variable X carries about an unknown parameter θ upon which the probability of X depends.

If the function is continuously differentiable in and the integral is non-negative, then:

Rao-Cramer Inequality This theorem related the amount of information that data can provide about the parameters of a

distribution to the variance of that distribution. For this inequality to hold, the conditions stated

above that the function is continuously differentiable in and the integral is non-

negative, must hold.

When

23

For the msq error:

Proof

We will need the following result from slide 144:

Let a statistic , and define :

Differentiate using result from slide 144:

Now consider the case when and :

Since this is zero, just subtract from both sides:

Apply the Cauchy-Bunyakovksy inequality:

24

By independence:

Exponential Family The exponential family is an important class of probability distributions sharing a certain form,

specified below.

The exponential families include many of the most common distributions, including the normal,

exponential, gamma, chi-squared, Bernoulli, Poisson, and binomial with a fixed number of trials.

The exponential family is complete, meaning that statistics for this family are also minimal.

Convergence of Random Variables

Modes of Convergence For a sequence as means that

Almost Sure Convergence

Convergence in Probability

Quadratic Mean Convergence

Convergence in Mean

25

Convergence in Distribution

At all points where is continuous. Note that the variables do not need to be defined over a

common probability space for this to hold.

A random variable converges in distribution iff for any continuous bounded function:

Relationships between Modes

Examples and Counterexamples Counterexample 1: a.s. but not or

Let and

Since

Counterexample 2: a.s. and but not

Let and

Since

Counterexample 3: all except a.s.

26

Counterexample 4: all except a.s., and

Counterexample 5: convergence in distribution only

This is the case because the variables are not even defined over the same probability space

Convergence of Sums If for two random variables and defined on a common probability space, we have and

, then it follows that

This applies to all types of convergence except convergence in distribution, as if all we know is that

two variables converge in distribution then they may not even be defined on the same probability

space, in which case asking for the sum is meaningless.

Proof for a.s. case:

Clearly if converges on an event A with and likewise converges on an event B

with then obviously .

Proof for P case:

27

Since the latter two terms converge by assumption, then the first term must also converge.

Convergence under Transformations Given any continuous function , if then , for all types of convergence.

Proof for P case

This is equivalent to saying that any function that is continuous on a closed bounded interval is

uniformly continuous there. Uniform continuity states that

. Now for the proof:

Let and . Also define

Choose so large that for some arbitrarily small , we have

Since , for all we can always choose a large enough such that

Thus we have:

However, because is uniformly continugous on , we can always choose a small enough such

that for we also have . It thus follows that

=0, and hence:

Hence we have proved that .

Proof of dist Case

Let . It is known that for a bounded continuous function, the composition is also

bounded:

Sums of Bernoulli Random Variables Weak Law of Large Numbers

Let be the sum of bernoulli random variables with probability .

convergence implies convergence, so prove the former:

28

Thus

as .

Strong Law of Large Numbers

Let

. We want to show that occurs infinitely often with probability zero.

For this it suffices to show that:

By Markov's inequality:

Let , hence:

Since the s are identically distributed and we have:

Thus:

29

Thus we have:

Thus we have proven that

occurs infinitely often with probability zero.

Characteristic Functions

Introducing Characteristic Functions The characteristic function of a random variable is given by:

always exists, is always finite, and

We also know that if is real-valued, then , meaning that is symmetric.

The characteristic function is the inverse Fourier transform of the density function.

Example: Standard Normal Distribution

30

Properties of Characteristic Functions Characteristic functions are uniformly continuous across the entire space

(so symmetric if real-valued)

Generating Functions These are an alternative or partner to characteristic functions which can occasionally be useful. For

an integer-valued random variable and arbitrary number such that we have:

It is related to the characteristic function as follows:

If and are independent, then

If then the th derivative exists and is equal to:

Linearity Let , then we have:

Uniform Continuity Consider

31

For all , we can pick large enough such that , and so for

we have:

This proves uniform continuity.

Independence If and are independent random variables, then .

Differentiability

If then is times continuously differentiable, and

This implies that, the 'smoother' is (i.e. the more times it is differentiable), the lighter are the

tails of , and visa-versa.

Inversion Formula and Uniqueness

If then has a continuous density given by:

Example: double exponential

32

Now work the other way:

Swap and , and then with

Thus we find the that ChF of the Cauchy distribution is .

Continuity Theorems There exists a very straightforward relationship between conversion in random variables and

conversion in characteristic functions:

The forward implication follows from the fact that is a continuous bounded function, and from

Theorem 5.8 any continuous bounded function of a random variable that converges will also

converge.

To prove the converse, use the inversion formula:

Fixing to be a constant we have:

33

Identifying Characteristic Functions If

as , where is continuous at , then is a ChF of some random

variable with .

Counterexample for discontinuous case:

converges to 0 for all , but for

gives a value of one. Thus the limiting function is not continuous, and so is not a

characteristic function.

Weak Law of Large Numbers If for iid random variables , then:

Using the properties of ChFs:

Using a second order Taylor Series expansion:

Taking the limit as :

Thus by Theorem 6.15, since the ChFs converge, the random variables will also converge.

Central Limit Theorem Given a finite mean and variance, and for iid s, then:

Proof

Let

, so are iid with mean of 0 and variance of 1, and hence:

34

Expand into second order taylor series:

Thus we have:

Taking the limit as :

In the case where variance is infinite but the mean is still finite, a limiting scaling sequence can still

exist. This is even possible in some cases where the mean is infinite.

Important difference: roughly speaking, the contributions of individual s to the sum are all

negligibly small in the case of the CLT, whereas in the case of convergence to a non-normal stable

distribution, the main contribution to comes from a small proportion of the Xs (the largest ones!).

One example of such a non-normal stable distribution is the Cauchy distribution, which is such that

, meaning that there is no gain in precision in taking many observations compared to a single

observation.

Poisson Limit Theorem If are independent binomial random variables with:

And as , then the sum of Xs, , is given by:

Proof

35

Characteristic Functions of Random Vectors The ChF of a random vector is a function of given by:

If where is some constant matrix then:

Where we use the result that

For two independent random variables, the joint ChF must admit the factorisation:

The expectation formula now requires partial differentiation:

A sequence of random vectors will now converge, potentially, to a random vector:

Multivariate Random Walk Let be iid random vectors with and

, such that

Multivariate CLT Let be the sum of random vectors which are iid with mean vector

and the covariance matrix exists. Under these conditions:

We also have a lower bound on the rate of convergence:

Which unfortunately is fairly slow.

36

Multinomial Distributions The multinomial distribution is a generalization of the binomial distribution. For n independent trials

each of which leads to a success for exactly one of k categories, with each category having a given

fixed success probability, the multinomial distribution gives the probability of any particular

combination of numbers of successes for the various categories.

Let where we have:

represents the probability that a ball falls into bin on a given trial. For a fixed , , etc

are independent, so by SLLN we have:

We know that (note that this is a scalar), so we can write:

We thus have the elements of the covariance matrix, expanding to fill the matrix we have:

From the multivariate CLT we thus have:

Application to the Chi-Square Test Suppose we want to test the hypothesis that some distribution is equal to some hypothesised

distribution function .

To do this, we could partition into intervals , with the

number of sample points which fell into the th interval. The probability vector here will be

if is true.

Take

. By theorem 5.23 we know that:

Consider now a simplified version of the same test. Let:

37

Obviously , and so from the multivariate central limit theorem:

And hence by the property of limits of continuous mappings:

The entries of the covariance matrix will be given by:

Thus our test statistic is a chi-square distribution:

Further Applications to Statistics

Empirical Distribution Function The empirical distribution function, or empirical cdf, is the cumulative distribution function

associated with the empirical measure of the sample. This cdf is a step function that jumps up by 1/n

at each of the n data points. We can write this as:

The sample mean can easily be extracted from the EDF:

As can the sample variance:

38

A way of generalising this fact is by introducing the notion of a functional, which is a function of a

function that returns a single number, so any parameter can be written as:

Glivenko-Cantelli Theorem Let be iid random variables with a distribution function . Then as we have:

That is, the empirical distribution function approaches the actual distribution function in the limit

pointwise with probability 1.

Proof

By the strong law of large numbers:

Hence, for any finite set of we have:

Since is continuous, choose large enough such that:

Goodness-of-fit Testing To us the above identity for goodness of fit testing, we need to know the distribution of , which

apparently depends on . However, it turns out that we can simplify things.

Assume that is continuous with a quantile function , such that . Using proposition

2.47 from slide 49:

So we have:

39

So therefore:

Treating as a parameter, we see that the form of the distribution will not depend on which

is chosen, so long as it is continuous. That makes the statistic very useful for testing goodness

of fit.

Maximum Likelihood Estimator The maximum likelihood estimator for is given by:

Let the true value of the parameter be denoted . Then we have:

By the strong law of large numbers.

Indeed, we can show that:

There is also a relationship between the curvature (second derivative) of the expected value and the

Fisher information:

Thus we conclude that the higher the value of , the closer should be to .

Probability and Statistical Inference - WordPress.com...2018/08/03 · Don't forget to apply...

Documents

Transcript of Probability and Statistical Inference - WordPress.com...2018/08/03 · Don't forget to apply...