EC381_probabilitybasics

8/10/2019 EC381_probabilitybasics

1/20

1

EXAM 1 MATERIAL

BASIC PROBABILITY

Probability of an event's complement: Probability of the union of events:

Conditional Probability:

which directly implies that:

Law of Total Probability:

For event space , with Bayes' Rule:

Definition of Independence:

which implies that:

and


2/20

2

DISCRETE RANDOM VARIABLES

Probability Mass Function: Cumulative Distribution Function:

***Expected Value:

NOTES: It's very possible to have an expected value that couldn't actually happen. For example, if a prof only

gives out 90s and 100s, and there is a 50% likelihood of each and an even number of students in the class, then

the expected value of the grades is 95 - even though the prof will never actually give a 95.

Expected value is LINEAR!!!! This is a beautiful thing, allowing us to do things like this:

***Variance:

Standard Deviation:

Conditional PMF:

Conditional Expected Value:

Families of Discrete Random Variables(list is not exhaustive, but includes the "most important"):

Bernoulli

- single trial with two possible outcomes (e.g., flipping a coin, answering a yes/no question)

For


3/20

3

Binomial - repeated trials of Bernoullis(e.g., flipping several coins in sequence, answering severalyes/no questions in sequence)

For a positive integer and

Geometric - number of successes until a (given number of) failure(s), or number of failures until a (given

number of) success(es)(e.g., running the Boston marathon every year until the year you manage to

finish; cold-calling people for donations until you have six donations)

For

Discrete Uniform - outcomes in a given range all have an equal likelihood of occurring(e.g., rolling a die)For integers a and b such that a


4/20

4

CONTINUOUS RANDOM VARIABLES

Probability Density Function :The density function for continuous random variables is different from the mass function for discrete random variables

in the sense that we are no longer looking for "mass" at a particular integer outcome to indicate the probability of that

outcome. Instead, we consider the area under the function's curve within a range of outcomes as the indicator of

probability for that range of outcomes; hence, we integrate between these limits to determine probabilities for

outcomes of continuous random variables, and the total area under a density function must equal 1.

Cumulative Distribution Function :Because the cumulative distribution function is equal at every point to the value of the probability density function's

integral from up to that point, finding the value of the CDF at a point is equivalent to finding the probability that arandom variable's outcome is less than or equal to that point. In other words, its definition has not changed.

***Expected Value :As with all transitions from discrete to continuous values, we should expect summations to be replaced by integrals, and

that is the only change you see here: and, as you might expect:

***Variance :Because we define variance in terms of expected values, and because we have amended our definition of expected

value to utilize the necessary integration, the definition of variance is exactly the same as before...

Standard Deviation Again, this is the same as before:


5/20

5

Families of Continuous Random Variables(list is not even close to exhaustive, but includes the "most important"):

Uniform - pdf is uniformly distributed on (a,b):For constants ,

Exponential (

For

Erlang -

For

and a positive integer

Gaussian

- NOTE: Gaussian RVs are HUGELY important!!!!! These babies AREN'T going away, so start

loving them now ...

For constants ,


6/20

6

FUNCTIONS OF RANDOM VARIABLES

If random variable is a linear transformation of random variable, i.e., , then:

This shouldn't surprise you, because expected value is linear!!

Again, there is no surprise here! Addition of a constant just shifts the location of the density function - it does NOT

affect the spread of the pdf, which is what variance measures. Only the scaling affects the variance, and it shouldn't

surprise you that the effect is quadratic, for variance is essentially a quadratic measure (the second moment minus the

square of the expected value).

Are the above relationships true for ANY random variable? YES!!!! This is true for ANY random variable, provided that

the transformation of that random variable is LINEAR.

NOTE: Linear transformations of any kind on uniform and Gaussian random variables produce uniform and Gaussian

random variables, respectively.

This is not so for exponential and Erlangrandom variables, the distributions for which are constrained to be 0 for values

less than 0 and are constrained as well to start at 0. Hence, ONLY linear transformations of the form , where , result in a Y that is an exponential or an Erlang random variable, respectively. Shifting the original Xdistribution in any direction results in a random variable that is neither exponential nor Erlang (respectively); likewise,

scaling the original X distribution by a negative constant will flip the original distribution, again resulting in a random

variable that is neither exponential nor Erlang (respectively).


7/20

7

THE STANDARD NORMAL (AND COMPLEMENTARY) CDFs

Standard Normal CDF:

NOTE: The standard normal CDF is just the CDF of a Gaussian PDF centered at 0, with variance 1. Because the CDF is

always the integral of the PDF, this gives us the following definition:

Notice that the function under integration is exactly the 0 mean, variance 1 Gaussian.

Transformation of, a Gaussian Random Variable to Standard Normal Random Variable :We need to be able to do this transformation because there is no analytical solution for the integration of a

Gaussian pdf; thus, we need to be able to linearly transform general Gaussian (non-zero mean and/or non-unity

variance) into the standard Gaussian. To do this: Probability thatis less than :This is straightforward, given the standard normal CDF. We simply convert our X value (a) to Z, and find the

value of the CDF at that point, because the standard normal CDF is, by definition, P[Z


8/20

8

EXAM 2 MATERIAL

Pairs of Random Variables

Relationships between Distributions/Mass Functions:

Discrete: Continuous:

Marginals

Marginals

Probability of Event A

Probability of Event A

Joint Conditional (on Event A)

NOTE: To find marginal , can marginalize the

expression on the left over the values of y in A!

Joint Conditional (on Event A)

NOTE: To find marginal , can marginalize the

expression on the left over the values of y in A!

Conditional (on a Random Variable)

and Conditional (on a Random Variable)

and

Bayes' Rule

Bayes' Rule

Independence

Two RVs are independent if and only if ...

...which directly implies that for independent RVs:

and

Independence

Two RVs are independent if and only if ...

...which directly implies that for independent RVs:

and


9/20

9

Covariance/Correlation:

NOTE: Remember, shifts (addition of scalars) don't affect covariances. Also, scalars can be pulled out, i.e.,

Correlation coefficient NOTE: and uncorrelated; completely correlated (linear relationship, i.e.

Uncorrelated vs. Independent RVs:*Independent RVs are uncorrelated, but uncorrelated RVs are NOT necessarily independent, unless they

are Gaussian RVs!!

X,Y Uncorrelated X,Y Independent

Iterated Expectation:


10/20

10

Jointly Gaussian Random Variables

If X,Y are jointly Gaussian random variables (not necessarily independent), then:

Their joint distribution is Gaussian (hence the phrase "jointly Gaussian") Their marginal distributions and are Gaussian X,Y uncorrelated X,Y independent

where

and

Linear combinations of the Gaussians are Gaussian, i.e.:

is Gaussian, with: and

Linear transformations of X and Y are invertible transformations, and hence the transformed variables

are also jointly Gaussian, i.e., and are not only marginally Gaussian, but also are jointly Gaussian!

The conditional is Gaussian, with: The conditional is Gaussian, with:


11/20

11

Detection (Binary Hypothesis Testing)

We have two hypotheses, ("null hypothesis"/"nothing's there") and ("hypothesis"/"something's there"). Weobserve some random variable and want to decide, based on its value, which hypothesis is true.

Errors:

Missed Detection: Choose when is true.

False Alarm: Choose when is true. Probability of Error: This is always equal to the probability of missed detection times the a priori probability of

the hypothesis plus the probability of false alarm times the a priori probability of the null hypothesis,

i.e.:

Expected Value of the Cost of Errors: All we do here is factor in the associated costs to the probability of error

computation, i.e.:

Detectors:

NOTE: All detectors are written for the continuous case. As always, for the discrete case, the expressions are the same,

but with big "P" substituted for little "".Maximum Likelihood (ML) Detector: Most basic detector; compares likelihood ratio to 1 in order to determine

which density is larger. If the density in numerator is larger, then the ratio is larger than 1; if the ratio is smaller

than 1, then the density in the denominator must be larger.

Maximum A Posteriori (MAP) Detector: Minimizes probability of error by weighting the likelihoods with the a

priori probabilities. Note that if the a prioris are equal, then the MAP detector simplifies to the ML detector.

Minimum Cost (Bayes' Risk) Detector: Minimizes expected value of the cost of error by weighting the

likelihoods by both the a priori probabilities and the costs of missed detection/false alarm. Note that if the costs

are equal, then the minimum cost detector simplifies to the MAP detector.


12/20

12

Useful Math Facts:

Simplification of the likelihood ratio used in detection becomes much easier when logarithms are taken. So, remember:

*the log of the product is the sum of the logs: *the log of the quotient is the difference of the logs: *

(We engineers know that the log can be taken to any base, including base e, so no need to specify

'ln'.)

*log 1 = 0


13/20

13

Estimation

We want to know X, but we can only observe Y; so, we have to make an educated guess at the value of X given that we

have observed some value of Y.

Biased/Unbiased:

NOTE: If bias = 0, then

, and the estimator is said to be "unbiased".

ML Estimation:

Choose the value of X that maximizes the conditional distribution, called the "likelihood function".

MAP Estimation:

Choose the value of X that maximizes the conditional distribution, but now we multiply by the prior distribution on X,

because we no longer assume equal priors (as in the ML case). Minimum Mean Square Error Estimation:

Choose the value of X that minimizes the mean square error, , over ALL possible relationships (evennonlinear ones) between X and Y.

Case 1 (Blind Estimate):

Case 2 (A, some attribute of X, is observed, thus restricting the possibilities for X):

Case 3 (A dependent random variable's value, Y=y, is observed):

Properties of MMSE Estimator:

*Unbiased*Estimate is orthogonal to (uncorrelated with) the estimation error

*All functions of the data used in the estimate are orthogonal to (uncorrelated with) the estimation error

Linear Least Squares Error Estimation:

Easier than MMSE, because we need the conditional distribution for MMSE! If we don't have that distribution, but know

the key statistics of X and Y, then:


14/20

14

Choose the value of X that minimizes the mean square error, , over ONLY LINEAR relationshipsbetween X and Y.

where, clearly,

Properties of LLSE Estimator:

*Unbiased

*Estimate is orthogonal to (uncorrelated with) the estimation error

*Estimation error is orthogonal to (uncorrelated with) the data used in the estimate

Mean SquareError of the LLSE Estimator:

NOTE: This is also called the "variance of the error", or

which is the variance in this case because

Let the error in the linear estimate. Then the mean square error of the estimate is:

BIG NOTE: X,Y Jointly Gaussian:

--> conditional distribution is Gaussian, and actually, Y and X ARE linearly related. So, in this case,

!!!!!


15/20

15

EXAM 3 MATERIAL

LIMIT THEOREMS

Sums of Random Variables

Let

. Then:

by the linearity of expected value. which, ifare independent (or just uncorrelated), simplifies to:

Repeat after me: If the RVs are independent, then the variance of the sum equals the sum of the variances."

NOTE: PDFs of the sums of RVs are convolutions of the individual PDFs!!!

What if N is random? (i.e., we don't know how many variables we're adding ...)

Let , and let thebe i.i.d. (independent, identically distributed). Then:

Average of Random Variables

Letbe i.i.d. (or even just uncorrelated) RVs. Then:

Markov Inequality

For non-negative RV ,


16/20

16

Chebyshev Inequality (this gives a tighter bound than the Markov inequality ...)

For any RV (not necessarily non-negative),

Laws of Large Numbers

Weak Law:

Letbe i.i.d. (although the weak law holds for uncorrelated RVs). Then:

for any as arbitrarily close to 0 as we like. Stated another way:The WLLN says that as the number of RVs we're summing approaches infinity, the sample meanapproaches

the true mean with a probability that approaches certainty!!!!

Strong Law:

This is very similar to the weak law, except we require that the RVs be i.i.d., and we are essentially changing

the above statement to say that as the number of samples approaches infinity, then the sample meanactually

equals the true mean with absolute certainty:

Central Limit Theorem

Remember, the CLT is your friend!!! It basically says that, when we sum together a bunch of i.i.d. RVs, we get

a Gaussian - which means we can apply all of those nice Gaussian properties simply by invoking the CLT.

Formally, this says that in the limit as n approaches infinity, the CDF of the sum(NOT the sample mean!!!) of

the RVs approaches a Gaussian CDF. So, we can use:

for i.i.d.in which we transform to a standard normal Gaussian CDF, then use the phi or Q function to get the

probability we seek. We can of course express the mean and standard deviation above in terms of the mean

and standard deviation of

as follows...

Remember:


17/20

17

Confidence Intervals

Here, the confidence interval is and is called the lower limit, while is called the upper limit. Moreover, iscalled the confidence coefficient. Notice that if is small, then we're more sure that the outcome we've estimated lieswithin the confidence interval.

By Chebyshev's inequality:

Theorem: is a Gaussian RV with unknown mean . The relationship between a confidence intervalestimate of , denoted , is given by

where


18/20

18

MARKOV CHAINS

Markov Property

Basically, it says we only need to know the current state in order to know the probabilities of the next state, so that

there is only a one step delay dependence: Markov ("Stochastic") Matrices

Let represent a stochastic matrix.Row numbers are the states we're leaving; column numbers are the states at which we're arriving, and the -thelement is the probability of entering state from state .

The sum of each row must be 1, because at each time step, some action MUST be taken, whether it's to stay in the

same state or move to another.

Steady State Behavior of Homogeneous Markov Chains

We can iterate the matrix by raising it to the exponent that represents the time of interest to find the stochastic matrix

at that time step: Under certain conditions, the matrix will converge to the steady state, where each row will contain the same steady

state probability vector, denoted using the Greek lowercase letter pi:

is the eigenvector of corresponding to eigenvalue 1. There can be as many steady state probability vectors as thematrix has eigenvalues equal to 1.

Definitions

accessible- Stateis accessible from state if there's a directed path from to .communicate- States and communicate if is accessible fromandis accessible from . A group whose memberscommunicate with each other is called a communicating class.irreducible- In the state diagram or matrix, every state communicates with every other, and hence all states are active

in the steady state, so none of them can be "reduced out".


19/20

19

transient- State is transient if communicates with state , but doesn't communicate with. You can think oftransient states as follows: Over time, "things" in that state will leak out to other non-transient states, until

eventually there's virtually nothing left to leak out. Hence,

recurrent- Nontransient. The end.

NOTE: If there are multiple communicating classes with recurrent states, then the steady state probability vectordepends on the initial probability vector (where you probably started).

period- Greatest common divisor of all possible cycles from all states back to themselves.

aperiodic- Period of Markov Chain is equal to 1

NOTE: At least one self loop means that the Markov chain must have period 1, and thus must be aperiodic!!!

NOTE 2: If the period is greater than 1, then the chain will oscillate between steady state vectors, and will depend on

where you started.

Finding Steady State Vectors(without eigendecomposition)

Step 1.

Step 2. Draw the state transition diagram.

Step 3. Identify the transient and recurrent states. If there are any transient states, then you already know the

probability of being in that state after reaching the steady state is 0! Hence, the element in the steady state vector that

corresponds to that state is equal to 0!!!

Step 4. Determine a system of equations, where the elements of are the unknowns. (Remember, you need as manyequations as you have unknowns in order to solve a system of equations.)

a) You get one of these equations for free: because is a probability vector, its elements must sum to 1, so: b) Draw a dashed line between one recurrent state and the rest of the states. Then, the probability of flowing

into state must equal the probability of flowing out of state That is, write an equation where


20/20

20

NOTE: Be sure not to draw dashed lines around transient states. Their steady state probabilities are zero, so

their values won't help you solve systems of equations involving them.

NOTE 2: Self-loops don't matter, because they're neither flowing into nor out of that state.

c)Repeat part b for a different recurrent state; do this until you have a system of as many equations as you

have unknowns.

d)Solve the system for the unknown elements of the steady state vector.

e)Check your work - your results should sum to 1!!!

EC381_probabilitybasics

Documents

Transcript of EC381_probabilitybasics