Intro to Probability

26
Intro to Probability Slides from Professor Pan,Yan, SYSU

description

Intro to Probability. Slides from Professor Pan,Yan, SYSU. 0. 1. 2. 3. X. 4. 5. 6. 7. 8. Probability Theory. Example of a random experiment We poll 60 users who are using one of two search engines and record the following:. Each point corresponds to one of 60 users. Two search - PowerPoint PPT Presentation

Transcript of Intro to Probability

Page 1: Intro to Probability

Intro to Probability

Slides from Professor Pan,Yan, SYSU

Page 2: Intro to Probability

Probability TheoryExample of a random experiment

– We poll 60 users who are using one of two search engines and record the following:

X0 1 2 3 4 5 6 7 8

Each point corresponds to one of 60 users

Two searchengines

Number of “good hits” returned by search

engine

Page 3: Intro to Probability

X0 1 2 3 4 5 6 7 8

Probability TheoryRandom variables

– X and Y are called random variables– Each has its own sample space:

• SX = {0,1,2,3,4,5,6,7,8}

• SY = {1,2}

Page 4: Intro to Probability

X0 1 2 3 4 5 6 7 8

Probability TheoryProbability

– P(X=i,Y=j) is the probability (relative frequency) of

observing X = i and Y = j– P(X,Y) refers to the whole table of probabilities

– Properties: 0 ≤ P ≤ 1, P = 1

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

P(X=i,Y=j)

Page 5: Intro to Probability

Probability TheoryMarginal probability

– P(X=i) is the marginal probability that X = i, ie, the

probability that X = i, ignoring Y

X0 1 2 3 4 5 6 7 8

P(X)

P(Y)

Page 6: Intro to Probability

Probability TheoryMarginal probability

– P(X=i) is the marginal probability that X = i, ie, the

probability that X = i, ignoring Y– From the table: P(X=i) =j P(X=i,Y=j)

Note that

i P(X=i) = 1

and

j P(Y=j) = 1

X0 1 2 3 4 5 6 7 8

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

3460

2660

P(Y=j)

3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60

P(X=i)

SUM RULE

Page 7: Intro to Probability

Probability TheoryConditional probability

– P(X=i|Y=j) is the probability that X=i, given that Y=j– From the table: P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j)

X0 1 2 3 4 5 6 7 8

P(X|Y=1)

P(Y=1)

Page 8: Intro to Probability

Probability TheoryConditional probability

– How about the opposite conditional probability, P(Y=j|X=i)?

– P(Y=j|X=i) =P(X=i,Y=j) / P(X=i) Note that jP(Y=j|

X=i)=1

X0 1 2 3 4 5 6 7 8

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

X0 1 2 3 4 5 6 7 8

33

03

66

06

88

08

89

19

59

49

38

58

19

89

06

66

02

22

P(Y=j|X=i)P(X=i,Y=j)

3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60

P(X=i)

Page 9: Intro to Probability

Summary of types of probability

• Joint probability: P(X,Y)

• Marginal probability (ignore other variable): P(X) and P(Y)

• Conditional probability (condition on the other variable having a certain value): P(X|Y) and P(Y|X)

Page 10: Intro to Probability

Probability TheoryConstructing joint probability

– Suppose we know• The probability that the user will pick each search

engine, P(Y=j), and

• For each search engine, the probability of each number of good hits, P(X=i|Y=j)

– Can we construct the joint probability, P(X=i,Y=j)?

– Yes. Rearranging P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j) we get P(X=i,Y=j) =P(X=i|Y=j) P(Y=j)

PRODUCT RULE

Page 11: Intro to Probability

Summary of computational rules

• SUM RULE: P(X) = Y P(X,Y)

P(Y) = X P(X,Y)

– Notation: We simplify P(X=i,Y=j) for clarity

• PRODUCT RULE: P(X,Y) = P(X|Y) P(Y)

P(X,Y) = P(Y|X) P(X)

Page 12: Intro to Probability

Ordinal variables• In our example, X has a natural order 0…8

– X is a number of hits, and– For the ordering of the columns in the table below,

nearby X’s have similar probabilities

• Y does not have a natural order

X0 1 2 3 4 5 6 7 8

Page 13: Intro to Probability

Probabilities for real numbers

• Can’t we treat real numbers as IEEE DOUBLES with 264 possible values?

• Hah, hah. No!

• How about quantizing real variables to reasonable number of values?

• Sometimes works, but…– We need to carefully account for ordinality– Doing so can lead to cumbersome mathematics

Page 14: Intro to Probability

Probability theory for real numbers• Quantize X using bins of width • Then, X {.., -2, -, 0, , 2, ..}

• Define PQ(X=x) = Probability that x X ≤ x+

• Problem: PQ(X=x) depends on the choice of • Solution: Let 0• Problem: In that case, PQ(X=x) 0

• Solution: Define a probability density

P(x) = lim0 PQ(X=x)/

= lim0 (Probability that x X ≤ x+)/

Page 15: Intro to Probability

Probability theory for real numbers

Probability density

– Suppose P(x) is a probability density

– Properties

• P(x) 0• It is NOT necessary that P(x) ≤ 1

• x P(x) dx = 1

– Probabilities of intervals:

P(aX≤b) = b

x=a P(x) dx

Page 16: Intro to Probability

Probability theory for real numbers

Joint, marginal and conditional densities

• Suppose P(x,y) is a joint probability density

– x y P(x,y) dx dy = 1

– P( (X,Y) R) = R P(x,y) dx dy

• Marginal density: P(x) = y P(x,y) dy

• Conditional density: P(x|y) = P(x,y) / P(y)

x

yR

Page 17: Intro to Probability

The Gaussian distribution

is the standard deviation

Page 18: Intro to Probability

Mean and variance

• The mean of X is E[X] = X X P(X)

or E[X] = x x P(x) dx

• The variance of X is VAR(X) = X(X-E[X])2P(X)

or VAR(X) = x (x - E[X])2P(x)dx

• The std dev of X is STD(X) = SQRT(VAR(X))

• The covariance of X and Y is

COV(X,Y) = XY (X-E[X]) (Y-E[Y]) P(X,Y)

or COV(X,Y) = x y (x-E[X]) (y-E[Y]) P(x,y) dx dy

Page 19: Intro to Probability

Mean and variance of the Gaussian

E[X] = VAR(X) = 2

STD(X) =

Page 20: Intro to Probability

How can we use probability as a framework for machine learning?

Page 21: Intro to Probability

Maximum likelihood estimation• Say we have a density P(x|) with parameter

• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is

P(x|) = n=1N P(xn|)

• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)

• The maximum likelihood (ML) estimate of is

ML = argmax L = argmax n=1N ln P(xn|)

• Example: For Gaussian likelihood P(x|) = N (x|,2),

L =

Page 22: Intro to Probability

Comments on notation from now on• Instead of j P(X=i,Y=j), we write X P(X,Y)

• P() and p() are used interchangeably

• Discrete and continuous variables treated the same, so X, X, x and x are interchangeable

• ML and ML are interchangeable

• argmax f() is the value of that maximizes f()

• In the context of data x1,…,xN, symbols x, X, X and X refer to the entire set of data

• N (x|,2) =

• log() = ln() and exp(x) = ex

• pcontext(x) and p(x|context) are interchangable

Page 23: Intro to Probability

Maximum likelihood estimation• Say we have a density P(x|) with parameter

• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is

P(x|) = n=1N P(xn|)

• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)

• The maximum likelihood (ML) estimate of is

ML = argmax L = argmax n=1N ln P(xn|)

• Example: For Gaussian likelihood P(x|) = N (x|,2),

L =

Page 24: Intro to Probability

Questions?

Page 25: Intro to Probability

Maximum likelihood estimation• Say we have a density P(x|) with parameter

• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is

P(x|) = n=1N P(xn|)

• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)

• The maximum likelihood (ML) estimate of is

ML = argmax L = argmax n=1N ln P(xn|)

• Example: For Gaussian likelihood P(x|) = N (x|,2),

L =

Page 26: Intro to Probability

Maximum likelihood estimation

L =

• Example: For Gaussian likelihood P(x|) = N (x|,2),

Objective of regression: Minimize error

E(w) = ½ n ( tn - y(xn,w) )2