Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

22
1 Jerry Post Copyright © 2003 Database Management Database Management Systems: Systems: Data Mining Data Mining Statistics Review

Transcript of Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

Page 1: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

1

Jerry PostCopyright © 2003

Database Management Database Management Systems:Systems:Data MiningData Mining

Statistics Review

Page 2: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

2

DDAATTAABBAASSEE

Probability

Relative frequency approach: The number of times that an event occurs out of the total population of events. You have 3 red balls and 7 white balls in a bag. The probability of

drawing a white ball on the first try is 70%. Your customers are distributed across five cities: 35% in City A,

25% in City B, 20% in City C, 15% in City D, 15% in City E.

Subjective probability: A belief in the likelihood of an outcome. Often subjective because of lack of full information. Generally modified over time based on acquisition of new information. It is important to separate belief from preference (but difficult), and also important that subjective probability maintain consistency. There is a 65% chance that the Federal Reserve board will reduce

interest rates at the next meeting.

Page 3: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

3

DDAATTAABBAASSEE

Probability: Frequency

Need a complete count of events Permutations: Order does count Combinations: Order does not count

Basic multiplication rule. If a single action has k ways to be performed, and the action is performed n times; the total number of possible outcomes is: k*k*k*…*k Flip a coin five times (n=5). A single act has two outcomes (k=2), so

there are 25 = 32 possible outcomes.

iespossibilittotalof

successforwaysof

#

#Pr

Page 4: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

4

DDAATTAABBAASSEE

Counting: Permutation How many ways can objects (or actions) be

rearranged?You have four cards: A, K, Q, J. How many ways can they

be arranged?Four items (n) arranged one card at a time (r):4 * 3 * 2 * 1

A

K

Q

J

Q

J

K

J

K

Q

J

Q

J

K

Q

KK, Q, J

4 3 2 1

Page 5: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

5

DDAATTAABBAASSEE

Permutation: General

Ways to rearrange n items taken r at a time: n(n-1)(n-2)…(n-r+1)

)!(

!

rn

nPnr

Page 6: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

6

DDAATTAABBAASSEE

Combinations

Number of ways of selecting items, and order does not count. Combinations are smaller

than permutations You can divide the number

of permutations by the number of ways of arranging the r objects (r!)

Elect three people from a group of ten. n = 10, r = 3

!)!(

!

rrn

n

r

nC nr

1206/720)1*2*3)(1*2*3*4*5*6*7(

1*2*3*4*5*6*7*8*9*10

!3)!310(

!10

Page 7: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

7

DDAATTAABBAASSEE

Probability Rules: Complement

Complement (opposite): P(E) + P(E’) = 1

The probability of an event happening or not happening is one.

Page 8: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

8

DDAATTAABBAASSEE

Probability Rules: Mutually Exclusive

Mutually Exclusive: Only one event of a group can happen. The probability of both occurring is zero.

P(A B) = 0 Then, the probability of one or the other of the events occurring

is computed by the sum of the probabilities: P(A B) = P(A) + P(B)

Example, pool balls, numbered 1 through 10 Event A: Draw a ball number <= 3 Event B: Draw a ball number >= 6 P(A or B) = 3/10 + 5/10 = 8/10 Can also find as complement: 1 – 2/10 = 8/10

In general, P(E1 E2 … En) = P(Ei)

Page 9: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

9

DDAATTAABBAASSEE

Probability Rules: Independence

Events are independent (pairwise) if they have no influence on each other.

If events are independent, the probability of both events occurring is found by multiplying their individual probabilities:P(A B) = P(A) P(B)

Example: An urn has 3 red balls and 7 white ones. Draw a ball and then flip a coin. What is the probability you draw a white ball and flip heads?P(A B) = 0.7 * 0.5 = 0.35

Page 10: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

10

DDAATTAABBAASSEE

Conditional Probability

The probability that event A will occur given that event B has already happened: P(A | B)Example 1: An urn has 3 red balls and 7 white ones. On the

first draw you pull out a white ball (event B). If you do not replace that ball in the urn, what is the probability of drawing a red ball next (Event A). Answer: 3/9 Note that these events are not independent.

In general, the probability of two events occuring:P(A B) = P(A) P(B | A)Example 2: Draw 2 cards from a 52-card deck without

replacement. What is the probability that both are kings?P(King1) = 4/52 P(King2 | King1) = 3/51P(King2 King1) = 4/52 * 3/51

Page 11: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

11

DDAATTAABBAASSEE

Probability: Joint and Conditional Table

Female Male

Married .42 .18 .60

Not Married

.28 .12 .40

.70 .30 1.00

P(Female) = .70P(Married Female) = .42P(Married | Female) = P(M F)/P(F) = .42/.70

Page 12: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

12

DDAATTAABBAASSEE

Joint Probability: Tree Diagram

Manufacturing: Group A: 4 machines 5% defect rateGroup B: 6 machines, 10% defect rateChoose a machine, then a product—probability defective?

*

*

*

*

*

*

*

P(A) =

.4

P(B) = .6

P(D | A) = .05

P(D’ | A) = .95

P(D | B) = .10

P(D’ | B) = .90

P(A D) = .02

P(A D’) = ..38

P(B D) = .06

P(B D’) = .54

1.00

Page 13: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

13

DDAATTAABBAASSEE

Joint Probabilities: Table

Probability Defective (D) Non-defective (D’)

P(A) = 0.4 0.05 0.95

P(B) = 0.6 0.10 0.90

Production Defective (D) Non-defective (D’)

A 0.02 0.38

B 0.06 0.54

Total 0.08 0.92

P(A D) = P(A)*P(D|A) = 0.4(0.05) = .2

Page 14: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

14

DDAATTAABBAASSEE

Bayes’ Theorem

Now, in a sense, work backwards.We sample a part at random and it is defective.What is the probability that it came from machine A? Machine B?

)(

)()|(

DP

DAPDAP

P(A | D) = 0.02/0.08 = 1/4P(B | D) = 0.06/0.08 = 3/4

In this example, the machine is the state of nature we wish to identify, and defective or not is the information.

Page 15: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

15

DDAATTAABBAASSEE

Bayes’ Theorem in GeneralWe know: (1) There are n states of nature S1, S2, …, Sn

(2) An initial (a priori) probability for each state(3) Some type of information I(4) The conditional probabilities: P(I | Si)

We can compute the posterior probabilities,given the new information:

)(...)()(

)(

)(

)()|(

21 ISPISPISP

ISP

IP

ISPISP

n

iii

)|()(...)|()()|()(

)|()(

2211 nn

ii

SIPSPSIPSPSIPSP

SIPSP

Page 16: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

16

DDAATTAABBAASSEE

Bayes’ Theorem Example

Chao: Statistics for Management/2e States of economy: S1: recession, S2: stable, S3: prosperity P(S1) = .25, P(S2) = .5, P(S3) = .25 (in general/a priori) We have forecasts as information. The forecasts are either optimistic

(I) or pessimistic (I’) The results of the forecasts in the past are as follows:

Prior Probability

State of Economy

Optimistic (I) Pessimistic (I’)

P(S1) = .25 S1 0.1 0.9

P(S2) = .50 S2 0.5 0.5

P(S3) = .25 S3 0.8 0.2

Page 17: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

17

DDAATTAABBAASSEE

Example: Joint Probability

Prior Probability State of Economy

Optimistic (I)

P(I | Si)

Pessimistic (I’)

P(I’ | Si)

P(S1) = .25 S1 0.1 0.9

P(S2) = .50 S2 0.5 0.5

P(S3) = .25 S3 0.8 0.2

State Optimistic (I) Pessimistic (I’)

S1 P(S1 I) = 0.025 0.225

S2 P(S2 I) = 0.250 0.250

S3 P(S2 I) = 0.200 0.050

Total P(I) = 0.475 P(I’) = 0.525

Page 18: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

18

DDAATTAABBAASSEE

Bayes’ Example

State Optimistic (I) Pessimistic (I’)

S1 P(S1 I) = 0.025 0.225

S2 P(S2 I) = 0.250 0.250

S3 P(S2 I) = 0.200 0.050

Total P(I) = 0.475 P(I’) = 0.525

Probability next year is prosperous (S3) if the forecast is optimistic (I):P(S3 | I) = P(S3 I)/P(I) = 0.200/0.475 = .421

475.0/2.0)8.0(25.0)5.0(5.0)1.0(25.0

)8.0(25.0)|3(

ISP

Page 19: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

19

DDAATTAABBAASSEE

Bayes: Prior and Posterior Probabilities

Probability estimates at the start (a priori) are naïve:

P(S1) = 0.25

P(S2) = 0.50

P(S3) = 0.25

Probabilities after the forecast (posterior) reflect the new information:

P(S1 | I) = 0.053 P(S1 | I’) = 0.429

P(S2 | I) = 0.526 P(S2 | I’) = 0.476

P(S3 | I) = 0.421 P(S3 | I’) = 0.095

Page 20: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

20

DDAATTAABBAASSEE

Mean and Standard Deviation

Normal Distribution

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-6 -5 -4 -3 -2 -1 -0 1 2 3 4 5 6

N(0,1)

N(3,1)

N(0,5)

Mean=0

Standard deviations: 1, 2, 3

Page 21: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

21

DDAATTAABBAASSEE

Cumulative NormalCumulative Normal Probability

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4

P(X<=3)0.9987

P(X<=0)0.5000

P(X<=1)0.8413

P(X<=2)0.9773

Page 22: Jerry Post Copyright © 2003 1 Database Management Systems: Data Mining Statistics Review.

22

DDAATTAABBAASSEE

Hypothesis Testing

0

0.05

0.1

0.15

0.2

0.25

-6

-5.3

-4.6

-3.9

-3.2

-2.5

-1.8

-1.1

-0.4

0.3 1

1.7

2.4

3.1

3.8

4.5

5.2

5.9

0

0.05

0.1

0.15

0.2

0.25-6

-5.3

-4.6

-3.9

-3.2

-2.5

-1.8

-1.1

-0.4

0.3 1

1.7

2.4

3.1

3.8

4.5

5.2

5.9

Critical value