Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.
Probability and Statistics for Data Mining
description
Transcript of Probability and Statistics for Data Mining
Probability and Statistics for Data Mining
COMP5318
Question 1
• Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?
Gender % of credit card holders
% of gender who default
Male 60 55
Female 40 35
Probability
• Probability is the mathematical language to understand uncertainty.
• We need to make decisions in the presence of uncertainty which is ever present.
• Example: The Earth is warming- a phenomenon that is known as Global Warming (GW). Is modern human activity the cause of GW.– Physics driven approach– Data driven approach
Experiments and Observation
• When an experiment is carried out we observe the outcome – which is often uncertain.– If not uncertain then why carry out the experiment?
• We look into a random shopping basket. Does it contain a a packet of “Tofu”?
• We toss a coin, does it land on “Heads”?• We ask a question: “Is it raining in Broom, WA,
right now”?
Building Blocks of Probability
• The space of all possible outcomes is called the sample space.– Non-trivial to decide.
• Single Coin Toss. The space is {H,T}.
• Shopping Basket. The space of all possible combinations of all items sold in the store.
• Shopping Basket: {Tofu, Not-Tofu}.
Events
• Events are subsets of the sample space. Events are often defined in familiar terms.
• In the shopping basket scenario– A vegetarian shopping basket is an event.– all possible vegetarian item combinations.
• Throw of a dice. The event we are looking for could be: Even Number = {2,4,6}, where the sample space = {1,2,3,4,5,6}
Events
• Let G be the set of all galaxies. Characterize each galaxy by three number – d: distance from earth– a: major axis– b: minor axis
• Elliptic Galaxies (EG)– EG ={(a,b,d) | a/b > 1.5}
• Distant Spiral Galaxies (DSG)– DSG ={(a,b,d) | a/b <= 1.5 and d > 10}
Events
• Let G be the set of all genes. Each gene can be “on” or “off”. Let E correspond to the event: all genes which are “on” when the skin cells are “starved”.
Events are Sets
• At the most basic level events are sets. Therefore we can carry out set union, difference and intersection on events.
• For example:– E1: shopping baskets which contain Tofu– E2: shopping baskets which contain Milk– E1 U E2: shopping baskets which contain
either Tofu or Milk
Probability
• Let S be the space of all possible elementary outcomes. Let = Power(S) be the power set of S. Then the probability P is function: P : [0,1]
that satisfy the following properties (axioms):
Interpretation of Probability
• Physical or Ontological: Long term frequency– 50% chance that a coin will land on heads.– 20% of all Woolworth shopping baskets are
vegetarian.– 22% of all Woolworth shopping baskets in
Northbridge plaza are vegetarian.• Epistemological : Degree of Belief
– 20% chance that my neighbours are watering their lawn on “dry” days.
– 99% chance that the green immovable object outside my house is a Tree.
– 90% chance that Australia will win the cricket world cup.
Consequences of Axioms
Example
• Two coin tosses. Let H1 be the event that a heads occurs on toss 1 and H2 a heads on toss 2. All events are equally likely.
• Sample space = {HH, HT, TH, TT}– H1 = {HH, HT}– H2 = {HH,TH}– P(H1 U H2) = ½ + ½ - ¼ = 3/4
Example
• Two events A and B are independent if – P(A ∩ B) = P(A)P(B)
• P(A∩B) is also written as P(AB) and P(A,B).• If A and B are disjoint event then A and B such
that P(A) > 0 and P(B) > 0 then A and B cannot be independent– P(A ∩ B) = 0. Yet P(A)P(B) > 0
• Except for this case you cannot determine independence by looking at a Venn diagram
Question
• A shopping basket can either be kosher or not. The probability that it will be kosher is 3/4. Examine 10 baskets at a check out counter. What is the probability that there will be at least one kosher basket.
Answer
• Let E be the event “At least one kosher basket.” Let NKi be the event that the i-th basket is non-kosher.
Independence
Example
• For an Online Book Seller (OBS) the conversion rate is 1/100, i.e., every 100th visitors ends up making a purchase. What is the probability that at least one purchase will be made in 10 consecutive visits (by distinct customers).
Example
• Two people take turns to sink a basketball. P1 succeeds with probability 1/3 and P2 with ¼. What is the probability that P1 succeeds before P2.
• Requires clever setting up of the events.– Let E be the event that P1 succeeds before P2.
– Let Ai be the event that P1 succeeds before P2 on the ith trial.
– Ai ∩Aj = Ø and E = [i=11Ai
Conditional Probability
• Very Important Concept• P(A|B) is “fraction of occurrences of B in
which A also occurs”– P(A|B) = P(A ∩ B)/P(B); P(B) > 0
• For a fixed B, P(.|B) is a probability– Therefore if A1 and A2 are disjoint then– P(A1 U A2 |B) = P(A1|B) + P(A2|B)
• Note, P(A|B U C) =/= P(A|B) + P(A|C)• Also P(A|B) =/= P(B|A)
Standard Example
D Dc
+ 0.009 0.099
- 0.001 0.891
9.0001.0009.0
009.0
)(
)()|(
DP
DPDP
9.0099.0891.0
891.0
)(
)()|(
c
cc
DP
DPDP
Suppose a test is positive. What isthe probability of disease?
08.0099.0009.0
009.0)|(
DP
D is disease+/-; Test positive or negative
Standard Data Mining ExampleTID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Suppose the data above closely resembles the behaviour of the populationat large.
What is the chance that those who buy a Diaper will also buy Beer.
= P(Diaper ∩ Beer)/P(Diaper) = 0.6/0.8 = 0.75
Is Diaper an Event?
Conditional Independence
• If A and B are independent then P(A|B)=P(A)
• P(AB) = P(A|B)P(B)• Law of Total Probability.
Bayes Theorem
Question 1
• Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?
Gender % of credit card holders
% of gender who default
Male 60 55
Female 40 35
Answer to Question 1
30.060.055.040.035.0
40.035.0)|()|(
)()|()|(
MGYDPFGYDP
FGPFGYDPYDFGP
But what does G=F and D=Y mean? We have not even formally defined them.