Basic Concepts of Probability

Basic Concepts of Probability and Independence of Events

ByAmitava Bandyopadhyay

Learning Objectives• Understand the concepts of events and probability of

events• Understand the notion of conditional probabilities and

independence of different kinds• Understand the concept of inverse probabilities and

Bayes’ theorem• Understand specific concepts of lift, support, sensitivity

and specificity• Develop ability to use these concepts for formulation

of business problems and providing solutions to the same

Events• In business analytics we often study the occurrence of events. A

customer coming into a store may or may not buy some items. In this case buying is an event. We may be interested in knowing the amount spent by the customer. In this case the amount spent (or the range of spend) is an event. A particular machine may or may not fail in a given time interval. In this case failure of the machine in the given time interval is an event.

• Loosely speaking an event is something happening. We attach numeric value to the outcome being observed.

• An interesting point is that while dealing with events we are dealing with a special kind of variable. The range of all possible values of this variable is known in advance but the exact value that will happen in the next instance is not known. Such a variable is called a random variable.

• Note: An event is a subset of values that the random variable can assume

Events (Continued…)

• Consider the case of a customer walking into a retail store. During her stay in the store, she may or may not buy. Accordingly, we may define a random variable X that takes two values – 0 if nothing was bought and 1, if something was bought. Events of buying or not buying may then be defined.

• Usually events are defined in terms of capital letters• Note: Events are subsets of values of random variable. An

important assumption is that the values of the random variables are being generated under similar conditions. This assumption is often ignored and is a hazardous practice.

Examples• You are working for an automobile company. The company wants to

know how many times an automobile might fail during the warranty period (say before travelling 10000 miles). The number of failure is a random variable. Suppose we denote this random variable by X. An event may be described as X ≤ 5, i.e. the event that an automobile fails at most 5 times.

• Suppose an automobile has been serviced and you want to know how many miles it will travel before encountering the next failure. Let X denote the number of miles. Here X is a random variable. An event may be X ≥ 5000. Note

• Many real life situations are actually events that describe some values assumed by a random variable

• Number of vehicles sold during a month, number of accidents in a day, number of customers coming to a retail store in a month, number of telephone calls made by customers in a day to a call center – all these are examples of random variables. An event specifies some values of the random variable.

• In business analytics you must always try to identify the random variable and events.

Probability of an Event

• We will normally use the symbol P(A) to denote the probability that the event A happens

• Let A be the event that a personal loan given to a particular customer turns out to be bad (cannot be recovered)

• Let P(A) = 0.03• This implies that the past record of the bank shows that 3% of

the personal loans turns out to be bad. (Note that we are assuming that all personal loans are sanctioned under more or less similar conditions. In case the past records contain a period of severe recession leading to loss of many jobs – the analyst should be careful in estimating the probability. Even dropping that period may be meaningful)

Example-cum-ExerciseSuppose a telecom service provider has carried out a survey to find the level of importance customers attach to various aspects of their experience of using the service. Suppose the importance is given in a seven point scale (1 to 7) where 1 means least importance and 7 stands for the highest importance. One of the aspects of customer experience is accuracy of bills and suppose that the survey has yielded the following result

Value Frequency 1 1 2 3 3 6 4 13 5 72 6 135 7 130

Let A be the event that a randomly selected customer will consider the importance of accurate billing to be 6 or more on a 7 point scale. What is P(A)? How did you arrive at the value? What assumptions did you make?

Some Elementary Properties • One can derive many properties of the function P from the axioms• 0 ≤ P(A) ≤ 1 for all events• Let Ω be the universal set (i.e. the event that contains all possible values of the random

variable under consideration). Ω is called the sample space as well.• If Ω is the sample space then P(Ω) = 1• Let φ be the empty set (i.e. the event that none of the possible values of the random

variable occurs). Then P(φ) = 0 • P(Ac) = 1 – P(A) where the set Ac consists of all points x that does not belong to the set

A. Ac is called the complementary set of A (read as A complement)• For any event A we have A ∩ Ω = A; A ∩ φ = φ; A ᴜ φ = A and A ᴜ Ω = Ω • A and B are said to be mutually exclusive in case A ∩ B = φ • If A and B are two mutually exclusive events, then P(A ᴜ B) = P(A) + P(B)• If A and B are not mutually exclusive, then P(A ᴜ B) = P(A) + P(B) – P(A ∩ B)• A set of events B1, B2, ……, Bp are mutually exclusive and collectively exhaustive, when Bi

∩ Bj ≠ φ for all i≠ j and B1 ᴜ B2 ᴜ …… ᴜ Bp = Ω

• Note: In the previous case ᴜ (A ∩ Bj) = A where the union is taken over all j = 1, 2, …p. Thus Σ P(A ∩ Bj) = P(A) when the Bj

’s are mutually exclusive and collectively exhaustive. (Why?)

Axioms of Probability• A function P that assigns areal number P(A) to

each event A is a probability distribution or a probability measure if it satisfies the following three axiomsa. P(A) ≥ 0b. P(Ω) = 1c. If A1, A2, ….∞ are disjoint events, i.e. Ai ∩ Aj = φ

where φ is the empty set, then P(ᴜAj) = Σ P(Aj)

The axioms of probability provides the theoretical basis and the elementary properties mentioned in the previous slide follows from the axioms

Concept of Joint Probability• Let A and B be two events with probabilities P(A) and P(B). Suppose we are

interested in finding out the probability of the event AᴖB (read A intersection B)• The event AᴖB denotes the joint occurrence of A and B• Examples: Suppose in the context of a retail store, A denotes the event that a

customer buys bread and B denotes the event that a customer buys butter. Then AᴖB denotes the event that the customer buys both bread and butter.

• Another example: Suppose a travel company places online ads for hotel booking, air ticketing and car hire. Let A, B and C be the events that a prospective customer visiting the site books hotel room, buys air ticket through the travel company and hires car respectively. Then AᴖB indicates that the customer books hotel room as well as air ticket through the travel company. What will AᴖC, BᴖC and AᴖB ᴖC indicate?

• In the previous case suppose N people have visited the site. Let NA, NB and NC denote the number of customers who booked hotel room, air ticket and hired cars. Let NAB, NBC and NAC be the number of cases when the customers have booked two services and NABC be the number of cases when the customer booked all three services. Find P(AᴖB), P(AᴖC), P(BᴖC) and P(AᴖB ᴖC ). What are the assumptions made by you?

Conditional Probability• Conditional probability of event A given event B – written as

P(A│B) is the relative frequency of A given B has happened.• Conditional probability P(A│B) = P(AB) / P(B). Actually P(A│B)

= NAB / NB • In the table given in the example-cum-exercise slide, what is

the conditional probability that a customer will rate his billing experience as 7 given that his experience score is > 5?

• Suppose a family has three siblings. What is the conditional probability that the family has three daughters given that out of the 3 siblings at least two are girls?

• P(A│B) is defined only if B ≠ φ, i.e. only if P(B) > 0• Note that P(A│B) and P(B│A) are not the same.

An Important Point• Note that P(A│B) and P(B│A) are not the same. Consider the following example.• An epidemiologist wants to assess the impact of smoking on the incidence of lung

cancer. From hospital records she collected data on 100 patients of lung cancer and she also collected data on 300 persons not suffering from lung cancer. She has classified the 400 samples into smokers and non smokers and the observations are summarized below

• Let A be the event that a person has lung cancer and let B be the event that the person is a smoker. Can you estimate P(A│B) from the table given above?

Smoker Lung Cancer Total

Yes No

Yes 69 137 206

No 31 163 194

Total 100 300 400

An Interesting Aspect of Conditional Probability

• P(A│B) may be very different from P(A) and we can often use this to our advantage

• Note that A│B is a subset of all occurrences of A. For example A may denote the event that a machine fails. On a random day the chance of failure may be 0.0001 or 1 in 10000.

• However, given certain conditions described through the event B, P(A│B) may increase to 0.01.

• Thus presence of condition B leads to a 100 fold increase of the probability of failure on any given day. If the condition persists for 10 days, the chance increases tremendously. (Can you calculate assuming failure across days are independent?)

• Another example: We know that probability of a heart attack during a period of say one year for a randomly selected Indian male may be fairly low. However, this risk may increase significantly for a given combination of age, genetic disposition, smoking habit, BMI, level of blood sugar and LDL.

• In many analytics problems our job is to find the event B that increases / decreases the chance of occurrence of an event of interest significantly.

Concept of Independence• We say that events A and B are independent in case P(A│B) =

P(A), i.e. the probability of A is not impacted by the presence of event B

• This definition implies that when A and B are independent, P(AB) = P(A).P(B)

• Example: Suppose a machine may fail for three different reasons and suppose these three reasons happen independently. Let A, B and C denote the events that reason 1, reason 2 and reason 3 are present. Then P(ABC) = P(A). P(B).P(C)

• Note: If A and B are independent, then Ac and Bc are independent. In fact it can be easily shown that Ac and B , and A and Bc are also independent.

Examples of Independence

• Suppose you are tossing a fair coin. Thus the probability that a toss results in a head is 0.5. Assuming that tosses are independent of each other, what is the chance that 3 tosses will result in 3 heads?

• Suppose a machine has 20 different parts. Suppose the parts fail independently of each other and on any given day a part fails with 1% chance only. Suppose the machine continues to operate if all parts are operational and fails if one or more parts fail. What is the chance that the machine will fail on a randomly selected day?

Mutually Independent Events

A set of events B1, B2, ……, Bp are said to be mutually independent in case for all combinations 1 ≤ i < j < k <…..≤ p, the following multiplication rules holdP(Ai ᴖ Aj) = P(Ai) P(Aj) ……………………………..(1)

P(Ai ᴖ Aj ᴖ Ak) = P(Ai) P(Aj) P(Ak) ………………(2)

…………………………………………………….P(A1 ᴖ A2 ᴖ …ᴖ Ap ) = P(A1) P(A2)…P(Ap).......(p – 1)

Notes on Mutual Independence

• Mutual independence is a strong condition• Even though the condition consisting of a set

of 2p equations looks complicated, its validity is obvious and requires no checking.

• It may be readily verified that when the last equation holds good, all other equations will hold good.

Pairwise Independence

• When the first equation involving two events hold good for all possible choices of two events, the events are said to be pairwise independent

• Pairwise independence does not mean mutual independence. Suppose two fair dice are thrown and the following three events are defined– A means odd face with first die– B means odd face with second die– C means odd sumNote that A, B and C are pairwise independent but not mutually independent

Concept of Total Probability

• Let B1, B2, ……, Bp be a set of mutually exclusive and collectively exhaustive events such that Bj ≠ φ for j = 1, 2, …p

• Let A be any other event.• Then P(A) = ΣP(A│Bj) P(Bj) (Why?)

ExerciseIn a certain county 60% of registered voters support party A, 30% support party B and 10% are independents. When those voters were asked about increasing military spending 40% of supporters of A opposed it, 65% of supporters of B opposed it and 55% of the independents opposed it. What is the probability that a voter selected randomly in this county opposes increased military spending?

Bayes’ Theorem• Bayes’ theorem allows us to look at probability from a inverse perspective • Bayes’ theorem states that

P(B│A) = P(A│B) P(B) / P(A)• Let B1, B2, ……, Bp be a set of mutually exclusive and collectively exhaustive

events such that Bj ≠ φ for j = 1, 2, …p. In this set up Bayes’ theorem may be stated as

P(Bj│A) = P(A│Bj) P(Bj) / (ΣP(A│Bj) P(Bj)), j = 1, 2, …p• This simple yet intelligent way of looking at probability is often very

effective. We may not be able to find P(Bj│A) directly but it may be far easier to estimate P(A│Bj).

• Construct examples of the previous statement. Recall the example of smoking and lung cancer. Can you use Bayes’ theorem to estimate probability of lung cancer given smoking habit?

Application of Bayes’ Theorem

• Suppose I divide my email into three categories: A1 = spam; A2 = administrative and A3 = technical. From previous experience I find that P(A1) = 0.3; P(A2) = 0.5; and P(A3) = 0.2. Let B be the event that the email contains the word free and has at least one occurrence of the character !. From previous experience I have noted that P(B/A1) = 0.95; P(B/A2) = 0.005 and P(B/A3) = 0.001. I receive an email with the word free and the !. What is the probability that it is a spam? Notice that we have used Bayes’ theorem to construct a spam filter. In the subsequent slides we will see how this simple concept may be extended to construct powerful classification mechanisms

Sensitivity and Specificity• While Bayes’ theorem may be used to construct classification

mechanisms, it may also be used to evaluate their performance • Let B denote the event of interest – say the failure of a machine

or the event that a person has a particular disease • Let A be the event that the classifier gives a positive response• AC is the event that classifier gives a negative response• Note the difference between actual occurrence and a positive

response from the classification technique

Questions: a. What is the difference between the two conditional events – A / B and B / A? b. Which probability we are interested in?c. Is it possible to estimate the probability of interest directly? If yes, how? If not,

why not?

Sensitivity and Specificity (Continued…)• P(A│B) is the conditional probability of a positive

response given that the event has actually occurred. This probability is called the sensitivity. Higher the probability of a positive response from the classifier when the underlying condition is truly positive, higher is the sensitivity.

• P(A│Bc) is the probability of a positive response when the underlying condition is actually negative. 1 − P(A│Bc) is called the specificity. Lower the value of P(A│Bc) higher is the specificity. Thus specificity is also given by P(Ac│Bc)

False Positive and False Negative

• P(A│Bc) is the probability of getting a false positive. This gives the probability of a positive response when the event of interest actually did not happen.

• P(Ac│B) is the probability of getting a false negative. This gives the probability of a negative response when the event of interest actually happened.

NoteA sensitive instrument does not give false negative results and a specific instrument does not give false positive results

Events of Interest• Note that sensitivity and specificity do not give the probabilities of

the events of interest• We are actually interested in positive and negative predictive values

(abbreviated as PPV and NPV respectively) defined as– PPV = P(B│A) = P(A│B) P(B) / P(A) – by Bayes’ theorem– NPV = P(Bc│Ac) = P(Ac│Bc) P(Bc) / P(Ac) – by Bayes’ theorem

• Notice that PPV and NPV cannot be found directly whereas sensitivity and specificity can be.

• Also P(A) = P(A ∩ B) + P(A ∩ Bc) = P(A│B) P(B) + P(A│Bc) P(Bc)

= Sensitivity. P(B) + (1 – Specificity)(1 – P(B))• Thus we can find PPV and NPV provided we know sensitivity,

specificity and prevalence of the particular event of interest in the population (i.e. if we know P(B))

Why is this Important?

Suppose we are trying to develop a classification model to understand what leads to failure of vehicles. It is not possible to conduct experiments where we observe impact of different conditions on the event of failure of vehicles in a given period of time. However, whenever vehicles fail, the failures will be reported. Suppose the conditions are captured by sensors. Thus we will have data on conditions given that failure has happened. We can, therefore, estimate the probability of different conditions given that failure has happened. From warranty report data we can also estimate the unconditional probability of failure. We can, therefore, use the methodology given above to classify whether vehicles will fail under given conditions

Further Insights• Note that the previous discussions show how we can estimate P(B│A)

where B is the failure event, when P(A│B) and P(B) are known• Generally we would like to estimate the conditional probability of

failure given many rather than only one event.• Thus we may like to estimate P(B│A1 A2 …. Ak). Note that using Bayes’

theorem, we know

P(B│A1 A2 …. Ak) = P(A1 A2 …. Ak│B) P(B) / P(A1 A2 …. Ak)

• In the next section we will see how the previous concepts, including the Bayes’ optimal classification rules and the concept of conditional independence to be introduced next may be used to solve classification problems

Concept of Conditional Independence

• Let A, B and C be three events• A and B are said to be conditionally independent given C, in case P(A│B ∩

C) = P(A│C) • Conditional independence is often a reasonable assumption as we show in

the subsequent examples Consider the following events A = Event that lecture is delivered by Amitava (there are two teachers – Amitava and Boby) B = Event that lecturer arrives late C = Event that lecture concerns stat theory (theory and practical are taught) Suppose Amitava has a higher chance of delivering lecture on stat theory Suppose Amitava is likelier to be late

Notice that the conditional probability of lecture being on stat theory given that lecturer is Amitava is independent of the event that the lecturer arrives late.Thus P(C / AB) = P(C / A)

Implication of Conditional Independence

• Let A and B be conditionally independent given C. Note that

• P(AB│C) = P(ABC) / P(C) = P(A│BC) P(BC) / P(C)

= P(A│C) P(B│C) P(C) / P(C) (Why?)• Thus we get P(AB│C) = P(A│C) P(B│C) • In general we may say that when A1, A2, …. Ak

are conditionally independent given B P(A1 A2 …. Ak│B) = P(A1│B) P(A2│B)…. P(Ak│B)

Naïve Bayes’ Classification• The concepts of Bayes’ theorem, Bayes’ optimality criterion for classification and

conditional independence may be combined to develop a classification methodology• Suppose a response variable R takes k different values. Let us assume that these

values are 1, 2, …k without loss of generality.• Let A1 A2 …. An be n different events defined in terms of explanatory variables.

• We want to estimate the probability P(R = j / A1 A2 …. An) for different combinations of A1 A2 …. An.

• Once these probabilities are estimated for all j for a given combination of A1 A2 …. An, we try to find j that maximizes this probability. From Bayes’ optimality criterion, for a given combination of A1 A2 …. An, the response is allocated to class j that maximizes the probability.

• We have already shown that P(B│A1 A2 …. Ak) = P(A1 A2 …. Ak│B) P(B) / P(A1 A2 …. Ak)• Since the denominator is constant, we allocate to that class for which the numerator is

maximum• Under the assumption of conditional independence of A1, A2, …. Ak given B, we get

P(A1 A2 …. Ak│B) = P(A1│B) P( A2│B) …P(Ak│B). Generally these probabilities can be estimated and consequently a classification mechanism may be developed

ExampleConsider the problem where data were collected for customers of computers. We need to develop a classification mechanism so that customers may be classified as buyers or non-buyers given the profile. We will use Naïve Bayes’ classification methodology to accomplish this objective.

Data Table

Age Income Student Credit Rating Buys Computer≤ 30 High No Fair No

≤ 30 High No Excellent No

31 – 40 High No Fair Yes

> 40 Medium No Fair Yes

> 40 Low Yes Fair Yes

> 40 Low Yes Excellent No

31 – 40 Low Yes Excellent Yes

≤ 30 Medium No Fair No

≤ 30 Low Yes Fair Yes

> 40 Medium Yes Fair Yes

≤ 30 Medium Yes Excellent Yes

31 – 40 Medium No Excellent Yes

31 – 40 High Yes Fair Yes

> 40 Medium No Excellent No

Classification Mechanism• The classifier aims at developing a method such that optimal

allocation to one of the classes (buys computer / does not buy computer) is made for any customer with a given combination of age, income, status (student or not) and credit rating

• Let B be the response variable that takes two values. B = 0 means the customer does not buy computer and 1 means s/he buys computer

• Now P(B = 0 / Age, Income, Status, Credit Rating) and P(B = 1 / Age, Income, Status, Credit Rating) needs to be found using the Naïve Bayes’ theory– We know that rather than estimating these probabilities, some

values proportional to the same shall be found

Exercise

• Develop a classification mechanism for the IRIS database – Hint: Note that the response variable has three

classes. Also observe that there are four explanatory variables

– Divide the explanatory variables into certain classes and find the conditional probabilities of the explanatory variables given the different values of the response variable

Examples of Usage of the Concepts• A machine has many different sensors that capture data on a number of variables –

say temperature, speed, vibration, and so on. Suppose these data are continuous – captured in a ratio scale. The machine may fail in many different modes – including degradation of function or development of fault codes that may not lead to stoppage of function. The failure is a categorical variable. Our problem is to allocate the machine to one of these classes given the sensor data.

• Inspection of products may be expensive. Thus we may need to develop a filtering mechanism on the basis of automatically collected data to classify products as good or bad. This application is very similar to spam filtering we have discussed earlier

• Suppose a manufacturer has installed an automatic sorting device at a great cost. The concepts of sensitivity and specificity – in particular positive predictive value and negative predictive predictive value may be used to assess the justification of the investment

• A manufacturer may like to estimate the probability of failure of certain mission given certain conditions – say for example an R&D project under certain condition. It is usually easier to look at failed and successful projects – called case-control studies, and assess the probabilities. The concept of Bayes’ theorem may then be used to find the probability of failure. Accordingly the company may be guided about making prudent investments

Review Questions

• What is a random experiment?• What is an event? What is the meaning of

probability?• Let A and B be two events. What is meant by ‘’A

and B are independent”?• Define conditional probability. Are P(A / B) and

P(B / A) same? If not, why not?• Explain the concept of conditional independence

and how is it used for classification?

Basic Concepts of Probability

Documents

Transcript of Basic Concepts of Probability